Discussion on the paper by Cule, Samworth and Stewart
Kaspar Rufibach (University of Zurich)
The authors are to be congratulated on the extension of logconcave density estimation to more than one dimension. Their work marks the temporary culmination of substantial research activity in shapeconstrained density estimation over the last decade and directs attention to (at least) two directions that previously had received little or no regard. First, apart from very recent concurrent papers by Schuhmacher et al. (2009), Seregin and Wellner (2009), Koenker and Mizera (2010), Dümbgen et al. (2010) and Schuhmacher and Dümbgen (2010), nonparametric estimation of shapeconstrained densities in dimension d2 has received virtually no attention. Apart from the theoretical obstacles that are related to these problems this neglect may be attributed to the difficulty of implementing algorithms to maximize the underlying likelihood function. The development of an algorithm and its implementation in R (Cule et al., 2009; R Development Core Team, 2009) for the logconcave case is certainly the first highlight of this paper. In dimension d=1, after realizing that the maximizer of the likelihood function must be piecewise linear with kinks only at the observations, finding the logconcave density estimate boils down to maximizing a concave functional on subject to linear constraints; see Rufibach (2007). In the multivariate case, however, it is not clear how to parameterize the class of concave tent functions that hampers the formulation of a (linearly) constrained maximization problem similarly to the univariate scenario. To circumvent this problem the authors modified the initial likelihood function to receive an updated functional whose unconstrained maximizer gives rise to the tent function that corresponds to the logconcave density estimate. This updated functional is concave but nondifferentiable, disallowing the use of standard optimization algorithms. Instead, the authors successfully implemented (Cule et al., 2009) an algorithm due to Shor (l985) which can handle nondifferentiable target functionals.
As a second highlight the authors show that the estimator converges to the logconcave density f^{*} where this density minimizes the Kullback–Leibler divergence to f_{0}, the density of the observations. To the best of my knowledge, this general setup has not previously been considered for shapeconstrained density estimation, not even for d=1 for example in Groeneboom et al. (2001) or Dümbgen and Rufibach (2009), which dealt only with the wellspecified case where f_{0} is logconcave. However, to assess robustness properties an analysis of the misspecified model is particularly valuable. A natural link here is: what are the limit distribution results under misspecification? In the wellspecified univariate case, the pointwise limiting distribution is known (see Balabdaoui et al. (2009)) and it seems worthwile to generalize these results to
Having shown consistency in some strong norms the natural next question is: what rates of convergence can be expected for the logconcave density estimator? For d=1 rates of convergence, either in supnorm (Dümbgen and Rufibach, 2009) or pointwise (Balabdaoui et al., 2009), have been derived.
If we assume that f_{0} belongs to a Hölder class with exponent α ∈ [1,2], the minimax optimal rate for estimators within such a class can be derived from the entropy structure of the underlying function space and is n^{−α/(2α+d)}; see Birgé and Massart (1993). However, Birgé and Massart (1993) showed that the rate of convergence for minimum contrast estimators, which are a class that contains maximum likelihood estimators (MLEs) as a special case, is only n^{−α/2d} once d>2α.
The dependence of the exponents of these rates of convergence on dimension for β=2, i.e. for densities with uniformly bounded second derivative, is displayed in Fig. 10, which reveals that up to dimension d=4 we can conjecture the MLE to be rate efficient but beyond that split MLEs do not reach the minimax optimal rate anymore. Future work should aim at
How to achieve this goal is another open issue: (additional) penalization comes to mind or consideration of classes of densities that are smaller than that of logconcave densities but yet nonparametric.
In addition to solving some important questions this paper has opened up new directions for research in shapeconstrained density estimation and I am convinced that it will stimulate further research in the area. Consequently, I have great pleasure in proposing the vote of thanks.
Aurore Delaigle (University of Melbourne)
I congratulate the authors for a very stimulating, innovative and carefully written paper on the topic of nonparametric estimation of a multivariate density f. A popular estimator in this context is the kernel density estimator (KDE). Although this estimator is consistent under mild smoothness conditions on f, its quality degrades quickly as the dimension increases. The authors suggest imposing a structural assumption of logconcavity on a nonparametric (maximum likelihood) estimator, to improve performance in practice. They develop a nice theoretical study and discuss a variety of interesting applications of their method, encompassing plain density estimation, hypothesis testing and clustering problems.
Compared with the KDE, the numerical improvement that is achieved by the new procedure is impressive. However, one may question the suitability of comparison between these two methods. In particular, the KDE that is discussed in the paper is a nonrestricted nonparametric estimator, whereas the methodology suggested incorporates a strong logconcavity shape constraint. Are we surprised to do better by incorporating a priori knowledge of the density? Perhaps it would be more appropriate to compare the new procedure with a KDE which satisfies the same logconcavity constraint. Such shaperestricted KDEs were developed in the literature more than a decade ago (see, for example, the tilting method of Hall and Presnell (l999) and the discussion in Braun and Hall (2001)). Moreover, these modified KDEs are not restricted to logconcavity constraints; they can be used to impose a variety of shapes. Can the method that is suggested by the authors be extended to more general constraints?
When comparing their procedure with the KDE, the authors highlight the fact that their method does not require a choice of a smoothing parameter. This attracts at least two comments.
 (a)
If we were to use a shapeconstrained KDE, which would make the comparison between methods more fair, then it is not clear that the choice of a bandwidth would be critical.
 (b)
A consequence of the fact that the authors do not use a smoothing parameter is that their estimator is not smooth, and in fact not differentiable. (See for example their Fig. 2.)
One might say that the estimator is not visually attractive, whereas by introducing a smoothing parameter the authors could make it smooth and differentiable. A simple approach could be to take the convolution between their estimator and a function K_{H}(x)=H^{−1} K(·/H), where the kernel function K is a smooth and symmetric density, and H denotes a small smoothing parameter. In fact, the approach that is discussed in Section 9 is of this type, where the kernel function is the standard normal density and H is a matrix of smoothing parameters. Hence, to make their estimator smooth, the authors suggest introducing a kernel function and smoothing parameters. We may wonder how different from the tilted KDE the resulting estimator is. Incidentally, it is not clear that the theoretical mean integrated squared error bandwidth that is used by the authors in their numerical work is systematically better than a datadriven bandwidth (if the authors had employed the integrated squared error bandwidth, this would not have been questionable).
In Section 1, application (d), the authors suggest that their shaperestricted estimator be employed to assess the validity of a parametric model. Since their estimator already contains a rather strong shape constraint, it is not clear that this procedure would be appropriate. In most cases their estimator will contain a systematic bias (which does not vanish as the sample size increases) and it seems a little odd to use such an estimator to infer the validity of a parametric model. For example, an incorrect shape constraint can give the erroneous impression that the systematic bias of a wrong parametric model is smaller than it really is.
In Section 7, it is a little surprising that the authors consider examples (a) and (b) as potential applications of their method. Clearly, in both cases, one could employ empirical estimators which, unlike the authors’ procedure, do not rely on any shape restriction. For the other examples that are treated in that section (where an empirical procedure is not available), again, the authors compare their method with the KDE, but the comparison does not seem fully satisfactory. First, as already noted above, the authors did not use the shaperestricted version of the KDE (the choice of the bandwidth is perhaps also questionable). Second, is it clear that, in the examples considered, imposing a shape constraint brings as much improvement as in the context of density estimation? This is particularly questionable for integrated quantities, which are easier to estimate than a full multivariate density (in such problems KDEs can usually achieve very good performance by undersmoothing, i.e. by using a bandwidth that is much smaller than for density estimation). How robust is the new estimator against nonlogconcavity in such problems? Is it clear that the gain that can be obtained by imposing the right shape constraint is worth the loss that can occur by imposing a wrong constraint?
The vote of thanks was passed by acclamation.
Wenyang Zhang (University of Bath) and Jialiang Li (National University of Singapore)
We congratulate Dr Cule, Dr Samworth and Dr Stewart for such a brilliant paper. We believe that this paper will have a big influence on the estimation of multivariate density functions and will stimulate many further researches in this direction.
The commonly used approach to estimate density functions is based on kernel smoothing. The authors take a completely different approach; by making use of the logconcavity of the density function, they transform the density estimation to a nondifferentiable convex optimization problem, which is quite interesting.
As the authors rightly point out kernel density estimation has a boundary effect problem. There are some boundary correction methods to allay this problem. It would be interesting to see a comparison between the method proposed and the kernel density estimation with boundary correction. We can envisage, even with the boundary correction, that kernel density estimation would still not perform as well as the method proposed does, as kernel density estimation has not made use of the logconcavity information. We guess that it is probably not very easy to make use of logconcavity information in kernel density estimation.
In multivariate kernel density estimation, the dimensionality could be a problem. Would the dimensionality be an issue in the method proposed? How would the dimensionality affect the convergence of the algorithm and the accuracy of the estimator proposed?
In real life, we often want to estimate the conditional density function of a response variable or vector given some covariates. Does the method proposed apply to the case where there are some covariates?
The basic idea of kernel estimation for conditional density functions is as follows: suppose that , are independent identically distributed from (X^{T},Y). By simple calculation, we have
 (1)
where p(yX=x) is the conditional density function of Y given X=x,K_{h}(·)=K(·/h)/h, K(·) is a kernel function such that ∫K(u) du=1 and h is a bandwidth. Expression (1) leads to the nonparametric regression model
 (2)
The conditional density function estimation is now transformed to a nonparametric regression problem, and the estimator of conditional density functions can be obtained by nonparametric regression. Like the standard nonparametric modelling, we must impose some conditions on p(yX=x) when the dimension of X is not very small owing to the ‘curse of dimensionality’. Which conditions should be imposed depends on the data set that we analyse and the problem that we are interested in.
Beaumont et al. (2002) proposed this regression approach to compute the posterior density function in approximate Bayesian computation with Y being the parameter concerned and X being the vector of selected statistics.
Vikneswaran Gopal and George Casella (University of Florida, Gainesville)
In Appendix B.3, the authors suggest an accept–reject (AR) algorithm to sample from the fitted maximum likelihood estimate of the density. We note the following.
 (a)
As the true density moves away from logconcavity, the acceptance rate falls.
 (b)
When both algorithms use an equal number of random variables, we show empirically that
If MH and AR algorithms use the same generating candidate, MH sampling always has a higher acceptance rate (Robert and Casella (2004), lemma 7.9).
Metropolis–Hastings candidate density
As in the main paper, denote the estimated density by A natural modification of the authors’ AR candidate gives our MH proposal density Q,
 (3)
where C_{n,j} are defined in the paper. The volume of a simplex in is λ(C_{n,j})=(1/d!)det(A_{j}) (Stein, 1966), and the resulting MH algorithm at iteration n is given by the following steps.
 Step 1:
given X_{n}=x pick C_{n,j} with probability (q_{1},q_{2},…,q_{J}) and sample from the uniform distribution on this simplex to obtain the candidate Y_{n+1}=y.
 Step 2:
compute the MH ratio, which is explicitly given by
 (4)
for
x ∈
C_{n,i} and
y ∈
C_{n,j}.
 Step 3:
We considered five sampling densities (Table 4), a correlated bivariate normal, two gammas and two tdistributions, and generated 150 observation from each. Both algorithms simulated n=5000 candidates and computed the mean of the random variables returned. To obtain standard error estimates, each mean was computed 100 times. (To avoid burnin issues, the Metropolis algorithm was started with an AR step.)
Table 4. True means and estimated means (with standard errors in parentheses) for AR and MH algorithms, from a total of 5000 random variables for each algorithm†   Results when logconcave  Results when not logconcave 

 Bivariate normal  Γ(1.1, 2)  Γ(0.9, 2)  t(1)  Bivariate t 


True sample mean   1.888  −1.980  0.519  0.414  −0.683  −1.144  0.208 
Estimate of mean  AR  1.888  −1.980  0.519  0.414  −0.698  −1.117  0.237 
 (0.018)  (0.018)  (0.010)  (0.016)  (0.134)  (0.501)  (0.433) 
MH  1.891  −1.979  0.517  0.414  −0.677  −1.135  0.254 
 (0.017)  (0.016)  (0.010)  (0.014)  (0.115)  (0.296)  (0.293) 
Acceptance rate  AR  0.575  0.402  0.120  0.121  0.020 
MH  0.856  0.677  0.246  0.266  0.158 
As seen in Table 4, the acceptance rate falls as we move from left to right However, MH sampling is consistently better than AR sampling and provides standard errors at least as good as the AR algorithm, with more pronounced differences towards the righthand side of Table 4. Note also that switching from Γ(1.1,2) to Γ(0.2), which crosses the logconcave border, causes the acceptance rate to plummet sharply.
Fig. 11 provides some insight about the MH–AR performance. If the AR scheme picks simplex 2 it then samples from the conditional density on that simplex using a uniform proposal. But the disparity between the uniform and the conditional density results in a large number of rejected random variables.
When the underlying density is not logconcave, the AR approach has problems because the fitted density will be logconcave, and hence have light tails. This corresponds to steep slopes on the boundary simplices of the convex hull defining the support of The fat tails of the true distribution cause the q_{i}s to be large for these simplices, which the AR scheme picks often, but does not generate from efficiently.
JingHao Xue (University College London) and D. M. Titterington (University of Glasgow)
We congratulate the authors on this most impressive paper. In this contribution, we discuss four issues that are related to applying the LogConcDEAD method to clustering or classification problems.
First, as pointed out by the authors, a shortcoming of the LogConcDEAD method is its performance for small samples, which is mainly caused by the restriction of the support of the underlying density estimate to be the convex hull C_{n}, which is purely decided by the observed data. This C_{n} is almost inevitably a proper subset of the true support, in which case the integral in equation (3.2) is less than, not equal to, 1. To mitigate the negative effect of such an underestimated support, it is reasonable to postprocess the estimated density . One postprocessing way, as suggested by the authors in Section 9, is to use a Gaussian kernel to smooth . However, this leads to a virtually infinite support. Alternatively, we may consider postprocessing by extending the lowest exponential surfaces of downwards to zero, such that a largerthanC_{n} and finite support can be naturally obtained.
Secondly, classification is challenging when there is class imbalance in data: in the case of twogroup discrimination, there are often a majority group and a minority group, with the size of the former being very much larger than that of the latter. We may consider using LogConcDEAD for the majority group while using a kernelbased method for the minority group. Nevertheless, it would be attractive to use LogConcDEAD throughout, if the small sample performance of LogConcDEAD is comparable with that of kernelbased methods.
Thirdly, the authors note from Table 1 an interesting pattern: the number of iterations decreases as the dimension d increases. Is this pattern influenced by the termination criteria, given that we note that in the experiments the criteria are not adaptive to the dimension d? For example, when d increases, is it possible that the integral goes to 1 faster than in the case of a smaller d? If such behaviour implies an undesired convergence, it might be better to make the parameters δ, ɛ or η adaptive to d; however, this leads to further complexity, which may not be worthwhile.
Finally, it is common for clustering and classification to involve data of moderate or high dimension. Therefore, for the method to be attractive the computational complexity of the LogConcDEAD method must be reduced substantially.
Kevin Lu and Alastair Young (Imperial College London)
This is a clever elegant paper and, reassuringly, it gives proper consideration to the question of what happens if the central assumption of logconcavity is violated. But, perhaps the authors undersell the full power of their method in these circumstances: high accuracy can often be obtained if we avoid interpreting model constraints too rigidly.
In the context of likelihoodbased parametric inference, conventional aspirations of what might be achieved under model misspecification are typically limited to ensuring asymptotic validity, rather than small sample accuracy.
Let Y={Y_{1},…,Y_{n}} be a random sample from an underlying density g(y), modelled (perhaps incorrectly) by a parametric density f(y;θ), with θ=(ψ,φ), with scalar ψ. Let θ_{0}=(ψ_{0},φ_{0}) maximize T(θ)=∫ log {f(y;θ)} g(y) dy and suppose that we test H_{0}:ψ=ψ_{0}, against a onesided alternative, say ψ>ψ_{0}.
 (a)
simulating the distribution of
R under
or
 (b)
using an adjusted form of
R, such as BarndorffNielsen's
R^{*}statistic (
BarndorffNielsen, 1986).
Under model misspeciflcation, none of the procedures is asymptotically valid. Using an estimate of v we can, however, construct a statistic , which is asymptotically N(0,1), whether the distributional assumption is correct or not. Inference based on an N(0,1) approximation to the distribution of R^{′} is a sensible safeguard against misspecification and still achieves an O(n^{−1/2}) error rate, albeit with some loss of efficiency with small n. But, we can do much better for small n by simulating the distribution of R^{′} under the assumed (wrong) distribution, as this typically does not change much with the underlying (true) distribution g.
For example, suppose that our parametric assumption is of the inverse Gaussian distribution,
with interest parameter the shape λ, and the mean μ as nuisance, whereas the true distribution g is gamma, scale parameter 1. Fig. 12 shows for n=10 the densities of the various statistics both under the model assumption and various cases of gamma distribution g. Stability of the distribution of R^{′}, and that the distribution is far from its N(0,1) limit for n=10, is apparent. In Table 5, we compare, from a series of 50000 replications, the nominal and actual size properties of tests derived by normal approximation to the distributions of R,R^{′} and R^{*}(Φ(R),Φ(R^{′}) and Φ(R^{*})) with those of the procedure which simulates the distribution of the relevant statistic. The simulation procedure performs well compared with the N(0,1) approximation under model misspecification, though with noticeable loss of accuracy compared with the case of correct specification. Harnessing the stability of R^{′} allows excellent small sample accuracy.
Table 5. Actual sizes of tests of different nominal size, inverse Gaussian shape example, for the two cases g is misspecified and g is correctly specified  Results for the following nominal sizes: 

0.010  0.050  0.100  0.900  0.950  0.990 

g is gamma (scale = 1; shape = 5.5) 
Φ(R^{′})  0.022  0.056  0.090  0.696  0.776  0.886 
‘Bootstrap’ R^{′}  0.012  0.055  0.106  0.871  0.931  0.984 
Φ(R^{*})  0.032  0.082  0.132  0.859  0.923  0.982 
g is inverse Gaussian (mean = 1; shape = 2) 
Φ(R)  0.004  0.023  0.051  0.808  0.890  0.971 
‘Bootstrap’ R  0.010  0.050  0.101  0.902  0.950  0.990 
Φ(R^{*})  0.009  0.050  0.099  0.900  0.950  0.990 
Mervyn Stone (University College London)
The authors of this theoretically impressive paper say that theorem 3 has, in itself, a ‘desirable robustness property’. The same should therefore apply to the close analogue of theorem 3, for least squares estimation with an untrue model (Table 6).
Table 6. Analogous questionable robustnesses Step  Density estimation of f  Least squares estimation of β 

I  X_{1},…,X_{n} unknown true probability density function f  E(Y)=Xβ: true X and β 
II  {f_{0}}: untrue (‘misspecified’) logconcave model  {Z}: untrue model 
III  : the f_{0} that is Kullback–Leibler closest to f  Zγ^{*}: the Zγvector closest to Xβ 
IV  : theorem 3  
The proviso that the datagenerating f be ‘not too far from’ logconcavity rather begs the question of robustness. If ‘robustness’ means anything, it must accommodate what the real world dictates—and that is not a theoretical question. For empirical least squares, most statisticians would think that there is no such robustness in the research underlying the formulae for funding England's primary care trusts. At the heart of the Department of Health's case for reallocating £ 10 billion (13%) of England's primary care trust funding (Stone, 2010), there was a supposedly plausible linear model {Zγ} with Z=(V v) and γ^{T}=(δ^{T}ɛ) in which the least squares estimate of ɛ (the coefficient of variable v) was, somewhat paradoxically, held to have a ‘wrong’, implausibly negative sign. The dependent variable y was a local measure of healthcare need. The negative sign was taken to reveal ‘unmet need’ in areas with high values of the socioeconomic variable v, justifying a later reallocation of the £ l0 billion to favour those areas. However, there is no intrinsic robustness in such a conclusion. Simply extend Z to X=(Z a) by including just one of the variables omitted for one reason or another (the research did plenty of that). The consequences for the estimation of ɛ in the model E(y)=Zγ+αa are not now ascertainable without a historical reanalysis of the data. The outcome would depend on both the magnitude of the omitted component αa and its orientation to the subspace {Zγ}—to the vector v in particular. The ‘wrong sign’ might be restored to ‘plausible’ positivity and the case for moving billions would have been weakened.
Yingcun Xia (National University of Singapore) and Howell Tong (London School of Economics and Political Science)
We congratulate the authors on their breathtaking paper. We have two questions and two comments.
MingYen Cheng (University College London)
This interesting paper establishes existence and uniqueness of a nonparametric maximum likelihood estimator (MLE) for a multidimensional logconcave density function and suggests the use of Shor's ralgorithm to compute the MLE. Some characterization of a multidimensional logconcave density function is also given. Compared with the univariate case, for which the authors provide a comprehensive literature review, such investigations are much more challenging although of no less importance in many areas of inference. By giving illustrative examples, this paper further exploits statistical problems where multidimensional logconcave modelling may be useful; this includes classification, clustering, validation of a smaller (parametric) model and detecting mixing. In what follows, a few questions are raised in the hope of stimulating interest in future studies on multidimensional logconcave densities and applications.
Basically, logconcavity is a stronger assumption than the unimodal shape constraint. In using logconcave constrained estimators to assess suitability of a smaller model, it is sensible to ask that the model under investigation is logconcave, in which case the present approach is expected to be more powerful than using unimodal smoothing or simply nonparametric smoothing; both have been extensively studied in the literature. There are certain types of mixing that can be differentiated from logconcavity whereas other types may not be. For example, a mixture of two logconcave densities can be either logconcave or not, and the method can detect only mixtures that are no longer logconcave. Further characterization of logconcave densities and their mixtures seems necessary to gain more insight into this problem.
Shifting back to the MLE, although the authors acknowledge that the computational burden remains an issue, it is worthwhile to seek approximations that allow fast implementation or dimension reduction techniques for logconcave densities; in practice the performance deteriorates quickly when the dimension becomes larger and usually one does not go beyond d=3. A question is whether the logconcavity framework allows certain simple dimension reduction transformation. Of course, studying the rate of convergence of the MLE and estimators of functionals of the density is important to understanding or assuring the performance from the theoretical viewpoint. Finally, there is an abundant literature on shape constraint estimation based on alternative approaches such as kernel smoothing and penalized likelihood approach. Interesting questions include what the differences and similarities between these different approaches are and whether ideas for one approach can be used in another.
Peter Hall (University of Melbourne and University of California, Davis)
This paper contains fascinating elegant results, and the authors are to be congratulated on a lovely piece of work. Of course, the paper also generates a hunger for still more, e.g. for information about the rate of convergence, but I assume that this will appear in the fullness of time. One cannot help but conjecture that, since logconcavity is essentially a property of the second derivative of the density estimator, the rate of convergence will be the same as for a kernel estimator when the density has two derivatives and the bandwidth is chosen optimally.
The implications of logconcavity for nonparametric inference are perhaps a little unclear, because the severity of the constraint seems difficult to judge. The fact that logconcavity implies unique maximization of the likelihood suggests that it is rather confining, although the authors can perhaps contradict this. Is there an interesting class of constraints that imply unique maximization of the likelihood, and for which analogues of the authors’ results can be derived?
Logconcavity can be enforced by using a variety of other approaches, including the biased bootstrap and data sharpening methods of Hall and Presnell (1999) and Braun and Hall (2001) respectively. I have tried the first method in the logconcave case, and, like other applications of the biased bootstrap to impose shape on function estimators (e.g. the constraints of unimodality and monotonicity in the context of density estimation), it makes the estimator much less susceptible than usual to choice of the smoothing parameter, e.g. to the selection of the bandwidth in a kernel estimator. This property resonates with the authors’ result that a logconcave density estimator can be constructed by ‘maximum likelihood’ without the need for a smoothing parameter.
Jon Wellner (University of Washington, Seattle)
I congratulate the authors on their very interesting paper. It has already stimulated considerable further research in an area which deserves much further. investigation and which promises to be useful from several perspectives.
I shall focus my comments on some possible avenues for further developments and briefly mention some related work.
An alternative to the logconcave class
The classes of hyperbolically completely monotone and hyperbolically kmonotone densities that were studied by Bondesson (1990, 1992, 1997) offer one way of introducing a very interesting family of shapeconstrained densities with a range of smoothness and useful preservation properties on . As Bondesson (l997) showed,
 (a)
the hyperbolically monotone densities of order k on (0,∞) are closed under formation of products of the corresponding (independent) random variables, and hence under sums of the logarithms of these random variables in the transformed classes on (−∞,∞),
 (b)
the logarithm of a random variable with a hyperbolically monotone density of order 1 has a density which is logconcave on
and
 (c)
the logarithms of the class of random variables with completely hyperbolically monotone densities yields a class of random variables which contains the Gaussian densities on
. These facts suggest several further problems and questions.
 (i)
Can we estimate a hyperbolically monotone density of order
k nonparametrically for
k2, and hence their natural logtransforms on
? (For
k=1 such nonparametric estimators follow from the existence of nonparametric estimators of a logconcave density as studied in
Dümbgen and Rufibach (2009) and
Balabdaoui et al. (2009).)
 (ii)
Do there exist ‘natural’ generalizations of the hyperbolically
kmonotone classes to
which when transformed to
include the Gaussian densities? Such classes, if they exist, would generalize the multidimensional logconcave class that was studied by the authors and give the possibility of trading off smoothness and dimension with smaller classes of densities offering many of the advantages of the logconcave class but with more smoothness.
These possibilities might be related to the authors’ nice observation in (b) of their discussion concerning the possibility of further smoothing of the maximum likelihood estimator .
Multivariate convex regression, including some work on the algorithmic side, has recently been studied by Seijo and Sen (2010).
Arseni Seregin (University of Washington, Seattle)
I thank the authors for their stimulating contribution to shapeconstrained estimation and inference. I shall limit my comments to a brief discussion of related classes of shapeconstrained families which may be of interest.
The logconcave class may be too small
As mentioned by the authors, logconcave densities have tails which decline at least exponentially fast. Larger classes of densities, the classes of sconcave densities, were introduced in both econometrics and probability in the 1970s and connected with the theory of sconcave measures by Borell (1975). A useful summary of the properties of these classes, including preservation properties under marginalization, formation of products and convolution, has been given by Dharmadhikari and JoagDev (l988). An initial study of estimation in such classes via likelihood methods is given in Seregin and Wellner (2010), and via minimum contrast estimation methods in Koenker and Mizera (2010). Much more research concerning properties of the estimators and development of efficient algorithms for various estimators in these classes is needed.
The logconcave class may be too big
Theory for smooth functionals
A large number of interesting problems arise from the authors’Section 7 concerning the proposed plugin estimators of (smooth) functionals of a logconcave density f. To the best of our knowledge the corresponding class of estimators has not been studied thoroughly even in the case of Grenander's maximum likelihood estimators of a monotone decreasing density on .
Qiwei Yao (London School of Economics and Political Science)
This paper provides an elegant solution to an important statistical problem. The extension to the estimation for regression functions that is presented in Dümbgen et al. (2010) is also attractive. I would like to make two remarks and to pose one openended question.
I wonder whether it is necessary for theorem 1 to assume that the observations are independent and identically distributed as the result is largely geometric. Is it enough to assume that all X_{i} share the same distribution? If so, the method proposed would be applicable to, for example, vector time series data.
Estimation of conditional density f(yx) is another important and difficult problem. Since log {f(yx)}= log {f(y,x)}− log {f(x)} the method proposed provides an estimator for f(yx) by estimating log {f(y,x)} and log {f(x)} separately, and the support of the conditional density f(·x) is identified as {y:f(y,x)>0}. All those involve no smoothing.
Smoothing is a tricky technical issue in multivariate nonparametric estimation. It is associated with many practical difficulties. As illustrated in this paper, we are better off without it if possible. But, if the density function to be estimated is smooth such as having a first derivative, is it possible to incorporate this information in the algorithm?
Roger Koenker (University of Illinois, Urbana–Champaign) and Ivan Mizera (University of Alberta, Edmonton)
We are pleased to have this opportunity to congratulate the authors on this contribution to the growing literature on logconcave density estimation. Having begun to explore regularization of multivariate density estimation via concavity constraints several years ago (Mizera and Koenker, 2006), we can also sympathize with the prolonged gestation period for publication of such work.
We feel that the authors may be too pessimistic about Newtontype methods when rationalizing their gradient descent approach to computation. Interior point algorithms for convex optimization have been remarkably successful in adapting barrier function methods to a variety of nonsmooth problems and employing Newton steps. Linear programming has served as a prototype for these developments, but there has been enormous progress throughout the full range of convex optimization.
Our computational experience has focused on finite difference methods that impose both the concavity and the integrability constraints on a grid with increments controlling the accuracy of the approximation. Even on rather fine grids this approach combined with modern interior point optimization is quite quick. For the bivariate example in Koenker and Mizera (2010) with 3000 points, computing the Hellinger estimate subject to the f^{−1/2} concavity constraint takes about 23 s, whereas the maximum likelihood estimate with the logconcavity constraint required 45 min on the same machine with the LogConcDEAD package implementing the authors’ algorithm.
The authors express the hope that their results for maximum likelihood estimation of logconcave densities may offer ideas that can be transferred to more general settings. Koenker and Mizera (2010) establish a polyhedral characterization, which is kindred to that exemplified by Fig. 1, for a class of maximum entropy estimators imposing concavity on corresponding transformations of densities. Particular special cases include maximum likelihood estimation of logconcave densities; the instance that we find especially appealing amounts to minimizing a Hellinger entropy criterion for densities f, such that f^{−1/2} is concave. This class of densities covers the Student tdensities with degrees of freedom ν1. Whether any similar polyhedral representation holds for maximum likelihood estimation subject to such concavity requirements, as recently proposed by Seregin and Wellner (2010), is not clear.
In view of this common polyhedral characterization, it would be interesting to know whether the Shor approach can be adapted to this broader class of quasiconcave estimation problems. We are looking forward to the authors’ opinion on this.
The following contributions were received in writing after the meeting.
Christoforos Anagnostopoulos (University of Cambridge)
The multidimensional density estimator that is proposed in this work is a key contribution in the field of nonparametric statistics, owing to its automated operation, computational simplicity and theoretical properties. It represents the culmination of a recent body of work on logconcave probability densities and will certainly stimulate further research into the properties and applications of shapeconstrained estimators.
The authors mention classification and clustering as two possible application areas of their method. Indeed, nonparametric class descriptions (or cluster descriptions) have been an increasingly active area of research in the machine learning community (e.g. Fukunaga and Mantock (1983) and Roberts (1997)). A further challenge that such algorithms face when deployed in realtime environments is the need to process data on line without revisiting the data history. Unfortunately, the requirement of a constant time update clearly clashes with the infinite dimensional nature of nonparametric estimators such as that proposed in the paper. It is consequently of great practical interest to investigate the extent to which an online approximation could be devised.
A working candidate may be constructed readily, by performing a fixed number of iterations of Shor's ralgorithm per time step, initialized at the previous time step's pole heights estimates. In Section 3.2, the number of iterations required for convergence in the offline case is reported to increase approximately linearly with n. This suggests that, on arrival of each novel data point, a constant number of iterations may indeed suffice for convergence, but early stopping may be employed if necessary. To handle the increasing sample size, we may fix the number of pole heights to a constant number w. In an online context, this means dynamically maintaining an active set of w data points, and replacing (at most) one data point per time step. The selection of which data point to replace could be arbitrary (as in a sliding window where, at time n, x_{n−w} is replaced by x_{n}) geometric (for example replace the data point whose removal has the smallest effect on the shape of the estimator) or information theoretic.
Similar work on sequential kernel density estimation has attracted great attention in the machine learning community (e.g. Han et al. (2007)). Notably, the lack of bandwidth parameters for the estimator of Cule, Samworth and Stewart represents a crucial comparative advantage, even more so in online than in offline contexts. There is consequently little doubt that a theoretical argument concerning the error in the approximation above as a function of w and the data selection mechanism would be of great interest. Finally, it should be noted that the extension to online estimation of mixtures of logconcave densities can be handled by using recent work on online expectation–maximization (Cappé and Moulines, 2009).
Dankmar Böhning (University of Reading) and Yong Wang (University of Auckland)
We congratulate the authors on their excellent contribution to multivariate nonparametric density estimation under a logconcavity restriction. This restriction appears to be quite realistic for many practical problems and we expect to see many successful applications of this new methodology.
We acknowledge the authors’ detailed use of convex geometry that led to the existence and uniqueness of the logconcave maximum likelihood estimator (LCMLE). Evidently, the algorithmic approach still lacks computational power as their Table 1 indicates and there is likely room for improvements; see Wang (2007) for a fast algorithm on nonparametric mixture estimation, which is a problem that is somewhat related. Also, the authors point out that the kernel density estimator is a natural competitor but has difficulty with bandwidth selection, especially in the multivariate case.
We are also intrigued by the clustering example that the authors discuss in Section 6. For a long time, the mixture community has been looking for a nonparametric replacement for the parametric mixture component distribution. This paper gives a very interesting new solution to this problem. It appears to us that the misclassification rate might be competitive only in comparison with the Gaussian mixture, if crossvalidation assessment is used. We also wonder whether the new method can be extended to nonparametric clustering.
Finally, we have explored the predictive performance of the LCMLE in the setting of supervised learning and compared it with that of three others: a Gaussian density estimator, logistic regression and a kernel estimator. The same Wisconsin breast cancer data set as shown in Fig. 6(a) was used, but for classification purposes here. Except for logistic regression, which is fitted to all observations, observations in each class are modelled by each specific distribution estimator. A new observation is then classified according to its posterior probability. For bandwidth selection of the kernel estimator, we simply use Silverman's rule of thumb, where s_{j} is the sample standard deviation along the jth coordinate (Silverman (1986), page 87).
Table 7 gives both the resubstitution and the tenfold crossvalidation (averaged over 20 replications) classification errors. Despite its low resubstitution error, the LCMLE performs only comparatively with the less appealing kernel estimator and the apparently biased Gaussian estimator in terms of crossvalidation error. With only a small increase from its resubstitution error, the parametric logistic regression gives a remarkably smaller prediction error than the other three. Note that the LCMLE is far more expensive to compute than the others.
Table 7. Numbers of misclassified observations, with standard errors in parentheses Method  Parametric results  Nonparametric results 

Gaussian  Logistic  Kernel  Logconcave 

Resubstitution  35  25  32  25 
Crossvalidation  37.1 (0.35)  27.5 (0.30)  37.5 (0.21)  37.1 (0.32) 
It is likely that the fair performance of the LCMLE in this example is due to its nonparametric nature and truncation of the density to zero outside the convex hull of the training data. This increases the estimation variance, which may significantly outweigh its reduced bias.
José E. Chacón (Universidad de Extremadura, Badajoz)
First of all, I congratulate the authors on their thorough and interesting paper. Sometimes, the extension of a univariate technique to its multivariate analogue is taken as a trivial or incremental step, but most of the time this is not so, and this paper provides a nice example of the latter type of advance.
Even if the present study is quite exhaustive. I would just like to add further avenues for future research to those which the authors already propose.
Although most of the literature on maximum likelihood (ML) estimation of the density is devoted to the univariate case, there is another recent reference which provides a multivariate method: Carando et al. (2009). There, the class of densities is constrained to be Lipschitz continuous, so the problem is of a different nature, but both resulting ML estimators present some similarities in shape. In the univariate case the two estimators are piecewise linear, although the logconcave estimator allows for different slopes; in contrast, the Lipschitz estimate is not necessarily unimodal. In any case, the connections between the two methods surely deserve to be explored.
Probably an undesirable feature of the ML estimate is that it is not smooth (i.e. differentiable). Dümbgen and Rufibach (2009) amended this via convolution, but perhaps it would be more natural in this setting to investigate the ML estimator imposing some smoothness condition on the class of logconcave densities. In the univariate case, for instance, we could think of a smoothness constraint leading to a piecewise quadratic or cubic (instead of linear) ML estimate.
Another possible research direction points to a comparison with kernel methods. I agree with the authors that general bandwidth matrix selection is a difficult task, yet the plugin method that was recently introduced in Chacón and Duong (2010) looks promising from a practical point of view, being the multivariate analogue of the method by Sheather and Jones (1991). On the theoretical side, it would be interesting to obtain the mean integrated squared error rates (and the asymptotic distribution) for the multivariate logconcave ML estimator, since it seems from the simulations that they might be faster than for the kernel estimator. Nevertheless, in the supersmooth case of, say, the standard dvariate normal density, it looks like this rate should be slower than n^{−1} log (n)^{d/2}, which can be deduced to be the rate for a superkernel estimator, reasoning as in Chacón et al. (2007).
Yining Chen (University of Cambridge)
I congratulate the authors for developing an innovative and attractive method for nonparametric density estimation. The power of this method was well illustrated in their simulation examples by comparing the mean integrated squared error (MISE) with other kernelbased approaches. To improve its performance at small sample sizes, the authors proposed a smoothed (yet still fully automatic) version of their estimator via convolution in Section 9. Below, we give some justification for this new estimator and argue that it has some favourable properties.
We consider the same simulation examples as in Section 5, for d=2 and d=3, and for small to moderate sample sizes n=100,200,500. Results are given in Figs 14 and 15. We see that for cases (a), (b), (d) and (e), where the true density is logconcave and has full support, the smoothed logconcave maximum likelihood estimator has a much smaller MISE than the original estimator. The improvement is most significant (around 60%) for d=3 with small sample sizes, i.e. n=100 and n=200, but is still around 20% even when d=2 and n=500. Interestingly, this new estimator outperforms most kernelbased estimators (including those based on MISE optimal bandwidths, which would be unknown in practice) even at small sample sizes, where the original estimator performs relatively poorly. As shown in case (f), even if the logconcavity assumption is violated, the smoothing process still offers some mild reduction in MISE for small sample sizes.
However, as demonstrated in case (c), this modification can sometimes lead to an increased MISE at large sample sizes. This is mainly due to the boundary effect. Recall that in case (c) the underlying gamma distribution does not have full support. Convolution with the multivariate normal distribution shifts some mass of the estimated density outside the support of the true distribution and thus results in a higher MISE. It is a nice feature of the original estimator that it handles cases of restricted support effectively and automatically.
Finally, we note that the smoothed logconcave maximum likelihood estimator also offers a natural way of estimating the derivative of a density.
Frank Critchley (The Open University, Milton Keynes)
It is a great pleasure to congratulate the authors on a splendid paper: I only regret that I could not be there to say this in person!
Like all good papers read to the Society, its depth and originality raise many interesting further questions. The authors themselves allude to a variety of these, implicitly if not explicitly, and I hope that they will forgive any overlap with the following.
 (a)
What can be said about which (mixtures of) shapes admit maximum likelihood estimators?
 (b)
With logconcavity as target, what scope is there for transformation–retransformation methods?
 (c)
Notwithstanding the overall thrust of the paper, are there contexts in which there is some advantage to smoothing the maximum likelihood estimator that is produced?
 (d)
Are there potential links with dimension reduction methods in regression?
Jörn Dannemann and Axel Munk (University of Göttingen)
We congratulate the authors for their very interesting and stimulating paper which demonstrates that multivariate estimation with a logconcave shape constraint is computationally feasible. Conceptually, this approach seems very appealing, since it is much more flexible than parametric models, but sufficiently restrictive to preserve relevant data structures. Further, we believe that the extension to finite mixtures of logconcave densities for clustering as addressed in Section 6 is of particular practical importance.
As for classical mixture models identifiability is essential for model analysis and interpretation and as almost nothing is known for logconcave models we would like to comment on this issue here. First, note that classical parametric mixtures, namely mixtures of multivariate Gaussian (logconcave) or tdistributions (not logconcave), are identifiable (Yakowitz and Spragins, 1968; Holzmann et al., 2006; Dümbgen et al., 2008).
We would like to draw the authors’ attention to mixture models, where one component is modelled as a logconcave density, whereas the others belong to some parametric family, i.e. a Gaussian or tdistribution. For example consider a twocomponent mixture with a logconcave f_{LC} and a Gaussian component f_{Gauss} This model is identifiable if there is an interval I for which I∩supp(f_{LC})=∅ is a priori known. The EM algorithm that was suggested by the authors can easily be adapted to this semiparametric model. Applying it to the Wisconsin breast cancer data in the way that the component that is associated with the malignant cases is modelled as a Gaussian (or multivariate tdistribution) shows that it is intermediate between the purely Gaussian and the purely logconcave EM algorithm with 55 misclassified instances (51 for tdistributions with ν=3 degrees of freedom). The estimated mixture densities are presented in Fig. 17.
David Draper (University of California, Santa Cruz)
The potential usefulness of this interesting paper is indicated by, among other things, the existence of the rather infelicitously named LogConcDEAD package in R that the authors have already made available, for implementing their point estimate of an underlying datagenerating density f. I would like to suggest a potentially fruitful area of future work by adding to the paper's reference list a few pointers into the Bayesian nonparametric density estimation literature; this may be seen as a possible small sample competitor, to a bootstrapped version of the authors’ point estimate, in creating wellcalibrated uncertainty bands for density estimates and functionals based on them. This parallel literature dates back at least to the early 1960s (Freedman, 1963, 1965; Ferguson, 1973, 1974) and has burgeoned since the advent of Markov chain Monte Carlo methods (Escobar and West, 1995): main lines include Dirichlet process mixture modelling (e.g. Hjort et al. (2010)) and (mixtures of) Pólya trees (e.g. Hanson and Johnson (2001)). Advantages of the Bayesian nonparametric approach to density estimation include
 (a)
the automatic creation of a full posterior distribution on the space
of all cumulative distribution functions, with builtin uncertainty bands arising directly from the Markov chain Monte Carlo sampling, and
 (b)
a guarantee of asymptotic consistency of the posterior distribution in estimating
f (when the prior distribution on
is chosen sensibly: see, for example
Walker (2004)) whether
f is logconcave or not.
From the authors’ viewpoint, with their emphasis on the lack of smoothing parameters in their point estimate, disadvantages of the Bayesian approach may include the need to specify hyperparameters in the construction of the prior distribution on , which act like userspecified tuning constants. The small sample performance—both in terms of calibration (e.g. nominal 95% intervals include the datagenerating truth x% of the time; x= ?) and of useful information obtained per central processor unit second—of these two rather different approaches would seem to be an open problem that is worth exploring.
Martin L. Hazelton (Massey University, Palmerston North)
Nonparametric density estimation in high dimensions is a difficult business. It is therefore natural to look at restricted versions of the problem, e.g. by placing shape constraints on the target density f. The authors are to be congratulated on their progress in the case where f is assumed to be logconcave. I offer two (loosely connected) comments on this work: the first with regard to practical performance for bivariate data, and the second to suggest an alternative test for logconcavity.
I would expect the logconcave maximum likelihood estimator to improve markedly on kernel methods when the data are highly multivariate. However, the situation is less clear for bivariate data, where the curse of dimensionality has not really begun to bite. In that important case, kernel estimation using plugin bandwidth selection is generally very competitive against the logconcave maximum likelihood estimator for n500, and only slightly worse when n=2000. Arguably the extra smoothness properties of the kernel estimate are a fair swap for the small loss in performance with respect to mean integrated squared error. The only bivariate setting in which the logconcave maximum likelihood estimator appears much better is for test density (c). However, this is almost certainly a result of boundary bias in the kernel estimator, for which corrections are available (e.g. Hazelton and Marshall (2009)).
Of course, if we are convinced that f is logconcave then kernel estimation with a standard bandwidth selector may be unattractive because it is not guaranteed to produce a density of that form. However, if the kernel is logconcave then so also will be the density estimate for sufficiently large bandwidth h, although this might result in significant oversmoothing from most standpoints. This observation motivates a test for logconcavity of f.
Suppose that we construct a kernel estimate by using an isotropic Gaussian kernel. Then there will be a (scalar) bandwidth h_{0}>0 such that the estimate will be logconcave if and only if hh_{0} (because logconcavity is preserved under convolution). This bandwidth is a plausible test statistic for logconcavity, since the larger its value the more we have had to (over)smooth the data to enforce logconcavity. This idea mirrors the bump hunting test that was developed by Silverman (1981). Following Silverman's approach, bootstrapping could be employed to test significance, although for practical application it would be necessary to refine the basic methodology to mitigate the effects of tail wiggles that are generated by isolated data points.
Woncheol Jang (University of Georgia, Athens) and Johan Lim (Seoul National University)
We congratulate the authors for an interesting and stimulating paper. The methodology in the paper is well supported in theory and is nicely applied to classification and clustering. Here we consider the application of the proposed method to bagging, which is popularly used in the machine learning literature.
The main idea of bagging (Breiman, 1996) is to use a committee network approach. Instead of using a single predictor, bootstrap samples are generated from the original data and the bagged predictions are calculated as averages of the models fitted to the bootstrap samples.
Clyde and Lee (2001) proposed a Bayesian version of bagging based on the Bayesian bootstrap (Rubin, 1981) and proved a variance reduction under Bayesian bagging. A key idea of Bayesian bagging is to use smoothed weights for the bootstrap samples whereas the weights in the original bagging can be considered as being generated from a discrete multinomial(n;1/n,…,1/n) distribution.
Other related ideas are output smearing of Breiman (2000) and input smearing of Frank and Pfahringer (2006). They suggested adding Gaussian noise to the output and input respectively and applied the bagging to these noiseadded data sets. Both smearing methods were shown empirically to work very well in their simulation studies. However, the optimal magnitude (the variance) of the noise to be added is not well understood.
The idea behind smearing methods is indeed equivalent to generating resamples with the smoothed bootstrap and the issue of the choice of magnitude of the noise is the same as that of bandwidth selection of the multivariate kernel density estimator that is used in the smoothed bootstrap procedure. In Bayesian bagging, there is a similar issue with the choice of the hyperparameter of the Dirichlet prior that is used in the Bayesian bootstrap.
An advantage of the proposed method against the aforementioned methods is that it needs no tuning. The authors also propose a procedure to sample from the estimated logconcave density in Appendix B.3. Thus, bagging based on resamples from the estimated logconcave density would be a good alternative to the Bayesian bagging or smearing methods.
Hanna K. Jankowski (York University, Toronto)
I congratulate the authors on an important and thoughtprovoking paper. This work will certainly be a catalyst for further research in the area of shapeconstrained estimation, and the authors themselves suggest several open problems towards the end of the paper. I shall restrict my discussion to adding another question to this list.
One of the identifying features of nonparametric shapeconstrained estimators is their rates of convergence, which are slower than the typical n^{1/2}rate that is achieved by parametric estimators. In one dimension, the Grenander estimator of the decreasing density converges at a local rate of n^{1/3} whereas the estimator of a convex decreasing density converges locally at rate n^{2/5} (Prakasa Rao, 1970; Groeneboom, 1989; Groeneboom et al., 2001). A similar rate is seen for the onedimensional nonparametric maximum likelihood estimator of a logconcave density, which was recently proved to be n^{2/5}, as long as the density is strictly logconcave (Balabdaoui et al., 2009). A heuristic justification of how different local rates arise has been given by Kim and Pollard (1990). The global convergence rates, in contrast, can be quite different. For the Grenander estimator, the convergence rate for functionals is known to be
where Z is a standard normal random variable (Groeneboom, 1985; Groeneboom et al., 1999; Kulikov and Lopuhaä, 2005). Here, f_{0} denotes the true underlying monotone density. Thus, smooth functionals with μ(f_{0})=0 (such as plugin estimators of the moments) converge at rate n^{1/2} and recover the faster rate characteristic of parametric estimators.
Global and local convergence rates for the logconcave nonparametric maximum likelihood estimator are sure to be of much interest in the near future. Indeed, it is already conjectured in Seregin and Wellner (2009) that the local convergence rate for the estimator that is introduced here is n^{2/(4+d)} when d=2, 3. In Section 7, the authors consider plugin estimators of the moments or the differential entropy for . What would the convergence rate be for these functionals? Preliminary simulations for d=1 indicate that the n^{1/2}rate may continue to hold for the logconcave maximum likelihood estimators (Fig. 18). Further investigation is needed in higher dimensions. A rate of n^{1/2} would, naturally, be very attractive in the application of these methods.
Theodore Kypraios and Simon P. Preston (University of Nottingham) and Simon R. White (Medical Research Council Biostatistics Unit, Cambridge, and University of Nottingham)
We congratulate the authors for this interesting paper. In this discussion, we would like to hear the authors’ views on the applicability of their approach in the following context.
Suppose that we interested in a distribution whose probability density function, say p(x), is proportional to a product of other probability density functions f_{i}(x), i=1,…,k, i.e.
 (5)
with c being a normalizing constant. Suppose that none of the f_{i}(x) is known explicitly but that we can draw independent and identically distributed samples from each. How should we best calculate functionals of p(x), or draw samples from it?
White et al. (2010) consider an exact method of samplingbased Bayesian inference in the context of stochastic population models. This gives rise to a posterior distribution of the parameters of the form (5). Their approach is to use a kernel density estimator for each f_{i}(x), and then to estimate p(x) as the normalized pointwise product of kernel density estimators. But if the f_{i}(x) are logconcave then would the methodology that is presented in this paper provide a better alternative? If so, then a clear advantage would be that we could draw samples from p(x) by making use of the rejection sampling method in Appendix B.3 of this paper. Can the authors comment on the applicability of their method to product densities such as density (5), in particular on issues as k increases?
Chenlei Leng (National University of Singapore) and Yongho Jeon (Yonsei University, Seoul)
Multidimensional density estimation without any parametric distributional assumption is known to be difficult. We congratulate Cule, Samworth and Stewart for an impressive piece of work, in which they show that logconcavity is an attractive option compared with nonparametric smoothing. Here we focus on an alternative formulation, which may greatly facilitate numerical implementation. In the following discussion, we use the notation that is used in the paper.
Consider an alternative objective function to function (3.1),
 (6)
where g is a concave function. Jeon and Lin (2006) showed that its population minimizer is g= log (f_{0}). It is easy to see that the sample minimizer of this function is a least logconcave function. An application of theorem 2 in the paper leads to our alternative formulation
 (7)
Following Appendix B, it is easy to see that the subgradient corresponding to equation (B.l) can be written as
Note that this formulation requires ∫_{Td}w_{1} dw=1/(d+1)! to be computed only once, precisely, whereas the authors require defined in Appendix B.2 to be computed iteratively, and to use Taylor series expansion in approximating the integral to avoid singularity problems if necessary. The new formulation can be straightforwardly extended to other types of constrained density estimation, by replacing the function exp {−g(x)} in expression (6) with some appropriate function ψ{g(x)}, which can be formulated to correspond to the quasiconcave function in Koenker and Mizera (2010) or the convextransformed density in Seregin and Wellner (2010). The computation of our estimator remains effectively the same with respect to the integral.
Another interesting problem is to introduce structures to the density. For example, we may decompose the logdensity as an analysisofvariance model by writing
where h_{j}s are the main effects and h_{jk}s are the twoway interactions. Higher order interactions may be considered as well. Some sideconditions are assumed to assure the identifiability of this decomposition. It is known that h_{jk}=0 corresponds to conditional independence of the jth variable and the kth variable. The conditional independent structure corresponds to a graphical model, as discussed by Jeon and Lin (2006). Using our formulation, we may decompose y_{i} as
and minimize expression (7) with this decomposition. For graphical model building, we may apply the group lasso penalty (Yuan and Lin, 2006) which can estimate y_{i,jk},i=1,…,n, as 0.
In ongoing work, we are investigating this new density estimator and will report the result elsewhere.
Dominic Schuhmacher (University of Bern)
It was a pleasure to read this interesting and elegant paper that covers so much ground on multivariate logconcave density estimation. I would like to comment on two central points.
First, the logconcave maximum likelihood estimator that is studied by the authors may be written as the unique function
where
denotes the empirical distribution of . We know now from joint work with one of the authors that this is a special case of a more universal approximation scheme, in which is replaced by a general probability measure P on . It is shown in Dümbgen et al. (2010) that for a probability measure P that has a first moment and is not concentrated on any hyperplane of a unique maximizer
exists and depends continuously on P in Mallows distance. If P has a logconcave density f, then almost surely; if f is a general density, minimizes the Kullback–Leibler divergence d_{KL}(·,f).
My second point concerns computation. I congratulate the authors on their algorithm LogConcDEAD, which in view of the adversity of the problem in the multivariate case is surprisingly fast and reliable. However, the computation times in Table 1 mean that the algorithm cannot realistically be applied in higher dimensions, with large samples or many times sequentially. The second limitation is also relevant for an approximate computation of if P is nondiscrete; the third limitation in particular for the multivariate version of the regression setting that was outlined above.
Guenther Walther (Stanford University)
I started looking at logconcave distributions when I was searching for an appropriate model for subpopulations of multivariate flow cytometry data about 10 years ago. The use of logconcave distributions is appealing for this purpose since their unimodal character is commonly associated with a single component population. In addition, logconcave distributions have a certain nonparametric flexibility that is helpful in many problems, but they can still be estimated without having to deal with a tuning parameter. When I worked out how to compute the maximum likelihood estimator (MLE) in the univariate case, I realized that the multivariate case would be much more daunting, requiring a more involved optimization algorithm and a considerable computational overhead for the construction of multivariate tessellations. I considered the task to be too challenging and decided not to pursue it further beyond the univariate work that I had done at that time.
Cule, Samworth and Stewart have shown in their paper how to compute the multivariate MLE by using Shor's ralgorithm, and they provide an accompanying software package that implements their algorithm. I congratulate them on this work and I believe that the paper will inspire much further research into the multivariate case. In particular, they show how, by modifying the objective function for the MLE, the problem becomes amenable to known, albeit slow, convex optimization algorithms. It is desirable to improve on the computation times that are given in Table 1, especially for the higher dimensional cases. I expect that the groundwork that the paper lays in terms of the optimization problem will inspire new research into faster algorithms. Another intriguing result is the outstanding performance of the MLE visàvis other nonparametric methods as reported in their simulation study. These results provide a strong motivation to establish theoretical results about the finite sample and asymptotic performance of the MLE.
The authors replied later, in writing, as follows.
We are very grateful to all the discussants for their many helpful comments, insights and suggestions, which will no doubt inspire plenty of future work. Unfortunately we cannot respond to all of the issues raised in this brief rejoinder, but we offer the following thoughts related to some of these contributions.
Other shape constraints and methods
Several discussants (Delaigle, Hall, Wellner, Seregin, Chacón and Critchley) ask about other possible shape constraints. Indeed, Seregin and Wellner (2010) have recently shown that a maximum likelihood estimator exists within the class of dvariate densities of the form f=h○g, where h is a known monotone function and g is an unknown convex function. Certain conditions are required on h, but taking h(y)= exp (−y) recovers logconcavity, whereas taking (with 0>r>−1/d) yields the larger class of rconcave densities. Questions of uniqueness and computation of the estimate for these larger classes are still open. Of course, such larger classes must still rule out the spiking problem that was mentioned on the second page of the paper. Koenker and Mizera (2010) study maximum entropy estimators within these larger classes, whereas Leng and Jeon propose in their discussion an alternative Mestimation method which again has wide applicability.
As pointed out in Chacón's discussion, Carando et al. (2009) have considered maximum likelihood estimation of a multidimensional Lipschitz continuous density. The Lipschitz constant κ must be specified in advance and the estimator will be as rough as allowed by the class, but consistency, e.g. in L_{1}distance, is achievable provided that κ is chosen sufficiently large (we are not required to let κ∞). Given the size of the class, slower rates of convergence are to be expected.
Shapeconstrained kernel methods, as studied in Braun and Hall (2001) and mentioned by Delaigle, Cheng and Hall, offer a further alternative. The idea here is to choose a distance (or divergence) between an original data point and a perturbed version of it. Starting with a standard kernel estimate, we then minimize the sum of these distances subject to the shape constraint being satisfied by the kernel estimate applied to the perturbed data set. Attractive features are smoothness of the resulting estimates and the generality of the method for incorporating different shape constraints; difficulties include the need to choose a distance as well as a bandwidth matrix and the challenges that are involved in solving the optimization problem, particularly in multidimensional cases. Similarly, the related biased bootstrap method of Hall and Presnell (1999) warrants further study in multidimensional density estimation contexts.
Wellner mentions the interesting class of hyperbolically kmonotone (completely monotone) densities on (0,∞). To answer one of his questions, it seems the natural generalization to higher dimensions is to say that a density f on (0,∞)^{d} is hyperbolically k monotone (completely monotone) if, for all u ∈ (0,∞)^{d}, the function f(uv) f(u/v) is k monotone (completely monotone) in w=v+v^{−1} ∈ [2,∞). We would then be interested, for instance, in the class of densities of random vectors X=(X_{1},…,X_{d})^{T} such that the density of exp (X)= exp (X_{1}),…, exp (X_{d}))^{T} is hyperbolically completely monotone. It can be shown that does indeed contain the Gaussian densities on , and, given the attractive closure and other properties, maximum likelihood estimation within the class would seem to be an exciting avenue for future research.
We wholeheartedly agree with the many discussants (Rufibach, Zhang and Li, Cheng, Hall, Seregin, Chacón and Jankowski) who identify the problem of establishing the rates of convergence of the logconcave maximum likelihood estimator (and corresponding functional estimates) when d>1 as a key future challenge. The wellknown conjectured rates (e.g. Seregin and Wellner (2010)) suggest a suboptimal rate when d4. Although this certainly motivates the search for modified rate optimal estimates involving penalization or working with smaller classes of densities, as mentioned by both Rufibach and Seregin, it is also important not to lose sight of the computational demands in these higher dimensional problems. With this is mind, dimension reduction techniques, as mentioned by both Cheng and Critchley, are especially valuable, as are methods which introduce further structure into the density, such as the analysisofvariance decomposition of the logdensity that was mentioned by Leng and Jeon. The fact that logconcavity is preserved under marginalization and conditioning, as described in proposition 1 of the paper, suggests viable methods that certainly deserve further exploration.
Theory for the plugin functional estimators that was introduced in Section 7 and discussed by Delaigle, Seregin and Jankowski is also of considerable interest, and the simulations by Jankowski suggesting an n^{−1/2} rate of convergence in one case are noteworthy in this respect. To answer a question that was raised by Delaigle, will be robust to misspecification of logconcavity in cases where the true density f_{0} is close to the Kullback–Leibler minimizing density f^{*} and/or where the functional θ(f) varies only slowly as f moves from f_{0} to f^{*}. In a different context, Lu and Young argue that simulating the distribution of a scaled version of the signed root likelihood ratio statistic under an incorrect fitted distribution is robust to model misspecification. The disturbing story that was recounted by Stone regarding the allocation of primary care trust funding by the Department of Health emphasizes the need for much greater understanding of the properties of statistical procedures under model misspecification.
Zhang and Li, Xia and Tong, and Yao ask about conditional density estimation. In low dimensional contexts, one could use the logconcave maximum likelihood estimate (or its smoothed version) of the joint density and then obtain a conditional density estimate by taking the relevant normalized ‘slice’ through the joint density estimate. Proposition 1 of course guarantees that this conditional density estimate is logconcave. In the specific time series settings that were mentioned by both Xia and Tong, and Yao, where the likelihood may be expressed as a product of conditional likelihoods, we can in fact extend our ideas to handle these cases. For instance, take the simple example of an autoregressive model of order 1, where X_{0}=0 and
Assuming that the innovations ɛ_{1},…,ɛ_{n} are independent with common density f, the likelihood function in this semiparametric model is
Dümbgen et al. (2010) discuss algorithms for maximizing similar functions to obtain the joint maximizer under the assumption that f is logconcave. These ideas can be extended to certain other types of dependence, which greatly increases the scope of our methodology. Heuristic arguments indicate that consistency results of the sort given for independent data in Dümbgen et al. (2010) should continue to hold for these sorts of dependent data, though these require formal verification.
Both Xue and Titterington and Xia and Tong discuss the possibility of modifying the logconcave maximum likelihood estimate so that it is positive beyond the boundary of the convex hull of the data by extending the lowest exponential surfaces (and presumably renormalizing so that the density has unit integral). Unfortunately, in certain cases such an extension is not well defined: for instance, if d=1 and the data are uniformly spaced, the logconcave maximum likelihood estimate is the uniform distribution between the minimum and the maximum data points; extending this density yields a function which cannot be renormalized. The smoothed logconcave estimator that is proposed in Section 9 offers an alternative method for obtaining an estimate with full support.
Gopal and Casella show that the Metropolis–Hastings method for sampling from the fitted logconcave maximum likelihood estimator results in a higher acceptance rate and smaller standard errors than the rejection sampling method that is proposed in Appendix B.3. The (weak) dependence that is introduced into successive sampled observations by this method is probably insignificant for most purposes, so we have incorporated the algorithm into the latest version of the R package LogConcDEAD (Cule et al., 2010).
To answer a question of Xia and Tong, the triangulation of the convex hull of the data into simplices which underpins the maximum likelihood estimator is not unique; however, there is a unique set of maximal polytopes (whose vertices correspond to the set of ‘critically supporting tent poles’) on which is linear. Schuhmacher comments on identifying these maximal polytopes. Indeed, in one dimension, Dümbgen and Rufibach (2009) showed that, under sufficient smoothness and other conditions, the maximal distance between consecutive knots in the estimator is , where ρ_{n}=n^{−1} log (n). An analogous result in higher dimensions would certainly be of interest. It would remain a challenge to exploit this information to yield a faster algorithm but, along with Xue and Titterington, Böhning and Wang, Schuhmacher and Walther, we strongly encourage further developments in this area. Such developments may even facilitate online algorithms, which as described by Anagnostpoulos are of great interest particularly in the machine learning community.
Koenker and Mizera report impressive time savings for computing their maximum entropy estimator in a bivariate example. Their algorithm is based on interior point methods for convex programming which enforce convexity on a finite grid through a discrete Hessian and uses a Riemann sum and linear interpolation approximations to estimate the integral in their analogue of equation (3.2) in our paper. It may be desirable, instead of only computing the estimator at grid points, to obtain the triangulation into simplices C_{n,j} and quantities and involved in the polyhedral characterization of the estimator (see Appendix B), in which case it seems that it should be possible to adapt Shor's ralgorithm to handle rconcave estimators, though some numerical approximation of the integral term may be necessary. It would be interesting to know whether Koenker and Mizera have had success with their method in more than two dimensions, and whether it is possible to control the error in their approximations in terms of the mesh size of the grid.
Several discussants (Delaigle, Chacón, Chen, Hazelton and Walther) discuss the simulation results. Of course the maximum likelihood estimator makes use of additional logconcavity information, but what makes the results interesting is the fact that maximum likelihood estimators are not designed specifically to perform well against integrated squared error (ISE) criteria. Moreover, the logconcave maximum likelihood estimator has other desirable properties, such as affine equivariance, which many other methods do not have.
It is gratifying to see from the additional simulations that are provided by Chen that the smoothed logconcave estimator in Section 9 does appear to offer quite substantial ISE improvements over its unsmoothed analogue for small or even moderate sample sizes. In Fig. 19 we give further detail on these results in the case of density (a), the standard Gaussian density, by providing boxplots of the ISE for various methods based on 50 replications. Apart from giving another demonstration of the performance of the smoothed logconcave estimator, two points are particularly worth noting: firstly, in most cases the variability of the ISE does not appear to be larger for the two logconcave methods compared with the kernel methods (this addresses a question that was raised by Chacón in a personal communication). Secondly, using the optimal ISE bandwidth for the kernel method (which would again be unknown in practice) offers very little improvement over the optimal mean ISE bandwidth. This agrees with the findings for other distributions in a study by Chacón (personal communication) and addresses a point that was raised by Delaigle.
Both Zhang and Li, and Hazelton mention using boundary kernels (Wand and Jones (1995), pages 46–49) to improve the ISE performance of kernel methods in cases where the true density does not have full support. Indeed, as Fig. 20 indicates for the onedimensional Γ(2,1) true density, some improvements are possible when the bandwidth for the linear boundary kernel is chosen to minimize the ISE (though the method also assumes knowledge of the support of the true density). As envisaged by Zhang and Li, however, even then we can do better with our proposed methods (except in the case of a small sample size, for the unsmoothed logconcave estimator).
Both Xue and Titterington, and Böhning and Wang discuss applications of the logconcave maximum likelihood estimator to classification and clustering. Chen (2010) has also observed competitive performance from the logconcave maximum likelihood estimator in classification problems. Using the smoothed logconcave estimator (Section 9) can further improve matters, and finesses the issue of how to classify observations outside the convex hulls of the training data in each class.
Dannemann and Munk make insightful remarks about the identifiability of mixtures of logconcave densities, and their Fig. 16 with two mixture components is particularly instructive. One sensible alternative, as Dannemann and Munk suggest, is to model one of the mixture components parametrically; another possibility in some circumstances might be to model the logarithm of each of the mixture components as a tent function (requiring no change to the algorithm).
Critchley asks a very pertinent question about the possibility of transforming to logconcavity. In this context, Wellner mentions that logarithmic transformations of random variables with hyperbolically monotone densities of order 1 have logconcave densities, but this is an area which deserves much greater exploration.
Draper provides several pointers to the parallel Bayesian nonparametric density estimation literature. As he points out, these methods offer small sample competitors to confidence intervals or bands for densities or functionals of densities that are constructed using the bootstrap or asymptotic theory.
Hazelton presents a nice extension of Silverman's bump hunting idea as an alternative test for logconcavity. It may be that taking bootstrap samples from the fitted smoothed logconcave estimator (which is very straightforward to do) when computing the critical value of the test is a sensible option here. More generally, as mentioned by Jang and Lim, taking bootstrap samples from the fitted smoothed logconcave estimator, or its unsmoothed analogue, can form the basis for many other smoothed bootstrap (bagging with smearing) procedures, which certainly deserve further investigation. Sampling from the smoothed version has a clear advantage in the product density scenario of Kypraios, Preston and White, since, when using the unsmoothed maximum likelihood estimator, the product density would only be positive on the intersection of the convex hulls of the samples. The strategy is viable in principle regardless of the number of terms in the product, though, as with all related methods, estimates in the tails (where the product density is very small) are likely to be highly variable when the number of terms in the product is large.
We thank Yining Chen for his help with the simulations that are reported in this rejoinder. Finally, we record our gratitude to the Research Section for their handling of the paper, and the Royal Statistical Society for organizing the Ordinary Meeting.