Analysis of interval‐grouped data in weed science: The binnednp Rcpp package

Abstract Weed scientists are usually interested in the study of the distribution and density functions of the random variable that relates weed emergence with environmental indices like the hydrothermal time (HTT). However, in many situations, experimental data are presented in a grouped way and, therefore, the standard nonparametric kernel estimators cannot be computed. Kernel estimators for the density and distribution functions for interval‐grouped data, as well as bootstrap confidence bands for these functions, have been proposed and implemented in the binnednp package. Analysis with different treatments can also be performed using a bootstrap approach and a Cramér‐von Mises type distance. Several bandwidth selection procedures were also implemented. This package also allows to estimate different emergence indices that measure the shape of the data distribution. The values of these indices are useful for the selection of the soil depth at which HTT should be measured which, in turn, would maximize the predictive power of the proposed methods. This paper presents the functions of the package and provides an example using an emergence data set of Avena sterilis (wild oat). The binnednp package provides investigators with a unique set of tools allowing the weed science research community to analyze interval‐grouped data.

Alternatively, the problem of studying the relation between HTT and weed emergence has been dealt with through nonparametric estimation of the distribution and density functions of cumulative HTT (CHTT) at emergence (Cao, Francisco-Fernández, Anand, Bastida, & González-Andújar, 2013;. These nonparametric methods have been recently proven to outperform the usual regression approaches in terms of prediction error (González-Andújar, Francisco-Fernández, et al., 2016).
In addition, when gathering experimental data, a different problem arises due to the fact that seedlings are generally buried at different depths and, therefore, the best depth at which HTT should be measured has to be selected. For this task, emergence indices have been defined and nonparametric estimators for them have been constructed (Cao, Francisco-Fernández, Anand, Bastida, & González-Andújar, 2011).
The techniques required for both, the nonparametric estimation of the density and distribution functions and the emergence indices, have been implemented in the binnednp Rcpp package (Barreiro et al., 2019).

| Density and distribution estimation
Let us suppose that modeling the emergence of a certain weed seedling, based on CHTT at emergence, is being investigated. Denote by n the number of seedlings that have emerged at the end of the monitoring process, and by X the random variable measuring the CHTT at emergence (with density function f and distribution function F). Since the inspections to count the number of emerged seedlings are performed at a limited number of instants, say k, the values X 1 , X 2 , …, X n , measuring the CHTT at emergence of every single seedling, cannot be observed.
However, what is observed is the total number of seedlings that have emerged in the intervals between consecutive inspection times, n 1 , n 2 , …, n k , or the corresponding sample proportions, w 1 , w 2 , …, w k , with w i = n i /n. In this sense, this type of data is called interval-grouped data.
In this interval-group framework, if the interest is to estimate the density function f, the standard kernel density estimator (Parzen, 1962;Rosenblatt, 1956), cannot be computed. An appropriate version of this estimator for interval-grouped data has been proposed in Cao et al. (2011), where t i , i = 1, …, k, are the central values between every pair of consecutive observed CHTT. In Equation (1), the function K(·) is the kernel function, and h is the bandwidth or smoothing parameter, controlling the amount of smoothing.
Similarly, to estimate the distribution function F, a kernel distribution estimator adapted for interval-grouped data, derived from (1), was proposed in Cao et al. (2013), It has been proven that the selection of the kernel function, K(·), is of secondary importance in terms of efficiency. However, the selection of the bandwidth, h, is crucial in the behavior of estimators (1) and (2).
As pointed out in the Introduction, nonparametric estimators (1) and (2) are novel approaches to model weed emergence, presenting some advantages over the parametric regression techniques traditionally used in this framework (Cao et al., 2013;González-Andújar, Francisco-Fernández, et al., 2016). These estimators, jointly with two types of bandwidth selectors, plug-in and bootstrap, have been implemented in the binnednp package. Moreover, a new and successful method to select the pilot bandwidth for the bootstrap bandwidths has been proposed and also implemented. Plug-in bandwidth selectors estimate the unknown terms in the expression of the asymptotically optimal bandwidth, whereas bootstrap bandwidths try to directly estimate the optimal bandwidth by mimicking the sampling process through resampling.
The binnednp package also allows to compute bootstrap confidence bands for the density and distribution functions that can be used to assess the uncertainty of the corresponding estimates.
Additionally, the binnednp package includes a function to evaluate the effect of a specific factor on weed emergence, for example, when considering different treatments. This procedure is based on a bootstrap approach properly designed to address the multiple testing problem (Westfall & Young, 1993). The idea of this approach is the following. (a) Split the data set into subsets according to the different levels of the factor under study. (b) Compute the nonparametric estimator of the emergence curve considering the pooled sample and, for each level, the nonparametric estimators of the emergence curves using the corresponding data subsets. (c) A reasonable statistic, D, to test the null hypothesis that the factor effect is not significant is defined based on a Cramér-von Mises distance between the nonparametric estimator with the pooled sample and the nonparametric estimators in the different groups. This distance is inspired in the generalization of the Cramér-von Mises statistic to the problem of comparing k independent samples, proposed by Kiefer (1959 A more detailed description of this algorithm is given in Appendix 3. Both bandwidth selection methods, plug-in and bootstrap, for the nonparametric estimators (1) and (2) implemented in the corresponding functions of the binnednp package are described in Appendix 1. Moreover, Appendix 2 contains the steps of the bootstrap algorithm used in the binnednp package to compute confidence bands for the distribution function. Finally, Appendix 3 describes the statistical procedure implemented in the binnednp package to test whether a factor can be statistically significant.

| Emergence indices estimation
In this context, another interesting problem is that of finding the best soil depth at which to measure the HTT. For this, momentbased indices and probability density-based indices were proposed in Cao et al. (2011), and estimates of them are also implemented in the binnednp package. Some of these index estimators are based on nonparametric methods and require the selection of a bandwidth. Different techniques to automatically obtain approximately optimal bandwidths have also been included in the package.
In order to maximize the predictive power of the weed emergence models considered, one should choose the depth such that the density function of X, measuring the CHTT at emergence, is as flatter as possible (or the distribution of X has as much spread as possible). Taking this into account, two indices based on the moments of X, the coefficient of variation and the kurtosis of X, have been have been also considered.
The indices J 1 and J 2 measure the curvature of the distribution and density functions of X, respectively. Therefore, small values for both J 1 and J 2 are desirable.

| PACK AG E ARCHITEC TURE
binnednp is an R package (R Development Core Team, 2019) designed for nonparametric estimation of both density and distribution functions of interval-grouped data. Although binnednp can be used for the analysis of any variable presented as grouped in intervals, the package and its structure were designed for its use by the weed science research community.
The package was developed using the Rcpp API (Eddelbuettel & François, 2011) This function computes the plug-in and bootstrap bandwidths for the density estimator (1). Regarding the plug-in bandwidth, with the parameter plugin.type, the iterative process to estimate the bandwidth can be chosen. As for the bootstrap bandwidth, the parameter pilot.type allows the user to select the method to automatically compute the pilot bandwidth needed for the calculation of the bootstrap bandwidth, whereas the parameter gboot allows to manually select that pilot bandwidth. In most situations, it is recommended to employ the default values of these parameters. Additionally, the estimation process can be further personalized using parameters like hn, that determines the number of iterations done during the optimi- This function computes estimates for grouped data of the moment-based and density-based emergence indices presented in Section 2.2. In the case of the density-based indices, with the parameter method, the method to select the bandwidth used for their estimation can be chosen: if method = "plugin", a plug-in approach is considered, whereas if method = "mix" or method = "np", a parametric or nonparametric bootstrap approach is considered, respectively. If confint = TRUE, bootstrap confidence intervals are constructed considering B resamples and using parallel computing, in the case that parallel = TRUE.

| Analysis with different treatments
anv.binned(n, y, trt.w, abs.values = FALSE, B = 500) This function allows to analyze whether a factor has a significant effect on the emergence curve. The idea behind this approach was briefly explained in Section 2.1, and it is described in detail in Appendix 3. It consists in using a bootstrap approach and a Cramérvon-Mises type distance.
In this case, n is a vector composed of the sizes of the complete samples corresponding to each treatment, and trt.w is a matrix each of whose columns contains the proportion of observations lying within each of the intervals for the corresponding treatment.
If instead of proportions the user wants to provide absolute values, the parameter abs.values must be set to TRUE. Furthermore, the user can choose the number of bootstrap resamples through the parameter B. The function anv.binned returns the p-value of the test.

| E X AMPLE
In this section, an unpublished data set of wild oat (Avena sterilis L.) emergence is considered to illustrate the use of the binnednp Rcpp package. These data were taken from an experiment performed dur- weed seedlings were recorded once or twice a week and then removed by cutting seedling stems at ground level with minimum disturbance of the substrate. All the data for the cumulative numbers of seedling emergence from the field were converted to a square meter basis. Additionally, following the same procedure as that described in Cao et al. (2011), the CHTT at emergence in the different inspection days, at three depths (10, 20, and 50 mm), was calculated.
The observed emergence data are shown in Table 1. As it can be seen, the cumulative hydrothermal time at emergence cannot be observed for every individual seed, but just in an aggregated way.
In the first part of the study, we use the function anv.binned to evaluate whether a factor (in this case, "the cylinder factor") has a significant effect on the emergence curve, for any of the three depths.
As pointed out in Section 2.1, the test implemented in the function anv.binned is based on a Cramér-von Mises distance between the nonparametric estimator with the pooled sample and the nonparametric estimators in the different levels of the factor. For example, in the case of depth 10 mm, the distance between all these curves could be visually observed comparing Figures 1 and 3. Figure 1 contains the nonparametric emergence curve estimates with bootstrap bandwidths (jointly with the corresponding 95% bootstrap confidence bands) for each one of the four cylinders. Figure 3 depicts the emergence curve estimates using the nonparametric approach with the pooled sample. However, it is clear that a reliable and formal solution for this problem requires the application of a statistical test, such as that implemented in the function anv.binned.
Note that, in this case, the identical experimentation conditions carried out in the four cylinders seem to support the idea that the null hypothesis could be true and, for this, only a very strong evidence against it will lead us to reject the null hypothesis of "nonsignificant cylinder effect." Denoting by y1, y2, and y3 the CHTT was calculated for 10, 20, and 50 mm, respectively, and after applying the function anv.
binned for the three depths, using the following code: # Observed values of CHTT for each depth y1 = c (100,160,218,265,352,405,459,509) y2 = c (92,146,199,217,261,340,421,505,571) y3 = c (67,105,143,155,185,199,204,232,287,343,421,538)  Given these weed emergence data, a first interesting issue is to find out what is the best depth, among the three possibilities available in this case, 10, 20, and 50 mm, to measure the CHTT in order to have more prediction power. Denoting, as before, by y1, y2, and y3 the CHTT calculated for 10, 20, and 50 mm, respectively, and by ni To maximize the predictive power a specific weed emergence model, the HTT should be measured at a depth producing a density function as flat as possible or, equivalently, a distribution function as dispersed as possible. Therefore, small values of J 1 and J 2 are preferable. On the other hand, CHTT samples with higher coefficient of variation (higher value of I 1 ) and a lower kurtosis (lower value of I 2 ) will improve weed emergence prediction. Consequently, 10 mm seems to be the best soil depth to predict weed emergence in terms of indices I 1 , I 2 , J 1 , and J 2 and, therefore, only observations at 10 mm are considered in what follows. Now, the function bw.dens.binned, described in Section 3.1, can be applied using the pooled sample and the CHTT at 10 mm to compute the plug-in and bootstrap bandwidths for the kernel density estimator (1).
# computing bandwidths for kernel density estimator dens <-bw.dens.binned (n, y, ni, plot = FALSE) #  Figure 2 shows, in the left panel, the kernel density estimates computed using (1) with the obtained plug-in (green lines) and bootstrap (red lines) bandwidths. Moreover, the parameter model was used in bw.dens.binned to fit parametric logistic and Weibull densities to the emergence data. The right panel of Figure 2 shows the corresponding density estimators (green and blue lines for the Weibull and the logistic densities, respectively). Additionally, the nonparametric estimator computed with the bootstrap bandwidth is also included in this figure (red line) for the sake of comparison.
Next, using the functions bw.dist.binned and bw.dist.
binned.boot, described in Section 3.2, the plug-in and bootstrap bandwidths for the kernel distribution estimator (2)

| CON CLUS ION
The binnednp R package gives the weed science research community a simple tool to analyze interval-grouped data. This is useful to study, for example, the CHTT at seedling emergence. Using Additionally, analyses to test the effect of a specific factor on weed emergence can be also performed.

CO N FLI C T O F I NTE R E S T
None declared.

APPENDIX 1
In this appendix, the plug-in and bootstrap bandwidth selection methods implemented in the functions bw.dens.binned, bw.dist.

BA N DWI DTH S E LEC TI O N FO R TH E D E N S IT Y FU N C TI O N
The function bw.dens.binned of the binnednp package computes plug-in and bootstrap bandwidths for the density estimator (1). Plug-in bandwidths are obtained minimizing in h the expression of the asymptotic mean squared error (AMISE) of (1) and estimating the unknown quantities in the minimizer of the AMISE. Under suitable assumptions, asymptotic properties of (1) were obtained in Reyes et al. (2016). In that paper, it is obtained that the AMISE of (1) is: From (A1), it follows that the asymptotically optimal global bandwidth for (1) is Using (A2) and estimating the unknown quantity A(f′′), a plug-in bandwidth for estimator (1) can be derived. In Cao et al. (2011), a nonparametric estimator for A(f′′), denoted by Â g , depending on an auxiliary smoothing parameter was proposed. Plugging Â g in (A2) gives the plug-in bandwidth selector for f g h : In practice, to calculate Â g a new bandwidth has to be selected.
This pilot smoothing parameter can be selected using again a plugin procedure, appearing new auxiliary bandwidths, and so on. The usual strategy is to stop this iterative process after two steps and estimate the unknown quantities assuming that f is Gaussian.
As for the bootstrap bandwidth selector for (1) implemented in bw.dens.binned, this is obtained minimizing the bootstrap version of the mean integrated squared error (MISE) of f g h .
Using standard calculations, a closed form expression for the MISE of estimator (1) is given by: where K h (u) = 1 h K u h , p j = F(y j ) − F(y j − 1 ), for j = 1, …, k, and symbol * stands for convolution.
The bootstrap version MISE * of the MISE is obtained as follows.
Let be the estimator (1) based on a pilot bandwidth . Draw a bootstrap sample X * 1 , X * 2 , … , X * n from f g and, given a bandwidth h, consider the analogue of the kernel density estimator, where * denotes the bootstrap expectation (with respect to f g ).
Using a parallel process to that followed to obtain (3), it is possible to derive a closed representation for (4). This expression is given by where f mix is a normal mixture model with a maximum number of r = 5 components fitted with the grouped-data sample. In practice, the expectation-maximization (EM) method was used to estimate the parameters of the mixture model, using the BIC criterion to select the best fit.

BA N DWI DTH S E LEC TI O N FO R TH E D I S TR I B UTI O N FU N C TI O N
Following analogous arguments to those described for the case of the density function, plug-in and bootstrap bandwidth selectors can be proposed for the nonparametric distribution estimator given in (2). They are implemented in the functions bw.dist.binned and bw.dist.binned.boot, respectively.
Regarding the plug-in bandwidth, under some assumptions, it can be obtained that the AMISE of (2) is: In this case, using standard calculations and assuming that F(y k ) = 1 and F(y 0 ) = 0, it is easy to prove that the expectation and the variance of F g h (x) are, respectively, and From (A8) and (A9), it is straightforward to obtain a closed expression for the MISE of the estimator defined in (2): where, denotes the integrated squared bias and is the integrated variance.
To build a bootstrap version of MISE, we consider a pilot bandwidth, , and construct the grouped-data smooth estimator of F as defined in (2), but replacing h by . The idea is to draw resamples from F g , to group the data and to compute the estimator F g h with those bootstrap samples. The bootstrap resampling plan proceeds as follows.
1. Fix some pilot bandwidth, , and consider the grouped-data smooth cdf estimator, F g .
4. Compute, with the same bootstrap resamples, the proportion of bootstrap curves that are included in each of these confidence bands. These proportions satisfy Otherwise increase i in one unit and repeat Steps 2-5.
The final approximate (1-α) simultaneous confidence intervals are those obtained for level 1 − (i) mean in the last iteration.

APPENDIX 3
In this appendix, the statistical procedure implemented in the function anv.binned, presented in Section 3.4, is described in detail.
Consider an interval-grouped sample of size n. This means that given a set of k intervals [y j-1 , y j ), j = 1, …, k, only the number of observations (n 1 , …, n k ) within each interval (instead of the value of every single observation) is known. Note that n = n 1 + … +n k . For example, as described in Section 2.1, in the weed emergence problem studied in this paper, the vector (y 0 , y 1 , …, y k ) denotes the CHTT at emergence and the vector (n 1 , …, n k ) represents the number of seedlings that have emerged in each interval. Additionally, consider that there exists a factor of interest, with G levels or treatments, and we want to test whether this factor has a significant effect in the emergence. Let us denote by N i j the number of observations in the interval j considering the treatment i, with j = 1, …, k and i = 1, …, G.
Then, ∑ k j=1 N i j = N i , where N i denotes the number of observations associated with treatment i, i = 1, …, G (n = N 1 +⋯ + N G ), and ∑ G i=1 N i j = n j , j = 1, …, k.
Denoting by F i the emergence curve considering treatment i, i = 1, …, G, the problem of testing whether the factor has a significant effect in the emergence can be formulated as: A reasonable statistic to address this hypothesis testing is, for example, the following one, of Cramér-von Mises type, based on the statistic to compare k independent samples, proposed by Kiefer (1959): where F g i ( ⋅ ) denotes the corresponding nonparametric estimator of F i ( ⋅ ), using (2), computed using N i and (N i 1 , … ,N i k ), for every i = 1, …, G, and where Fg ( ⋅ ) represents the nonparametric estimator (2) using the pooled sample n and (n 1 , …, n k ).
Intuitively, the null hypothesis H 0 will be accepted for small values of D and rejected for large values of D. To calibrate the test, a bootstrap procedure is employed to approximate the sampling distribution of D. The specific steps are the following: 1. Using a bandwidth h (e.g., the one provided by bw.dist. binned.boot), consider the grouped-data nonparametric estimator, F g h , using the pooled data set with n and (n 1 , …, n k ).
2. For each treatment i, i = 1, …, G, draw (N i * 1 , … ,N i * k ) from a multinomial distribution M k (N i ;p h 1 , … ,p h k ), with p h j =F g h (y j ) −F g h (y j−1 ), j = 1, …, k, and define w i * j = N i * j ∕N i . 3. Using the weights w i * 1 , … ,w i * k , compute the grouped-data nonparametric estimator F g * i , for each treatment i = 1, …, G, and the nonparametric estimator Fg * using the pooled bootstrap resam- arranged in increasing order of magnitude. Additionally, the p-value of this test can be approximated by: where {⋅} is the indicator function.
It is important to note that with the resampling process described in Step 2, the bootstrap resamples for each treatment are generated under the null hypothesis.