In this paper, the maximum spacing method is considered for multivariate observations. Nearest neighbour balls are used as a multidimensional analogue to univariate spacings. A class of information-type measures is used to generalize the concept of maximum spacing estimators. Weak and strong consistency of these generalized maximum spacing estimators are proved both when the assigned model class is correct and when the true density is not a member of the model class. An example of the generalized maximum spacing method in model validation context is discussed.

Estimating the fibre length distribution in composite materials is of practical relevance in materials science. We propose an estimator for the fibre length distribution using the point process of fibre endpoints as input. Assuming that this point process is a realization of a Neyman–Scott process, we use results for the reduced second moment measure to derive a consistent and unbiased estimator for the fibre length distribution. We introduce various versions of the estimator taking anisotropy or errors in the observation into account. The estimator is evaluated using a heuristic for its mean squared error as well as a simulation study. Finally, the estimator is applied to the fibre endpoint process extracted from a tomographic image of a glass fibre composite.

In this paper, we consider non-parametric copula inference under bivariate censoring. Based on an estimator of the joint cumulative distribution function, we define a discrete and two smooth estimators of the copula. The construction that we propose is valid for a large range of estimators of the distribution function and therefore for a large range of bivariate censoring frameworks. Under some conditions on the tails of the distributions, the weak convergence of the corresponding copula processes is obtained in *l*^{∞}([0,1]^{2}). We derive the uniform convergence rates of the copula density estimators deduced from our smooth copula estimators. Investigation of the practical behaviour of these estimators is performed through a simulation study and two real data applications, corresponding to different censoring settings. We use our non-parametric estimators to define a goodness-of-fit procedure for parametric copula models. A new bootstrap scheme is proposed to compute the critical values.

We discuss the problem of selecting among alternative parametric models within the Bayesian framework. For model selection problems, which involve non-nested models, the common objective choice of a prior on the model space is the uniform distribution. The same applies to situations where the models are nested. It is our contention that assigning equal prior probability to each model is over simplistic. Consequently, we introduce a novel approach to objectively determine model prior probabilities, conditionally, on the choice of priors for the parameters of the models. The idea is based on the notion of the *worth* of having each model within the selection process. At the heart of the procedure is the measure of this *worth* using the Kullback–Leibler divergence between densities from different models.

We propose a random partition model that implements prediction with many candidate covariates and interactions. The model is based on a modified product partition model that includes a regression on covariates by favouring homogeneous clusters in terms of these covariates. Additionally, the model allows for a cluster-specific choice of the covariates that are included in this evaluation of homogeneity. The variable selection is implemented by introducing a set of cluster-specific latent indicators that include or exclude covariates. The proposed model is motivated by an application to predicting mortality in an intensive care unit in Lisboa, Portugal.

We propose a flexible prior model for the parameters of binary Markov random fields (MRF), defined on rectangular lattices and with maximal cliques defined from a template maximal clique. The prior model allows higher-order interactions to be included. We also define a reversible jump Markov chain Monte Carlo algorithm to sample from the associated posterior distribution. The number of possible parameters for a higher-order MRF becomes high, even for small template maximal cliques. We define a flexible parametric form where the parameters have interpretation as potentials for clique configurations, and limit the effective number of parameters by assigning apriori discrete probabilities for events where groups of parameter values are equal. To cope with the computationally intractable normalising constant of MRFs, we adopt a previously defined approximation of binary MRFs. We demonstrate the flexibility of our prior formulation with simulated and real data examples.

In this paper, we consider a mixed compound Poisson process, that is, a random sum of independent and identically distributed (*i.i.d*.) random variables where the number of terms is a Poisson process with random intensity. We study nonparametric estimators of the jump density by specific deconvolution methods. Firstly, assuming that the random intensity has exponential distribution with unknown expectation, we propose two types of estimators based on the observation of an *i.i.d*. sample. Risks bounds and adaptive procedures are provided. Then, with no assumption on the distribution of the random intensity, we propose two non-parametric estimators of the jump density based on the joint observation of the number of jumps and the random sum of jumps. Risks bounds are provided, leading to unusual rates for one of the two estimators. The methods are implemented and compared via simulations.

We give a simple proof of Bell's inequality in quantum mechanics using theory from causal interaction, which, in conjunction with experiments, demonstrates that the local hidden variable assumption is false. The proof sheds light on relationships between the notion of causal interaction and interference between treatments.

In their recent work, Jiang and Yang studied six classical Likelihood Ratio Test statistics under high-dimensional setting. Assuming that a random sample of size *n* is observed from a *p*-dimensional normal population, they derive the central limit theorems (CLTs) when *p* and *n* are proportional to each other, which are different from the classical chi-square limits as *n* goes to infinity, while *p* remains fixed. In this paper, by developing a new tool, we prove that the mentioned six CLTs hold in a more applicable setting: *p* goes to infinity, and *p* can be very close to *n*. This is an almost sufficient and necessary condition for the CLTs. Simulations of histograms, comparisons on sizes and powers with those in the classical chi-square approximations and discussions are presented afterwards.

COGARCH models are continuous time versions of the well-known GARCH models of financial returns. The first aim of this paper is to show how the method of prediction-based estimating functions can be applied to draw statistical inference from observations of a COGARCH(1,1) model if the higher-order structure of the process is clarified. A second aim of the paper is to provide recursive expressions for the joint moments of any fixed order of the process. Asymptotic results are given, and a simulation study shows that the method of prediction-based estimating function outperforms the other available estimation methods.

In measurement error problems, two major and consistent estimation methods are the conditional score and the corrected score. They are functional methods that require no parametric assumptions on mismeasured covariates. The conditional score requires that a suitable sufficient statistic for the mismeasured covariate can be found, while the corrected score requires that the object score function can be estimated without bias. These assumptions limit their ranges of applications. The extensively corrected score proposed here is an extension of the corrected score. It yields consistent estimations in many cases when neither the conditional score nor the corrected score is feasible. We demonstrate its constructions in generalized linear models and the Cox proportional hazards model, assess its performances by simulation studies and illustrate its implementations by two real examples.

The paper describes a generalized iterative proportional fitting procedure that can be used for maximum likelihood estimation in a special class of the general log-linear model. The models in this class, called relational, apply to multivariate discrete sample spaces that do not necessarily have a Cartesian product structure and may not contain an overall effect. When applied to the cell probabilities, the models without the overall effect are curved exponential families and the values of the sufficient statistics are reproduced by the MLE only up to a constant of proportionality. The paper shows that Iterative Proportional Fitting, Generalized Iterative Scaling, and Improved Iterative Scaling fail to work for such models. The algorithm proposed here is based on iterated Bregman projections. As a by-product, estimates of the multiplicative parameters are also obtained. An implementation of the algorithm is available as an R-package.

This work provides a class of non-Gaussian spatial Matérn fields which are useful for analysing geostatistical data. The models are constructed as solutions to stochastic partial differential equations driven by generalized hyperbolic noise and are incorporated in a standard geostatistical setting with irregularly spaced observations, measurement errors and covariates. A maximum likelihood estimation technique based on the Monte Carlo expectation-maximization algorithm is presented, and a Monte Carlo method for spatial prediction is derived. Finally, an application to precipitation data is presented, and the performance of the non-Gaussian models is compared with standard Gaussian and transformed Gaussian models through cross-validation.

We present a novel methodology for estimating the parameters of a finite mixture model (FMM) based on partially rank-ordered set (PROS) sampling and use it in a fishery application. A PROS sampling design first selects a simple random sample of fish and creates partially rank-ordered judgement subsets by dividing units into subsets of prespecified sizes. The final measurements are then obtained from these partially ordered judgement subsets. The traditional expectation–maximization algorithm is not directly applicable for these observations. We propose a suitable expectation–maximization algorithm to estimate the parameters of the FMMs based on PROS samples. We also study the problem of classification of the PROS sample into the components of the FMM. We show that the maximum likelihood estimators based on PROS samples perform substantially better than their simple random sample counterparts even with small samples. The results are used to classify a fish population using the length-frequency data.

The particle Gibbs sampler is a systematic way of using a particle filter within Markov chain Monte Carlo. This results in an off-the-shelf Markov kernel on the space of state trajectories, which can be used to simulate from the full joint smoothing distribution for a state space model in a Markov chain Monte Carlo scheme. We show that the particle Gibbs Markov kernel is uniformly ergodic under rather general assumptions, which we will carefully review and discuss. In particular, we provide an explicit rate of convergence, which reveals that (i) for fixed number of data points, the convergence rate can be made arbitrarily good by increasing the number of particles and (ii) under general mixing assumptions, the convergence rate can be kept constant by increasing the number of particles superlinearly with the number of observations. We illustrate the applicability of our result by studying in detail a common stochastic volatility model with a non-compact state space.

We develop statistical procedures for estimating shape and orientation of arbitrary three-dimensional particles. We focus on the case where particles cannot be observed directly, but only via sections. Volume tensors are used for describing particle shape and orientation, and we derive stereological estimators of the tensors. These estimators are combined to provide consistent estimators of the moments of the so-called particle cover density. The covariance structure associated with the particle cover density depends on the orientation and shape of the particles. For instance, if the distribution of the typical particle is invariant under rotations, then the covariance matrix is proportional to the identity matrix. We develop a non-parametric test for such isotropy. A flexible Lévy-based particle model is proposed, which may be analysed using a generalized method of moments in which the volume tensors enter. The developed methods are used to study the cell organization in the human brain cortex.

The linear regression model for right censored data, also known as the accelerated failure time model using the logarithm of survival time as the response variable, is a useful alternative to the Cox proportional hazards model. Empirical likelihood as a non-parametric approach has been demonstrated to have many desirable merits thanks to its robustness against model misspecification. However, the linear regression model with right censored data cannot directly benefit from the empirical likelihood for inferences mainly because of dependent elements in estimating equations of the conventional approach. In this paper, we propose an empirical likelihood approach with a new estimating equation for linear regression with right censored data. A nested coordinate algorithm with majorization is used for solving the optimization problems with non-differentiable objective function. We show that the Wilks' theorem holds for the new empirical likelihood. We also consider the variable selection problem with empirical likelihood when the number of predictors can be large. Because the new estimating equation is non-differentiable, a quadratic approximation is applied to study the asymptotic properties of penalized empirical likelihood. We prove the oracle properties and evaluate the properties with simulated data. We apply our method to a Surveillance, Epidemiology, and End Results small intestine cancer dataset.

We propose a new method for risk-analytic benchmark dose (BMD) estimation in a dose-response setting when the responses are measured on a continuous scale. For each dose level *d*, the observation *X*(*d*) is assumed to follow a normal distribution: . No specific parametric form is imposed upon the mean *μ*(*d*), however. Instead, nonparametric maximum likelihood estimates of *μ*(*d*) and *σ* are obtained under a monotonicity constraint on *μ*(*d*). For purposes of quantitative risk assessment, a ‘hybrid’ form of risk function is defined for any dose *d* as *R*(*d*) = *P*[*X*(*d*) < *c*], where *c* > 0 is a constant independent of *d*. The BMD is then determined by inverting the *additional risk function**R*_{A}(*d*) = *R*(*d*) − *R*(0) at some specified value of benchmark response. Asymptotic theory for the point estimators is derived, and a finite-sample study is conducted, using both real and simulated data. When a large number of doses are available, we propose an adaptive grouping method for estimating the BMD, which is shown to have optimal mean integrated squared error under appropriate designs.

Many model-free dimension reduction methods have been developed for high-dimensional regression data but have not paid much attention on problems with non-linear confounding. In this paper, we propose an inverse-regression method of dependent variable transformation for detecting the presence of non-linear confounding. The benefit of using geometrical information from our method is highlighted. A ratio estimation strategy is incorporated in our approach to enhance the interpretation of variable selection. This approach can be implemented not only in principal Hessian directions (PHD) but also in other recently developed dimension reduction methods. Several simulation examples that are reported for illustration and comparisons are made with sliced inverse regression and PHD in ignorance of non-linear confounding. An illustrative application to one real data is also presented.

Supremum score test statistics are often used to evaluate hypotheses with unidentifiable nuisance parameters under the null hypothesis. Although these statistics provide an attractive framework to address non-identifiability under the null hypothesis, little attention has been paid to their distributional properties in small to moderate sample size settings. In situations where there are identifiable nuisance parameters under the null hypothesis, these statistics may behave erratically in realistic samples as a result of a non-negligible bias induced by substituting these nuisance parameters by their estimates under the null hypothesis. In this paper, we propose an adjustment to the supremum score statistics by subtracting the expected bias from the score processes and show that this adjustment does not alter the limiting null distribution of the supremum score statistics. Using a simple example from the class of zero-inflated regression models for count data, we show empirically and theoretically that the adjusted tests are superior in terms of size and power. The practical utility of this methodology is illustrated using count data in HIV research.

Panel count data arise in many fields and a number of estimation procedures have been developed along with two procedures for variable selection. In this paper, we discuss model selection and parameter estimation together. For the former, a focused information criterion (FIC) is presented and for the latter, a frequentist model average (FMA) estimation procedure is developed. A main advantage, also the difference from the existing model selection methods, of the FIC is that it emphasizes the accuracy of the estimation of the parameters of interest, rather than all parameters. Further efficiency gain can be achieved by the FMA estimation procedure as unlike existing methods, it takes into account the variability in the stage of model selection. Asymptotic properties of the proposed estimators are established, and a simulation study conducted suggests that the proposed methods work well for practical situations. An illustrative example is also provided. © 2014 Board of the Foundation of the Scandinavian Journal of Statistics

Partial linear models have been widely used as flexible method for modelling linear components in conjunction with non-parametric ones. Despite the presence of the non-parametric part, the linear, parametric part can under certain conditions be estimated with parametric rate. In this paper, we consider a high-dimensional linear part. We show that it can be estimated with oracle rates, using the least absolute shrinkage and selection operator penalty for the linear part and a smoothness penalty for the nonparametric part.

We extend the log-mean linear parameterization for binary data to discrete variables with arbitrary number of levels and show that also in this case it can be used to parameterize bi-directed graph models. Furthermore, we show that the log-mean linear parameterization allows one to simultaneously represent marginal independencies among variables and marginal independencies that only appear when certain levels are collapsed into a single one. We illustrate the application of this property by means of an example based on genetic association studies involving single-nucleotide polymorphisms. More generally, this feature provides a natural way to reduce the parameter count, while preserving the independence structure, by means of substantive constraints that give additional insight into the association structure of the variables. © 2014 Board of the Foundation of the Scandinavian Journal of Statistics

We propose the Laplace Error Penalty (LEP) function for variable selection in high-dimensional regression. Unlike penalty functions using piecewise splines construction, the LEP is constructed as an exponential function with two tuning parameters and is infinitely differentiable everywhere except at the origin. With this construction, the LEP-based procedure acquires extra flexibility in variable selection, admits a unified derivative formula in optimization and is able to approximate the *L*_{0} penalty as close as possible. We show that the LEP procedure can identify relevant predictors in exponentially high-dimensional regression with normal errors. We also establish the oracle property for the LEP estimator. Although not being convex, the LEP yields a convex penalized least squares function under mild conditions if *p* is no greater than *n*. A coordinate descent majorization-minimization algorithm is introduced to implement the LEP procedure. In simulations and a real data analysis, the LEP methodology performs favorably among competitive procedures.

This paper develops a Bayesian control chart for the percentiles of the Weibull distribution, when both its in-control and out-of-control parameters are unknown. The Bayesian approach enhances parameter estimates for small sample sizes that occur when monitoring rare events such as in high-reliability applications. The chart monitors the parameters of the Weibull distribution directly, instead of transforming the data as most Weibull-based charts do in order to meet normality assumption. The chart uses accumulated knowledge resulting from the likelihood of the current sample combined with the information given by both the initial prior knowledge and all the past samples. The chart is adapting because its control limits change (e.g. narrow) during Phase I. An example is presented and good average run length properties are demonstrated.

The bootstrap variance estimate is widely used in semiparametric inferences. However, its theoretical validity is a well-known open problem. In this paper, we provide a *first* theoretical study on the bootstrap moment estimates in semiparametric models. Specifically, we establish the bootstrap moment consistency of the Euclidean parameter, which immediately implies the consistency of *t*-type bootstrap confidence set. It is worth pointing out that the only additional cost to achieve the bootstrap moment consistency in contrast with the distribution consistency is to simply strengthen the *L*_{1} maximal inequality condition required in the latter to the *L*_{p} maximal inequality condition for *p*≥1. The general *L*_{p} multiplier inequality developed in this paper is also of independent interest. These general conclusions hold for the bootstrap methods with exchangeable bootstrap weights, for example, non-parametric bootstrap and Bayesian bootstrap. Our general theory is illustrated in the celebrated Cox regression model.

We consider hypothesis testing problems for low-dimensional coefficients in a high dimensional additive hazard model. A variance reduced partial profiling estimator (VRPPE) is proposed and its asymptotic normality is established, which enables us to test the significance of each single coefficient when the data dimension is much larger than the sample size. Based on the p-values obtained from the proposed test statistics, we then apply a multiple testing procedure to identify significant coefficients and show that the false discovery rate can be controlled at the desired level. The proposed method is also extended to testing a low-dimensional sub-vector of coefficients. The finite sample performance of the proposed testing procedure is evaluated by simulation studies. We also apply it to two real data sets, with one focusing on testing low-dimensional coefficients and the other focusing on identifying significant coefficients through the proposed multiple testing procedure.

Assessing the absolute risk for a future disease event in presently healthy individuals has an important role in the primary prevention of cardiovascular diseases (CVD) and other chronic conditions. In this paper, we study the use of non-parametric Bayesian hazard regression techniques and posterior predictive inferences in the risk assessment task. We generalize our previously published Bayesian multivariate monotonic regression procedure to a survival analysis setting, combined with a computationally efficient estimation procedure utilizing case–base sampling. To achieve parsimony in the model fit, we allow for multidimensional relationships within specified subsets of risk factors, determined either on *a priori* basis or as a part of the estimation procedure. We apply the proposed methods for 10-year CVD risk assessment in a Finnish population. © 2014 Board of the Foundation of the Scandinavian Journal of Statistics

The aim of the paper is to study the problem of estimating the quantile function of a finite population. Attention is first focused on point estimation, and asymptotic results are obtained. Confidence intervals are then constructed, based on both the following: (i) asymptotic results and (ii) a resampling technique based on rescaling the ‘usual’ bootstrap. A simulation study to compare asymptotic and resampling-based results, as well as an application to a real population, is finally performed.

We propose a new summary statistic for inhomogeneous intensity-reweighted moment stationarity spatio-temporal point processes. The statistic is defined in terms of the *n*-point correlation functions of the point process, and it generalizes the *J*-function when stationarity is assumed. We show that our statistic can be represented in terms of the generating functional and that it is related to the spatio-temporal *K*-function. We further discuss its explicit form under some specific model assumptions and derive ratio-unbiased estimators. We finally illustrate the use of our statistic in practice. © 2014 Board of the Foundation of the Scandinavian Journal of Statistics

This paper studies the asymptotic behaviour of the false discovery and non-discovery proportions of the dynamic adaptive procedure under some dependence structure. A Bahadur-type representation of the cut point in simultaneously performing a large scale of tests is presented. The asymptotic bias decompositions of the false discovery and non-discovery proportions are given under some dependence structure. In addition to existing literatures, we find that the randomness due to the dynamic selection of the tuning parameter in estimating the true null rate serves as a source of the approximation error in the Bahadur representation and enters into the asymptotic bias term of the false discovery proportion and those of the false non-discovery proportion. The theory explains to some extent why some seemingly attractive dynamic adaptive procedures do not outperform the competing fixed adaptive procedures substantially in some situations. Simulations justify our theory and findings.

Self-regulating processes are stochastic processes whose local regularity, as measured by the pointwise Hölder exponent, is a function of amplitude. They seem to provide relevant models for various signals arising for example in geophysics or biomedicine. We propose in this work an estimator of the self-regulating function (that is, the function relating amplitude and Hölder regularity) of the self-regulating midpoint displacement process and study some of its properties. We prove that it is almost surely convergent and obtain a central limit theorem. Numerical simulations show that the estimator behaves well in practice.

We develop an easy and direct way to define and compute the fiducial distribution of a real parameter for both continuous and discrete exponential families. Furthermore, such a distribution satisfies the requirements to be considered a confidence distribution. Many examples are provided for models, which, although very simple, are widely used in applications. A characterization of the families for which the fiducial distribution coincides with a Bayesian posterior is given, and the strict connection with Jeffreys prior is shown. Asymptotic expansions of fiducial distributions are obtained without any further assumptions, and again, the relationship with the objective Bayesian analysis is pointed out. Finally, using the Edgeworth expansions, we compare the coverage of the fiducial intervals with that of other common intervals, proving the good behaviour of the former.

Length-biased and right-censored failure time data arise from many fields, and their analysis has recently attracted a great deal of attention. Two examples of the areas that often produce such data are epidemiological studies and cancer screening trials. In this paper, we discuss regression analysis of such data in the presence of missing covariates, for which no established inference procedure seems to exist. For the problem, we consider the data arising from the proportional hazards model and propose two inverse probability weighted estimation procedures. The asymptotic properties of the resulting estimators are established, and the extensive simulation study conducted for the evaluation of the proposed methods suggests that they work well for practical situations.

The problem of interest is to estimate the concentration curve and the area under the curve (AUC) by estimating the parameters of a linear regression model with an autocorrelated error process. We introduce a simple linear unbiased estimator of the concentration curve and the AUC. We show that this estimator constructed from a sampling design generated by an appropriate density is asymptotically optimal in the sense that it has exactly the same asymptotic performance as the best linear unbiased estimator. Moreover, we prove that the optimal design is robust with respect to a minimax criterion. When repeated observations are available, this estimator is consistent and has an asymptotic normal distribution. Finally, a simulated annealing algorithm is applied to a pharmacokinetic model with correlated errors.

Mixture models are commonly used in biomedical research to account for possible heterogeneity in population. In this paper, we consider tests for homogeneity between two groups in the exponential tilt mixture models. A novel pairwise pseudolikelihood approach is proposed to eliminate the unknown nuisance function. We show that the corresponding pseudolikelihood ratio test has an asymptotic distribution as a supremum of two squared Gaussian processes under the null hypothesis. To maintain the appeal of simplicity for conventional likelihood ratio tests, we propose two alternative tests, both shown to have a simple asymptotic distribution of under the null. Simulation studies show that the proposed class of pseudolikelihood ratio tests performs well in controlling type I errors and having competitive powers compared with the current tests. The proposed tests are illustrated by an example of partial differential expression detection using microarray data from prostate cancer patients.

Small area estimators in linear models are typically expressed as a convex combination of direct estimators and synthetic estimators from a suitable model. When auxiliary information used in the model is measured with error, a new estimator, accounting for the measurement error in the covariates, has been proposed in the literature. Recently, for area-level model, Ybarra & Lohr (Biometrika, 95, 2008, 919) suggested a suitable modification to the estimates of small area means based on Fay & Herriot (J. Am. Stat. Assoc., 74, 1979, 269) model where some of the covariates are measured with error. They used a frequentist approach based on the method of moments. Adopting a Bayesian approach, we propose to rewrite the measurement error model as a hierarchical model; we use improper non-informative priors on the model parameters and show, under a mild condition, that the joint posterior distribution is proper and the marginal posterior distributions of the model parameters have finite variances. We conduct a simulation study exploring different scenarios. The Bayesian predictors we propose show smaller empirical mean squared errors than the frequentist predictors of Ybarra & Lohr (Biometrika, 95, 2008, 919), and they seem also to be more stable in terms of variability and bias. We apply the proposed methodology to two real examples.

In this paper, we propose to use a special class of bivariate frailty models to study dependent censored data. The proposed models are closely linked to Archimedean copula models. We give sufficient conditions for the identifiability of this type of competing risks models. The proposed conditions are derived based on a property shared by Archimedean copula models and satisfied by several well-known bivariate frailty models. Compared with the models studied by Heckman and Honoré and Abbring and van den Berg, our models are more restrictive but can be identified with a discrete (even finite) covariate. Under our identifiability conditions, expectation–maximization (EM) algorithm provides us with consistent estimates of the unknown parameters. Simulation studies have shown that our estimation procedure works quite well. We fit a dependent censored leukaemia data set using the Clayton copula model and end our paper with some discussions. © 2014 Board of the Foundation of the Scandinavian Journal of Statistics

In this work, we develop a method of adaptive non-parametric estimation, based on ‘warped’ kernels. The aim is to estimate a real-valued function *s* from a sample of random couples (*X*,*Y*). We deal with transformed data (Φ(*X*),*Y*), with Φ a one-to-one function, to build a collection of kernel estimators. The data-driven bandwidth selection is performed with a method inspired by Goldenshluger and Lepski (Ann. Statist., 39, 2011, 1608). The method permits to handle various problems such as additive and multiplicative regression, conditional density estimation, hazard rate estimation based on randomly right-censored data, and cumulative distribution function estimation from current-status data. The interest is threefold. First, the squared-bias/variance trade-off is automatically realized. Next, non-asymptotic risk bounds are derived. Lastly, the estimator is easily computed, thanks to its simple expression: a short simulation study is presented.

The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing-data mechanism together with the causal structure and allow the direct application of causal calculus in the estimation of the causal effects. The flow of the study is visualized by ordering the nodes of the causal diagram in two dimensions by their causal order and the time of the observation. Conclusions on whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph. Causal models with design offer a systematic and unifying view to scientific inference and increase the clarity and speed of communication. Examples on the causal models for a case–control study, a nested case–control study, a clinical trial and a two-stage case–cohort study are presented.

For right-censored survival data, it is well-known that the mean survival time can be consistently estimated when the support of the censoring time contains the support of the survival time. In practice, however, this condition can be easily violated because the follow-up of a study is usually within a finite window. In this article, we show that the mean survival time is still estimable from a linear model when the support of some covariate(s) with non-zero coefficient(s) is unbounded regardless of the length of follow-up. This implies that the mean survival time can be well estimated when the support of linear predictor is wide in practice. The theoretical finding is further verified for finite samples by simulation studies. Simulations also show that, when both models are correctly specified, the linear model yields reasonable mean square prediction errors and outperforms the Cox model, particularly with heavy censoring and short follow-up time.

For many stochastic models, it is difficult to make inference about the model parameters because it is impossible to write down a tractable likelihood given the observed data. A common solution is data augmentation in a Markov chain Monte Carlo (MCMC) framework. However, there are statistical problems where this approach has proved infeasible but where simulation from the model is straightforward leading to the popularity of the approximate Bayesian computation algorithm. We introduce a forward simulation MCMC (fsMCMC) algorithm, which is primarily based upon simulation from the model. The fsMCMC algorithm formulates the simulation of the process explicitly as a data augmentation problem. By exploiting non-centred parameterizations, an efficient MCMC updating schema for the parameters and augmented data is introduced, whilst maintaining straightforward simulation from the model. The fsMCMC algorithm is successfully applied to two distinct epidemic models including a birth–death–mutation model that has only previously been analysed using approximate Bayesian computation methods.

The Cox-Aalen model, obtained by replacing the baseline hazard function in the well-known Cox model with a covariate-dependent Aalen model, allows for both fixed and dynamic covariate effects. In this paper, we examine maximum likelihood estimation for a Cox-Aalen model based on interval-censored failure times with fixed covariates. The resulting estimator globally converges to the truth slower than the parametric rate, but its finite-dimensional component is asymptotically efficient. Numerical studies show that estimation via a constrained Newton method performs well in terms of both finite sample properties and processing time for moderate-to-large samples with few covariates. We conclude with an application of the proposed methods to assess risk factors for disease progression in psoriatic arthritis.

We study semiparametric time series models with innovations following a log-concave distribution. We propose a general maximum likelihood framework that allows us to estimate simultaneously the parameters of the model and the density of the innovations. This framework can be easily adapted to many well-known models, including autoregressive moving average (ARMA), generalized autoregressive conditionally heteroscedastic (GARCH), and ARMA-GARCH models. Furthermore, we show that the estimator under our new framework is consistent in both ARMA and ARMA-GARCH settings. We demonstrate its finite sample performance via a thorough simulation study and apply it to model the daily log-return of the FTSE 100 index.

We consider classification in the situation of two groups with normally distributed data in the ‘large *p* small *n*’ framework. To counterbalance the high number of variables, we consider the thresholded independence rule. An upper bound on the classification error is established that is taylored to a mean value of interest in biological applications.

We study estimation and prediction in linear models where the response and the regressor variable both take values in some Hilbert space. Our main objective is to obtain consistency of a principal component-based estimator for the regression operator under minimal assumptions. In particular, we avoid some inconvenient technical restrictions that have been used throughout the literature. We develop our theory in a time-dependent setup that comprises as important special case the autoregressive Hilbertian model.

In this paper, we consider the deterministic trend model where the error process is allowed to be weakly or strongly correlated and subject to non-stationary volatility. Extant estimators of the trend coefficient are analysed. We find that under heteroskedasticity, the Cochrane–Orcutt-type estimator (with some initial condition) could be less efficient than Ordinary Least Squares (OLS) when the process is highly persistent, whereas it is asymptotically equivalent to OLS when the process is less persistent. An efficient non-parametrically weighted Cochrane–Orcutt-type estimator is then proposed. The efficiency is uniform over weak or strong serial correlation and non-stationary volatility of unknown form. The feasible estimator relies on non-parametric estimation of the volatility function, and the asymptotic theory is provided. We use the data-dependent smoothing bandwidth that can automatically adjust for the strength of non-stationarity in volatilities. The implementation does not require pretesting persistence of the process or specification of non-stationary volatility. Finite-sample evaluation via simulations and an empirical application demonstrates the good performance of proposed estimators.

Informative dropout is a vexing problem for any biomedical study. Most existing statistical methods attempt to correct estimation bias related to this phenomenon by specifying unverifiable assumptions about the dropout mechanism. We consider a cohort study in Africa that uses an outreach programme to ascertain the vital status for dropout subjects. These data can be used to identify a number of relevant distributions. However, as only a subset of dropout subjects were followed, vital status ascertainment was incomplete. We use semi-competing risk methods as our analysis framework to address this specific case where the terminal event is incompletely ascertained and consider various procedures for estimating the marginal distribution of dropout and the marginal and conditional distributions of survival. We also consider model selection and estimation efficiency in our setting. Performance of the proposed methods is demonstrated via simulations, asymptotic study and analysis of the study data.

We study errors-in-variables problems when the response is binary and instrumental variables are available. We construct consistent estimators through taking advantage of the prediction relation between the unobservable variables and the instruments. The asymptotic properties of the new estimator are established and illustrated through simulation studies. We also demonstrate that the method can be readily generalized to generalized linear models and beyond. The usefulness of the method is illustrated through a real data example.

This paper discusses regression analysis of current status or case I interval-censored failure time data arising from the additive hazards model. In this situation, some covariates could be missing because of various reasons, but there may exist some auxiliary information about the missing covariates. To address the problem, we propose an estimated partial likelihood approach for estimation of regression parameters, which makes use of the available auxiliary information. The method can be easily implemented, and the asymptotic properties of the resulting estimates are established. To assess the finite sample performance of the proposed method, an extensive simulation study is conducted and indicates that the method works well.

We consider the Whittle likelihood estimation of seasonal autoregressive fractionally integrated moving-average models in the presence of an additional measurement error and show that the spectral maximum Whittle likelihood estimator is asymptotically normal. We illustrate by simulation that ignoring measurement errors may result in incorrect inference. Hence, it is pertinent to test for the presence of measurement errors, which we do by developing a likelihood ratio (LR) test within the framework of Whittle likelihood. We derive the non-standard asymptotic null distribution of this LR test and the limiting distribution of LR test under a sequence of local alternatives. Because in practice, we do not know the order of the seasonal autoregressive fractionally integrated moving-average model, we consider three modifications of the LR test that takes model uncertainty into account. We study the finite sample properties of the size and the power of the LR test and its modifications. The efficacy of the proposed approach is illustrated by a real-life example.

We consider the problem of supplementing survey data with additional information from a population. The framework we use is very general; examples are missing data problems, measurement error models and combining data from multiple surveys. We do not require the survey data to be a simple random sample of the population of interest. The key assumption we make is that there exists a set of common variables between the survey and the supplementary data. Thus, the supplementary data serve the dual role of providing adjustments to the survey data for model consistencies and also enriching the survey data for improved efficiency. We propose a semi-parametric approach using empirical likelihood to combine data from the two sources. The method possesses favourable large and moderate sample properties. We use the method to investigate wage regression using data from the National Longitudinal Survey of Youth Study.

We present a statistical methodology for fitting time-varying rankings, by estimating the strength parameters of the Plackett–Luce multiple comparisons model at regularly spaced times for each ranked item. We use the little-known method of barycentric rational interpolation to interpolate between the strength parameters so that a competitor's strength can be evaluated at any time. We chose the time-varying strengths to evolve deterministically rather than stochastically, a preference that we reason often has merit. There are many statistical and computational problems to overcome on fitting anything beyond ‘toy’ data sets. The methodological innovations here include a method for maximizing a likelihood function for many parameters, approximations for modelling tied data and an approach to the elimination of secular drift of the estimated ‘strengths’. The methodology has obvious applications to fields such as marketing, although we demonstrate our approach by analysing a large data set of golf tournament results, in search of an answer to the question ‘who is the greatest golfer of all time?’

This article studies a new procedure to test for the equality of *k* regression curves in a fully non-parametric context. The test is based on the comparison of empirical estimators of the characteristic functions of the regression residuals in each population. The asymptotic behaviour of the test statistic is studied in detail. It is shown that under the null hypothesis, the distribution of the test statistic converges to a finite combination of independent chi-squared random variables with one degree of freedom. The coefficients in this linear combination can be consistently estimated. The proposed test is able to detect contiguous alternatives converging to the null at the rate *n*^{ − 1 ∕ 2}. The practical performance of the test based on the asymptotic null distribution is investigated by means of simulations.

Parametrically guided non-parametric regression is an appealing method that can reduce the bias of a non-parametric regression function estimator without increasing the variance. In this paper, we adapt this method to the censored data case using an unbiased transformation of the data and a local linear fit. The asymptotic properties of the proposed estimator are established, and its performance is evaluated via finite sample simulations.

In this paper, we consider the linear autoregressive model with varying coefficients *θ*_{n}∈[0,1). When *θ*_{n} tending to the unit root, the moderate deviation principle for empirical covariance is discussed, and as statistical applications, we provide the moderate deviation estimates of the least square and the Yule–Walker estimators of the parameter *θ*_{n}.

On the basis of the idea of the Nadaraya–Watson (NW) kernel smoother and the technique of the local linear (LL) smoother, we construct the NW and LL estimators of conditional mean functions and their derivatives for a left-truncated and right-censored model. The target function includes the regression function, the conditional moment and the conditional distribution function as special cases. It is assumed that the lifetime observations with covariates form a stationary *α*-mixing sequence. Asymptotic normality of the estimators is established. Finite sample behaviour of the estimators is investigated via simulations. A real data illustration is included too.

Technical advances in many areas have produced more complicated high-dimensional data sets than the usual high-dimensional data matrix, such as the fMRI data collected in a period for independent trials, or expression levels of genes measured in different tissues. Multiple measurements exist for each variable in each sample unit of these data. Regarding the multiple measurements as an element in a Hilbert space, we propose Principal Component Analysis (PCA) in Hilbert space. The principal components (PCs) thus defined carry information about not only the patterns of variations in individual variables but also the relationships between variables. To extract the features with greatest contributions to the explained variations in PCs for high-dimensional data, we also propose sparse PCA in Hilbert space by imposing a generalized elastic-net constraint. Efficient algorithms to solve the optimization problems in our methods are provided. We also propose a criterion for selecting the tuning parameter.

Menarche, the onset of menstruation, is an important maturational event of female childhood. Most of the studies of age at menarche make use of dichotomous (status quo) data. More information can be harnessed from recall data, but such data are often censored in a informative way. We show that the usual maximum likelihood estimator based on interval censored data, which ignores the informative nature of censoring, can be biased and inconsistent. We propose a parametric estimator of the menarcheal age distribution on the basis of a realistic model of the recall phenomenon. We identify the additional information contained in the recall data and demonstrate theoretically as well as through simulations the advantage of the maximum likelihood estimator based on recall data over that based on status quo data.

This paper focuses on efficient estimation, optimal rates of convergence and effective algorithms in the partly linear additive hazards regression model with current status data. We use polynomial splines to estimate both cumulative baseline hazard function with monotonicity constraint and nonparametric regression functions with no such constraint. We propose a simultaneous sieve maximum likelihood estimation for regression parameters and nuisance parameters and show that the resultant estimator of regression parameter vector is asymptotically normal and achieves the semiparametric information bound. In addition, we show that rates of convergence for the estimators of nonparametric functions are optimal. We implement the proposed estimation through a backfitting algorithm on generalized linear models. We conduct simulation studies to examine the finite-sample performance of the proposed estimation method and present an analysis of renal function recovery data for illustration.