We propose a non-parametric variable selection method which does not rely on any regression model or predictor distribution. The method is based on a new statistical relationship, called *additive conditional independence*, that has been introduced recently for graphical models. Unlike most existing variable selection methods, which target the mean of the response, the method proposed targets a set of attributes of the response, such as its mean, variance or entire distribution. In addition, the additive nature of this approach offers non-parametric flexibility without employing multi-dimensional kernels. As a result it retains high accuracy for high dimensional predictors. We establish estimation consistency, convergence rate and variable selection consistency of the method proposed. Through simulation comparisons we demonstrate that the method proposed performs better than existing methods when the predictor affects several attributes of the response, and it performs competently in the classical setting where the predictors affect the mean only. We apply the new method to a data set concerning how gene expression levels affect the weight of mice.

Although recovering a Euclidean distance matrix from noisy observations is a common problem in practice, how well this could be done remains largely unknown. To fill in this void, we study a simple distance matrix estimate based on the so-called regularized kernel estimate. We show that such an estimate can be characterized as simply applying a constant amount of shrinkage to all observed pairwise distances. This fact allows us to establish risk bounds for the estimate, implying that the true distances can be estimated consistently in an average sense as the number of objects increases. In addition, such a characterization suggests an efficient algorithm to compute the distance matrix estimator, as an alternative to the usual second-order cone programming which is known not to scale well for large problems. Numerical experiments and an application in visualizing the diversity of Vpu protein sequences from a recent study of human immunodeficiency virus type 1 further demonstrate the practical merits of the method proposed.

We consider testing regression coefficients in high dimensional generalized linear models. By modifying the test statistic of Goeman and his colleagues for large but fixed dimensional settings, we propose a new test, based on an asymptotic analysis, that is applicable for diverging dimensions and is robust to accommodate a wide range of link functions. The power properties of the tests are evaluated asymptotically under two families of alternative hypotheses. In addition, a test in the presence of nuisance parameters is also proposed. The tests can provide *p*-values for testing significance of multiple gene sets, whose application is demonstrated in a case-study on lung cancer.

We study the problem of estimating a compact set from a trajectory of a reflected Brownian motion in *S* with reflections on the boundary of *S*. We establish consistency and rates of convergence for various estimators of *S* and its boundary. This problem has relevant applications in ecology in estimating the home range of an animal on the basis of tracking data. There are a variety of studies on the habitat of animals that employ the notion of home range. The paper offers theoretical foundations for a new methodology that, under fairly unrestrictive shape assumptions, allows us to find flexible regions close to reality. The theoretical findings are illustrated on simulated and real data examples.

We consider a two-component mixture model with one known component. We develop methods for estimating the mixing proportion and the unknown distribution non-parametrically, given independent and identically distributed data from the mixture model, using ideas from shape-restricted function estimation. We establish the consistency of our estimators. We find the rate of convergence and asymptotic limit of the estimator for the mixing proportion. Completely automated distribution-free honest finite sample lower confidence bounds are developed for the mixing proportion. Connection to the problem of multiple testing is discussed. The identifiability of the model and the estimation of the density of the unknown distribution are also addressed. We compare the estimators proposed, which are easily implementable, with some of the existing procedures through simulation studies and analyse two data sets: one arising from an application in astronomy and the other from a microarray experiment.

We develop new statistical theory for probabilistic principal component analysis models in high dimensions. The focus is the estimation of the noise variance, which is an important and unresolved issue when the number of variables is large in comparison with the sample size. We first unveil the reasons for an observed downward bias of the maximum likelihood estimator of the noise variance when the data dimension is high. We then propose a bias-corrected estimator by using random-matrix theory and establish its asymptotic normality. The superiority of the new and bias-corrected estimator over existing alternatives is checked by Monte Carlo experiments with various combinations of (*p*,*n*) (the dimension and sample size). Next, we construct a new criterion based on the bias-corrected estimator to determine the number of the principal components, and a consistent estimator is obtained. Its good performance is confirmed by a simulation study and real data analysis. The bias-corrected estimator is also used to derive new asymptotics for the related goodness-of-fit statistic under the high dimensional scheme.

Local smoothing testing based on multivariate non-parametric regression estimation is one of the main model checking methodologies in the literature. However, the relevant tests suffer from the typical curse of dimensionality, resulting in slow rates of convergence to their limits under the null hypothesis and less deviation from the null hypothesis under alternative hypotheses. This problem prevents tests from maintaining the level of significance well and makes tests less sensitive to alternative hypotheses. In the paper, a model adaptation concept in lack-of-fit testing is introduced and a dimension reduction model-adaptive test procedure is proposed for parametric single-index models. The test behaves like a local smoothing test, as if the model were univariate. It is consistent against any global alternative hypothesis and can detect local alternative hypotheses distinct from the null hypothesis at a fast rate that existing local smoothing tests can achieve only when the model is univariate. Simulations are conducted to examine the performance of our methodology. An analysis of real data is shown for illustration. The method can be readily extended to global smoothing methodology and other testing problems.

The problem of non-random sample selectivity often occurs in practice in many fields. The classical estimators introduced by Heckman are the backbone of the standard statistical analysis of these models. However, these estimators are very sensitive to small deviations from the distributional assumptions which are often not satisfied in practice. We develop a general framework to study the robustness properties of estimators and tests in sample selection models. We derive the influence function and the change-of-variance function of Heckman's two-stage estimator, and we demonstrate the non-robustness of this estimator and its estimated variance to small deviations from the model assumed. We propose a procedure for robustifying the estimator, prove its asymptotic normality and give its asymptotic variance. Both cases with and without an exclusion restriction are covered. This allows us to construct a simple robust alternative to the sample selection bias test. We illustrate the use of our new methodology in an analysis of ambulatory expenditures and we compare the performance of the classical and robust methods in a Monte Carlo simulation study.

We propose a likelihood ratio scan method for estimating multiple change points in piecewise stationary processes. Using scan statistics reduces the computationally infeasible global multiple-change-point estimation problem to a number of single-change-point detection problems in various local windows. The computation can be efficiently performed with order *O*{*npt* log (*n*)}. Consistency for the estimated numbers and locations of the change points are established. Moreover, a procedure is developed for constructing confidence intervals for each of the change points. Simulation experiments and real data analysis are conducted to illustrate the efficiency of the likelihood ratio scan method.

Principal stratification is a causal framework to analyse randomized experiments with a post-treatment variable between the treatment and end point variables. Because the principal strata defined by the potential outcomes of the post-treatment variable are not observable, we generally cannot identify the causal effects within principal strata. Motivated by a real data set of phase III adjuvant colon cancer clinical trials, we propose approaches to identifying and estimating the principal causal effects via multiple trials. For the identifiability, we remove the commonly used exclusion restriction assumption by stipulating that the principal causal effects are homogeneous across these trials. To remove another commonly used monotonicity assumption, we give a necessary condition for the local identifiability, which requires at least three trials. Applying our approaches to the data from adjuvant colon cancer clinical trials, we find that the commonly used monotonicity assumption is untenable, and disease-free survival with 3-year follow-up is a valid surrogate end point for overall survival with 5-year follow-up, which satisfies both causal necessity and causal sufficiency. We also propose a sensitivity analysis approach based on Bayesian hierarchical models to investigate the effect of the deviation from the homogeneity assumption.

The paper investigates the estimation of a wide class of multivariate volatility models. Instead of estimating an *m*-multivariate volatility model, a much simpler and numerically efficient method consists in estimating *m* univariate generalized auto-regressive conditional heteroscedasticity type models equation by equation in the first step, and a correlation matrix in the second step. Strong consistency and asymptotic normality of the equation-by-equation estimator are established in a very general framework, including dynamic conditional correlation models. The equation-by-equation estimator can be used to test the restrictions imposed by a particular multivariate generalized auto-regressive conditional heteroscedasticity specification. For general constant conditional correlation models, we obtain the consistency and asymptotic normality of the two-step estimator. Comparisons with the global method, in which the model parameters are estimated in one step, are provided. Monte Carlo experiments and applications to financial series illustrate the interest of the approach.

Variable selection is a challenging issue in statistical applications when the number of predictors *p* far exceeds the number of observations *n*. In this ultrahigh dimensional setting, the sure independence screening procedure was introduced to reduce the dimensionality significantly by preserving the true model with overwhelming probability, before a refined second-stage analysis. However, the aforementioned sure screening property strongly relies on the assumption that the important variables in the model have large marginal correlations with the response, which rarely holds in reality. To overcome this, we propose a novel and simple screening technique called high dimensional ordinary least squares projection which we refer to as ‘HOLP’. We show that HOLP has the sure screening property and gives consistent variable selection without the strong correlation assumption, and it has a low computational complexity. A ridge-type HOLP procedure is also discussed. Simulation study shows that HOLP performs competitively compared with many other marginal correlation-based methods. An application to a mammalian eye disease data set illustrates the attractiveness of HOLP.

The estimation of average treatment effects based on observational data is extremely important in practice and has been studied by generations of statisticians under different frameworks. Existing globally efficient estimators require non-parametric estimation of a propensity score function, an outcome regression function or both, but their performance can be poor in practical sample sizes. Without explicitly estimating either function, we consider a wide class of calibration weights constructed to attain an exact three-way balance of the moments of observed covariates among the treated, the control and the combined group. The wide class includes exponential tilting, empirical likelihood and generalized regression as important special cases, and extends survey calibration estimators to different statistical problems and with important distinctions. Global semiparametric efficiency for the estimation of average treatment effects is established for this general class of calibration estimators. The results show that efficiency can be achieved by solely balancing the covariate distributions without resorting to direct estimation of the propensity score or outcome regression function. We also propose a consistent estimator for the efficient asymptotic variance, which does not involve additional functional estimation of either the propensity score or the outcome regression functions. The variance estimator proposed outperforms existing estimators that require a direct approximation of the efficient influence function.

Identifying leading measurement units from a large collection is a common inference task in various domains of large-scale inference. Testing approaches, which measure evidence against a null hypothesis rather than effect magnitude, tend to overpopulate lists of leading units with those associated with low measurement error. By contrast, local maximum likelihood approaches tend to favour units with high measurement error. Available Bayesian and empirical Bayesian approaches rely on specialized loss functions that result in similar deficiencies. We describe and evaluate a generic empirical Bayesian ranking procedure that populates the list of top units in a way that maximizes the expected overlap between the true and reported top lists for all list sizes. The procedure relates unit-specific posterior upper tail probabilities with their empirical distribution to yield a ranking variable. It discounts high variance units less than popular non-maximum-likelihood methods and thus achieves improved operating characteristics in the models considered.

The paper brings together the theory and practice of local linear kernel hazard estimation. Bandwidth selection is fully analysed, including double one-sided cross-validation that is shown to have good practical and theoretical properties. Insight is provided into the choice of the weighting function in the local linear minimization and it is pointed out that classical weighting sometimes lacks stability. A new semiparametric hazard estimator transforming the survival data before smoothing is introduced and shown to have good practical properties.

We study the properties of points in generated by applying Hilbert's space filling curve to uniformly distributed points in [0,1]. For deterministic sampling we obtain a discrepancy of for *d*⩾2. For random stratified sampling, and scrambled van der Corput points, we derive a mean-squared error of for integration of Lipschitz continuous integrands, when *d*⩾3. These rates are the same as those obtained by sampling on *d*-dimensional grids and they show a deterioration with increasing *d*. The rate for Lipschitz functions is, however, the best possible at that level of smoothness and is better than plain independent and identically distributed sampling. Unlike grids, space filling curve sampling provides points at any desired sample size, and the van der Corput version is extensible in *n*. We also introduce a class of piecewise Lipschitz functions whose discontinuities are in rectifiable sets described via Minkowski content. Although these functions may have infinite variation in the sense of Hardy and Krause, they can be integrated with a mean-squared error of . It was previously known only that the rate was . Other space filling curves, such as those due to Sierpinski and Peano, also attain these rates, whereas upper bounds for the Lebesgue curve are somewhat worse, as if the dimension were times as high.

We study generalized additive models, with shape restrictions (e.g. monotonicity, convexity and concavity) imposed on each component of the additive prediction function. We show that this framework facilitates a non-parametric estimator of each additive component, obtained by maximizing the likelihood. The procedure is free of tuning parameters and under mild conditions is proved to be uniformly consistent on compact intervals. More generally, our methodology can be applied to generalized additive index models. Here again, the procedure can be justified on theoretical grounds and, like the original algorithm, has highly competitive finite sample performance. Practical utility is illustrated through the use of these methods in the analysis of two real data sets. Our algorithms are publicly available in the R package scar, short for shape-constrained additive regression.

We develop a connection between mixture and envelope representations of objective functions that arise frequently in statistics. We refer to this connection by using the term ‘hierarchical duality’. Our results suggest an interesting and previously underexploited relationship between marginalization and profiling, or equivalently between the Fenchel–Moreau theorem for convex functions and the Bernstein–Widder theorem for Laplace transforms. We give several different sets of conditions under which such a duality result obtains. We then extend existing work on envelope representations in several ways, including novel generalizations to variance–mean models and to multivariate Gaussian location models. This turns out to provide an elegant missing data interpretation of the proximal gradient method, which is a widely used algorithm in machine learning. We show several statistical applications in which the framework proposed leads to easily implemented algorithms, including a robust version of the fused lasso, non-linear quantile regression via trend filtering and the binomial fused double-Pareto model. Code for the examples is available on GitHub at https://github.com/jgscott/hierduals.

Variable order Markov chains have been used to model discrete sequential data in a variety of fields. A host of methods exist to estimate the history-dependent lengths of memory which characterize these models and to predict new sequences. In several applications, the data-generating mechanism is known to be reversible, but combining this information with the procedures mentioned is far from trivial. We introduce a Bayesian analysis for reversible dynamics, which takes into account uncertainty in the lengths of memory. The model proposed is applied to the analysis of molecular dynamics simulations and compared with several popular algorithms.

Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation that is not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, which is generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply our method to the National Head Start impact study, which is a large-scale randomized evaluation of a Federal preschool programme, finding that there is indeed significant unexplained treatment effect variation.

A major challenge in many modern superresolution fluorescence microscopy techniques at the nanoscale lies in the correct alignment of long sequences of sparse but spatially and temporally highly resolved images. This is caused by the temporal drift of the protein structure, e.g. due to temporal thermal inhomogeneity of the object of interest or its supporting area during the observation process. We develop a simple semiparametric model for drift correction in single-marker switching microscopy. Then we propose an *M*-estimator for the drift and show its asymptotic normality. This is used to correct the final image and it is shown that this purely statistical method is competitive with state of the art calibration techniques which require the incorporation of fiducial markers in the specimen. Moreover, a simple bootstrap algorithm allows us to quantify the precision of the drift estimate and its effect on the final image estimation. We argue that purely statistical drift correction is even more robust than fiducial tracking, rendering the latter superfluous in many applications. The practicability of our method is demonstrated by a simulation study and by a single-marker switching application. This serves as a prototype for many other typical imaging techniques where sparse observations with high temporal resolution are blurred by motion of the object to be reconstructed.

A conventional linear model for functional data involves expressing a response variable *Y* in terms of the explanatory function *X*(*t*), via the model , where *a* is a scalar, *b* is an unknown function and is a compact interval. However, in some problems the support of *b* or *X*, say, is a proper and unknown subset of , and is a quantity of particular practical interest. Motivated by a real data example involving particulate emissions, we develop methods for estimating . We give particular emphasis to the case , where *θ* ∈ (0,*α*], and suggest two methods for estimating *a*,* b* and *θ* jointly; we introduce techniques for selecting tuning parameters; and we explore properties of our methodology by using both simulation and the real data example mentioned above. Additionally, we derive theoretical properties of the methodology and discuss implications of the theory. Our theoretical arguments give particular emphasis to the problem of identifiability.

We define an empirical likelihood approach which gives consistent design-based confidence intervals which can be calculated without the need of variance estimates, designeffects, resampling, joint inclusion probabilities and linearization, even when the point estimator is not linear. It can be used to construct confidence intervals for a large class of sampling designs and estimators which are solutions of estimating equations. It can be used for means, regressions coefficients, quantiles, totals or counts even when the population size is unknown. It can be used with large sampling fractions and naturally includes calibration constraints. It can be viewed as an extension of the empirical likelihood approach to complex survey data. This approach is computationally simpler than the pseudoempirical likelihood and the bootstrap approaches. The simulation study shows that the confidence interval proposed may give better coverages than the confidence intervals based on linearization, bootstrap and pseudoempirical likelihood. Our simulation study shows that, under complex sampling designs, standard confidence intervals based on normality may have poor coverages, because point estimators may not follow a normal sampling distribution and their variance estimators may be biased.

We propose simple methods for multivariate diffusion bridge simulation, which plays a fundamental role in simulation-based likelihood and Bayesian inference for stochastic differential equations. By a novel application of classical coupling methods, the new approach generalizes a previously proposed simulation method for one-dimensional bridges to the multivariate setting. First a method of simulating approximate, but often very accurate, diffusion bridges is proposed. These approximate bridges are used as proposals for easily implementable Markov chain Monte Carlo algorithms that, apart from a small discretization error, produce exact diffusion bridges. The new method is more generally applicable than previous methods. Another advantage is that the new method works well for diffusion bridges in long intervals because the computational complexity of the method is linear in the length of the interval. In a simulation study the new method performs well, and its usefulness is illustrated by an application to Bayesian estimation for the multivariate hyperbolic diffusion model.

Most of the literature on change point analysis by means of hypothesis testing considers hypotheses of the form *versus*, where and denote parameters of the process before and after a change point. The paper takes a different perspective and investigates the null hypotheses of *no relevant changes*, i.e. , where ‖·‖ is an appropriate norm. This formulation of the testing problem is motivated by the fact that in many applications a modification of the statistical analysis might not be necessary, if the difference between the parameters before and after the change point is small. A general approach to problems of this type is developed which is based on the cumulative sum principle. For the asymptotic analysis weak convergence of the sequential empirical process must be established under the alternative of non-stationarity, and it is shown that the resulting test statistic is asymptotically normally distributed. The results can also be used to establish similarity of the parameters, i.e. , at a controlled type 1 error and to estimate the magnitude of the change with a corresponding confidence interval. Several applications of the methodology are given including tests for relevant changes in the mean, variance, parameter in a linear regression model and distribution function among others. The finite sample properties of the new tests are investigated by means of a simulation study and illustrated by analysing a data example from portfolio management.

The upper bounds on the coverage probabilities of the confidence regions based on blockwise empirical likelihood and non-standard expansive empirical likelihood methods for time series data are investigated via studying the probability of violating the convex hull constraint. The large sample bounds are derived on the basis of the pivotal limit of the blockwise empirical log-likelihood ratio obtained under fixed *b* asymptotics, which has recently been shown to provide a more accurate approximation to the finite sample distribution than the conventional -approximation. Our theoretical and numerical findings suggest that both the finite sample and the large sample upper bounds for coverage probabilities are strictly less than 1 and the blockwise empirical likelihood confidence region can exhibit serious undercoverage when the dimension of moment conditions is moderate or large, the time series dependence is positively strong or the block size is large relative to the sample size. A similar finite sample coverage problem occurs for non-standard expansive empirical likelihood. To alleviate the coverage bound problem, we propose to penalize both empirical likelihood methods by relaxing the convex hull constraint. Numerical simulations and data illustrations demonstrate the effectiveness of our proposed remedies in terms of delivering confidence sets with more accurate coverage. Some technical details and additional simulation results are included in on-line supplemental material.

We consider a multiple-hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block of hypotheses. A rejection rule in this setting amounts to a procedure for choosing the stopping point *k*. This setting is inspired by the sequential nature of many model selection problems, where choosing a stopping point or a model is equivalent to rejecting all hypotheses up to that point and none thereafter. We propose two new testing procedures and prove that they control the false discovery rate in the ordered testing setting. We also show how the methods can be applied to model selection by using recent results on *p*-values in sequential model selection settings.

Motivated by both the shortcomings of high order density estimators, and the increasingly large data sets in many areas of modern science, we introduce new high order, non-parametric density estimators that are guaranteed to be positive and do not have highly oscillatory tails. Our approach is based on data perturbation, e.g. by tilting or data sharpening. It leads to new estimators that are more accurate than conventional kernel techniques that use positive kernels, but which nevertheless enjoy the positivity property, and are far less ‘wiggly’ than high order kernel estimators. We investigate performance by theoretical analysis and in a numerical study.

The paper uses a random-weighting (RW) method to bootstrap the critical values for the Ljung–Box or Monti portmanteau tests and weighted Ljung–Box or Monti portmanteau tests in weak auto-regressive moving average models. Unlike the existing methods, no user-chosen parameter is needed to implement the RW method. As an application, these four tests are used to check the model adequacy in power generalized auto-regressive conditional heteroscedasticity models. Simulation evidence indicates that the weighted portmanteau tests have a power advantage over other existing tests. A real example on the Standard and Poor's 500 index illustrates the merits of our testing procedure. As an extension, the blockwise RW method is also studied.

We consider the problem of jointly estimating multiple graphical models in high dimensions. We assume that the data are collected from *n* subjects, each of which consists of *T* possibly dependent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of closeness between subjects. We propose a kernel-based method for jointly estimating all graphical models. Theoretically, under a double asymptotic framework, where both (*T*,*n*) and the dimension *d* can increase, we provide an explicit rate of convergence in parameter estimation. It characterizes the strength that one can borrow across different individuals and the effect of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging data illustrate the effectiveness of the method proposed.