Practitioners are interested in not only the average causal effect of a treatment on the outcome but also the underlying causal mechanism in the presence of an intermediate variable between the treatment and outcome. However, in many cases we cannot randomize the intermediate variable, resulting in sample selection problems even in randomized experiments. Therefore, we view randomized experiments with intermediate variables as semiobservational studies. In parallel with the analysis of observational studies, we provide a theoretical foundation for conducting objective causal inference with an intermediate variable under the principal stratification framework, with principal strata defined as the joint potential values of the intermediate variable. Our strategy constructs weighted samples based on principal scores, defined as the conditional probabilities of the latent principal strata given covariates, without access to any outcome data. This principal stratification analysis yields robust causal inference without relying on any model assumptions on the outcome distributions. We also propose approaches to conducting sensitivity analysis for violations of the ignorability and monotonicity assumptions: the very crucial but untestable identification assumptions in our theory. When the assumptions required by the classical instrumental variable analysis cannot be justified by background knowledge or cannot be made because of scientific questions of interest, our strategy serves as a useful alternative tool to deal with intermediate variables. We illustrate our methodologies by using two real data examples and find scientifically meaningful conclusions.

The non-causal auto-regressive process with heavy-tailed errors has non-linear causal dynamics, which allow for local explosion or asymmetric cycles that are often observed in economic and financial time series. It provides a new model for multiple local explosions in a strictly stationary framework. The causal predictive distribution displays surprising features, such as higher moments than for the marginal distribution, or the presence of a unit root in the Cauchy case. Aggregating such models can yield complex dynamics with local and global explosion as well as variation in the rate of explosion. The asymptotic behaviour of a vector of sample auto-correlations is studied in a semiparametric non-causal AR(1) framework with Pareto-like tails, and diagnostic tests are proposed. Empirical results based on the Nasdaq composite price index are provided.

We consider causal mediation analysis when exposures and mediators vary over time. We give non-parametric identification results, discuss parametric implementation and also provide a weighting approach to direct and indirect effects based on combining the results of two marginal structural models. We also discuss how our results give rise to a causal interpretation of the effect estimates produced from longitudinal structural equation models. When there are time varying confounders affected by prior exposure and a mediator, natural direct and indirect effects are not identified. However, we define a randomized interventional analogue of natural direct and indirect effects that are identified in this setting. The formula that identifies these effects we refer to as the ‘mediational *g*-formula’. When there is no mediation, the mediational *g*-formula reduces to Robins's regular *g*-formula for longitudinal data. When there are no time varying confounders affected by prior exposure and mediator values, then the mediational *g*-formula reduces to a longitudinal version of Pearl's mediation formula. However, the mediational *g*-formula itself can accommodate both mediation and time varying confounders and constitutes a general approach to mediation analysis with time varying exposures and mediators.

We propose novel optimal designs for longitudinal data for the common situation where the resources for longitudinal data collection are limited, by determining the optimal locations in time where measurements should be taken. As for all optimal designs, some prior information is needed to implement the optimal designs proposed. We demonstrate that this prior information may come from a pilot longitudinal study that has irregularly measured and noisy measurements, where for each subject one has available a small random number of repeated measurements that are randomly located on the domain. A second possibility of interest is that a pilot study consists of densely measured functional data and one intends to take only a few measurements at strategically placed locations in the domain for the future collection of similar data. We construct optimal designs by targeting two criteria: optimal designs to recover the unknown underlying smooth random trajectory for each subject from a few optimally placed measurements such that squared prediction errors are minimized; optimal designs that minimize prediction errors for functional linear regression with functional or longitudinal predictors and scalar responses, again from a few optimally placed measurements. The optimal designs proposed address the need for sparse data collection when planning longitudinal studies, by taking advantage of the close connections between longitudinal and functional data analysis. We demonstrate in simulations that the designs perform considerably better than randomly chosen design points and include a motivating data example from the Baltimore Longitudinal Study of Aging. The designs are shown to have an asymptotic optimality property.

It is common that, in multiarm randomized trials, the outcome of interest is ‘truncated by death’, meaning that it is only observed or well-defined conditioning on an intermediate outcome. In this case, in addition to pairwise contrasts, the joint inference for all treatment arms is also of interest. Under a monotonicity assumption we present methods for both pairwise and joint causal analyses of ordinal treatments and binary outcomes in the presence of truncation by death. We illustrate via examples the appropriateness of our assumptions in different scientific contexts.

Statistical inferences for sample correlation matrices are important in high dimensional data analysis. Motivated by this, the paper establishes a new central limit theorem for a linear spectral statistic of high dimensional sample correlation matrices for the case where the dimension *p* and the sample size *n* are comparable. This result is of independent interest in large dimensional random-matrix theory. We also further investigate the sample correlation matrices of a high dimensional vector whose elements have a special correlated structure and the corresponding central limit theorem is developed. Meanwhile, we apply the linear spectral statistic to an independence test for *p* random variables, and then an equivalence test for *p* factor loadings and *n* factors in a factor model. The finite sample performance of the test proposed shows its applicability and effectiveness in practice. An empirical application to test the independence of household incomes from various cities in China is also conducted.

We propose a novel sparse tensor decomposition method, namely the tensor truncated power method, that incorporates variable selection in the estimation of decomposition components. The sparsity is achieved via an efficient truncation step embedded in the tensor power iteration. Our method applies to a broad family of high dimensional latent variable models, including high dimensional Gaussian mixtures and mixtures of sparse regressions. A thorough theoretical investigation is further conducted. In particular, we show that the final decomposition estimator is guaranteed to achieve a local statistical rate, and we further strengthen it to the global statistical rate by introducing a proper initialization procedure. In high dimensional regimes, the statistical rate obtained significantly improves those shown in the existing non-sparse decomposition methods. The empirical advantages of tensor truncated power are confirmed in extensive simulation results and two real applications of click-through rate prediction and high dimensional gene clustering.

In the Gaussian linear regression model (with unknown mean and variance), we show that the standard confidence set for one or two regression coefficients is admissible in the sense of Joshi. This solves a long-standing open problem in mathematical statistics, and this has important implications on the performance of modern inference procedures post model selection or post shrinkage, particularly in situations where the number of parameters is larger than the sample size. As a technical contribution of independent interest, we introduce a new class of conjugate priors for the Gaussian location–scale model.

A non-parametric extension of control variates is presented. These leverage gradient information on the sampling density to achieve substantial variance reduction. It is not required that the sampling density be normalized. The novel contribution of this work is based on two important insights: a trade-off between random sampling and deterministic approximation and a new gradient-based function space derived from Stein's identity. Unlike classical control variates, our estimators improve rates of convergence, often requiring orders of magnitude fewer simulations to achieve a fixed level of precision. Theoretical and empirical results are presented, the latter focusing on integration problems arising in hierarchical models and models based on non-linear ordinary differential equations.

Most time series that are encountered in practice contain non-zero trend, yet textbook approaches to time series analysis are typically focused on zero-mean stationary auto-regressive moving average (ARMA) processes. Trend is often estimated by *ad hoc* methods and subtracted from time series, and the residuals are used as the true ARMA noise for data analysis and inference, including parameter estimation, lag selection and prediction. We propose a theoretically justified two-step method to analyse time series consisting of a smooth trend function and ARMA error term, which is computationally efficient and easy for practitioners to implement. The trend is estimated by *B*-spline regression, and the maximum likelihood estimator based on residuals is shown to be oracally efficient in the sense that it is asymptotically as efficient as if the true trend function were known and then removed to obtain the ARMA errors. In addition, consistency of the Bayesian information criterion for model selection is established for the detrended residual sequence. Finite sample performance of the procedure is illustrated by simulation studies and real data analysis.

The modifiable areal unit problem and the ecological fallacy are known problems that occur when modelling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By ‘regionalization’ we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error, which we minimize to obtain an optimal regionalization. To define the criterion we draw a connection between spatial aggregation error and a new multiscale representation of the Karhunen–Loève expansion. This relationship between the criterion for spatial aggregation error and the multiscale Karhunen–Loève expansion leads to illuminating theoretical developments including connections between spatial aggregation error, squared prediction error, spatial variance and a novel extension of Obled–Creutin eigenfunctions. The effectiveness of our approach is demonstrated through an analysis of two data sets: one using the American Community Survey and one related to environmental ocean winds.

Using properties of shuffles of copulas and tools from combinatorics we solve the open question about the exact region Ω determined by all possible values of Kendall's *τ* and Spearman's *ρ*. In particular, we prove that the well-known inequality established by Durbin and Stuart in 1951 is not sharp outside a countable set, give a simple analytic characterization of Ω in terms of a continuous, strictly increasing piecewise concave function and show that Ω is compact and simply connected, but not convex. The results also show that for each (*x*,*y*) ∈ Ω there are mutually completely dependent random variables *X* and *Y* whose *τ*- and *ρ*-values coincide with *x* and *y* respectively.

Sparse regression techniques have been popular in recent years because of their ability in handling high dimensional data with built-in variable selection. The lasso is perhaps one of the most well-known examples. Despite intensive work in this direction, how to provide valid inference for sparse regularized methods remains a challenging statistical problem. We take a unique point of view of this problem and propose to make use of stochastic variational inequality techniques in optimization to derive confidence intervals and regions for the lasso. Some theoretical properties of the procedure are obtained. Both simulated and real data examples are used to demonstrate the performance of the method.

The paper considers the computer model calibration problem and provides a general frequentist solution. Under the framework proposed, the data model is semiparametric with a non-parametric discrepancy function which accounts for any discrepancy between physical reality and the computer model. In an attempt to solve a fundamentally important (but often ignored) identifiability issue between the computer model parameters and the discrepancy function, the paper proposes a new and identifiable parameterization of the calibration problem. It also develops a two-step procedure for estimating all the relevant quantities under the new parameterization. This estimation procedure is shown to enjoy excellent rates of convergence and can be straightforwardly implemented with existing software. For uncertainty quantification, bootstrapping is adopted to construct confidence regions for the quantities of interest. The practical performance of the methodology is illustrated through simulation examples and an application to a computational fluid dynamics model.

We propose a multi-resolution scanning approach to identifying two-sample differences. Windows of multiple scales are constructed through nested dyadic partitioning on the sample space and a hypothesis regarding the two-sample difference is defined on each window. Instead of testing the hypotheses on different windows independently, we adopt a joint graphical model, namely a Markov tree, on the null or alternative states of these hypotheses to incorporate spatial correlation across windows. The induced dependence allows borrowing strength across nearby and nested windows, which we show is critical for detecting high resolution local differences. We evaluate the performance of the method through simulation and show that it substantially outperforms other state of the art two-sample tests when the two-sample difference is local, involving only a small subset of the data. We then apply it to a flow cytometry data set from immunology, in which it successfully identifies highly local differences. In addition, we show how to control properly for multiple testing in a decision theoretic approach as well as how to summarize and report the inferred two-sample difference. We also construct hierarchical extensions of the framework to incorporate adaptivity into the construction of the scanning windows to improve inference further.

A formal likelihood ratio hypothesis test for the validity of a parametric regression function is proposed, using a large dimensional, non-parametric *double-cone* alternative. For example, the test against a constant function uses the alternative of increasing or decreasing regression functions, and the test against a linear function uses the convex or concave alternative. The test proposed is exact and unbiased and the critical value is easily computed. The power of the test increases to 1 as the sample size increases, under very mild assumptions—even when the alternative is misspecified, i.e. the power of the test converges to 1 for any true regression function that deviates (in a non-degenerate way) from the parametric null hypothesis. We also formulate tests for the linear *versus* partial linear model and consider the special case of the additive model. Simulations show that our procedure behaves well consistently when compared with other methods. Although the alternative fit is non-parametric, no tuning parameters are involved. Supplementary materials with proofs and technical details are available on line.

Sampling from various kinds of distribution is an issue of paramount importance in statistics since it is often the key ingredient for constructing estimators, test procedures or confidence intervals. In many situations, exact sampling from a given distribution is impossible or computationally expensive and, therefore, one needs to resort to approximate sampling strategies. However, there is no well-developed theory providing meaningful non-asymptotic guarantees for the approximate sampling procedures, especially in high dimensional problems. The paper makes some progress in this direction by considering the problem of sampling from a distribution having a smooth and log-concave density defined on , for some integer *p*>0. We establish non-asymptotic bounds for the error of approximating the target distribution by the distribution obtained by the Langevin Monte Carlo method and its variants. We illustrate the effectiveness of the established guarantees with various experiments. Underlying our analysis are insights from the theory of continuous time diffusion processes, which may be of interest beyond the framework of log-concave densities that are considered in the present work.

Data subject to heavy-tailed errors are commonly encountered in various scientific fields. To address this problem, procedures based on quantile regression and least absolute deviation regression have been developed in recent years. These methods essentially estimate the conditional median (or quantile) function. They can be very different from the conditional mean functions, especially when distributions are asymmetric and heteroscedastic. How can we efficiently estimate the mean regression functions in ultrahigh dimensional settings with existence of only the second moment? To solve this problem, we propose a penalized Huber loss with diverging parameter to reduce biases created by the traditional Huber loss. Such a penalized robust approximate (RA) quadratic loss will be called the RA lasso. In the ultrahigh dimensional setting, where the dimensionality can grow exponentially with the sample size, our results reveal that the RA lasso estimator produces a consistent estimator at the same rate as the optimal rate under the light tail situation. We further study the computational convergence of the RA lasso and show that the composite gradient descent algorithm indeed produces a solution that admits the same optimal rate after sufficient iterations. As a by-product, we also establish the concentration inequality for estimating the population mean when there is only the second moment. We compare the RA lasso with other regularized robust estimators based on quantile regression and least absolute deviation regression. Extensive simulation studies demonstrate the satisfactory finite sample performance of the RA lasso.

The cumulative incidence is the probability of failure from the cause of interest over a certain time period in the presence of other risks. A semiparametric regression model proposed by Fine and Gray has become the method of choice for formulating the effects of covariates on the cumulative incidence. Its estimation, however, requires modelling of the censoring distribution and is not statistically efficient. We present a broad class of semiparametric transformation models which extends the Fine and Gray model, and we allow for unknown causes of failure. We derive the non-parametric maximum likelihood estimators and develop simple and fast numerical algorithms using the profile likelihood. We establish the consistency, asymptotic normality and semiparametric efficiency of the non-parametric maximum likelihood estimators. In addition, we construct graphical and numerical procedures to evaluate and select models. Finally, we demonstrate the advantages of the proposed methods over the existing methods through extensive simulation studies and an application to a major study on bone marrow transplantation.

A new class of dependent random measures which we call *compound random measures* is proposed and the use of normalized versions of these random measures as priors in Bayesian non-parametric mixture models is considered. Their tractability allows the properties of both compound random measures and normalized compound random measures to be derived. In particular, we show how compound random measures can be constructed with gamma, *σ*-stable and generalized gamma process marginals. We also derive several forms of the Laplace exponent and characterize dependence through both the Lévy copula and the correlation function. An augmented Pólya urn scheme sampler and a slice sampler are described for posterior inference when a normalized compound random measure is used as the mixing measure in a non-parametric mixture model and a data example is discussed.

We propose new methodology for two-sample testing in high dimensional models. The methodology provides a high dimensional analogue to the classical likelihood ratio test and is applicable to essentially any model class where sparse estimation is feasible. Sparse structure is used in the construction of the test statistic. In the general case, testing then involves non-nested model comparison, and we provide asymptotic results for the high dimensional setting. We put forward computationally efficient procedures based on data splitting, including a variant of the permutation test that exploits sparse structure. We illustrate the general approach in two-sample comparisons of high dimensional regression models (‘differential regression’) and graphical models (‘differential network’), showing results on simulated data as well as data from two recent cancer studies.

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Carlo method for sampling from the posterior distribution of the parameters of interest. The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semiparametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Carlo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

For the analysis of clustered survival data, two different types of model that take the association into account are commonly used: frailty models and copula models. Frailty models assume that, conditionally on a frailty term for each cluster, the hazard functions of individuals within that cluster are independent. These unknown frailty terms with their imposed distribution are used to express the association between the different individuals in a cluster. Copula models in contrast assume that the joint survival function of the individuals within a cluster is given by a copula function, evaluated in the marginal survival function of each individual. It is the copula function which describes the association between the lifetimes within a cluster. A major disadvantage of the present copula models over the frailty models is that the size of the different clusters must be small and equal to set up manageable estimation procedures for the different model parameters. We describe a copula model for clustered survival data where the clusters are allowed to be moderate to large and varying in size by considering the class of Archimedean copulas with completely monotone generator. We develop both one- and two-stage estimators for the copula parameters. Furthermore we show the consistency and asymptotic normality of these estimators. Finally, we perform a simulation study to investigate the finite sample properties of the estimators. We illustrate the method on a data set containing the time to first insemination in cows, with cows clustered in herds.

We propose a semiparametric latent Gaussian copula model for modelling mixed multivariate data, which contain a combination of both continuous and binary variables. The model assumes that the observed binary variables are obtained by dichotomizing latent variables that satisfy the Gaussian copula distribution. The goal is to infer the conditional independence relationship between the latent random variables, based on the observed mixed data. Our work has two main contributions: we propose a unified rank-based approach to estimate the correlation matrix of latent variables; we establish the concentration inequality of the proposed rank-based estimator. Consequently, our methods achieve the same rates of convergence for precision matrix estimation and graph recovery, as if the latent variables were observed. The methods proposed are numerically assessed through extensive simulation studies, and real data analysis.

Envelope tests are a popular tool in spatial statistics, where they are used in goodness-of-fit testing. These tests graphically compare an empirical function *T*(*r*) with its simulated counterparts from the null model. However, the type I error probability *α* is conventionally controlled for a fixed distance *r* only, whereas the functions are inspected on an interval of distances *I*. In this study, we propose two approaches related to Barnard's Monte Carlo test for building global envelope tests on *I*: ordering the empirical and simulated functions on the basis of their *r*-wise ranks among each other, and the construction of envelopes for a deviation test. These new tests allow the *a priori* choice of the global *α* and they yield *p*-values. We illustrate these tests by using simulated and real point pattern data.

Hierarchical models allow for heterogeneous behaviours in a population while simultaneously borrowing estimation strength across all subpopulations. Unfortunately, existing likelihood-based methods for fitting hierarchical models have high computational demands, and these demands have limited their adoption in large-scale prediction and inference problems. The paper proposes a moment-based procedure for estimating the parameters of a hierarchical model which has its roots in a method originally introduced by Cochran in 1937. The method trades statistical efficiency for computational efficiency. It gives consistent parameter estimates, competitive prediction error performance and substantial computational improvements. When applied to a large-scale recommender system application and compared with a standard maximum likelihood procedure, the method delivers competitive prediction performance while reducing the sequential computation time from hours to minutes.

The paper develops a general regression framework for the analysis of manifold-valued response in a Riemannian symmetric space (RSS) and its association with multiple covariates of interest, such as age or gender, in Euclidean space. Such RSS-valued data arise frequently in medical imaging, surface modelling and computer vision, among many other fields. We develop an intrinsic regression model solely based on an intrinsic conditional moment assumption, avoiding specifying any parametric distribution in RSS. We propose various link functions to map from the Euclidean space of multiple covariates to the RSS of responses. We develop a two-stage procedure to calculate the parameter estimates and determine their asymptotic distributions. We construct the Wald and geodesic test statistics to test hypotheses of unknown parameters. We systematically investigate the geometric invariant property of these estimates and test statistics. Simulation studies and a real data analysis are used to evaluate the finite sample properties of our methods.

A common feature in large-scale scientific studies is that signals are sparse and it is desirable to narrow down significantly the focus to a much smaller subset in a sequential manner. We consider two related data screening problems: one is to find the smallest subset such that it virtually contains all signals and another is to find the largest subset such that it essentially contains only signals. These screening problems are closely connected to but distinct from the more conventional signal detection or multiple-testing problems. We develop phase transition diagrams to characterize the fundamental limits in simultaneous inference and derive data-driven screening procedures which control the error rates with near optimality properties. Applications in the context of multistage high throughput studies are discussed.

The analysis of spatial data is based on a set of assumptions, which in practice need to be checked. A commonly used assumption is that the spatial random field is second-order stationary. In the paper, a test for spatial stationarity for irregularly sampled data is proposed. The test is based on a transformation of the data (a type of Fourier transform), where the correlations between the transformed data are close to 0 if the random field is second-order stationary. However, if the random field were second-order non-stationary, this property does not hold. Using this property a test for second-order stationarity is constructed. The test statistic is based on measuring the degree of correlation in the transformed data. The asymptotic sampling properties of the test statistic are derived under both stationarity and non-stationarity of the random field. These results motivate a graphical tool which allows a visual representation of the non-stationary features. The method is illustrated with simulations and a real data example.

We consider the problem of Laplace deconvolution with noisy discrete non-equally spaced observations on a finite time interval. We propose a new method for Laplace deconvolution which is based on expansions of the convolution kernel, the unknown function and the observed signal over a Laguerre functions basis (which acts as a surrogate eigenfunction basis of the Laplace convolution operator) using a regression setting. The expansion results in a small system of linear equations with the matrix of the system being triangular and Toeplitz. Because of this triangular structure, there is a common number *m* of terms in the function expansions to control, which is realized via a complexity penalty. The advantage of this methodology is that it leads to very fast computations, produces no boundary effects due to extension at zero and cut-off at *T* and provides an estimator with the risk within a logarithmic factor of *m* of the oracle risk. We emphasize that we consider the true observational model with possibly non-equispaced observations which are available on a finite interval of length *T* which appears in many different contexts, and we account for the bias associated with this model (which is not present in the case *T*∞). The study is motivated by perfusion imaging using a short injection of contrast agent, a procedure which is applied for medical assessment of microcirculation within tissues such as cancerous tumours. The presence of a tuning parameter *a* allows the choice of the most advantageous time units, so that both the kernel and the unknown right-hand side of the equation are well represented for the deconvolution. The methodology is illustrated by an extensive simulation study and a real data example which confirms that the technique proposed is fast, efficient, accurate, usable from a practical point of view and very competitive.

Consider the extreme quantile region induced by the half-space depth function HD of the form , such that for a given, very small *p*>0. Since this involves extrapolation outside the data cloud, this region can hardly be estimated through a fully non-parametric procedure. Using extreme value theory we construct a natural semiparametric estimator of this quantile region and prove a refined consistency result. A simulation study clearly demonstrates the good performance of our estimator. We use the procedure for risk management by applying it to stock market returns.

We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected to observations through a loss function rather than the traditional likelihood function, which is recovered as a special case. Modern application areas make it increasingly challenging for Bayesians to attempt to model the true data-generating mechanism. For instance, when the object of interest is low dimensional, such as a mean or median, it is cumbersome to have to achieve this via a complete model for the whole data distribution. More importantly, there are settings where the parameter of interest does not directly index a family of density functions and thus the Bayesian approach to learning about such parameters is currently regarded as problematic. Our framework uses loss functions to connect information in the data to functionals of interest. The updating of beliefs then follows from a decision theoretic approach involving cumulative loss functions. Importantly, the procedure coincides with Bayesian updating when a true likelihood is known yet provides coherent subjective inference in much more general settings. Connections to other inference frameworks are highlighted.

Simulation from the truncated multivariate normal distribution in high dimensions is a recurrent problem in statistical computing and is typically only feasible by using approximate Markov chain Monte Carlo sampling. We propose a minimax tilting method for exact independently and identically distributed data simulation from the truncated multivariate normal distribution. The new methodology provides both a method for simulation and an efficient estimator to hitherto intractable Gaussian integrals. We prove that the estimator has a rare vanishing relative error asymptotic property. Numerical experiments suggest that the scheme proposed is accurate in a wide range of set-ups for which competing estimation schemes fail. We give an application to exact independently and identically distributed data simulation from the Bayesian posterior of the probit regression model.

We investigate a longitudinal data model with non-parametric regression functions that may vary across the observed individuals. In a variety of applications, it is natural to impose a group structure on the regression curves. Specifically, we may suppose that the observed individuals can be grouped into a number of classes whose members all share the same regression function. We develop a statistical procedure to estimate the unknown group structure from the data. Moreover, we derive the asymptotic properties of the procedure and investigate its finite sample performance by means of a simulation study and a real data example.

Different dependence scenarios can arise in multivariate extremes, entailing careful selection of an appropriate class of models. In bivariate extremes, the variables are either asymptotically dependent or are asymptotically independent. Most available statistical models suit one or other of these cases, but not both, resulting in a stage in the inference that is unaccounted for but can substantially impact subsequent extrapolation. Existing modelling solutions to this problem are either applicable only on subdomains or appeal to multiple limit theories. We introduce a unified representation for bivariate extremes that encompasses a wide variety of dependence scenarios and applies when at least one variable is large. Our representation motivates a parametric model that encompasses both dependence classes. We implement a simple version of this model and show that it performs well in a range of settings.

We introduce a simple and interpretable model for functional data analysis for situations where the observations at each location are functional rather than scalar. This new approach is based on a tensor product representation of the function-valued process and utilizes eigenfunctions of marginal kernels. The resulting marginal principal components and product principal components are shown to have nice properties. Given a sample of independent realizations of the underlying function-valued stochastic process, we propose straightforward fitting methods to obtain the components of this model and to establish asymptotic consistency and rates of convergence for the estimates proposed. The methods are illustrated by modelling the dynamics of annual fertility profile functions for 17 countries. This analysis demonstrates that the approach proposed leads to insightful interpretations of the model components and interesting conclusions.

The paper develops inferential methodology for detecting a change in the annual pattern of an environmental variable measured at fixed locations in a spatial region. Using a framework built on functional data analysis, we model observations as a collection of function-valued time sequences available at many sites. Each sequence is modelled as an annual mean function, which may change, plus a sequence of error functions, which are spatially correlated. The tests statistics extend the cumulative sum paradigm to this more complex setting. Their asymptotic distributions are not parameter free because of the spatial dependence but can be effectively approximated by Monte Carlo simulations. The new methodology is applied to precipitation data. Its finite sample performance is assessed by a simulation study.

We propose a non-parametric variable selection method which does not rely on any regression model or predictor distribution. The method is based on a new statistical relationship, called *additive conditional independence*, that has been introduced recently for graphical models. Unlike most existing variable selection methods, which target the mean of the response, the method proposed targets a set of attributes of the response, such as its mean, variance or entire distribution. In addition, the additive nature of this approach offers non-parametric flexibility without employing multi-dimensional kernels. As a result it retains high accuracy for high dimensional predictors. We establish estimation consistency, convergence rate and variable selection consistency of the method proposed. Through simulation comparisons we demonstrate that the method proposed performs better than existing methods when the predictor affects several attributes of the response, and it performs competently in the classical setting where the predictors affect the mean only. We apply the new method to a data set concerning how gene expression levels affect the weight of mice.

Although recovering a Euclidean distance matrix from noisy observations is a common problem in practice, how well this could be done remains largely unknown. To fill in this void, we study a simple distance matrix estimate based on the so-called regularized kernel estimate. We show that such an estimate can be characterized as simply applying a constant amount of shrinkage to all observed pairwise distances. This fact allows us to establish risk bounds for the estimate, implying that the true distances can be estimated consistently in an average sense as the number of objects increases. In addition, such a characterization suggests an efficient algorithm to compute the distance matrix estimator, as an alternative to the usual second-order cone programming which is known not to scale well for large problems. Numerical experiments and an application in visualizing the diversity of Vpu protein sequences from a recent study of human immunodeficiency virus type 1 further demonstrate the practical merits of the method proposed.

We consider testing regression coefficients in high dimensional generalized linear models. By modifying the test statistic of Goeman and his colleagues for large but fixed dimensional settings, we propose a new test, based on an asymptotic analysis, that is applicable for diverging dimensions and is robust to accommodate a wide range of link functions. The power properties of the tests are evaluated asymptotically under two families of alternative hypotheses. In addition, a test in the presence of nuisance parameters is also proposed. The tests can provide *p*-values for testing significance of multiple gene sets, whose application is demonstrated in a case-study on lung cancer.

We study the problem of estimating a compact set from a trajectory of a reflected Brownian motion in *S* with reflections on the boundary of *S*. We establish consistency and rates of convergence for various estimators of *S* and its boundary. This problem has relevant applications in ecology in estimating the home range of an animal on the basis of tracking data. There are a variety of studies on the habitat of animals that employ the notion of home range. The paper offers theoretical foundations for a new methodology that, under fairly unrestrictive shape assumptions, allows us to find flexible regions close to reality. The theoretical findings are illustrated on simulated and real data examples.

We consider a two-component mixture model with one known component. We develop methods for estimating the mixing proportion and the unknown distribution non-parametrically, given independent and identically distributed data from the mixture model, using ideas from shape-restricted function estimation. We establish the consistency of our estimators. We find the rate of convergence and asymptotic limit of the estimator for the mixing proportion. Completely automated distribution-free honest finite sample lower confidence bounds are developed for the mixing proportion. Connection to the problem of multiple testing is discussed. The identifiability of the model and the estimation of the density of the unknown distribution are also addressed. We compare the estimators proposed, which are easily implementable, with some of the existing procedures through simulation studies and analyse two data sets: one arising from an application in astronomy and the other from a microarray experiment.

We develop new statistical theory for probabilistic principal component analysis models in high dimensions. The focus is the estimation of the noise variance, which is an important and unresolved issue when the number of variables is large in comparison with the sample size. We first unveil the reasons for an observed downward bias of the maximum likelihood estimator of the noise variance when the data dimension is high. We then propose a bias-corrected estimator by using random-matrix theory and establish its asymptotic normality. The superiority of the new and bias-corrected estimator over existing alternatives is checked by Monte Carlo experiments with various combinations of (*p*,*n*) (the dimension and sample size). Next, we construct a new criterion based on the bias-corrected estimator to determine the number of the principal components, and a consistent estimator is obtained. Its good performance is confirmed by a simulation study and real data analysis. The bias-corrected estimator is also used to derive new asymptotics for the related goodness-of-fit statistic under the high dimensional scheme.

Local smoothing testing based on multivariate non-parametric regression estimation is one of the main model checking methodologies in the literature. However, the relevant tests suffer from the typical curse of dimensionality, resulting in slow rates of convergence to their limits under the null hypothesis and less deviation from the null hypothesis under alternative hypotheses. This problem prevents tests from maintaining the level of significance well and makes tests less sensitive to alternative hypotheses. In the paper, a model adaptation concept in lack-of-fit testing is introduced and a dimension reduction model-adaptive test procedure is proposed for parametric single-index models. The test behaves like a local smoothing test, as if the model were univariate. It is consistent against any global alternative hypothesis and can detect local alternative hypotheses distinct from the null hypothesis at a fast rate that existing local smoothing tests can achieve only when the model is univariate. Simulations are conducted to examine the performance of our methodology. An analysis of real data is shown for illustration. The method can be readily extended to global smoothing methodology and other testing problems.

The problem of non-random sample selectivity often occurs in practice in many fields. The classical estimators introduced by Heckman are the backbone of the standard statistical analysis of these models. However, these estimators are very sensitive to small deviations from the distributional assumptions which are often not satisfied in practice. We develop a general framework to study the robustness properties of estimators and tests in sample selection models. We derive the influence function and the change-of-variance function of Heckman's two-stage estimator, and we demonstrate the non-robustness of this estimator and its estimated variance to small deviations from the model assumed. We propose a procedure for robustifying the estimator, prove its asymptotic normality and give its asymptotic variance. Both cases with and without an exclusion restriction are covered. This allows us to construct a simple robust alternative to the sample selection bias test. We illustrate the use of our new methodology in an analysis of ambulatory expenditures and we compare the performance of the classical and robust methods in a Monte Carlo simulation study.

We propose a likelihood ratio scan method for estimating multiple change points in piecewise stationary processes. Using scan statistics reduces the computationally infeasible global multiple-change-point estimation problem to a number of single-change-point detection problems in various local windows. The computation can be efficiently performed with order *O*{*npt* log (*n*)}. Consistency for the estimated numbers and locations of the change points are established. Moreover, a procedure is developed for constructing confidence intervals for each of the change points. Simulation experiments and real data analysis are conducted to illustrate the efficiency of the likelihood ratio scan method.

Principal stratification is a causal framework to analyse randomized experiments with a post-treatment variable between the treatment and end point variables. Because the principal strata defined by the potential outcomes of the post-treatment variable are not observable, we generally cannot identify the causal effects within principal strata. Motivated by a real data set of phase III adjuvant colon cancer clinical trials, we propose approaches to identifying and estimating the principal causal effects via multiple trials. For the identifiability, we remove the commonly used exclusion restriction assumption by stipulating that the principal causal effects are homogeneous across these trials. To remove another commonly used monotonicity assumption, we give a necessary condition for the local identifiability, which requires at least three trials. Applying our approaches to the data from adjuvant colon cancer clinical trials, we find that the commonly used monotonicity assumption is untenable, and disease-free survival with 3-year follow-up is a valid surrogate end point for overall survival with 5-year follow-up, which satisfies both causal necessity and causal sufficiency. We also propose a sensitivity analysis approach based on Bayesian hierarchical models to investigate the effect of the deviation from the homogeneity assumption.

Identifying leading measurement units from a large collection is a common inference task in various domains of large-scale inference. Testing approaches, which measure evidence against a null hypothesis rather than effect magnitude, tend to overpopulate lists of leading units with those associated with low measurement error. By contrast, local maximum likelihood approaches tend to favour units with high measurement error. Available Bayesian and empirical Bayesian approaches rely on specialized loss functions that result in similar deficiencies. We describe and evaluate a generic empirical Bayesian ranking procedure that populates the list of top units in a way that maximizes the expected overlap between the true and reported top lists for all list sizes. The procedure relates unit-specific posterior upper tail probabilities with their empirical distribution to yield a ranking variable. It discounts high variance units less than popular non-maximum-likelihood methods and thus achieves improved operating characteristics in the models considered.

The paper brings together the theory and practice of local linear kernel hazard estimation. Bandwidth selection is fully analysed, including double one-sided cross-validation that is shown to have good practical and theoretical properties. Insight is provided into the choice of the weighting function in the local linear minimization and it is pointed out that classical weighting sometimes lacks stability. A new semiparametric hazard estimator transforming the survival data before smoothing is introduced and shown to have good practical properties.

We study the properties of points in generated by applying Hilbert's space filling curve to uniformly distributed points in [0,1]. For deterministic sampling we obtain a discrepancy of for *d*⩾2. For random stratified sampling, and scrambled van der Corput points, we derive a mean-squared error of for integration of Lipschitz continuous integrands, when *d*⩾3. These rates are the same as those obtained by sampling on *d*-dimensional grids and they show a deterioration with increasing *d*. The rate for Lipschitz functions is, however, the best possible at that level of smoothness and is better than plain independent and identically distributed sampling. Unlike grids, space filling curve sampling provides points at any desired sample size, and the van der Corput version is extensible in *n*. We also introduce a class of piecewise Lipschitz functions whose discontinuities are in rectifiable sets described via Minkowski content. Although these functions may have infinite variation in the sense of Hardy and Krause, they can be integrated with a mean-squared error of . It was previously known only that the rate was . Other space filling curves, such as those due to Sierpinski and Peano, also attain these rates, whereas upper bounds for the Lebesgue curve are somewhat worse, as if the dimension were times as high.

We study generalized additive models, with shape restrictions (e.g. monotonicity, convexity and concavity) imposed on each component of the additive prediction function. We show that this framework facilitates a non-parametric estimator of each additive component, obtained by maximizing the likelihood. The procedure is free of tuning parameters and under mild conditions is proved to be uniformly consistent on compact intervals. More generally, our methodology can be applied to generalized additive index models. Here again, the procedure can be justified on theoretical grounds and, like the original algorithm, has highly competitive finite sample performance. Practical utility is illustrated through the use of these methods in the analysis of two real data sets. Our algorithms are publicly available in the R package scar, short for shape-constrained additive regression.

We develop a connection between mixture and envelope representations of objective functions that arise frequently in statistics. We refer to this connection by using the term ‘hierarchical duality’. Our results suggest an interesting and previously underexploited relationship between marginalization and profiling, or equivalently between the Fenchel–Moreau theorem for convex functions and the Bernstein–Widder theorem for Laplace transforms. We give several different sets of conditions under which such a duality result obtains. We then extend existing work on envelope representations in several ways, including novel generalizations to variance–mean models and to multivariate Gaussian location models. This turns out to provide an elegant missing data interpretation of the proximal gradient method, which is a widely used algorithm in machine learning. We show several statistical applications in which the framework proposed leads to easily implemented algorithms, including a robust version of the fused lasso, non-linear quantile regression via trend filtering and the binomial fused double-Pareto model. Code for the examples is available on GitHub at https://github.com/jgscott/hierduals.

Variable order Markov chains have been used to model discrete sequential data in a variety of fields. A host of methods exist to estimate the history-dependent lengths of memory which characterize these models and to predict new sequences. In several applications, the data-generating mechanism is known to be reversible, but combining this information with the procedures mentioned is far from trivial. We introduce a Bayesian analysis for reversible dynamics, which takes into account uncertainty in the lengths of memory. The model proposed is applied to the analysis of molecular dynamics simulations and compared with several popular algorithms.

In the practice of point prediction, it is desirable that forecasters receive a directive in the form of a statistical functional. For example, forecasters might be asked to report the mean or a quantile of their predictive distributions. When evaluating and comparing competing forecasts, it is then critical that the scoring function used for these purposes be consistent for the functional at hand, in the sense that the expected score is minimized when following the directive. We show that any scoring function that is consistent for a quantile or an expectile functional can be represented as a mixture of elementary or extremal scoring functions that form a linearly parameterized family. Scoring functions for the mean value and probability forecasts of binary events constitute important examples. The extremal scoring functions admit appealing economic interpretations of quantiles and expectiles in the context of betting and investment problems. The Choquet-type mixture representations give rise to simple checks of whether a forecast dominates another in the sense that it is preferable under any consistent scoring function. In empirical settings it suffices to compare the average scores for only a finite number of extremal elements. Plots of the average scores with respect to the extremal scoring functions, which we call Murphy diagrams, permit detailed comparisons of the relative merits of competing forecasts.

A major challenge in many modern superresolution fluorescence microscopy techniques at the nanoscale lies in the correct alignment of long sequences of sparse but spatially and temporally highly resolved images. This is caused by the temporal drift of the protein structure, e.g. due to temporal thermal inhomogeneity of the object of interest or its supporting area during the observation process. We develop a simple semiparametric model for drift correction in single-marker switching microscopy. Then we propose an *M*-estimator for the drift and show its asymptotic normality. This is used to correct the final image and it is shown that this purely statistical method is competitive with state of the art calibration techniques which require the incorporation of fiducial markers in the specimen. Moreover, a simple bootstrap algorithm allows us to quantify the precision of the drift estimate and its effect on the final image estimation. We argue that purely statistical drift correction is even more robust than fiducial tracking, rendering the latter superfluous in many applications. The practicability of our method is demonstrated by a simulation study and by a single-marker switching application. This serves as a prototype for many other typical imaging techniques where sparse observations with high temporal resolution are blurred by motion of the object to be reconstructed.

Variable selection is a challenging issue in statistical applications when the number of predictors *p* far exceeds the number of observations *n*. In this ultrahigh dimensional setting, the sure independence screening procedure was introduced to reduce the dimensionality significantly by preserving the true model with overwhelming probability, before a refined second-stage analysis. However, the aforementioned sure screening property strongly relies on the assumption that the important variables in the model have large marginal correlations with the response, which rarely holds in reality. To overcome this, we propose a novel and simple screening technique called high dimensional ordinary least squares projection which we refer to as ‘HOLP’. We show that HOLP has the sure screening property and gives consistent variable selection without the strong correlation assumption, and it has a low computational complexity. A ridge-type HOLP procedure is also discussed. Simulation study shows that HOLP performs competitively compared with many other marginal correlation-based methods. An application to a mammalian eye disease data set illustrates the attractiveness of HOLP.

The paper investigates the estimation of a wide class of multivariate volatility models. Instead of estimating an *m*-multivariate volatility model, a much simpler and numerically efficient method consists in estimating *m* univariate generalized auto-regressive conditional heteroscedasticity type models equation by equation in the first step, and a correlation matrix in the second step. Strong consistency and asymptotic normality of the equation-by-equation estimator are established in a very general framework, including dynamic conditional correlation models. The equation-by-equation estimator can be used to test the restrictions imposed by a particular multivariate generalized auto-regressive conditional heteroscedasticity specification. For general constant conditional correlation models, we obtain the consistency and asymptotic normality of the two-step estimator. Comparisons with the global method, in which the model parameters are estimated in one step, are provided. Monte Carlo experiments and applications to financial series illustrate the interest of the approach.

A conventional linear model for functional data involves expressing a response variable *Y* in terms of the explanatory function *X*(*t*), via the model , where *a* is a scalar, *b* is an unknown function and is a compact interval. However, in some problems the support of *b* or *X*, say, is a proper and unknown subset of , and is a quantity of particular practical interest. Motivated by a real data example involving particulate emissions, we develop methods for estimating . We give particular emphasis to the case , where *θ* ∈ (0,*α*], and suggest two methods for estimating *a*,* b* and *θ* jointly; we introduce techniques for selecting tuning parameters; and we explore properties of our methodology by using both simulation and the real data example mentioned above. Additionally, we derive theoretical properties of the methodology and discuss implications of the theory. Our theoretical arguments give particular emphasis to the problem of identifiability.

Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation that is not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, which is generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply our method to the National Head Start impact study, which is a large-scale randomized evaluation of a Federal preschool programme, finding that there is indeed significant unexplained treatment effect variation.

The estimation of average treatment effects based on observational data is extremely important in practice and has been studied by generations of statisticians under different frameworks. Existing globally efficient estimators require non-parametric estimation of a propensity score function, an outcome regression function or both, but their performance can be poor in practical sample sizes. Without explicitly estimating either function, we consider a wide class of calibration weights constructed to attain an exact three-way balance of the moments of observed covariates among the treated, the control and the combined group. The wide class includes exponential tilting, empirical likelihood and generalized regression as important special cases, and extends survey calibration estimators to different statistical problems and with important distinctions. Global semiparametric efficiency for the estimation of average treatment effects is established for this general class of calibration estimators. The results show that efficiency can be achieved by solely balancing the covariate distributions without resorting to direct estimation of the propensity score or outcome regression function. We also propose a consistent estimator for the efficient asymptotic variance, which does not involve additional functional estimation of either the propensity score or the outcome regression functions. The variance estimator proposed outperforms existing estimators that require a direct approximation of the efficient influence function.