In many practical applications of multiple testing, there are natural ways to partition the hypotheses into groups by using the structural, spatial or temporal relatedness of the hypotheses, and this prior knowledge is not used in the classical Benjamini–Hochberg procedure for controlling the false discovery rate (FDR). When one can define (possibly several) such partitions, it may be desirable to control the *group FDR* simultaneously for all partitions (as special cases, the ‘finest’ partition divides the *n* hypotheses into *n* groups of one hypothesis each, and this corresponds to controlling the usual notion of FDR, whereas the ‘coarsest’ partition puts all *n* hypotheses into a single group, and this corresponds to testing the global null hypothesis). We introduce the *p-filter*, which takes as input a list of *n p*-values and *M*⩾1 partitions of hypotheses, and produces as output a list of *n* or fewer discoveries such that the group FDR is provably *simultaneously* controlled for all partitions. Importantly, since the partitions are arbitrary, our procedure can also handle multiple partitions which are non-hierarchical. The *p*-filter generalizes two classical procedures—when *M*=1, choosing the finest partition into *n* singletons, we exactly recover the Benjamini–Hochberg procedure, whereas, choosing instead the coarsest partition with a single group of size *n*, we exactly recover the Simes test for the global null hypothesis. We verify our findings with simulations that show how this technique can not only lead to the aforementioned multilayer FDR control but also lead to improved *precision* of rejected hypotheses. We present some illustrative results from an application to a neuroscience problem with functional magnetic resonance imaging data, where hypotheses are explicitly grouped according to predefined regions of interest in the brain, thus allowing the scientist to employ field-specific prior knowledge explicitly and flexibly.

We develop a complete methodology for detecting time varying or non-time-varying parameters in auto-regressive conditional heteroscedasticity (ARCH) processes. For this, we estimate and test various semiparametric versions of time varying ARCH models which include two well-known non-stationary ARCH-type models introduced in the econometrics literature. Using kernel estimation, we show that non-time-varying parameters can be estimated at the usual parametric rate of convergence and, for Gaussian noise, we construct estimates that are asymptotically efficient in a semiparametric sense. Then we introduce two statistical tests which can be used for detecting non-time-varying parameters or for testing the second-order dynamics. An information criterion for selecting the number of lags is also provided. We illustrate our methodology with several real data sets.

A sensible use of an estimation method requires that assessment criteria for the quality of the estimate be available. We present a coverage theory for the least squares estimate. By suitably modifying the empirical costs, one constructs statistics that are guaranteed to cover with known probability the cost associated with a next, still unseen, member of the population. All results of this paper are distribution free and can be applied to least squares problems in use across a variety of fields.

We study parameter estimation in linear Gaussian covariance models, which are *p*-dimensional Gaussian models with linear constraints on the covariance matrix. Maximum likelihood estimation for this class of models leads to a non-convex optimization problem which typically has many local maxima. Using recent results on the asymptotic distribution of extreme eigenvalues of the Wishart distribution, we provide sufficient conditions for any hill climbing method to converge to the global maximum. Although we are primarily interested in the case in which *n*≫*p*, the proofs of our results utilize large sample asymptotic theory under the scheme *n*/*p**γ*>1. Remarkably, our numerical simulations indicate that our results remain valid for *p* as small as 2. An important consequence of this analysis is that, for sample sizes *n*≃14*p*, maximum likelihood estimation for linear Gaussian covariance models behaves as if it were a convex optimization problem.

We propose new concordance-assisted learning for estimating optimal individualized treatment regimes. We first introduce a type of concordance function for prescribing treatment and propose a robust rank regression method for estimating the concordance function. We then find treatment regimes, up to a threshold, to maximize the concordance function, named the prescriptive index. Finally, within the class of treatment regimes that maximize the concordance function, we find the optimal threshold to maximize the value function. We establish the rate of convergence and asymptotic normality of the proposed estimator for parameters in the prescriptive index. An induced smoothing method is developed to estimate the asymptotic variance of the estimator. We also establish the -consistency of the estimated optimal threshold and its limiting distribution. In addition, a doubly robust estimator of parameters in the prescriptive index is developed under a class of monotonic index models. The practical use and effectiveness of the methodology proposed are demonstrated by simulation studies and an application to an acquired immune deficiency syndrome data set.

We develop robust targeted maximum likelihood estimators (TMLEs) for transporting intervention effects from one population to another. Specifically, we develop TMLEs for three transported estimands: the intent-to-treat average treatment effect (ATE) and complier ATE, which are relevant for encouragement design interventions and instrumental variable analyses, and the ATE of the exposure on the outcome, which is applicable to any randomized or observational study. We demonstrate finite sample performance of these TMLEs by using simulation, including in the presence of practical violations of the positivity assumption. We then apply these methods to the ‘Moving to opportunity’ trial: a multisite, encouragement design intervention in which families in public housing were randomized to receive housing vouchers and logistical support to move to low poverty neighbourhoods. This application sheds light on whether effect differences across sites can be explained by differences in population composition.

Comparative and evolutive ecologists are interested in the distribution of quantitative traits between related species. The classical framework for these distributions consists of a random process running along the branches of a phylogenetic tree relating the species. We consider shifts in the process parameters, which reveal fast adaptation to changes of ecological niches. We show that models with shifts are not identifiable in general. Constraining the models to be parsimonious in the number of shifts partially alleviates the problem but several evolutionary scenarios can still provide the same joint distribution for the extant species. We provide a recursive algorithm to enumerate all the equivalent scenarios and to count the number of effectively different scenarios. We introduce an incomplete-data framework and develop a maximum likelihood estimation procedure based on the expectation–maximization algorithm. Finally, we propose a model selection procedure, based on the cardinal of effective scenarios, to estimate the number of shifts and for which we prove an oracle inequality.

Continuous treatments (e.g. doses) arise often in practice, but many available causal effect estimators are limited by either requiring parametric models for the effect curve, or by not allowing doubly robust covariate adjustment. We develop a novel kernel smoothing approach that requires only mild smoothness assumptions on the effect curve and still allows for misspecification of either the treatment density or outcome regression. We derive asymptotic properties and give a procedure for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of nurse staffing on hospital readmissions penalties.

The paper investigates a change point estimation problem in the context of high dimensional Markov random-field models. Change points represent a key feature in many dynamically evolving network structures. The change point estimate is obtained by maximizing a profile penalized pseudolikelihood function under a sparsity assumption. We also derive a tight bound for the estimate, up to a logarithmic factor, even in settings where the number of possible edges in the network far exceeds the sample size. The performance of the estimator proposed is evaluated on synthetic data sets and is also used to explore voting patterns in the US Senate in the 1979–2012 period.

Large-scale multiple testing with correlated test statistics arises frequently in much scientific research. Incorporating correlation information in approximating the false discovery proportion (FDP) has attracted increasing attention in recent years. When the covariance matrix of test statistics is known, Fan and his colleagues provided an accurate approximation of the FDP under arbitrary dependence structure and some sparsity assumption. However, the covariance matrix is often unknown in many applications and such dependence information must be estimated before approximating the FDP. The estimation accuracy can greatly affect the FDP approximation. In the current paper, we study theoretically the effect of unknown dependence on the testing procedure and establish a general framework such that the FDP can be well approximated. The effects of unknown dependence on approximating the FDP are in the following two major aspects: through estimating eigenvalues or eigenvectors and through estimating marginal variances. To address the challenges in these two aspects, we firstly develop general requirements on estimates of eigenvalues and eigenvectors for a good approximation of the FDP. We then give conditions on the structures of covariance matrices that satisfy such requirements. Such dependence structures include banded or sparse covariance matrices and (conditional) sparse precision matrices. Within this framework, we also consider a special example to illustrate our method where data are sampled from an approximate factor model, which encompasses most practical situations. We provide a good approximation of the FDP via exploiting this specific dependence structure. The results are further generalized to the situation where the multivariate normality assumption is relaxed. Our results are demonstrated by simulation studies and some real data applications.

Consider the following three important problems in statistical inference: constructing confidence intervals for the error of a high dimensional (*p*>*n*) regression estimator, the linear regression noise level and the genetic signal-to-noise ratio of a continuous-valued trait (related to the heritability). All three problems turn out to be closely related to the little-studied problem of performing inference on the -norm of the signal in high dimensional linear regression. We derive a novel procedure for this, which is asymptotically correct when the covariates are multivariate Gaussian and produces valid confidence intervals in finite samples as well. The procedure, called *EigenPrism*, is computationally fast and makes no assumptions on coefficient sparsity or knowledge of the noise level. We investigate the width of the EigenPrism confidence intervals, including a comparison with a Bayesian setting in which our interval is just 5% wider than the Bayes credible interval. We are then able to unify the three aforementioned problems by showing that EigenPrism with only minor modifications can make important contributions to all three. We also investigate the robustness of coverage and find that the method applies in practice and in finite samples much more widely than just the case of multivariate Gaussian covariates. Finally, we apply EigenPrism to a genetic data set to estimate the genetic signal-to-noise ratio for a number of continuous phenotypes.

A treatment regime is a deterministic function that dictates personalized treatment based on patients’ individual prognostic information. There is increasing interest in finding optimal treatment regimes, which determine treatment at one or more treatment decision points to maximize expected long-term clinical outcomes, where larger outcomes are preferred. For chronic diseases such as cancer or human immunodeficiency virus infection, survival time is often the outcome of interest, and the goal is to select treatment to maximize survival probability. We propose two non-parametric estimators for the survival function of patients following a given treatment regime involving one or more decisions, i.e. the so-called value. On the basis of data from a clinical or observational study, we estimate an optimal regime by maximizing these estimators for the value over a prespecified class of regimes. Because the value function is very jagged, we introduce kernel smoothing within the estimator to improve performance. Asymptotic properties of the proposed estimators of value functions are established under suitable regularity conditions, and simulation studies evaluate the finite sample performance of the regime estimators. The methods are illustrated by application to data from an acquired immune deficiency syndrome clinical trial.

In many applications involving point pattern data, the Poisson process assumption is unrealistic, with the data exhibiting a more regular spread. Such repulsion between events is exhibited by trees for example, because of competition for light and nutrients. Other examples include the locations of biological cells and cities, and the times of neuronal spikes. Given the many applications of repulsive point processes, there is a surprisingly limited literature developing flexible, realistic and interpretable models, as well as efficient inferential methods. We address this gap by developing a modelling framework around the Matérn type III repulsive process. We consider some extensions of the original Matérn type III process for both the homogeneous and the inhomogeneous cases. We also derive the probability density of this generalized Matérn process, allowing us to characterize the conditional distribution of the various latent variables, and leading to a novel and efficient Markov chain Monte Carlo algorithm. We apply our ideas to data sets of spatial locations of trees, nerve fibre cells and Greyhound bus stations.

Statistical node clustering in discrete time dynamic networks is an emerging field that raises many challenges. Here, we explore statistical properties and frequentist inference in a model that combines a stochastic block model for its static part with independent Markov chains for the evolution of the nodes groups through time. We model binary data as well as weighted dynamic random graphs (with discrete or continuous edges values). Our approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within-group connectivity behaviour. We study identifiability of the model parameters and propose an inference procedure based on a variational expectation–maximization algorithm as well as a model selection criterion to select the number of groups. We carefully discuss our initialization strategy which plays an important role in the method and we compare our procedure with existing procedures on synthetic data sets. We also illustrate our approach on dynamic contact networks: one of encounters between high school students and two others on animal interactions. An implementation of the method is available as an R package called dynsbm.

We propose, a heterogeneous simultaneous multiscale change point estimator called ‘H-SMUCE’ for the detection of multiple change points of the signal in a heterogeneous Gaussian regression model. A piecewise constant function is estimated by minimizing the number of change points over the acceptance region of a multiscale test which locally adapts to changes in the variance. The multiscale test is a combination of local likelihood ratio tests which are properly calibrated by scale-dependent critical values to keep a global nominal level *α*, even for finite samples. We show that H-SMUCE controls the error of overestimation and underestimation of the number of change points. For this, new deviation bounds for *F*-type statistics are derived. Moreover, we obtain confidence sets for the whole signal. All results are non-asymptotic and uniform over a large class of heterogeneous change point models. H-SMUCE is fast to compute, achieves the optimal detection rate and estimates the number of change points at almost optimal accuracy for vanishing signals, while still being robust. We compare H-SMUCE with several state of the art methods in simulations and analyse current recordings of a transmembrane protein in the bacterial outer membrane with pronounced heterogeneity for its states. An R-package is available on line.

A semiparametric model is presented utilizing dependence between a response and several covariates. We show that this model is optimum when the marginal distributions of the response and the covariates are *known*. This model extends the generalized linear model and the proportional likelihood ratio model when the marginal distributions are *unknown*. New interpretations of known models such as the logistic regression model, density ratio model and selection bias model are obtained in terms of dependence between variables. For estimation of parameters, a simple algorithm is presented which is guaranteed to converge. It is also the same regardless of the choice of the distribution for response and covariates; hence, it can fit a very wide variety of useful models. Asymptotic properties of the estimators of model parameters are derived. Real data examples are discussed to illustrate our approach and simulation experiments are performed to compare with existing procedures.

We define an isotropic Lévy-driven continuous auto-regressive moving average CARMA(*p*,*q*) random field on as the integral of a radial CARMA kernel with respect to a Lévy sheet. Such fields constitute a parametric family characterized by an auto-regressive polynomial *a* and a moving average polynomial *b* having zeros in both the left and the right complex half-planes. They extend the *well-balanced Ornstein–Uhlenbeck process* of Schnurr and Woerner to a well-balanced CARMA process in one dimension (with a much richer class of autocovariance functions) and to an isotropic CARMA random field on for *n*>1. We derive second-order properties of these random fields and extend the results to a larger class of anisotropic CARMA random fields. If the driving Lévy sheet is compound Poisson it is trivial to simulate the corresponding random field on any bounded subset of . A method for joint estimation of the CARMA kernel parameters and knot locations is proposed for compound-Poisson-driven fields and is illustrated by applications to simulated data and Tokyo land price data.

We consider the linear regression model with observation error in the design. In this setting, we allow the number of covariates to be much larger than the sample size. Several new estimation methods have been recently introduced for this model. Indeed, the standard lasso estimator or Dantzig selector turns out to become unreliable when only noisy regressors are available, which is quite common in practice. In this work, we propose and analyse a new estimator for the errors-in-variables model. Under suitable sparsity assumptions, we show that this estimator attains the minimax efficiency bound. Importantly, this estimator can be written as a second-order cone programming minimization problem which can be solved numerically in polynomial time. Finally, we show that the procedure introduced by Rosenbaum and Tsybakov, which is almost optimal in a minimax sense, can be efficiently computed by a single linear programming problem despite non-convexities.

Spatial regression is an important predictive tool in many scientific applications and an additive model provides a flexible regression relationship between predictors and a response variable. We develop a regularized variable selection technique for building a spatial additive model. We find that the methods developed for independent data do not work well for spatially dependent data. This motivates us to propose a spatially weighted -error norm with a group lasso type of penalty to select additive components in spatial additive models. We establish the selection consistency of the approach proposed where the penalty parameter depends on several factors, such as the order of approximation of additive components, characteristics of the spatial weight and spatial dependence. An extensive simulation study provides a vivid picture of the effects of dependent data structure and choice of a spatial weight on selection results as well as the asymptotic behaviour of the estimators. As an illustrative example, the method is applied to lung cancer mortality data over the period of 2000–2005, obtained from the ‘Surveillance, epidemiology, and end results’ programme, National Cancer Institute, USA.

Practitioners are interested in not only the average causal effect of a treatment on the outcome but also the underlying causal mechanism in the presence of an intermediate variable between the treatment and outcome. However, in many cases we cannot randomize the intermediate variable, resulting in sample selection problems even in randomized experiments. Therefore, we view randomized experiments with intermediate variables as semiobservational studies. In parallel with the analysis of observational studies, we provide a theoretical foundation for conducting objective causal inference with an intermediate variable under the principal stratification framework, with principal strata defined as the joint potential values of the intermediate variable. Our strategy constructs weighted samples based on principal scores, defined as the conditional probabilities of the latent principal strata given covariates, without access to any outcome data. This principal stratification analysis yields robust causal inference without relying on any model assumptions on the outcome distributions. We also propose approaches to conducting sensitivity analysis for violations of the ignorability and monotonicity assumptions: the very crucial but untestable identification assumptions in our theory. When the assumptions required by the classical instrumental variable analysis cannot be justified by background knowledge or cannot be made because of scientific questions of interest, our strategy serves as a useful alternative tool to deal with intermediate variables. We illustrate our methodologies by using two real data examples and find scientifically meaningful conclusions.

The non-causal auto-regressive process with heavy-tailed errors has non-linear causal dynamics, which allow for local explosion or asymmetric cycles that are often observed in economic and financial time series. It provides a new model for multiple local explosions in a strictly stationary framework. The causal predictive distribution displays surprising features, such as higher moments than for the marginal distribution, or the presence of a unit root in the Cauchy case. Aggregating such models can yield complex dynamics with local and global explosion as well as variation in the rate of explosion. The asymptotic behaviour of a vector of sample auto-correlations is studied in a semiparametric non-causal AR(1) framework with Pareto-like tails, and diagnostic tests are proposed. Empirical results based on the Nasdaq composite price index are provided.

We consider causal mediation analysis when exposures and mediators vary over time. We give non-parametric identification results, discuss parametric implementation and also provide a weighting approach to direct and indirect effects based on combining the results of two marginal structural models. We also discuss how our results give rise to a causal interpretation of the effect estimates produced from longitudinal structural equation models. When there are time varying confounders affected by prior exposure and a mediator, natural direct and indirect effects are not identified. However, we define a randomized interventional analogue of natural direct and indirect effects that are identified in this setting. The formula that identifies these effects we refer to as the ‘mediational *g*-formula’. When there is no mediation, the mediational *g*-formula reduces to Robins's regular *g*-formula for longitudinal data. When there are no time varying confounders affected by prior exposure and mediator values, then the mediational *g*-formula reduces to a longitudinal version of Pearl's mediation formula. However, the mediational *g*-formula itself can accommodate both mediation and time varying confounders and constitutes a general approach to mediation analysis with time varying exposures and mediators.

We propose novel optimal designs for longitudinal data for the common situation where the resources for longitudinal data collection are limited, by determining the optimal locations in time where measurements should be taken. As for all optimal designs, some prior information is needed to implement the optimal designs proposed. We demonstrate that this prior information may come from a pilot longitudinal study that has irregularly measured and noisy measurements, where for each subject one has available a small random number of repeated measurements that are randomly located on the domain. A second possibility of interest is that a pilot study consists of densely measured functional data and one intends to take only a few measurements at strategically placed locations in the domain for the future collection of similar data. We construct optimal designs by targeting two criteria: optimal designs to recover the unknown underlying smooth random trajectory for each subject from a few optimally placed measurements such that squared prediction errors are minimized; optimal designs that minimize prediction errors for functional linear regression with functional or longitudinal predictors and scalar responses, again from a few optimally placed measurements. The optimal designs proposed address the need for sparse data collection when planning longitudinal studies, by taking advantage of the close connections between longitudinal and functional data analysis. We demonstrate in simulations that the designs perform considerably better than randomly chosen design points and include a motivating data example from the Baltimore Longitudinal Study of Aging. The designs are shown to have an asymptotic optimality property.

It is common that, in multiarm randomized trials, the outcome of interest is ‘truncated by death’, meaning that it is only observed or well-defined conditioning on an intermediate outcome. In this case, in addition to pairwise contrasts, the joint inference for all treatment arms is also of interest. Under a monotonicity assumption we present methods for both pairwise and joint causal analyses of ordinal treatments and binary outcomes in the presence of truncation by death. We illustrate via examples the appropriateness of our assumptions in different scientific contexts.

Statistical inferences for sample correlation matrices are important in high dimensional data analysis. Motivated by this, the paper establishes a new central limit theorem for a linear spectral statistic of high dimensional sample correlation matrices for the case where the dimension *p* and the sample size *n* are comparable. This result is of independent interest in large dimensional random-matrix theory. We also further investigate the sample correlation matrices of a high dimensional vector whose elements have a special correlated structure and the corresponding central limit theorem is developed. Meanwhile, we apply the linear spectral statistic to an independence test for *p* random variables, and then an equivalence test for *p* factor loadings and *n* factors in a factor model. The finite sample performance of the test proposed shows its applicability and effectiveness in practice. An empirical application to test the independence of household incomes from various cities in China is also conducted.

We propose a novel sparse tensor decomposition method, namely the tensor truncated power method, that incorporates variable selection in the estimation of decomposition components. The sparsity is achieved via an efficient truncation step embedded in the tensor power iteration. Our method applies to a broad family of high dimensional latent variable models, including high dimensional Gaussian mixtures and mixtures of sparse regressions. A thorough theoretical investigation is further conducted. In particular, we show that the final decomposition estimator is guaranteed to achieve a local statistical rate, and we further strengthen it to the global statistical rate by introducing a proper initialization procedure. In high dimensional regimes, the statistical rate obtained significantly improves those shown in the existing non-sparse decomposition methods. The empirical advantages of tensor truncated power are confirmed in extensive simulation results and two real applications of click-through rate prediction and high dimensional gene clustering.

In the Gaussian linear regression model (with unknown mean and variance), we show that the standard confidence set for one or two regression coefficients is admissible in the sense of Joshi. This solves a long-standing open problem in mathematical statistics, and this has important implications on the performance of modern inference procedures post model selection or post shrinkage, particularly in situations where the number of parameters is larger than the sample size. As a technical contribution of independent interest, we introduce a new class of conjugate priors for the Gaussian location–scale model.

A non-parametric extension of control variates is presented. These leverage gradient information on the sampling density to achieve substantial variance reduction. It is not required that the sampling density be normalized. The novel contribution of this work is based on two important insights: a trade-off between random sampling and deterministic approximation and a new gradient-based function space derived from Stein's identity. Unlike classical control variates, our estimators improve rates of convergence, often requiring orders of magnitude fewer simulations to achieve a fixed level of precision. Theoretical and empirical results are presented, the latter focusing on integration problems arising in hierarchical models and models based on non-linear ordinary differential equations.

Most time series that are encountered in practice contain non-zero trend, yet textbook approaches to time series analysis are typically focused on zero-mean stationary auto-regressive moving average (ARMA) processes. Trend is often estimated by *ad hoc* methods and subtracted from time series, and the residuals are used as the true ARMA noise for data analysis and inference, including parameter estimation, lag selection and prediction. We propose a theoretically justified two-step method to analyse time series consisting of a smooth trend function and ARMA error term, which is computationally efficient and easy for practitioners to implement. The trend is estimated by *B*-spline regression, and the maximum likelihood estimator based on residuals is shown to be oracally efficient in the sense that it is asymptotically as efficient as if the true trend function were known and then removed to obtain the ARMA errors. In addition, consistency of the Bayesian information criterion for model selection is established for the detrended residual sequence. Finite sample performance of the procedure is illustrated by simulation studies and real data analysis.

The modifiable areal unit problem and the ecological fallacy are known problems that occur when modelling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By ‘regionalization’ we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error, which we minimize to obtain an optimal regionalization. To define the criterion we draw a connection between spatial aggregation error and a new multiscale representation of the Karhunen–Loève expansion. This relationship between the criterion for spatial aggregation error and the multiscale Karhunen–Loève expansion leads to illuminating theoretical developments including connections between spatial aggregation error, squared prediction error, spatial variance and a novel extension of Obled–Creutin eigenfunctions. The effectiveness of our approach is demonstrated through an analysis of two data sets: one using the American Community Survey and one related to environmental ocean winds.

Using properties of shuffles of copulas and tools from combinatorics we solve the open question about the exact region Ω determined by all possible values of Kendall's *τ* and Spearman's *ρ*. In particular, we prove that the well-known inequality established by Durbin and Stuart in 1951 is not sharp outside a countable set, give a simple analytic characterization of Ω in terms of a continuous, strictly increasing piecewise concave function and show that Ω is compact and simply connected, but not convex. The results also show that for each (*x*,*y*) ∈ Ω there are mutually completely dependent random variables *X* and *Y* whose *τ*- and *ρ*-values coincide with *x* and *y* respectively.

Sparse regression techniques have been popular in recent years because of their ability in handling high dimensional data with built-in variable selection. The lasso is perhaps one of the most well-known examples. Despite intensive work in this direction, how to provide valid inference for sparse regularized methods remains a challenging statistical problem. We take a unique point of view of this problem and propose to make use of stochastic variational inequality techniques in optimization to derive confidence intervals and regions for the lasso. Some theoretical properties of the procedure are obtained. Both simulated and real data examples are used to demonstrate the performance of the method.

The paper considers the computer model calibration problem and provides a general frequentist solution. Under the framework proposed, the data model is semiparametric with a non-parametric discrepancy function which accounts for any discrepancy between physical reality and the computer model. In an attempt to solve a fundamentally important (but often ignored) identifiability issue between the computer model parameters and the discrepancy function, the paper proposes a new and identifiable parameterization of the calibration problem. It also develops a two-step procedure for estimating all the relevant quantities under the new parameterization. This estimation procedure is shown to enjoy excellent rates of convergence and can be straightforwardly implemented with existing software. For uncertainty quantification, bootstrapping is adopted to construct confidence regions for the quantities of interest. The practical performance of the methodology is illustrated through simulation examples and an application to a computational fluid dynamics model.

We propose a multi-resolution scanning approach to identifying two-sample differences. Windows of multiple scales are constructed through nested dyadic partitioning on the sample space and a hypothesis regarding the two-sample difference is defined on each window. Instead of testing the hypotheses on different windows independently, we adopt a joint graphical model, namely a Markov tree, on the null or alternative states of these hypotheses to incorporate spatial correlation across windows. The induced dependence allows borrowing strength across nearby and nested windows, which we show is critical for detecting high resolution local differences. We evaluate the performance of the method through simulation and show that it substantially outperforms other state of the art two-sample tests when the two-sample difference is local, involving only a small subset of the data. We then apply it to a flow cytometry data set from immunology, in which it successfully identifies highly local differences. In addition, we show how to control properly for multiple testing in a decision theoretic approach as well as how to summarize and report the inferred two-sample difference. We also construct hierarchical extensions of the framework to incorporate adaptivity into the construction of the scanning windows to improve inference further.

A formal likelihood ratio hypothesis test for the validity of a parametric regression function is proposed, using a large dimensional, non-parametric *double-cone* alternative. For example, the test against a constant function uses the alternative of increasing or decreasing regression functions, and the test against a linear function uses the convex or concave alternative. The test proposed is exact and unbiased and the critical value is easily computed. The power of the test increases to 1 as the sample size increases, under very mild assumptions—even when the alternative is misspecified, i.e. the power of the test converges to 1 for any true regression function that deviates (in a non-degenerate way) from the parametric null hypothesis. We also formulate tests for the linear *versus* partial linear model and consider the special case of the additive model. Simulations show that our procedure behaves well consistently when compared with other methods. Although the alternative fit is non-parametric, no tuning parameters are involved. Supplementary materials with proofs and technical details are available on line.

Sampling from various kinds of distribution is an issue of paramount importance in statistics since it is often the key ingredient for constructing estimators, test procedures or confidence intervals. In many situations, exact sampling from a given distribution is impossible or computationally expensive and, therefore, one needs to resort to approximate sampling strategies. However, there is no well-developed theory providing meaningful non-asymptotic guarantees for the approximate sampling procedures, especially in high dimensional problems. The paper makes some progress in this direction by considering the problem of sampling from a distribution having a smooth and log-concave density defined on , for some integer *p*>0. We establish non-asymptotic bounds for the error of approximating the target distribution by the distribution obtained by the Langevin Monte Carlo method and its variants. We illustrate the effectiveness of the established guarantees with various experiments. Underlying our analysis are insights from the theory of continuous time diffusion processes, which may be of interest beyond the framework of log-concave densities that are considered in the present work.

Data subject to heavy-tailed errors are commonly encountered in various scientific fields. To address this problem, procedures based on quantile regression and least absolute deviation regression have been developed in recent years. These methods essentially estimate the conditional median (or quantile) function. They can be very different from the conditional mean functions, especially when distributions are asymmetric and heteroscedastic. How can we efficiently estimate the mean regression functions in ultrahigh dimensional settings with existence of only the second moment? To solve this problem, we propose a penalized Huber loss with diverging parameter to reduce biases created by the traditional Huber loss. Such a penalized robust approximate (RA) quadratic loss will be called the RA lasso. In the ultrahigh dimensional setting, where the dimensionality can grow exponentially with the sample size, our results reveal that the RA lasso estimator produces a consistent estimator at the same rate as the optimal rate under the light tail situation. We further study the computational convergence of the RA lasso and show that the composite gradient descent algorithm indeed produces a solution that admits the same optimal rate after sufficient iterations. As a by-product, we also establish the concentration inequality for estimating the population mean when there is only the second moment. We compare the RA lasso with other regularized robust estimators based on quantile regression and least absolute deviation regression. Extensive simulation studies demonstrate the satisfactory finite sample performance of the RA lasso.

The cumulative incidence is the probability of failure from the cause of interest over a certain time period in the presence of other risks. A semiparametric regression model proposed by Fine and Gray has become the method of choice for formulating the effects of covariates on the cumulative incidence. Its estimation, however, requires modelling of the censoring distribution and is not statistically efficient. We present a broad class of semiparametric transformation models which extends the Fine and Gray model, and we allow for unknown causes of failure. We derive the non-parametric maximum likelihood estimators and develop simple and fast numerical algorithms using the profile likelihood. We establish the consistency, asymptotic normality and semiparametric efficiency of the non-parametric maximum likelihood estimators. In addition, we construct graphical and numerical procedures to evaluate and select models. Finally, we demonstrate the advantages of the proposed methods over the existing methods through extensive simulation studies and an application to a major study on bone marrow transplantation.

A new class of dependent random measures which we call *compound random measures* is proposed and the use of normalized versions of these random measures as priors in Bayesian non-parametric mixture models is considered. Their tractability allows the properties of both compound random measures and normalized compound random measures to be derived. In particular, we show how compound random measures can be constructed with gamma, *σ*-stable and generalized gamma process marginals. We also derive several forms of the Laplace exponent and characterize dependence through both the Lévy copula and the correlation function. An augmented Pólya urn scheme sampler and a slice sampler are described for posterior inference when a normalized compound random measure is used as the mixing measure in a non-parametric mixture model and a data example is discussed.

We propose new methodology for two-sample testing in high dimensional models. The methodology provides a high dimensional analogue to the classical likelihood ratio test and is applicable to essentially any model class where sparse estimation is feasible. Sparse structure is used in the construction of the test statistic. In the general case, testing then involves non-nested model comparison, and we provide asymptotic results for the high dimensional setting. We put forward computationally efficient procedures based on data splitting, including a variant of the permutation test that exploits sparse structure. We illustrate the general approach in two-sample comparisons of high dimensional regression models (‘differential regression’) and graphical models (‘differential network’), showing results on simulated data as well as data from two recent cancer studies.

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Carlo method for sampling from the posterior distribution of the parameters of interest. The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semiparametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Carlo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

For the analysis of clustered survival data, two different types of model that take the association into account are commonly used: frailty models and copula models. Frailty models assume that, conditionally on a frailty term for each cluster, the hazard functions of individuals within that cluster are independent. These unknown frailty terms with their imposed distribution are used to express the association between the different individuals in a cluster. Copula models in contrast assume that the joint survival function of the individuals within a cluster is given by a copula function, evaluated in the marginal survival function of each individual. It is the copula function which describes the association between the lifetimes within a cluster. A major disadvantage of the present copula models over the frailty models is that the size of the different clusters must be small and equal to set up manageable estimation procedures for the different model parameters. We describe a copula model for clustered survival data where the clusters are allowed to be moderate to large and varying in size by considering the class of Archimedean copulas with completely monotone generator. We develop both one- and two-stage estimators for the copula parameters. Furthermore we show the consistency and asymptotic normality of these estimators. Finally, we perform a simulation study to investigate the finite sample properties of the estimators. We illustrate the method on a data set containing the time to first insemination in cows, with cows clustered in herds.

We propose a semiparametric latent Gaussian copula model for modelling mixed multivariate data, which contain a combination of both continuous and binary variables. The model assumes that the observed binary variables are obtained by dichotomizing latent variables that satisfy the Gaussian copula distribution. The goal is to infer the conditional independence relationship between the latent random variables, based on the observed mixed data. Our work has two main contributions: we propose a unified rank-based approach to estimate the correlation matrix of latent variables; we establish the concentration inequality of the proposed rank-based estimator. Consequently, our methods achieve the same rates of convergence for precision matrix estimation and graph recovery, as if the latent variables were observed. The methods proposed are numerically assessed through extensive simulation studies, and real data analysis.

Envelope tests are a popular tool in spatial statistics, where they are used in goodness-of-fit testing. These tests graphically compare an empirical function *T*(*r*) with its simulated counterparts from the null model. However, the type I error probability *α* is conventionally controlled for a fixed distance *r* only, whereas the functions are inspected on an interval of distances *I*. In this study, we propose two approaches related to Barnard's Monte Carlo test for building global envelope tests on *I*: ordering the empirical and simulated functions on the basis of their *r*-wise ranks among each other, and the construction of envelopes for a deviation test. These new tests allow the *a priori* choice of the global *α* and they yield *p*-values. We illustrate these tests by using simulated and real point pattern data.

Hierarchical models allow for heterogeneous behaviours in a population while simultaneously borrowing estimation strength across all subpopulations. Unfortunately, existing likelihood-based methods for fitting hierarchical models have high computational demands, and these demands have limited their adoption in large-scale prediction and inference problems. The paper proposes a moment-based procedure for estimating the parameters of a hierarchical model which has its roots in a method originally introduced by Cochran in 1937. The method trades statistical efficiency for computational efficiency. It gives consistent parameter estimates, competitive prediction error performance and substantial computational improvements. When applied to a large-scale recommender system application and compared with a standard maximum likelihood procedure, the method delivers competitive prediction performance while reducing the sequential computation time from hours to minutes.

The paper develops a general regression framework for the analysis of manifold-valued response in a Riemannian symmetric space (RSS) and its association with multiple covariates of interest, such as age or gender, in Euclidean space. Such RSS-valued data arise frequently in medical imaging, surface modelling and computer vision, among many other fields. We develop an intrinsic regression model solely based on an intrinsic conditional moment assumption, avoiding specifying any parametric distribution in RSS. We propose various link functions to map from the Euclidean space of multiple covariates to the RSS of responses. We develop a two-stage procedure to calculate the parameter estimates and determine their asymptotic distributions. We construct the Wald and geodesic test statistics to test hypotheses of unknown parameters. We systematically investigate the geometric invariant property of these estimates and test statistics. Simulation studies and a real data analysis are used to evaluate the finite sample properties of our methods.

A common feature in large-scale scientific studies is that signals are sparse and it is desirable to narrow down significantly the focus to a much smaller subset in a sequential manner. We consider two related data screening problems: one is to find the smallest subset such that it virtually contains all signals and another is to find the largest subset such that it essentially contains only signals. These screening problems are closely connected to but distinct from the more conventional signal detection or multiple-testing problems. We develop phase transition diagrams to characterize the fundamental limits in simultaneous inference and derive data-driven screening procedures which control the error rates with near optimality properties. Applications in the context of multistage high throughput studies are discussed.

The analysis of spatial data is based on a set of assumptions, which in practice need to be checked. A commonly used assumption is that the spatial random field is second-order stationary. In the paper, a test for spatial stationarity for irregularly sampled data is proposed. The test is based on a transformation of the data (a type of Fourier transform), where the correlations between the transformed data are close to 0 if the random field is second-order stationary. However, if the random field were second-order non-stationary, this property does not hold. Using this property a test for second-order stationarity is constructed. The test statistic is based on measuring the degree of correlation in the transformed data. The asymptotic sampling properties of the test statistic are derived under both stationarity and non-stationarity of the random field. These results motivate a graphical tool which allows a visual representation of the non-stationary features. The method is illustrated with simulations and a real data example.

We consider the problem of Laplace deconvolution with noisy discrete non-equally spaced observations on a finite time interval. We propose a new method for Laplace deconvolution which is based on expansions of the convolution kernel, the unknown function and the observed signal over a Laguerre functions basis (which acts as a surrogate eigenfunction basis of the Laplace convolution operator) using a regression setting. The expansion results in a small system of linear equations with the matrix of the system being triangular and Toeplitz. Because of this triangular structure, there is a common number *m* of terms in the function expansions to control, which is realized via a complexity penalty. The advantage of this methodology is that it leads to very fast computations, produces no boundary effects due to extension at zero and cut-off at *T* and provides an estimator with the risk within a logarithmic factor of *m* of the oracle risk. We emphasize that we consider the true observational model with possibly non-equispaced observations which are available on a finite interval of length *T* which appears in many different contexts, and we account for the bias associated with this model (which is not present in the case *T*∞). The study is motivated by perfusion imaging using a short injection of contrast agent, a procedure which is applied for medical assessment of microcirculation within tissues such as cancerous tumours. The presence of a tuning parameter *a* allows the choice of the most advantageous time units, so that both the kernel and the unknown right-hand side of the equation are well represented for the deconvolution. The methodology is illustrated by an extensive simulation study and a real data example which confirms that the technique proposed is fast, efficient, accurate, usable from a practical point of view and very competitive.

Consider the extreme quantile region induced by the half-space depth function HD of the form , such that for a given, very small *p*>0. Since this involves extrapolation outside the data cloud, this region can hardly be estimated through a fully non-parametric procedure. Using extreme value theory we construct a natural semiparametric estimator of this quantile region and prove a refined consistency result. A simulation study clearly demonstrates the good performance of our estimator. We use the procedure for risk management by applying it to stock market returns.

Simulation from the truncated multivariate normal distribution in high dimensions is a recurrent problem in statistical computing and is typically only feasible by using approximate Markov chain Monte Carlo sampling. We propose a minimax tilting method for exact independently and identically distributed data simulation from the truncated multivariate normal distribution. The new methodology provides both a method for simulation and an efficient estimator to hitherto intractable Gaussian integrals. We prove that the estimator has a rare vanishing relative error asymptotic property. Numerical experiments suggest that the scheme proposed is accurate in a wide range of set-ups for which competing estimation schemes fail. We give an application to exact independently and identically distributed data simulation from the Bayesian posterior of the probit regression model.

We investigate a longitudinal data model with non-parametric regression functions that may vary across the observed individuals. In a variety of applications, it is natural to impose a group structure on the regression curves. Specifically, we may suppose that the observed individuals can be grouped into a number of classes whose members all share the same regression function. We develop a statistical procedure to estimate the unknown group structure from the data. Moreover, we derive the asymptotic properties of the procedure and investigate its finite sample performance by means of a simulation study and a real data example.

Different dependence scenarios can arise in multivariate extremes, entailing careful selection of an appropriate class of models. In bivariate extremes, the variables are either asymptotically dependent or are asymptotically independent. Most available statistical models suit one or other of these cases, but not both, resulting in a stage in the inference that is unaccounted for but can substantially impact subsequent extrapolation. Existing modelling solutions to this problem are either applicable only on subdomains or appeal to multiple limit theories. We introduce a unified representation for bivariate extremes that encompasses a wide variety of dependence scenarios and applies when at least one variable is large. Our representation motivates a parametric model that encompasses both dependence classes. We implement a simple version of this model and show that it performs well in a range of settings.

We introduce a simple and interpretable model for functional data analysis for situations where the observations at each location are functional rather than scalar. This new approach is based on a tensor product representation of the function-valued process and utilizes eigenfunctions of marginal kernels. The resulting marginal principal components and product principal components are shown to have nice properties. Given a sample of independent realizations of the underlying function-valued stochastic process, we propose straightforward fitting methods to obtain the components of this model and to establish asymptotic consistency and rates of convergence for the estimates proposed. The methods are illustrated by modelling the dynamics of annual fertility profile functions for 17 countries. This analysis demonstrates that the approach proposed leads to insightful interpretations of the model components and interesting conclusions.

The paper develops inferential methodology for detecting a change in the annual pattern of an environmental variable measured at fixed locations in a spatial region. Using a framework built on functional data analysis, we model observations as a collection of function-valued time sequences available at many sites. Each sequence is modelled as an annual mean function, which may change, plus a sequence of error functions, which are spatially correlated. The tests statistics extend the cumulative sum paradigm to this more complex setting. Their asymptotic distributions are not parameter free because of the spatial dependence but can be effectively approximated by Monte Carlo simulations. The new methodology is applied to precipitation data. Its finite sample performance is assessed by a simulation study.

We develop new statistical theory for probabilistic principal component analysis models in high dimensions. The focus is the estimation of the noise variance, which is an important and unresolved issue when the number of variables is large in comparison with the sample size. We first unveil the reasons for an observed downward bias of the maximum likelihood estimator of the noise variance when the data dimension is high. We then propose a bias-corrected estimator by using random-matrix theory and establish its asymptotic normality. The superiority of the new and bias-corrected estimator over existing alternatives is checked by Monte Carlo experiments with various combinations of (*p*,*n*) (the dimension and sample size). Next, we construct a new criterion based on the bias-corrected estimator to determine the number of the principal components, and a consistent estimator is obtained. Its good performance is confirmed by a simulation study and real data analysis. The bias-corrected estimator is also used to derive new asymptotics for the related goodness-of-fit statistic under the high dimensional scheme.

What is the difference between a prediction that is made with a causal model and that with a non-causal model? Suppose that we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (e.g. various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments.

Local smoothing testing based on multivariate non-parametric regression estimation is one of the main model checking methodologies in the literature. However, the relevant tests suffer from the typical curse of dimensionality, resulting in slow rates of convergence to their limits under the null hypothesis and less deviation from the null hypothesis under alternative hypotheses. This problem prevents tests from maintaining the level of significance well and makes tests less sensitive to alternative hypotheses. In the paper, a model adaptation concept in lack-of-fit testing is introduced and a dimension reduction model-adaptive test procedure is proposed for parametric single-index models. The test behaves like a local smoothing test, as if the model were univariate. It is consistent against any global alternative hypothesis and can detect local alternative hypotheses distinct from the null hypothesis at a fast rate that existing local smoothing tests can achieve only when the model is univariate. Simulations are conducted to examine the performance of our methodology. An analysis of real data is shown for illustration. The method can be readily extended to global smoothing methodology and other testing problems.

We propose a non-parametric variable selection method which does not rely on any regression model or predictor distribution. The method is based on a new statistical relationship, called *additive conditional independence*, that has been introduced recently for graphical models. Unlike most existing variable selection methods, which target the mean of the response, the method proposed targets a set of attributes of the response, such as its mean, variance or entire distribution. In addition, the additive nature of this approach offers non-parametric flexibility without employing multi-dimensional kernels. As a result it retains high accuracy for high dimensional predictors. We establish estimation consistency, convergence rate and variable selection consistency of the method proposed. Through simulation comparisons we demonstrate that the method proposed performs better than existing methods when the predictor affects several attributes of the response, and it performs competently in the classical setting where the predictors affect the mean only. We apply the new method to a data set concerning how gene expression levels affect the weight of mice.

We study the problem of estimating a compact set from a trajectory of a reflected Brownian motion in *S* with reflections on the boundary of *S*. We establish consistency and rates of convergence for various estimators of *S* and its boundary. This problem has relevant applications in ecology in estimating the home range of an animal on the basis of tracking data. There are a variety of studies on the habitat of animals that employ the notion of home range. The paper offers theoretical foundations for a new methodology that, under fairly unrestrictive shape assumptions, allows us to find flexible regions close to reality. The theoretical findings are illustrated on simulated and real data examples.

We consider testing regression coefficients in high dimensional generalized linear models. By modifying the test statistic of Goeman and his colleagues for large but fixed dimensional settings, we propose a new test, based on an asymptotic analysis, that is applicable for diverging dimensions and is robust to accommodate a wide range of link functions. The power properties of the tests are evaluated asymptotically under two families of alternative hypotheses. In addition, a test in the presence of nuisance parameters is also proposed. The tests can provide *p*-values for testing significance of multiple gene sets, whose application is demonstrated in a case-study on lung cancer.

We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected to observations through a loss function rather than the traditional likelihood function, which is recovered as a special case. Modern application areas make it increasingly challenging for Bayesians to attempt to model the true data-generating mechanism. For instance, when the object of interest is low dimensional, such as a mean or median, it is cumbersome to have to achieve this via a complete model for the whole data distribution. More importantly, there are settings where the parameter of interest does not directly index a family of density functions and thus the Bayesian approach to learning about such parameters is currently regarded as problematic. Our framework uses loss functions to connect information in the data to functionals of interest. The updating of beliefs then follows from a decision theoretic approach involving cumulative loss functions. Importantly, the procedure coincides with Bayesian updating when a true likelihood is known yet provides coherent subjective inference in much more general settings. Connections to other inference frameworks are highlighted.