We consider heteroscedastic regression models where the mean function is a partially linear single-index model and the variance function depends on a generalized partially linear single-index model. We do not insist that the variance function depends only on the mean function, as happens in the classical generalized partially linear single-index model. We develop efficient and practical estimation methods for the variance function and for the mean function. Asymptotic theory for the parametric and non-parametric parts of the model is developed. Simulations illustrate the results. An empirical example involving ozone levels is used to illustrate the results further and is shown to be a case where the variance function does not depend on the mean function.

The paper develops a unified theoretical and computational framework for false discovery control in multiple testing of spatial signals. We consider both pointwise and clusterwise spatial analyses, and derive oracle procedures which optimally control the false discovery rate, false discovery exceedance and false cluster rate. A data-driven finite approximation strategy is developed to mimic the oracle procedures on a continuous spatial domain. Our multiple-testing procedures are asymptotically valid and can be effectively implemented using Bayesian computational algorithms for analysis of large spatial data sets. Numerical results show that the procedures proposed lead to more accurate error control and better power performance than conventional methods. We demonstrate our methods for analysing the time trends in tropospheric ozone in eastern USA.

The essentials of our paper of 2002 are briefly summarized and compared with other criteria for model comparison. After some comments on the paper's reception and influence, we consider criticisms and proposals for improvement made by us and others.

Random effects or shared parameter models are commonly advocated for the analysis of combined repeated measurement and event history data, including dropout from longitudinal trials. Their use in practical applications has generally been limited by computational cost and complexity, meaning that only simple special cases can be fitted by using readily available software. We propose a new approach that exploits recent distributional results for the extended skew normal family to allow exact likelihood inference for a flexible class of random-effects models. The method uses a discretization of the timescale for the time-to-event outcome, which is often unavoidable in any case when events correspond to dropout. We place no restriction on the times at which repeated measurements are made. An analysis of repeated lung function measurements in a cystic fibrosis cohort is used to illustrate the method.

We study quantile regression when the response is an event time subject to potentially dependent censoring. We consider the semicompeting risks setting, where the time to censoring remains observable after the occurrence of the event of interest. Although such a scenario frequently arises in biomedical studies, most of current quantile regression methods for censored data are not applicable because they generally require the censoring time and the event time to be independent. By imposing quite mild assumptions on the association structure between the time-to-event response and the censoring time variable, we propose quantile regression procedures, which allow us to garner a comprehensive view of the covariate effects on the event time outcome as well as to examine the informativeness of censoring. An efficient and stable algorithm is provided for implementing the new method. We establish the asymptotic properties of the resulting estimators including uniform consistency and weak convergence. The theoretical development may serve as a useful template for addressing estimating settings that involve stochastic integrals. Extensive simulation studies suggest that the method proposed performs well with moderate sample sizes. We illustrate the practical utility of our proposals through an application to a bone marrow transplant trial.

The paper focuses primarily on temperature extremes measured at 24 European stations with at least 90 years of data. Here, the term extremes refers to rare excesses of daily maxima and minima. As mean temperatures in this region have been warming over the last century, it is automatic that this positive shift can be detected also in extremes. After removing this warming trend, we focus on the question of determining whether other changes are still detectable in such extreme events. As we do not want to hypothesize any parametric form of such possible changes, we propose a new non-parametric estimator based on the Kullback–Leibler divergence tailored for extreme events. The properties of our estimator are studied theoretically and tested with a simulation study. Our approach is also applied to seasonal extremes of daily maxima and minima for our 24 selected stations.

Prior specification for non-parametric Bayesian inference involves the difficult task of quantifying prior knowledge about a parameter of high, often infinite, dimension. A statistician is unlikely to have informed opinions about all aspects of such a parameter but will have real information about functionals of the parameter, such as the population mean or variance. The paper proposes a new framework for non-parametric Bayes inference in which the prior distribution for a possibly infinite dimensional parameter is decomposed into two parts: an informative prior on a finite set of functionals, and a non-parametric conditional prior for the parameter given the functionals. Such priors can be easily constructed from standard non-parametric prior distributions in common use and inherit the large support of the standard priors on which they are based. Additionally, posterior approximations under these informative priors can generally be made via minor adjustments to existing Markov chain approximation algorithms for standard non-parametric prior distributions. We illustrate the use of such priors in the context of multivariate density estimation using Dirichlet process mixture models, and in the modelling of high dimensional sparse contingency tables.

Increasingly larger data sets of processes in space and time ask for statistical models and methods that can cope with such data. We show that the solution of a stochastic advection–diffusion partial differential equation provides a flexible model class for spatiotemporal processes which is computationally feasible also for large data sets. The Gaussian process defined through the stochastic partial differential equation has, in general, a non-separable covariance structure. Its parameters can be physically interpreted as explicitly modelling phenomena such as transport and diffusion that occur in many natural processes in diverse fields ranging from environmental sciences to ecology. To obtain computationally efficient statistical algorithms, we use spectral methods to solve the stochastic partial differential equation. This has the advantage that approximation errors do not accumulate over time, and that in the spectral space the computational cost grows linearly with the dimension, the total computational cost of Bayesian or frequentist inference being dominated by the fast Fourier transform. The model proposed is applied to post-processing of precipitation forecasts from a numerical weather prediction model for northern Switzerland. In contrast with the raw forecasts from the numerical model, the post-processed forecasts are calibrated and quantify prediction uncertainty. Moreover, they outperform the raw forecasts, in the sense that they have a lower mean absolute error.

The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the *m* out of *n* bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number of subsampled data points), and they often require knowledge of the estimator's convergence rate, in contrast with the bootstrap. As an alternative, we introduce the ‘bag of little bootstraps’ (BLB), which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. The BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate the BLB's favourable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing the BLB with the bootstrap, the *m* out of *n* bootstrap and subsampling. In addition, we present results from a large-scale distributed implementation of the BLB demonstrating its computational superiority on massive data, a method for adaptively selecting the BLB's tuning parameters, an empirical study applying the BLB to several real data sets and an extension of the BLB to time series data.

In several areas of application ranging from brain imaging to astrophysics and geostatistics, an important statistical problem is to find regions where the process studied exceeds a certain level. Estimating such regions so that the probability for exceeding the level in the entire set is equal to some predefined value is a difficult problem connected to the problem of multiple significance testing. In this work, a method for solving this problem, as well as the related problem of finding credible regions for contour curves, for latent Gaussian models is proposed. The method is based on using a parametric family for the excursion sets in combination with a sequential importance sampling method for estimating joint probabilities. The accuracy of the method is investigated by using simulated data and an environmental application is presented.

We introduce a new method for improving the coverage accuracy of confidence intervals for means of lattice distributions. The technique can be applied very generally to enhance existing approaches, although we consider it in greatest detail in the context of estimating a binomial proportion or a Poisson mean, where it is particularly effective. The method is motivated by a simple theoretical result, which shows that, by splitting the original sample of size *n* into two parts, of sizes and , and basing the confidence procedure on the average of the means of these two subsamples, the highly oscillatory behaviour of coverage error, as a function of *n*, is largely removed. Perhaps surprisingly, this approach does not increase confidence interval width; usually the width is slightly reduced. Contrary to what might be expected, our new method performs well when it is used to modify confidence intervals based on existing techniques that already perform very well—it typically improves significantly their coverage accuracy. Each application of the split sample method to an existing confidence interval procedure results in a new technique.

The choice of the summary statistics that are used in Bayesian inference and in particular in approximate Bayesian computation algorithms has bearings on the validation of the resulting inference. Those statistics are nonetheless customarily used in approximate Bayesian computation algorithms without consistency checks. We derive necessary and sufficient conditions on summary statistics for the corresponding Bayes factor to be convergent, namely to select the true model asymptotically. Those conditions, which amount to the expectations of the summary statistics differing asymptotically under the two models, are quite natural and can be exploited in approximate Bayesian computation settings to infer whether or not a choice of summary statistics is appropriate, via a Monte Carlo validation.

Monotonic transformations are widely employed in statistics and data analysis. In computer experiments they are often used to gain accuracy in the estimation of global sensitivity statistics. However, one faces the question of interpreting results that are obtained on the transformed data back on the original data. The situation is even more complex in computer experiments, because transformations alter the model input–output mapping and distort the estimators. This work demonstrates that the problem can be solved by utilizing statistics which are monotonic transformation invariant. To do so, we offer an investigation into the families of metrics either based on densities or on cumulative distribution functions that are monotonic transformation invariant and we introduce a new generalized family of metrics. Numerical experiments show that transformations allow numerical convergence in the estimates of global sensitivity statistics, both invariant and not, in cases in which it would otherwise be impossible to obtain convergence. However, one fully exploits the increased numerical accuracy if the global sensitivity statistic is monotonic transformation invariant. Conversely, estimators of measures that do not have this invariance property might lead to misleading deductions.

We develop a semiparametric approach to geostatistical modelling and inference. In particular, we consider a geostatistical model with additive components, where the form of the covariance function of the spatial random error is not prespecified and thus is flexible. A novel, local Karhunen–Loève expansion is developed and a likelihood-based method is devised for estimating the model parameters and statistical inference. A simulation study demonstrates sound finite sample properties and a real data example is given for illustration. Finally, the theoretical properties of the estimates are explored and, in particular, consistency results are established.

In the classical biased sampling problem, we have *k* densities …, each known up to a normalizing constant, i.e., for *l*=1, …,*k*, , where is a known function and is an unknown constant. For each *l*, we have an independent and identically distributed sample from , and the problem is to estimate the ratios for all *l* and all *s*. This problem arises frequently in several situations in both frequentist and Bayesian inference. An estimate of the ratios was developed and studied by Vardi and his co-workers over two decades ago, and there has been much subsequent work on this problem from many perspectives. In spite of this, there are no rigorous results in the literature on how to estimate the standard error of the estimate. We present a class of estimates of the ratios of normalizing constants that are appropriate for the case where the samples from the s are not necessarily independent and identically distributed sequences but are Markov chains. We also develop an approach based on regenerative simulation for obtaining standard errors for the estimates of ratios of normalizing constants. These standard error estimates are valid for both the independent and identically distributed samples case and the Markov chain case.

The increasing prevalence and utility of large public databases necessitates the development of appropriate methods for controlling false discovery. Motivated by this challenge, we discuss the generic problem of testing a possibly infinite stream of null hypotheses. In this context, Foster and Stine suggested a novel method named *α*-investing for controlling a false discovery measure known as mFDR. We develop a more general procedure for controlling mFDR, of which *α*-investing is a special case. We show that, in common practical situations, the general procedure can be optimized to produce an expected reward optimal version, which is more powerful than *α*-investing. We then present the concept of quality preserving databases which was originally introduced by Aharoni and co-workers, which formalizes efficient public database management to save costs and to control false discovery simultaneously. We show how one variant of generalized *α*-investing can be used to control mFDR in a quality preserving database and to lead to significant reduction in costs compared with naive approaches for controlling the familywise error rate implemented by Aharoni and co-workers.

Researchers often believe that a treatment's effect on a response may be heterogeneous with respect to certain baseline covariates. This is an important premise of personalized medicine. Several methods for estimating heterogeneous treatment effects have been proposed. However, little attention has been given to the problem of choosing between estimators of treatment effects. Models that best estimate the regression function may not be best for estimating the effect of a treatment; therefore, there is a need for model selection methods that are targeted to treatment effect estimation. We demonstrate an application of the focused information criterion in this setting and develop a treatment effect cross-validation aimed at minimizing treatment effect estimation errors. Theoretically, treatment effect cross-validation has a model selection consistency property when the data splitting ratio is properly chosen. Practically, treatment effect cross-validation has the flexibility to compare different types of models. We illustrate the methods by using simulation studies and data from a clinical trial comparing treatments of patients with human immunodeficiency virus.

The emergence of the recent financial crisis, during which markets frequently underwent changes in their statistical structure over a short period of time, illustrates the importance of non-stationary modelling in financial time series. Motivated by this observation, we propose a fast, well performing and theoretically tractable method for detecting multiple change points in the structure of an auto-regressive conditional heteroscedastic model for financial returns with piecewise constant parameter values. Our method, termed BASTA (binary segmentation for transformed auto-regressive conditional heteroscedasticity), proceeds in two stages: process transformation and binary segmentation. The process transformation decorrelates the original process and lightens its tails; the binary segmentation consistently estimates the change points. We propose and justify two particular transformations and use simulation to fine-tune their parameters as well as the threshold parameter for the binary segmentation stage. A comparative simulation study illustrates good performance in comparison with the state of the art, and the analysis of the Financial Times Stock Exchange FTSE 100 index reveals an interesting correspondence between the estimated change points and major events of the recent financial crisis. Although the method is easy to implement, ready-made R software is provided.

High dimensional sparse modelling via regularization provides a powerful tool for analysing large-scale data sets and obtaining meaningful interpretable models. The use of non-convex penalty functions shows advantage in selecting important features in high dimensions, but the global optimality of such methods still demands more understanding. We consider sparse regression with a hard thresholding penalty, which we show to give rise to thresholded regression. This approach is motivated by its close connection with -regularization, which can be unrealistic to implement in practice but of appealing sampling properties, and its computational advantage. Under some mild regularity conditions allowing possibly exponentially growing dimensionality, we establish the oracle inequalities of the resulting regularized estimator, as the global minimizer, under various prediction and variable selection losses, as well as the oracle risk inequalities of the hard thresholded estimator followed by further -regularization. The risk properties exhibit interesting shrinkage effects under both estimation and prediction losses. We identify the optimal choice of the ridge parameter, which is shown to have simultaneous advantages to both the -loss and the prediction loss. These new results and phenomena are evidenced by simulation and real data examples.

We investigate the estimation efficiency of the central mean subspace in the framework of sufficient dimension reduction. We derive the semiparametric efficient score and study its practical applicability. Despite the difficulty caused by the potential high dimension issue in the variance component, we show that locally efficient estimators can be constructed in practice. We conduct simulation studies and a real data analysis to demonstrate the finite sample performance and gain in efficiency of the proposed estimators in comparison with several existing methods.

Discovering patterns from a set of text or, more generally, categorical data is an important problem in many disciplines such as biomedical research, linguistics, artificial intelligence and sociology. We consider here the well-known ‘market basket’ problem that is often discussed in the data mining community, and is also quite ubiquitous in biomedical research. The data under consideration are a set of ‘baskets’, where each basket contains a list of ‘items’. Our goal is to discover ‘themes’, which are defined as subsets of items that tend to co-occur in a basket. We describe a generative model, i.e. the theme dictionary model, for such data structures and describe two likelihood-based methods to infer themes that are hidden in a collection of baskets. We also propose a novel sequential Monte Carlo method to overcome computational challenges. Using both simulation studies and real applications, we demonstrate that the new approach proposed is significantly more powerful than existing methods, such as association rule mining and topic modelling, in detecting weak and subtle interactions in the data.

The paper considers in the high dimensional setting a canonical testing problem in multivariate analysis, namely testing the equality of two mean vectors. We introduce a new test statistic that is based on a linear transformation of the data by the precision matrix which incorporates the correlations between the variables. The limiting null distribution of the test statistic and the power of the test are analysed. It is shown that the test is particularly powerful against sparse alternatives and enjoys certain optimality. A simulation study is carried out to examine the numerical performance of the test and to compare it with other tests given in the literature. The results show that the test proposed significantly outperforms those tests in a range of settings.

We consider the problem of estimating multiple related Gaussian graphical models from a high dimensional data set with observations belonging to distinct classes. We propose the *joint graphical lasso*, which borrows strength across the classes to estimate multiple graphical models that share certain characteristics, such as the locations or weights of non-zero edges. Our approach is based on maximizing a penalized log-likelihood. We employ generalized fused lasso or group lasso penalties and implement a fast alternating directions method of multipliers algorithm to solve the corresponding convex optimization problems. The performance of the method proposed is illustrated through simulated and real data examples.

The paper deals with non-parametric estimation of a conditional distribution function. We suggest a method of preadjusting the original observations non-parametrically through location and scale, to reduce the bias of the estimator. We derive the asymptotic properties of the estimator proposed. A simulation study investigating the finite sample performances of the estimators discussed is provided and reveals the gain that can be achieved. It is also shown how the idea of the preadjusting opens the path to improved estimators in other settings such as conditional quantile and density estimation, and conditional survival function estimation in the case of censored data.

Max-stable processes are the natural analogues of the generalized extreme value distribution when modelling extreme events in space and time. Under suitable conditions, these processes are asymptotically justified models for maxima of independent replications of random fields, and they are also suitable for the modelling of extreme measurements over high thresholds. The paper shows how a pairwise censored likelihood can be used for consistent estimation of the extremes of space–time data under mild mixing conditions and illustrates this by fitting an extension of a model due to Schlather to hourly rainfall data. A block bootstrap procedure is used for uncertainty assessment. Estimator efficiency is considered and the choice of pairs to be included in the pairwise likelihood is discussed. The model proposed fits the data better than some natural competitors.

Modern technologies are producing a wealth of data with complex structures. For instance, in two-dimensional digital imaging, flow cytometry and electroencephalography, matrix-type covariates frequently arise when measurements are obtained for each combination of two underlying variables. To address scientific questions arising from those data, new regression methods that take matrices as covariates are needed, and sparsity or other forms of regularization are crucial owing to the ultrahigh dimensionality and complex structure of the matrix data. The popular lasso and related regularization methods hinge on the sparsity of the true signal in terms of the number of its non-zero coefficients. However, for the matrix data, the true signal is often of, or can be well approximated by, a low rank structure. As such, the sparsity is frequently in the form of low rank of the matrix parameters, which may seriously violate the assumption of the classical lasso. We propose a class of regularized matrix regression methods based on spectral regularization. A highly efficient and scalable estimation algorithm is developed, and a degrees-of-freedom formula is derived to facilitate model selection along the regularization path. Superior performance of the method proposed is demonstrated on both synthetic and real examples.