The likelihood ratio (LR) measures the relative weight of forensic data regarding two hypotheses. Several levels of uncertainty arise if frequentist methods are chosen for its assessment: the assumed population model only approximates the true one, and its parameters are estimated through a limited database. Moreover, it may be wise to discard part of data, especially that only indirectly related to the hypotheses. Different reductions define different LRs. Therefore, it is more sensible to talk about ‘a’ LR instead of ‘the’ LR, and the error involved in the estimation should be quantified. Two frequentist methods are proposed in the light of these points for the ‘rare type match problem’, that is, when a match between the perpetrator's and the suspect's DNA profile, never observed before in the database of reference, is to be evaluated.

We discuss a class of difference-based estimators for the autocovariance in nonparametric regression when the signal is discontinuous and the errors form a stationary *m*-dependent process. These estimators circumvent the particularly challenging task of pre-estimating such an unknown regression function. We provide finite-sample expressions of their mean squared errors for piecewise constant signals and Gaussian errors. Based on this, we derive biased-optimized estimates that do not depend on the unknown autocovariance structure. Notably, for positively correlated errors, that part of the variance of our estimators that depend on the signal is minimal as well. Further, we provide sufficient conditions for
-consistency; this result is extended to piecewise Hölder regression with non-Gaussian errors.

We combine our biased-optimized autocovariance estimates with a projection-based approach and derive covariance matrix estimates, a method that is of independent interest. An R package, several simulations and an application to biophysical measurements complement this paper.

This paper develops a statistically principled approach to kernel density estimation on a network of lines, such as a road network. Existing heuristic techniques are reviewed, and their weaknesses are identified. The correct analogue of the Gaussian kernel is the ‘heat kernel’, the occupation density of Brownian motion on the network. The corresponding kernel estimator satisfies the classical time-dependent heat equation on the network. This ‘diffusion estimator’ has good statistical properties that follow from the heat equation. It is mathematically similar to an existing heuristic technique, in that both can be expressed as sums over paths in the network. However, the diffusion estimate is an infinite sum, which cannot be evaluated using existing algorithms. Instead, the diffusion estimate can be computed rapidly by numerically solving the time-dependent heat equation on the network. This also enables bandwidth selection using cross-validation. The diffusion estimate with automatically selected bandwidth is demonstrated on road accident data.

Many directional data such as wind directions can be collected extremely easily so that experiments typically yield a huge number of data points that are sequentially collected. To deal with such big data, the traditional nonparametric techniques rapidly require a lot of time to be computed and therefore become useless in practice if real time or online forecasts are expected. In this paper, we propose a recursive kernel density estimator for directional data which (i) can be updated extremely easily when a new set of observations is available and (ii) keeps asymptotically the nice features of the traditional kernel density estimator. Our methodology is based on Robbins–Monro stochastic approximations ideas. We show that our estimator outperforms the traditional techniques in terms of computational time while being extremely competitive in terms of efficiency with respect to its competitors in the sequential context considered here. We obtain expressions for its asymptotic bias and variance together with an almost sure convergence rate and an asymptotic normality result. Our technique is illustrated on a wind dataset collected in Spain. A Monte-Carlo study confirms the nice properties of our recursive estimator with respect to its non-recursive counterpart.

Bayesian shrinkage methods have generated a lot of interest in recent years, especially in the context of high-dimensional linear regression. In recent work, a Bayesian shrinkage approach using generalized double Pareto priors has been proposed. Several useful properties of this approach, including the derivation of a tractable three-block Gibbs sampler to sample from the resulting posterior density, have been established. We show that the Markov operator corresponding to this three-block Gibbs sampler is not Hilbert–Schmidt. We propose a simpler two-block Gibbs sampler and show that the corresponding Markov operator is trace class (and hence Hilbert–Schmidt). Establishing the trace class property for the proposed two-block Gibbs sampler has several useful consequences. Firstly, it implies that the corresponding Markov chain is geometrically ergodic, thereby implying the existence of a Markov chain central limit theorem, which in turn enables computation of asymptotic standard errors for Markov chain-based estimates of posterior quantities. Secondly, because the proposed Gibbs sampler uses two blocks, standard recipes in the literature can be used to construct a sandwich Markov chain (by inserting an appropriate extra step) to gain further efficiency and to achieve faster convergence. The trace class property for the two-block sampler implies that the corresponding sandwich Markov chain is also trace class and thereby geometrically ergodic. Finally, it also guarantees that all eigenvalues of the sandwich chain are dominated by the corresponding eigenvalues of the Gibbs sampling chain (with at least one strict domination). Our results demonstrate that a minor change in the structure of a Markov chain can lead to fundamental changes in its theoretical properties. We illustrate the improvement in efficiency resulting from our proposed Markov chains using simulated and real examples.

The Ising model is one of the simplest and most famous models of interacting systems. It was originally proposed to model ferromagnetic interactions in statistical physics and is now widely used to model spatial processes in many areas such as ecology, sociology, and genetics, usually without testing its goodness of fit. Here, we propose various test statistics and an exact goodness-of-fit test for the finite-lattice Ising model. The theory of Markov bases has been developed in algebraic statistics for exact goodness-of-fit testing using a Monte Carlo approach. However, finding a Markov basis is often computationally intractable. Thus, we develop a Monte Carlo method for exact goodness-of-fit testing for the Ising model that avoids computing a Markov basis and also leads to a better connectivity of the Markov chain and hence to a faster convergence. We show how this method can be applied to analyze the spatial organization of receptors on the cell membrane.

We study minimum contrast estimation for parametric stationary determinantal point processes. These processes form a useful class of models for repulsive (or regular, or inhibitive) point patterns and are already applied in numerous statistical applications. Our main focus is on minimum contrast methods based on the Ripley's *K*-function or on the pair correlation function. Strong consistency and asymptotic normality of theses procedures are proved under general conditions that only concern the existence of the process and its regularity with respect to the parameters. A key ingredient of the proofs is the recently established Brillinger mixing property of stationary determinantal point processes. This work may be viewed as a complement to the study of Y. Guan and M. Sherman who establish the same kind of asymptotic properties for a large class of Cox processes, which in turn are models for clustering (or aggregation).

We investigate the estimation of specific intrinsic volumes of stationary Boolean models by local digital algorithms; that is, by weighted sums of local configuration counts. We show that asymptotically unbiased estimators for the specific surface area or integrated mean curvature do not exist if the dimension is at least two or three, respectively. For three-dimensional stationary isotropic Boolean models, we derive asymptotically unbiased estimators for the specific surface area and integrated mean curvature. For a Boolean model with balls as grains, we even obtain an asymptotically unbiased estimator for the specific Euler characteristic.

Time-varying coefficient models are widely used in longitudinal data analysis. These models allow the effects of predictors on response to vary over time. In this article, we consider a mixed-effects time-varying coefficient model to account for the within subject correlation for longitudinal data. We show that when kernel smoothing is used to estimate the smooth functions in time-varying coefficient models for sparse or dense longitudinal data, the asymptotic results of these two situations are essentially different. Therefore, a subjective choice between the sparse and dense cases might lead to erroneous conclusions for statistical inference. In order to solve this problem, we establish a unified self-normalized central limit theorem, based on which a unified inference is proposed without deciding whether the data are sparse or dense. The effectiveness of the proposed unified inference is demonstrated through a simulation study and an analysis of Baltimore MACS data.

We consider a semiparametric single-index model and suppose that endogeneity is present in the explanatory variables. The presence of an instrument is assumed, that is, non-correlated with the error term. We propose an estimator of the parametric component of the model, which is the solution of an ill-posed inverse problem. The estimator is shown to be asymptotically normal under certain regularity conditions. A simulation study is conducted to illustrate the finite sample performance of the proposed estimator.

This paper establishes a remarkable result regarding Palm distributions for a log Gaussian Cox process: the reduced Palm distribution for a log Gaussian Cox process is itself a log Gaussian Cox process that only differs from the original log Gaussian Cox process in the intensity function. This new result is used to study functional summaries for log Gaussian Cox processes.

For small area estimation of area-level data, the Fay–Herriot model is extensively used as a model-based method. In the Fay–Herriot model, it is conventionally assumed that the sampling variances are known, whereas estimators of sampling variances are used in practice. Thus, the settings of knowing sampling variances are unrealistic, and several methods are proposed to overcome this problem. In this paper, we assume the situation where the direct estimators of the sampling variances are available as well as the sample means. Using this information, we propose a Bayesian yet objective method producing shrinkage estimation of both means and variances in the Fay–Herriot model. We consider the hierarchical structure for the sampling variances, and we set uniform prior on model parameters to keep objectivity of the proposed model. For validity of the posterior inference, we show under mild conditions that the posterior distribution is proper and has finite variances. We investigate the numerical performance through simulation and empirical studies.

Spatio-temporal modelling is an increasingly popular topic in Statistics. Our paper contributes to this line of research by developing the theory, simulation and inference for a spatio-temporal Ornstein–Uhlenbeck process. We conduct detailed simulation studies and demonstrate the practical relevance of these processes in an empirical study of radiation anomaly data. Finally, we describe how predictions can be carried out in the Gaussian setting.

Skew-symmetric families of distributions such as the skew-normal and skew-*t* represent supersets of the normal and *t* distributions, and they exhibit richer classes of extremal behaviour. By defining a non-stationary skew-normal process, which allows the easy handling of positive definite, non-stationary covariance functions, we derive a new family of max-stable processes – the extremal skew-*t* process. This process is a superset of non-stationary processes that include the stationary extremal-*t* processes. We provide the spectral representation and the resulting angular densities of the extremal skew-*t* process and illustrate its practical implementation.

Multivariate extreme value statistical analysis is concerned with observations on several variables which are thought to possess some degree of tail dependence. The main approaches to inference for multivariate extremes consist in approximating either the distribution of block component-wise maxima or the distribution of the exceedances over a high threshold. Although the expressions of the asymptotic density functions of these distributions may be characterized, they cannot be computed in general. In this paper, we study the case where the spectral random vector of the multivariate max-stable distribution has known conditional distributions. The asymptotic density functions of the multivariate extreme value distributions may then be written through univariate integrals that are easily computed or simulated. The asymptotic properties of two likelihood estimators are presented, and the utility of the method is examined via simulation.

Data augmentation is required for the implementation of many Markov chain Monte Carlo (MCMC) algorithms. The inclusion of augmented data can often lead to conditional distributions from well-known probability distributions for some of the parameters in the model. In such cases, collapsing (integrating out parameters) has been shown to improve the performance of MCMC algorithms. We show how integrating out the infection rate parameter in epidemic models leads to efficient MCMC algorithms for two very different epidemic scenarios, final outcome data from a multitype SIR epidemic and longitudinal data from a spatial SI epidemic. The resulting MCMC algorithms give fresh insight into real-life epidemic data sets.

In the analysis of semi-competing risks data interest lies in estimation and inference with respect to a so-called non-terminal event, the observation of which is subject to a terminal event. Multi-state models are commonly used to analyse such data, with covariate effects on the transition/intensity functions typically specified via the Cox model and dependence between the non-terminal and terminal events specified, in part, by a unit-specific shared frailty term. To ensure identifiability, the frailties are typically assumed to arise from a parametric distribution, specifically a Gamma distribution with mean 1.0 and variance, say, *σ*^{2}. When the frailty distribution is misspecified, however, the resulting estimator is not guaranteed to be consistent, with the extent of asymptotic bias depending on the discrepancy between the assumed and true frailty distributions. In this paper, we propose a novel class of transformation models for semi-competing risks analysis that permit the non-parametric specification of the frailty distribution. To ensure identifiability, the class restricts to parametric specifications of the transformation and the error distribution; the latter are flexible, however, and cover a broad range of possible specifications. We also derive the semi-parametric efficient score under the complete data setting and propose a non-parametric score imputation method to handle right censoring; consistency and asymptotic normality of the resulting estimators is derived and small-sample operating characteristics evaluated via simulation. Although the proposed semi-parametric transformation model and non-parametric score imputation method are motivated by the analysis of semi-competing risks data, they are broadly applicable to any analysis of multivariate time-to-event outcomes in which a unit-specific shared frailty is used to account for correlation. Finally, the proposed model and estimation procedures are applied to a study of hospital readmission among patients diagnosed with pancreatic cancer.

It is the main purpose of this paper to study the asymptotics of certain variants of the empirical process in the context of survey data. Precisely, Functional Central Limit Theorems are established under usual conditions when the sample is drawn from a Poisson or a rejective sampling design. The framework we develop encompasses sampling designs with non-uniform first order inclusion probabilities, which can be chosen so as to optimize estimation accuracy. Applications to Hadamard differentiable functionals are considered.

Motivated from problems in canonical correlation analysis, reduced rank regression and sufficient dimension reduction, we introduce a double dimension reduction model where a single index of the multivariate response is linked to the multivariate covariate through a single index of these covariates, hence the name double single index model. Because nonlinear association between two sets of multivariate variables can be arbitrarily complex and even intractable in general, we aim at seeking a principal one-dimensional association structure where a response index is fully characterized by a single predictor index. The functional relation between the two single-indices is left unspecified, allowing flexible exploration of any potential nonlinear association. We argue that such double single index association is meaningful and easy to interpret, and the rest of the multi-dimensional dependence structure can be treated as nuisance in model estimation. We investigate the estimation and inference of both indices and the regression function, and derive the asymptotic properties of our procedure. We illustrate the numerical performance in finite samples and demonstrate the usefulness of the modelling and estimation procedure in a multi-covariate multi-response problem concerning concrete.

Right-censored and length-biased failure time data arise in many fields including cross-sectional prevalent cohort studies, and their analysis has recently attracted a great deal of attention. It is well-known that for regression analysis of failure time data, two commonly used approaches are hazard-based and quantile-based procedures, and most of the existing methods are the hazard-based ones. In this paper, we consider quantile regression analysis of right-censored and length-biased data and present a semiparametric varying-coefficient partially linear model. For estimation of regression parameters, a three-stage procedure that makes use of the inverse probability weighted technique is developed, and the asymptotic properties of the resulting estimators are established. In addition, the approach allows the dependence of the censoring variable on covariates, while most of the existing methods assume the independence between censoring variables and covariates. A simulation study is conducted and suggests that the proposed approach works well in practical situations. Also, an illustrative example is provided.

We consider the problem of parameter estimation for inhomogeneous space-time shot-noise Cox point processes. We explore the possibility of using a stepwise estimation method and dimensionality-reducing techniques to estimate different parts of the model separately.

We discuss the estimation method using projection processes and propose a refined method that avoids projection to the temporal domain. This remedies the main flaw of the method using projection processes – possible overlapping in the projection process of clusters, which are clearly separated in the original space-time process. This issue is more prominent in the temporal projection process where the amount of information lost by projection is higher than in the spatial projection process.

For the refined method, we derive consistency and asymptotic normality results under the increasing domain asymptotics and appropriate moment and mixing assumptions. We also present a simulation study that suggests that cluster overlapping is successfully overcome by the refined method.

Compositional tables – a continuous counterpart to the contingency tables – carry relative information about relationships between row and column factors; thus, for their analysis, only ratios between cells of a table are informative. Consequently, the standard Euclidean geometry should be replaced by the Aitchison geometry on the simplex that enables decomposition of the table into its independent and interactive parts. The aim of the paper is to find interpretable coordinate representation for independent and interaction tables (in sense of balances and odds ratios of cells, respectively), where further statistical processing of compositional tables can be performed. Theoretical results are applied to real-world problems from a health survey and in macroeconomics.

Regression discontinuity designs (RD designs) are used as a method for causal inference from observational data, where the decision to apply an intervention is made according to a ‘decision rule’ that is linked to some continuous variable. Such designs are being increasingly developed in medicine. The local average treatment effect (LATE) has been established as an estimator of the intervention effect in an RD design, particularly where a design's ‘decision rule’ is not adhered to strictly. Estimating the variance of the LATE is not necessarily straightforward. We consider three approaches to the estimation of the LATE: two-stage least squares, likelihood-based and a Bayesian approach. We compare these under a variety of simulated RD designs and a real example concerning the prescription of statins based on cardiovascular disease risk score.

Linear increments (LI) are used to analyse repeated outcome data with missing values. Previously, two LI methods have been proposed, one allowing non-monotone missingness but not independent measurement error and one allowing independent measurement error but only monotone missingness. In both, it was suggested that the expected increment could depend on current outcome. We show that LI can allow non-monotone missingness and either independent measurement error of unknown variance or dependence of expected increment on current outcome but not both. A popular alternative to LI is a multivariate normal model ignoring the missingness pattern. This gives consistent estimation when data are normally distributed and missing at random (MAR). We clarify the relation between MAR and the assumptions of LI and show that for continuous outcomes multivariate normal estimators are also consistent under (non-MAR and non-normal) assumptions not much stronger than those of LI. Moreover, when missingness is non-monotone, they are typically more efficient.

Influential units occur frequently in surveys, especially in business surveys that collect economic variables whose distributions are highly skewed. A unit is said to be influential when its inclusion or exclusion from the sample has an important impact on the sampling error of estimates. We extend the concept of conditional bias attached to a unit and propose a robust version of the double expansion estimator, which depends on a tuning constant. We determine the tuning constant that minimizes the maximum estimated conditional bias. Our results can be naturally extended to the case of unit nonresponse, the set of respondents often being viewed as a second-phase sample. A robust version of calibration estimators, based on auxiliary information available at both phases, is also constructed.

Linear structural equation models, which relate random variables via linear interdependencies and Gaussian noise, are a popular tool for modelling multivariate joint distributions. The models correspond to mixed graphs that include both directed and bidirected edges representing the linear relationships and correlations between noise terms, respectively. A question of interest for these models is that of parameter identifiability, whether or not it is possible to recover edge coefficients from the joint covariance matrix of the random variables. For the problem of determining generic parameter identifiability, we present an algorithm building upon the half-trek criterion. Underlying our new algorithm is the idea that ancestral subsets of vertices in the graph can be used to extend the applicability of a decomposition technique.

We are concerned with a situation in which we would like to test multiple hypotheses with tests whose *p*-values cannot be computed explicitly but can be approximated using Monte Carlo simulation. This scenario occurs widely in practice. We are interested in obtaining the same rejections and non-rejections as the ones obtained if the *p*-values for all hypotheses had been available. The present article introduces a framework for this scenario by providing a generic algorithm for a general multiple testing procedure. We establish conditions that guarantee that the rejections and non-rejections obtained through Monte Carlo simulations are identical to the ones obtained with the *p*-values. Our framework is applicable to a general class of step-up and step-down procedures, which includes many established multiple testing corrections such as the ones of Bonferroni, Holm, Sidak, Hochberg or Benjamini–Hochberg. Moreover, we show how to use our framework to improve algorithms available in the literature in such a way as to yield theoretical guarantees on their results. These modifications can easily be implemented in practice and lead to a particular way of reporting multiple testing results as three sets together with an error bound on their correctness, demonstrated exemplarily using a real biological dataset.

Log-normal linear regression models are popular in many fields of research. Bayesian estimation of the conditional mean of the dependent variable is problematic as many choices of the prior for the variance (on the log-scale) lead to posterior distributions with no finite moments. We propose a generalized inverse Gaussian prior for this variance and derive the conditions on the prior parameters that yield posterior distributions of the conditional mean of the dependent variable with finite moments up to a pre-specified order. The conditions depend on one of the three parameters of the suggested prior; the other two have an influence on inferences for small and medium sample sizes. A second goal of this paper is to discuss how to choose these parameters according to different criteria including the optimization of frequentist properties of posterior means.

We find the asymptotic distribution of the multi-dimensional multi-scale and kernel estimators for high-frequency financial data with microstructure. Sampling times are allowed to be asynchronous and endogenous. In the process, we show that the classes of multi-scale and kernel estimators for smoothing noise perturbation are asymptotically equivalent in the sense of having the same asymptotic distribution for corresponding kernel and weight functions. The theory leads to multi-dimensional stable central limit theorems and feasible versions. Hence, they allow to draw statistical inference for a broad class of multivariate models, which paves the way to tests and confidence intervals in risk measurement for arbitrary portfolios composed of high-frequently observed assets. As an application, we enhance the approach to construct a test for investigating hypotheses that correlated assets are independent conditional on a common factor.

The paper proposes a new test for detecting the umbrella pattern under a general non-parametric scheme. The alternative asserts that the umbrella ordering holds while the hypothesis is its complement. The main focus is put on controlling the power function of the test outside the alternative. As a result, the asymptotic error of the first kind of the constructed solution is smaller than or equal to the fixed significance level *α* on the whole set where the umbrella ordering does not hold. Also, under finite sample sizes, this error is controlled to a satisfactory extent. A simulation study shows, among other things, that the new test improves upon the solution widely recommended in the literature of the subject. A routine, written in R, is attached as the Supporting Information file.

For multivariate survival data, we study the generalized method of moments (GMM) approach to estimation and inference based on the marginal additive hazards model. We propose an efficient iterative algorithm using closed-form solutions, which dramatically reduces the computational burden. Asymptotic normality of the proposed estimators is established, and the corresponding variance–covariance matrix can be consistently estimated. Inference procedures are derived based on the asymptotic chi-squared distribution of the GMM objective function. Simulation studies are conducted to empirically examine the finite sample performance of the proposed method, and a real data example from a dental study is used for illustration.

An extended single-index model is considered when responses are missing at random. A three-step estimation procedure is developed to define an estimator for the single-index parameter vector by a joint estimating equation. The proposed estimator is shown to be asymptotically normal. An algorithm for computing this estimator is proposed. This algorithm only involves one-dimensional nonparametric smoothers, thereby avoiding the data sparsity problem caused by high model dimensionality. Some simulation studies are conducted to investigate the finite sample performances of the proposed estimators.

In geostatistics and also in other applications in science and engineering, it is now common to perform updates on Gaussian process models with many thousands or even millions of components. These large-scale inferences involve modelling, representational and computational challenges. We describe a visualization tool for large-scale Gaussian updates, the ‘medal plot’. The medal plot shows the updated uncertainty at each observation location and also summarizes the sharing of information across observations, as a proxy for the sharing of information across the state vector (or latent process). As such, it reflects characteristics of both the observations and the statistical model. We illustrate with an application to assess mass trends in the Antarctic Ice Sheet, for which there are strong constraints from the observations and the physics.

Uniformly most powerful Bayesian tests (UMPBTs) are a new class of Bayesian tests in which null hypotheses are rejected if their Bayes factor exceeds a specified threshold. The alternative hypotheses in UMPBTs are defined to maximize the probability that the null hypothesis is rejected. Here, we generalize the notion of UMPBTs by restricting the class of alternative hypotheses over which this maximization is performed, resulting in restricted most powerful Bayesian tests (RMPBTs). We then derive RMPBTs for linear models by restricting alternative hypotheses to ** g** priors. For linear models, the rejection regions of RMPBTs coincide with those of usual frequentist

This paper presents a goodness-of-fit test for parametric regression models with scalar response and directional predictor, that is, a vector on a sphere of arbitrary dimension. The testing procedure is based on the weighted squared distance between a smooth and a parametric regression estimator, where the smooth regression estimator is obtained by a projected local approach. Asymptotic behaviour of the test statistic under the null hypothesis and local alternatives is provided, jointly with a consistent bootstrap algorithm for application in practice. A simulation study illustrates the performance of the test in finite samples. The procedure is applied to test a linear model in text mining.

In this paper, we reconsider the mixture vector autoregressive model, which was proposed in the literature for modelling non-linear time series. We complete and extend the stationarity conditions, derive a matrix formula in closed form for the autocovariance function of the process and prove a result on stable vector autoregressive moving-average representations of mixture vector autoregressive models. For these results, we apply techniques related to a Markovian representation of vector autoregressive moving-average processes. Furthermore, we analyse maximum likelihood estimation of model parameters by using the expectation–maximization algorithm and propose a new iterative algorithm for getting the maximum likelihood estimates. Finally, we study the model selection problem and testing procedures. Several examples, simulation experiments and an empirical application based on monthly financial returns illustrate the proposed procedures.

Focusing on the model selection problems in the family of Poisson mixture models (including the Poisson mixture regression model with random effects and zero-inflated Poisson regression model with random effects), the current paper derives two conditional Akaike information criteria. The criteria are the unbiased estimators of the conditional Akaike information based on the conditional log-likelihood and the conditional Akaike information based on the joint log-likelihood, respectively. The derivation is free from the specific parametric assumptions about the conditional mean of the true data-generating model and applies to different types of estimation methods. Additionally, the derivation is not based on the asymptotic argument. Simulations show that the proposed criteria have promising estimation accuracy. In addition, it is found that the criterion based on the conditional log-likelihood demonstrates good model selection performance under different scenarios. Two sets of real data are used to illustrate the proposed method.