We propose a three-state frailty Markov model coupled with likelihood-based inference to analyse tooth level life course data in caries research. This analysis is challenging because of intraoral clustering, interval censoring, multiplicity of caries states and computational complexities. We also develop a Bayesian approach to predict future caries transition probabilities given observed life history data. Numerical experiments demonstrate that the methods proposed perform very well in finite samples with moderate sizes. The practical utility of the model is illustrated by using life course data from a unique longitudinal study of dental caries in young low income urban African-American children. In this analysis, we evaluate for any spatial symmetry in the mouth with respect to the life course of dental caries, and whether the same type of tooth has a similar decay process in boys and girls.

Interconnectedness between stocks and firms plays a crucial role in the volatility contagion phenomena that characterize financial crises, and graphs are a natural tool in their analysis. We propose graphical methods for an analysis of volatility interconnections in the Standard & Poor's 100 data set during the period 2000–2013, which contains the 2007–2008 Great Financial Crisis. The challenges are twofold: first, volatilities are not directly observed and must be extracted from time series of stock returns; second, the observed series, with about 100 stocks, is high dimensional, and curse-of-dimensionality problems are to be faced. To overcome this double challenge, we propose a dynamic factor model methodology, decomposing the panel into a factor-driven and an idiosyncratic component modelled as a sparse vector auto-regressive model. The inversion of this auto-regression, along with suitable identification constraints, produces networks in which, for a given horizon *h*, the weight associated with edge (*i*,*j*) represents the *h*-step-ahead forecast error variance of variable *i* accounted for by variable *j*'s innovations. Then, we show how those graphs yield an assessment of how *systemic* each firm is. They also demonstrate the prominent role of financial firms as sources of contagion during the 2007–2008 crisis.

We propose a hidden Markov mixture model for the analysis of gene expression measurements mapped to chromosome locations. These expression values represent preprocessed light intensities observed in each probe of Affymetrix oligonucleotide arrays. Here, the algorithm BLAT is used to align thousands of probe sequences to each chromosome. The main goal is to identify genome regions associated with high expression values which define clusters composed of consecutive observations. The model proposed assumes a mixture distribution in which one of the components (the one with the highest expected value) is supposed to accommodate the overexpressed clusters. The model takes advantage of the serial structure of the data and uses the distance information between neighbours to infer about the existence of a Markov dependence. This dependence is crucially important in the detection of overexpressed regions. We propose and discuss a Markov chain Monte Carlo algorithm to fit the model. Finally, the methodology proposed is used to analyse five data sets representing three types of cancer (breast, ovarian and brain).

Motivated by the evaluation of the causal effect of the General Agreement on Tariffs and Trade on bilateral international trade flows, we investigate the role of network structure in propensity score matching under the assumption of strong ignorability. We study the sensitivity of causal inference with respect to the presence of characteristics of the network in the set of confounders conditionally on which strong ignorability is assumed to hold. We find that estimates of the average causal effect are highly sensitive to the node level network statistics in the set of confounders. Therefore, we argue that estimates may suffer from omitted variable bias when the network information is ignored, at least in our application.

Consumer products and services can often be described as mixtures of ingredients. Examples are the mixture of ingredients in a cocktail and the mixture of different components of travel time (e.g. in-vehicle and out-of-vehicle travel time) in a transportation setting. Choice experiments may help to determine how the respondent's choice of a product or service is affected by the combination of ingredients. In such experiments, individuals are confronted with sets of hypothetical products or services and they are asked to choose the most preferred product or service from each set. However, there are no studies on the optimal design of choice experiments involving mixtures. We propose a method for generating optimal designs for such choice experiments and demonstrate the large increase in statistical efficiency that can be obtained by using an optimal design.

The correct identification of the source of a propagation process is crucial in many research fields. As a specific application, we consider source estimation of delays in public transportation networks. We propose two approaches: an effective distance median and a backtracking method. The former is based on a structurally generic effective distance-based approach for the identification of infectious disease origins, and the latter is specifically designed for delay propagation. We examine the performance of both methods in simulation studies and in an application to the German railway system, and we compare the results with those of a centrality-based approach for source detection.

Dupuytren disease is a fibroproliferative disorder with unknown aetiology that often progresses and eventually can cause permanent contractures of the fingers affected. We provide a computationally efficient Bayesian framework to discover potential risk factors and investigate which fingers are jointly affected. Our Bayesian approach is based on Gaussian copula graphical models, which provide a way to discover the underlying conditional independence structure of variables in multivariate data of mixed types. In particular, we combine the semiparametric Gaussian copula with extended rank likelihood to analyse multivariate data of mixed types with arbitrary marginal distributions. For structural learning, we construct a computationally efficient search algorithm by using a transdimensional Markov chain Monte Carlo algorithm based on a birth–death process. In addition, to make our statistical method easily accessible to other researchers, we have implemented our method in C++ and provide an interface with R software as an R package BDgraph, which is freely available from http://CRAN.R-project.org/package=BDgraph.

The increasing awareness of treatment effect heterogeneity has motivated flexible designs of confirmatory clinical trials that prospectively allow investigators to test for treatment efficacy for a subpopulation of patients in addition to the entire population. If a target subpopulation is not well characterized in the design stage, it can be developed at the end of a broad eligibility trial under an adaptive signature design. The paper proposes new procedures for subgroup selection and treatment effect estimation (for the selected subgroup) under an adaptive signature design. We first provide a simple and general characterization of the optimal subgroup that maximizes the power for demonstrating treatment efficacy or the expected gain based on a specified utility function. This characterization motivates a procedure for subgroup selection that involves prediction modelling, augmented inverse probability weighting and low dimensional maximization. A cross-validation procedure can be used to remove or reduce any resubstitution bias that may result from subgroup selection, and a bootstrap procedure can be used to make inference about the treatment effect in the subgroup selected. The approach proposed is evaluated in simulation studies and illustrated with real examples.

The estimation of time varying networks for functional magnetic resonance imaging data sets is of increasing importance and interest. We formulate the problem in a high dimensional time series framework and introduce a data-driven method, namely *network change points detection*, which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. Network change points detection is applied to various simulated data and a resting state functional magnetic resonance imaging data set. This new methodology also allows us to identify common functional states within and across subjects. Finally, network change points detection promises to offer a deep insight into the large-scale characterizations and dynamics of the brain.

Symbolic or categorical sequences occur in many contexts and can be characterized, for example, by integer-valued intersymbol distances or binary-valued indicator sequences. The analysis of these numerical sequences often sheds light on the properties of the original symbolic sequences. This work introduces new statistical tools for exploring auto-correlation structure in the indicator sequences, for the specific case of deoxyribonucleic acid (DNA) sequences. It is known that the probability distribution of internucleotide distances of DNA sequences deviates significantly from the distribution obtained by assuming independent random placement (i.e. the geometric distribution) and that the deviations can be used either to discriminate between species or to build phylogenetic trees. To investigate the extent to which auto-correlation structure explains these deviations, the 0–1 indicator sequence of each nucleotide (A, C, G and T) is endowed with a binary auto-regressive (AR) model of optimum order. The corresponding binary AR geometric distribution is derived analytically and compared with the observed internucleotide distance distribution by appropriate goodness-of-fit testing. Results in 34 mitochondrial DNA sequences show that the hypothesis of equal observed/expected frequencies is seldom rejected when a binary AR model is considered instead of independence (76/136 *versus* 125/136 rejections at the 1% level), in spite of -testing tending to reject for large samples, regardless of how close observed/expected values are. Furthermore, binary AR structure also leads to a median discrepancy reduction of 90% for G, 80% for C, 60% for T and 30% for nucleotide A. Therefore, these models are useful to describe the dependences within a given nucleotide and encourage the development of a model-based framework to compact internucleotide distance information and to understand DNA differences among species further.

When experiments are performed on social networks, it is difficult to justify the usual assumption of treatment–unit additivity, owing to the connections between actors in the network. We investigate how connections between experimental units affect the design of experiments on those experimental units. Specifically, where we have unstructured treatments, whose effects propagate according to a linear network effects model which we introduce, we show that optimal designs are no longer necessarily balanced; we further demonstrate how experiments which do not take a network effect into account can lead to much higher variance than necessary and/or a large bias. We show the use of this methodology in a very wide range of experiments in agricultural trials, and crossover trials, as well as experiments on connected individuals in a social network.

Complex network data problems are increasingly common in many fields of application. Our motivation is drawn from strategic marketing studies monitoring customer choices of specific products, along with co-subscription networks encoding multiple-purchasing behaviour. Data are available for several agencies within the same insurance company, and our goal is to exploit co-subscription networks efficiently to inform targeted advertising of cross-sell strategies to currently monoproduct customers. We address this goal by developing a Bayesian hierarchical model, which clusters agencies according to common monoproduct customer choices and co-subscription networks. Within each cluster, we efficiently model customer behaviour via a cluster-dependent mixture of latent eigenmodels. This formulation provides key information on monoproduct customer choices and multiple-purchasing behaviour within each cluster, informing targeted cross-sell strategies. We develop simple algorithms for tractable inference and assess performance in simulations and an application to business intelligence.

We introduce a non-stationary spatiotemporal model for gridded data on the sphere. The model specifies a computationally convenient covariance structure that depends on heterogeneous geography. Widely used statistical models on a spherical domain are non-stationary for different latitudes, but stationary at the same latitude (*axial symmetry*). This assumption has been acknowledged to be too restrictive for quantities such as surface temperature, whose statistical behaviour is influenced by large-scale geographical descriptors such as land and ocean. We propose an evolutionary spectrum approach that can account for different regimes across the Earth's geography and results in a more general and flexible class of models that vastly outperforms axially symmetric models and captures longitudinal patterns that would otherwise be assumed constant. The model can be estimated with a multistep conditional likelihood approximation that preserves the non-stationary features while allowing for easily distributed computations: we show how the model can be fitted to more than 20 million data points in less than 1 day on a state of the art workstation. The resulting estimates from the statistical model can be regarded as a synthetic description (i.e. a compression) of the space–time characteristics of an entire initial condition ensemble.

Genetic interactions confer robustness on cells in response to genetic perturbations. This often occurs through molecular buffering mechanisms that can be predicted by using, among other features, the degree of coexpression between genes, which is commonly estimated through marginal measures of association such as Pearson or Spearman correlation coefficients. However, marginal correlations are sensitive to indirect effects and often partial correlations are used instead. Yet, partial correlations convey no information about the (linear) influence of the coexpressed genes on the entire multivariate system, which may be crucial to discriminate functional associations from genetic interactions. To address these two shortcomings, here we propose to use the edge weight derived from the covariance decomposition over the paths of the associated gene network. We call this new quantity the networked partial correlation and use it to analyse genetic interactions in yeast.

Estimation of the effect of a treatment in the presence of unmeasured confounding is a common objective in observational studies. The two-stage least squares instrumental variables procedure is frequently used but is not applicable to time-to-event data if some observations are censored. We develop a simultaneous equations model to account for unmeasured confounding of the effect of treatment on survival time subject to censoring. The identification of the treatment effect is assisted by instrumental variables (variables related to treatment but conditional on treatment, not to the outcome) and the assumed bivariate distribution underlying the data-generating process. The methodology is illustrated on data from an observational study of time to death following endovascular or open repair of ruptured abdominal aortic aneurysms. As the instrumental variable and the distributional assumptions cannot be jointly assessed from the observed data, we evaluate the sensitivity of the results to these assumptions.

In regression models for categorical data a linear model is typically related to the response variables via a transformation of probabilities called the link function. We introduce an approach based on two link functions for binary data named the *log-mean* and the *log-mean linear* methods. The choice of the link function plays a key role in the interpretation of the model, and our approach is especially appealing in terms of interpretation of the effects of covariates on the association of responses. Similarly to Poisson regression, the log-mean and log-mean linear regression coefficients of single outcomes are log-relative-risks, and we show that the relative risk interpretation is maintained also in the regressions of the association of responses. Furthermore, certain collections of zero log-mean linear regression coefficients imply that the relative risks for joint responses factorize with respect to the corresponding relative risks for marginal responses. This work is motivated by the analysis of a data set obtained from a case–control study aimed at investigating the effect of human immunodeficiency virus infection on *multimorbidity*, i.e. simultaneous presence of two or more non-infectious comorbidities in one patient.

The association between maternal age of onset of dementia and amyloid deposition (measured by *in vivo* positron emission tomography imaging) in cognitively normal older offspring is of interest. In a regression model for amyloid, special methods are required because of the random right censoring of the covariate of maternal age of onset of dementia. Prior literature has proposed methods to address the problem of censoring due to assay limit of detection, but not random censoring. We propose imputation methods and a survival regression method that do not require parametric assumptions about the distribution of the censored covariate. Existing imputation methods address missing covariates, but not right-censored covariates. In simulation studies, we compare these methods with the simple, but inefficient, complete-case analysis, and with thresholding approaches. We apply the methods to the Alzheimer's study.

We analyse the delay between diagnosis of illness and claim settlement in critical illness insurance by using generalized linear-type models under a generalized beta of the second kind family of distributions. A Bayesian approach is employed which allows us to incorporate parameter and model uncertainty and also to impute missing data in a natural manner. We propose methodology involving a latent likelihood ratio test to compare missing data models and a version of posterior predictive *p*-values to assess different models. Bayesian variable selection is also performed, supporting a small number of models with small Bayes factors, and therefore we base our predictions on model averaging instead of on a best-fitting model.

A Bayesian model and design are described for a phase I–II trial to optimize jointly the doses of a targeted agent and a chemotherapy agent for solid tumours. A challenge in designing the trial was that both the efficacy and the toxicity outcomes were defined as four-level ordinal variables. To reflect possibly complex joint effects of the two doses on each of the two outcomes, for each marginal distribution a generalized continuation ratio model was assumed, with each agent's dose parametrically standardized in the linear term. A copula was assumed to obtain a bivariate distribution. Elicited outcome probabilities were used to construct a prior, with variances calibrated to obtain small prior effective sample size. Elicited numerical utilities of the 16 elementary outcomes were used to compute posterior mean utilities as criteria for selecting dose pairs, with adaptive randomization to reduce the risk of becoming stuck at a suboptimal pair. A simulation study showed that parametric dose standardization with additive dose effects provides a robust reliable model for dose pair optimization in this setting, and it compares favourably with designs based on alternative models that include dose–dose interaction terms. The model and method proposed are applicable generally to other clinical trial settings with similar dose and outcome structures.

In large multilevel studies effects of interest are often evaluated for a number of more or less related outcomes. For instance, the present work was motivated by the multiplicity issues that arose in the analysis of a cluster-randomized, crossover intervention study evaluating the health benefits of a school meal programme. We propose a novel and versatile framework for simultaneous inference on parameters estimated from linear mixed models that were fitted separately for several outcomes from the same study, but did not necessarily contain the same fixed or random effects. By combining asymptotic representations of parameter estimates from separate model fits we could derive the joint asymptotic normal distribution for all parameter estimates of interest for all outcomes considered. This result enabled the construction of simultaneous confidence intervals and calculation of adjusted *p*-values. For sample sizes of practical relevance we studied simultaneous coverage through simulation, which showed that the approach achieved acceptable coverage probabilities even for small sample sizes (10 clusters) and for 2–16 outcomes. The approach also compared favourably with a joint modelling approach. We also analysed data with 17 outcomes from the motivating study, resulting in adjusted *p*-values that were appreciably less conservative than Bonferroni adjustment.

Group sequential study designs have been proposed as an approach to conserve resources in biomarker validation studies. Typically, group sequential study designs allow both ‘early termination to reject the null hypothesis’ and ‘early termination for futility’ if there is evidence against the alternative hypothesis. In contrast, several researchers have advocated for using group sequential study designs that allow only early termination for futility in biomarker validation studies because of the desire to obtain a precise estimate of marker performance at study completion. This suggests a loss function that heavily weights the precision of the estimate that is obtained at study completion at the expense of an increased sample size when there is strong evidence against the null hypothesis. We propose a formal approach to comparing designs that allow early termination for futility, superiority or both by developing a loss function that incorporates the expected sample size under the null and alternative hypotheses, as well as the mean-squared error of the estimate that is obtained at study completion. We then use our loss function to compare several candidate designs and derive optimal two-stage designs for a recently reported validation study of a novel prostate cancer biomarker.

Design conditions for marine structures are typically informed by threshold-based extreme value analyses of oceanographic variables, in which excesses of a high threshold are modelled by a generalized Pareto distribution. Too low a threshold leads to bias from model misspecification, and raising the threshold increases the variance of estimators: a bias–variance trade-off. Many existing threshold selection methods do not address this trade-off directly but rather aim to select the lowest threshold above which the generalized Pareto model is judged to hold approximately. In the paper Bayesian cross-validation is used to address the trade-off by comparing thresholds based on predictive ability at extreme levels. Extremal inferences can be sensitive to the choice of a single threshold. We use Bayesian model averaging to combine inferences from many thresholds, thereby reducing sensitivity to the choice of a single threshold. The methodology is applied to significant wave height data sets from the northern North Sea and the Gulf of Mexico.

Mortality forecasts are typically limited in that they pertain only to national death rates, predict only all-cause mortality or do not capture and utilize the correlation between diseases. We present a novel Bayesian hierarchical model that jointly forecasts cause-specific death rates for geographic subunits. We examine its effectiveness by applying it to US vital statistics data for 1979–2011 and produce forecasts to 2024. Not only does the model generate coherent forecasts for mutually exclusive causes of death, but also it has lower out-of-sample error than alternative commonly used models for forecasting mortality.

We consider the problem of designing additional experiments to update statistical models for latent day specific effects. The problem appears in thermal spraying, where particles are sprayed on surfaces to obtain a coating. The relationships between in-flight properties of the particles and the controllable variables are modelled by generalized linear models. However, there are also non-controllable variables, which may vary from day to day and are modelled by day-specific additive effects. Existing generalized linear models for properties of the particles in flight must be updated on a limited number of additional experiments on a different day. We develop robust *D*-optimal designs to collect additional data for an update of the day effects, which are efficient for the estimation of the parameters in all models under consideration. The results are applied to the thermal spraying process and a comparison of the statistical analysis based on a reference design as well as on a selected Bayesian *D*-optimal design is performed.

We explore the sensitivity of time varying confounding adjusted estimates to different dropout mechanisms. We extend the Heckman correction to two time points and explore selection models to investigate situations where the dropout process is driven by unobserved variables and the outcome respectively. The analysis is embedded in a Bayesian framework which provides several advantages. These include fitting a hierarchical structure to processes that repeat over time and avoiding exclusion restrictions in the case of the Heckman correction. We adopt the decision theoretic approach to causal inference which makes explicit the *no-regime-dropout dependence* assumption. We apply our methods to data from the ‘Counterweight programme’ pilot: a UK protocol to address obesity in primary care. A simulation study is also implemented.

Statistical models used to estimate the spatiotemporal pattern in disease risk from areal unit data represent the risk surface for each time period with known covariates and a set of spatially smooth random effects. The latter act as a proxy for unmeasured spatial confounding, whose spatial structure is often characterized by a spatially smooth evolution between some pairs of adjacent areal units whereas other pairs exhibit large step changes. This spatial heterogeneity is not consistent with existing global smoothing models, in which partial correlation exists between all pairs of adjacent spatial random effects. Therefore we propose a novel space–time disease model with an adaptive spatial smoothing specification that can identify step changes. The model is motivated by a new study of respiratory and circulatory disease risk across the set of local authorities in England and is rigorously tested by simulation to assess its efficacy. Results from the England study show that the two diseases have similar spatial patterns in risk and exhibit some common step changes in the unmeasured component of risk between neighbouring local authorities.

Residual life is of great interest to patients with life threatening disease. It is also important for clinicians who estimate prognosis and make treatment decisions. Quantile residual life has emerged as a useful summary measure of the residual life. It has many desirable features, such as robustness and easy interpretation. In many situations, the longitudinally collected biomarkers during patients' follow-up visits carry important prognostic value. In this work, we study quantile regression methods that allow for dynamic predictions of the quantile residual life, by flexibly accommodating the post-baseline biomarker measurements in addition to the baseline covariates. We propose unbiased estimating equations that can be solved via existing L1-minimization algorithms. The resulting estimators have desirable asymptotic properties and satisfactory finite sample performance. We apply our method to a study of chronic myeloid leukaemia to demonstrate its usefulness as a dynamic prediction tool.

Weather forecasts are typically given in the form of forecast ensembles obtained from multiple runs of numerical weather prediction models with varying initial conditions and physics parameterizations. Such ensemble predictions tend to be biased and underdispersive and thus require statistical post-processing. In the ensemble model output statistics approach, a probabilistic forecast is given by a single parametric distribution with parameters depending on the ensemble members. The paper proposes two semilocal methods for estimating the ensemble model output statistics coefficients where the training data for a specific observation station are augmented with corresponding forecast cases from stations with similar characteristics. Similarities between stations are determined by using either distance functions or clustering based on various features of the climatology, forecast errors and locations of the observation stations. In a case-study on wind speed over Europe with forecasts from the ‘Grand limited area model ensemble prediction system’, the similarity-based semilocal models proposed show significant improvement in predictive performance compared with standard regional and local estimation methods. They further allow for estimating complex models without numerical stability issues and are computationally more efficient than local parameter estimation.

New computer algorithms for finding *D*-optimal designs of stimulus sequence for functional magnetic resonance imaging (MRI) experiments are proposed. Although functional MRI data are commonly analysed by linear models, the construction of a functional MRI design matrix is much more complicated than in conventional experimental design problems. Inspired by the widely used exchange algorithm technique, our proposed approach implements a greedy search strategy over the vast functional MRI design space for a *D*-optimal design. Compared with a recently proposed genetic algorithm, our algorithms are superior in terms of computing time and achieved design efficiency in both single-objective and multiobjective problems. In addition, the algorithms proposed are sufficiently flexible to incorporate a constraint that requires the exact number of appearances of each type of stimulus in a design. This realistic design issue is unfortunately not well handled by existing methods.

Graphene is an emerging nanomaterial for a wide variety of novel applications. Controlled synthesis of high quality graphene sheets requires analytical understanding of graphene growth kinetics. Graphene growth via chemical vapour deposition starts with randomly nucleated islands that gradually develop into complex shapes, grow in size and eventually connect together to form a graphene sheet. Models proposed for this stochastic process do not, in general, permit assessment of uncertainty. We develop a stochastic framework for the growth process and propose Bayesian inferential models, which account for the data collection mechanism and allow for uncertainty analyses, for learning about the kinetics from experimental data. Furthermore, we link the growth kinetics with controllable experimental factors, thus providing a framework for statistical design and analysis of future experiments.

In light of intense hurricane activity along the US Atlantic coast, attention has turned to understanding both the economic effect and the behaviour of these storms. The compound Poisson–log-normal process has been proposed as a model for aggregate storm damage but does not shed light on regional analysis since storm path data are not used. We propose a fully Bayesian regional prediction model which uses conditional auto-regressive models to account for both storm paths and spatial patterns for storm damage. When fitted to historical data, the analysis from our model both confirms previous findings and reveals new insights on regional storm tendencies. Posterior predictive samples can also be used for pricing regional insurance premiums, which we illustrate by using three different risk measures.

We propose novel methods for predictive (sparse) principal component analysis with spatially misaligned data. These methods identify principal component loading vectors that explain as much variability in the observed data as possible, while also ensuring that the corresponding principal component scores can be predicted accurately by means of spatial statistics at locations where air pollution measurements are not available. This will make it possible to identify important mixtures of air pollutants and to quantify their health effects in cohort studies, where currently available methods cannot be used. We demonstrate the utility of predictive (sparse) principal component analysis in simulated data and apply the approach to annual averages of particulate matter speciation data from national Environmental Protection Agency regulatory monitors.

We propose a novel adaptive design for clinical trials with time-to-event outcomes and covariates (which may consist of or include biomarkers). Our method is based on the expected entropy of the posterior distribution of a proportional hazards model. The expected entropy is evaluated as a function of a patient's covariates, and the information gained due to a patient is defined as the decrease in the corresponding entropy. Candidate patients are only recruited onto the trial if they are likely to provide sufficient information. Patients with covariates that are deemed uninformative are filtered out. A special case is where all patients are recruited, and we determine the optimal treatment arm allocation. This adaptive design has the advantage of potentially elucidating the relationship between covariates, treatments and survival probabilities by using fewer patients, albeit at the cost of rejecting some candidates. We assess the performance of our adaptive design by using data from the German Breast Cancer Study Group and numerical simulations of a biomarker validation trial.

Air pollution epidemiology studies are trending towards a multipollutant approach. In these studies, exposures at subject locations are unobserved and must be predicted by using observed exposures at misaligned monitoring locations. This induces measurement error, which can bias the estimated health effects and affect standard error estimates. We characterize this measurement error and develop an analytic bias correction when using penalized regression splines to predict exposure. Our simulations show that bias from multipollutant measurement error can be severe, and in opposite directions or simultaneously positive or negative. Our analytic bias correction combined with a non-parametric bootstrap yields accurate coverage of 95% confidence intervals. We apply our methodology to analyse the association of systolic blood pressure with PM_{2.5} and NO_{2} levels in the National Institute of Environmental Health Sciences Sister Study. We find that NO_{2} confounds the association of systolic blood pressure with PM_{2.5} levels and vice versa. Elevated systolic blood pressure was significantly associated with increased PM_{2.5} and decreased NO_{2} levels. Correcting for measurement error bias strengthened these associations and widened 95% confidence intervals.

We propose a general framework for non-normal multivariate data analysis called multivariate covariance generalized linear models, designed to handle multivariate response variables, along with a wide range of temporal and spatial correlation structures defined in terms of a covariance link function combined with a matrix linear predictor involving known matrices. The method is motivated by three data examples that are not easily handled by existing methods. The first example concerns multivariate count data, the second involves response variables of mixed types, combined with repeated measures and longitudinal structures, and the third involves a spatiotemporal analysis of rainfall data. The models take non-normality into account in the conventional way by means of a variance function, and the mean structure is modelled by means of a link function and a linear predictor. The models are fitted by using an efficient Newton scoring algorithm based on quasi-likelihood and Pearson estimating functions, using only second-moment assumptions. This provides a unified approach to a wide variety of types of response variables and covariance structures, including multivariate extensions of repeated measures, time series, longitudinal, spatial and spatiotemporal structures.

We present a common framework for Bayesian emulation methodologies for multivariate output simulators, or computer models, that employ either parametric linear models or non-parametric Gaussian processes. Novel diagnostics suitable for multivariate covariance separable emulators are developed and techniques to improve the adequacy of an emulator are discussed and implemented. A variety of emulators are compared for a humanitarian relief simulator, modelling aid missions to Sicily after a volcanic eruption and earthquake, and a sensitivity analysis is conducted to determine the sensitivity of the simulator output to changes in the input variables. The results from parametric and non-parametric emulators are compared in terms of prediction accuracy, uncertainty quantification and scientific interpretability.

The paper develops a parametric variant of the Machado–Mata simulation methodology to examine quantile wage differences between groups of workers, with an application to the wage gap between native and foreign workers in Luxembourg. Relying on conditional-likelihood-based ‘parametric quantile regression’ in place of the standard linear quantile regression is parsimonious and cuts computing time drastically with no loss in the accuracy of marginal quantile simulations in our application. We find that the native worker advantage is a concave function of quantile: the advantage is small (possibly negative) for both low and high quantiles, but it is large for the middle half of the quantile range (between the 20th and 70th native wage percentiles).

Civil unrest is a complicated, multifaceted social phenomenon that is difficult to forecast. Relevant data for predicting future protests consist of a massive set of heterogeneous sources of data, primarily from social media. Using a modular approach to extract pertinent information from disparate sources of data, we develop a spatiotemporal multiscale framework to fuse predictions from algorithms mining social media. This novel multiscale spatiotemporal model is developed to satisfy four essential requirements: be scalable to handle massive spatiotemporal data sets, incorporate hierarchical predictions, accommodate predictions of differing quality and uncertainty, and be flexible, allowing revisions to existing algorithms and the addition of new algorithms. The paper details the challenges that are posed by these four requirements and outlines the benefits of our novel multiscale spatiotemporal model relative to existing methods. In particular, our multiscale approach coupled with an efficient sequential Monte Carlo framework enables scalable rapid computation of richly specified Bayesian hierarchical models for spatiotemporal data.

Tissue samples from the same tumour are heterogeneous. They consist of different subclones that can be characterized by differences in DNA nucleotide sequences and copy numbers on multiple loci. Inference on tumour heterogeneity thus involves the identification of the subclonal copy number and single-nucleotide mutations at a selected set of loci. We carry out such inference on the basis of a Bayesian feature allocation model. We jointly model subclonal copy numbers and the corresponding allele sequences for the same loci, using three random matrices, **L**,** Z** and **w**, to represent subclonal copy numbers (**L**), the number of subclonal variant alleles (**Z**) and the cellular fractions (**w**) of subclones in one or more tumour samples respectively. The unknown number of subclones implies a random number of columns. More than one subclone indicates tumour heterogeneity. Using simulation studies and a real data analysis with next generation sequencing data, we demonstrate how posterior inference on the subclonal structure is enhanced with the joint modelling of both structure and sequencing variants on subclonal genomes. An R package is available from http://cran.r-project.org/web/packages/BayClone2/index.html.

Dual-frequency identification sonar delivers video-like underwater images which allow the investigation of fish behaviour even in cloudy and muddy water. Generally, images are recorded in a resolution of up to 10 pictures per second, so that practically one obtains a video of underwater movements. These videos allow ecologists to observe, count or investigate fish behaviour. We focus on automatic classification of fish based on such sonar videos. After appropriate preprocessing of the videos, we show how we can count and classify fish into different species on the basis of their shape and movement. The procedures developed work in realtime, i.e. data processing and classification of video sequences are faster than the length of the video sequences themselves.

The marker-stratified design (MSD) is an important design to assess treatment and marker effects in personalized medicine. The MSD stratifies patients into marker positive and marker negative subgroups on the basis of their biomarker profiles and then randomizes them to the standard treatment or a new treatment within each subgroup. The performance of the MSD can be seriously undermined when the biomarker is measured with error (or misclassified). A recently proposed analytic method corrects the biomarker misclassification in the MSD under the assumptions that the biomarker classification rates are known and no other covariates need to be adjusted. We propose a two-stage MSD to relax these assumptions. We analytically investigate the bias in the estimation of prognostic and predictive marker effects and treatment effects caused by biomarker misclassification in the presence of covariates, and we propose an expectation–maximization algorithm to correct such biases. The design does not require prespecification of the misclassification rates and can incorporate any covariates that potentially confound the prognostic and predictive marker effects and treatment effect. Numerical trial applications show that the method has desirable operating characteristics.

Euromillions is a lotto game played across nine countries. We examine more than 8 years of sales data for individual states to assess whether a jackpot win in a country increases subsequent sales in that country. We propose a novel test for the presence of such a ‘compatriot win effect’ that has as its only assumption that the lottery draw is random. Results suggest elevated sales over 12 draws following a national win. When we model the size of the effect, it proves to be modest in size for average jackpot wins but much larger and longer lasting for the highest pay-outs.

Clinical attachment level is regarded as the most popular measure to assess periodontal disease (PD). These probed tooth site level measures are usually rounded and recorded as whole numbers (in millimetres) producing clustered (site measures within a mouth) error prone ordinal responses representing some ordering of the underlying PD progression. In addition, it is hypothesized that PD progression can be spatially referenced, i.e. proximal tooth sites share similar PD status in comparison with sites that are distantly located. We develop a Bayesian multivariate probit framework for these ordinal responses where the cut point parameters linking the observed ordinal clinical attachment levels to the latent underlying disease process can be fixed in advance. The latent spatial association characterizing conditional independence under Gaussian graphs is introduced via a non-parametric Bayesian approach motivated by the probit stick breaking process, where the components of the stick breaking weights follow a multivariate Gaussian density with the precision matrix distributed as *G*-Wishart. This yields a computationally simple, yet robust and flexible, framework to capture the latent disease status leading to a natural clustering of tooth sites and subjects with similar PD status (beyond spatial clustering), and improved parameter estimation through sharing of information. Both simulation studies and application to a motivating PD data set reveal the advantages of considering this flexible non-parametric ordinal framework over other alternatives.

Counting by weighing is often more efficient than counting manually, which is time consuming and prone to human errors, especially when the number of items (e.g. plant seeds, printed labels or coins) is large. Papers in the statistical literature have focused on how to count, by weighing, a random number of items that is close to a prespecified number in some sense. The paper considers the new problem, arising from a consultation with a company, of making inference about the number of 1p coins in a bag with known weight for infinitely many bags, by using the estimated distribution of coin weight from one calibration data set only. Specifically, a lower confidence bound has been constructed on the number of 1p coins for each of infinitely many future bags of 1p coins, as required by the company. As the same calibration data set is used repeatedly in the construction of all these lower confidence bounds, the interpretation of coverage frequency of the lower confidence bounds that is proposed is different from that of a usual confidence set.