We propose a method for assessing variable importance in matched case–control investigations and other highly stratified studies characterized by high dimensional data (*p*>>*n*). In simulated and real data sets, we show that the algorithm proposed performs better than a conventional univariate method (conditional logistic regression) and a popular multivariable algorithm (random forests) that does not take the matching into account. The methods are applicable to wide ranging, high impact clinical studies including metabolomic, proteomic studies and neuroimaging analyses, such as those assessing stroke and Alzheimer's disease. The methods proposed have been implemented in a freely available R library (http://cran.r-project.org/web/packages/RPCLR/index.html).

In a longitudinal metabolomics study, multiple metabolites are measured from several observations at many time points. Interest lies in reducing the dimensionality of such data and in highlighting influential metabolites which change over time. A dynamic probabilistic principal components analysis model is proposed to achieve dimension reduction while appropriately modelling the correlation due to repeated measurements. This is achieved by assuming an auto-regressive model for some of the model parameters. Linear mixed models are subsequently used to identify influential metabolites which change over time. The model proposed is used to analyse data from a longitudinal metabolomics animal study.

We present a novel analysis of a landmark table of dose–response mortality counts from lung cancer in men. The data were originally collected by Doll and Hill. Our inferences are based on Poisson models for which the rates of occurrence are partially ordered according to two covariates. The partial ordering of the mortality rates enforces the well-established knowledge that lung cancer mortality rates are higher for older men and for heavier smokers. The ordered group reference priors that we use in our analyses generalize a class of reference priors that we previously derived for models of count data in which the rates of occurrence in different categories are completely ordered with respect to the values of a single covariate. The reference models for the lung cancer data based on the proposed priors are more flexible than and can be superior, in terms of goodness of fit, to a Bayesian version of several parametric models derived from a mathematical theory of carcinogenesis that have appeared in the literature.

Using data collected from the ‘Sequenced treatment alternatives to relieve depression’ study, we use logistic regression to predict whether a patient will respond to treatment on the basis of early symptom change and patient characteristics. Model selection criteria such as the Akaike information criterion AIC and mean-squared-error of prediction MSEP may not be appropriate if the aim is to predict with a high degree of certainty who will respond or not respond to treatment. Towards this aim, we generalize the definition of the positive and negative predictive value curves to the case of multiple predictors. We point out that it is the ordering rather than the precise values of the response probabilities which is important, and we arrive at a unified approach to model selection via two-sample rank tests. To avoid overfitting, we define a cross-validated version of the positive and negative predictive value curves and compare these curves after smoothing for various models. When applied to the study data, we obtain a ranking of models that differs from those based on AIC and MSEP, as well as a tree-based method and regularized logistic regression using a lasso penalty. Our selected model performs consistently well for both 4-week-ahead and 7-week-ahead predictions.

The analysis of genomics alterations that may occur in nature when segments of chromosomes are copied (known as copy number alterations) has been a focus of research to identify genetic markers of cancer. One high throughput technique that has recently been adopted is the use of molecular inversion probes to measure probe copy number changes. The resulting data consist of high dimensional copy number profiles that can be used to ascertain probe-specific copy number alterations in correlative studies with patient outcomes to guide risk stratification and future treatment. We propose a novel Bayesian variable selection method, the hierarchical structured variable selection method, which accounts for the natural gene and probe-within-gene architecture to identify important genes and probes associated with clinically relevant outcomes. We propose the hierarchical structured variable selection model for grouped variable selection, where simultaneous selection of both groups and within-group variables is of interest. The hierarchical structured variable selection model utilizes a discrete mixture prior distribution for group selection and group-specific Bayesian lasso hierarchies for variable selection within groups. We provide methods for accounting for serial correlations within groups that incorporate Bayesian fused lasso methods for within-group selection. Through simulations we establish that our method results in lower model errors than other methods when a natural grouping structure exists. We apply our method to a molecular inversion probe study of breast cancer and show that it identifies genes and probes that are significantly associated with clinically relevant subtypes of breast cancer.

In recent decades there has been an increase in the reported incidence of clinical pertussis in many countries. Estimation of the true circulation of the bacterium *Bordetella pertussis* is most reliably made on the basis of studies that measure antibody concentrations against pertussis toxin. Antibody levels decay over time and provide a fading memory of the infection. We develop a discrete bivariate mixture model for paired antibody levels in a cohort of 1002 Mexican adolescents who were followed over the 2008–2009 school year. This model postulates three groups of children based on past pertussis infection; never, prior and new. On the basis of this model we directly estimate incidence and prevalence, and select a diagnostic cut-off for classifying children as recently infected. We also discuss a relatively simple approach that uses only ‘discordant’ children who test positively on one visit and negatively on the other. The discordant approach provides inferences that are very similar to those of the full model when the data follow the assumed full model. Additionally, the discordant method is much more robust to model misspecification than the full model which has substantial problems with optimization. We estimate the school year incidence of pertussis to be about 3% and the prevalence to be about 8%. A cut-off of 50 was estimated to have about 99.5% specificity and 68% sensitvity.

A major issue in non-inferiority trials is the controversial assumption of constancy, namely that the active control has the same effect relative to placebo as in previous studies comparing the active control with placebo. The constancy assumption is often in doubt, which has motivated various methods that ‘discount’ the control effect estimate from historical data as well as methods that adjust for imbalances in observed covariates. We develop a new approach to deal with residual inconstancy, i.e. possible violations of the constancy assumption due to imbalances in unmeasured covariates after adjusting for the measured covariates. We characterize the extent of residual inconstancy under a generalized linear model framework and use the results to obtain fully adjusted estimates of the control effect in the current study based on plausible assumptions about an unmeasured covariate. Because such assumptions may be difficult to justify, we propose a sensitivity analysis approach that covers a range of situations. This approach is developed for indirect comparison with placebo and effect retention, and implemented through additive and multiplicative adjustments. The approach proposed is applied to two examples concerning benign prostate hyperplasia and human immunodeficiency virus infection, and evaluated in simulation studies.

Downscaled rainfall projections from climate models are essential for many meteorological and hydrological applications. The technique presented utilizes an approach that efficiently parameterizes spatiotemporal dynamic models in terms of the close association between mean sea level pressure patterns and rainfall during winter over south-west Western Australia by means of Bayesian hierarchical modelling. This approach allows us to understand characteristics of the spatiotemporal variability of the mean sea level pressure patterns and the associated rainfall patterns. An application is presented to show the effectiveness of the technique to reconstruct present day rainfall and to predict future rainfall.

In the quantitative analysis of dynamic contrast-enhanced magnetic resonance imaging compartment models allow the uptake of contrast medium to be described with biologically meaningful kinetic parameters. As simple models often fail to describe adequately the observed uptake behaviour, more complex compartment models have been proposed. However, the non-linear regression problem arising from more complex compartment models often suffers from parameter redundancy. We incorporate spatial smoothness on the kinetic parameters of a two-tissue compartment model by imposing Gaussian Markov random-field priors on them. We analyse to what extent this spatial regularization helps to avoid parameter redundancy and to obtain stable parameter point estimates per voxel. Choosing a full Bayesian approach, we obtain posteriors and point estimates by running Markov chain Monte Carlo simulations. The approach proposed is evaluated for simulated concentration time curves as well as for *in vivo* data from a breast cancer study.

We develop methods for analysing the spatial pattern of events, classified into several types, that occur on a network of lines. The motivation is the study of small protrusions called ‘spines’ which occur on the dendrite network of a neuron. The spatially varying density of spines is modelled by using relative distributions and regression trees. Spatial correlations are investigated by using counterparts of the *K*-function and pair correlation function, where the main problem is to compensate for the network geometry. This application illustrates the need for careful analysis of spatial variation in the intensity of points, before assessing any evidence of clustering.

Stable isotope sourcing is used to estimate proportional contributions of sources to a mixture, such as in the analysis of animal diets and plant nutrient use. Statistical methods for inference on the diet proportions by using stable isotopes have focused on the linear mixing model. Existing frequentist methods assume that the diet proportion vector can be uniquely solved for in terms of one or two isotope ratios. We develop large sample methods that apply to an arbitrary number of isotope ratios, assuming that the linear mixing model has a unique solution or is overconstrained. We generalize these methods to allow temporal modelling of the population mean diet, assuming that isotope ratio response data are collected over time. The methodology is motivated by a study of the diet of dunlin, a small migratory seabird.

Despite the widespread use of equal randomization in clinical trials, response-adaptive randomization has attracted considerable interest. There is typically a prerun of equal randomization before the implementation of response-adaptive randomization, although it is often not clear how many subjects are needed in this prephase, and in practice the number of patients in the equal randomization stage is often arbitrary. Another concern that is associated with realtime response-adaptive randomization is that trial conduct often requires patients' responses to be immediately available after the treatment, whereas clinical responses may take a relatively long period of time to exhibit. To resolve these two issues, we propose a two-stage procedure to achieve a balance between power and response, which is equipped with a likelihood ratio test before skewing the allocation probability towards a better treatment. Furthermore, we develop a non-parametric fractional model and a parametric survival design with an optimal allocation scheme to tackle the common problem caused by delayed response. We evaluate the operating characteristics of the two-stage designs through extensive simulation studies and illustrate them with a human immunodeficiency virus clinical trial. Numerical results show that the methods proposed satisfactorily resolve the arbitrary size of the equal randomization phase and the delayed response problem in response-adaptive randomization.

Our application data are produced from a scalable, on-line expert elicitation process that incorporates hundreds of participating raters to score the importance of research goals for the prevention of suicide with the purpose of informing policy making. We develop a Bayesian formulation for analysis of ordinal multirater data motivated by our application. Our model employs a non-parametric mixture distribution over rater-indexed parameters for a latent continuous response under a Poisson–Dirichlet process mixing measure that allows inference about distinct rater behavioural and learning typologies from realized clusters.

Familial searching is the process of searching in a deoxyribonucleic acid (DNA) database for relatives of a certain individual. Typically, this individual is the source of a crime stain that can be reasonably attributed to the offender. If this crime stain has not given rise to a match in a DNA database, then in some jurisdictions a familial search may be carried out to attempt to identify a relative of the offender, thus giving investigators a lead. We discuss two methods to select a subset from the database that contains a relative (if present) with a probability that we can control. The first method is based on a derivation of the likelihood ratio for each database member in favour of being the relative, taking all DNA profiles in the database into account. This method needs prior probabilities and yields posterior probabilities. The second method is based on case-specific false positive and false negative rates. We discuss the relationship between the approaches and assess the methods with familial searching carried out in the Dutch national DNA database. We also give practical recommendations on the usefulness of both methods.

Data obtained by using modern sequencing technologies are often summarized by recording the frequencies of observed sequences. Examples include the analysis of T-cell counts in immunological research and studies of gene expression based on counts of RNA fragments. In both cases the items being counted are sequences, of proteins and base pairs respectively. The resulting sequence abundance distribution is usually characterized by overdispersion. We propose a Bayesian semiparametric approach to implement inference for such data. Besides modelling the overdispersion, the approach takes also into account two related sources of bias that are usually associated with sequence counts data: some sequence types may not be recorded during the experiment and the total count may differ from one experiment to another. We illustrate our methodology with two data sets: one regarding the analysis of CD4+ T-cell counts in healthy and diabetic mice and another data set concerning the comparison of messenger RNA fragments recorded in a serial analysis of gene expression experiment with gastrointestinal tissue of healthy and cancer patients.

We propose a statistical post-processing method that yields locally calibrated probabilistic forecasts of temperature, based on the output of an ensemble prediction system. It represents the mean of the predictive distributions as a sum of short-term averages of local temperatures and ensemble prediction system driven terms. For the spatial interpolation of temperature averages and local forecast uncertainty parameters we use an intrinsic Gaussian random-field model with a location-dependent nugget effect that accounts for small-scale variability. Applied to the COSMO-DE ensemble, our method yields locally calibrated and sharp probabilistic forecasts and compares favourably with other approaches.

Probabilistic models for infectious disease dynamics are useful for understanding the mechanism underlying the spread of infection. When the likelihood function for these models is expensive to evaluate, traditional likelihood-based inference may be computationally intractable. Furthermore, traditional inference may lead to poor parameter estimates and the fitted model may not capture important biological characteristics of the observed data. We propose a novel approach for resolving these issues that is inspired by recent work in emulation and calibration for complex computer models. Our motivating example is the gravity time series susceptible–infected–recovered model. Our approach focuses on the characteristics of the process that are of scientific interest. We find a Gaussian process approximation to the gravity model by using key summary statistics obtained from model simulations. We demonstrate via simulated examples that the new approach is computationally expedient, provides accurate parameter inference and results in a good model fit. We apply our method to analyse measles outbreaks in England and Wales in two periods: the prevaccination period from 1944 to 1965 and the vaccination period from 1966 to 1994. On the basis of our results, we can obtain important scientific insights about the transmission of measles. In general, our method is applicable to problems where traditional likelihood-based inference is computationally intractable or produces a poor model fit. It is also an alternative to approximate Bayesian computation when simulations from the model are expensive.

]]>As women approach menopause, the patterns of their menstrual cycle lengths change. To study these changes, we need to model jointly both the mean and the variability of cycle length. Our proposed model incorporates separate mean and variance change points for each woman and a hierarchical model to link them together, along with regression components to include predictors of menopausal onset such as age at menarche and parity. Additional complexity arises from the fact that the calendar data have substantial missingness due to hormone use, surgery and failure to report. We integrate multiple imputation and time-to-event modelling in a Bayesian estimation framework to deal with different forms of the missingness. Posterior predictive model checks are applied to evaluate the model fit. Our method successfully models patterns of women's menstrual cycle trajectories throughout their late reproductive life and identifies change points for mean and variability of segment length, providing insight into the menopausal process. More generally, our model points the way towards increasing use of joint mean–variance models to predict health outcomes and to understand disease processes better.

The paper considers data from an aphid infestation on a sugar cane plantation and illustrates the use of an individual level infectious disease model for making inference on the biological process underlying these data. The data are interval censored, and the practical issues involved with the use of Markov chain Monte Carlo algorithms with models of this sort are explored and developed. As inference for spatial infectious disease models is complex and computationally demanding, emphasis is put on a minimal parsimonious model and speed of code execution. With careful coding we can obtain highly efficient Markov chain Monte Carlo algorithms based on a simple random-walk Metropolis-within-Gibbs routine. An assessment of model fit is provided by comparing the predicted numbers of weekly infections from the data to the trajectories of epidemics simulated from the posterior distributions of model parameters. This assessment shows that the data have periods where the epidemic proceeds more slowly and more quickly than the (temporally homogeneous) model predicts.

In the genetic association analysis of Holstein cattle data, researchers are interested in testing the association between a genetic marker with more than one estimated breeding value phenotype. It is well known that testing each trait individually may lead to problems of controlling the overall type I error rate and simultaneous testing of the association between a marker and multiple traits is desired. The analysis of Holstein cattle data has additional challenges due to complicated relationships between subjects. Furthermore, phenotypic data in many other genetic studies can be quantitative, binary, ordinal, count data or a combination of different types of data. Motivated by these problems, we propose a novel statistical method that allows simultaneous testing of multiple phenotypes and the flexibility to accommodate data from a broad range of study designs. The empirical results indicate that this new method effectively controls the overall type I error rate at the desired level; it is also generally more powerful than testing each trait individually at a given overall type I error rate. The method is applied to the analysis of Holstein cattle data as well as to data from the Collaborative Study on the Genetics of Alcoholism to demonstrate the flexibility of the approach with different phenotypic data types.

A subjective sampling ratio between the case and the control groups is not always an efficient choice to maximize the power or to minimize the total required sample size in comparative diagnostic trials. We derive explicit expressions for an optimal sampling ratio based on a common variance structure shared by several existing summary statistics of the receiver operating characteristic curve. We propose a two-stage procedure to estimate adaptively the optimal ratio without pilot data. We investigate the properties of the proposed method through theoretical proofs, extensive simulation studies and a real example in cancer diagnostic studies.