Applications of random-parameter logit models can be found in various disciplines. These models have non-concave simulated likelihood functions and the choice of starting values is therefore crucial to avoid convergence at an inferior optimum. Little guidance exists, however, on how to obtain good starting values. We apply an estimation strategy which makes joint use of heuristic global search routines and gradient-based algorithms. The central idea is to use heuristic routines to locate a starting point which is likely to be close to the global maximum, and then to use gradient-based algorithms to refine this point further. Using four empirical data sets, as well as simulated data, we find that the strategy proposed locates higher maxima than more conventional estimation strategies.

We propose a model for spatial and temporal interpolation and prediction for a set of monthly minimum temperature measurements from 37 stations in the state of Rio de Janeiro, Brazil, from 1961 to 2010. The model is based on a hierarchical specification where data are modelled with a regression structure for the mean and a non-stationary spatial representation for the spatially structured noise term. The regression term is also allowed to contain spatial dependence through coefficients and its covariates. A novel structure is assumed for the spatial non-stationarity, based on a generalization of a currently proposed convolution of stationary models. It allows more flexibility in the model specification and automatically selects the number of model components. This model provides superior performance when compared with two special cases: one stationary, and one non-stationary, provided by a sum of locally stationary processes. The results show a growth in the spread of temperature, with largely urbanized, warmest areas growing warmer, and largely forested, coolest areas growing cooler.

A biosimilar product refers to a follow-on biologic that is intended to be approved for marketing on the basis of biosimilarity to an existing patented biological product (i.e. the reference product). To develop a biosimilar product, it is essential to demonstrate biosimilarity between the follow-on biologic and the reference product, typically through two-arm randomization trials. We propose a Bayesian adaptive design for trials to evaluate biosimilar products. To take advantage of the abundant historical data on the efficacy of the reference product that is typically available at the time that a biosimilar product is developed, we propose the calibrated power prior, which allows our design to borrow information adaptively from the historical data according to the congruence between the historical data and the new data collected from the current trial. We propose a new measure, the Bayesian biosimilarity index, to measure the similarity between the biosimilar product and the reference product. During the trial, we evaluate the Bayesian biosimilarity index in a group sequential fashion on the basis of the accumulating interim data and stop the trial early once there is enough information to conclude or reject the similarity. Extensive simulation studies show that the design proposed has higher power than traditional designs. We applied the design to a biosimilar trial for treating rheumatoid arthritis.

Digit preference is the habit of reporting certain end digits more often than others. If such a misreporting pattern is a concern, then measures to reduce digit preference can be taken and monitoring changes in digit preference becomes important. We propose a two-dimensional penalized composite link model to estimate the true distributions unaffected by misreporting, the digit preference pattern and a trend in the preference pattern simultaneously. A transfer pattern is superimposed on a series of smooth latent distributions and is modulated along a second dimension. Smoothness of the latent distributions is enforced by a roughness penalty. Ridge regression with an -penalty is used to extract the misreporting pattern, and an additional weighted least squares regression estimates the modulating trend vector. Smoothing parameters are selected by the Akaike information criterion. We present a simulation study and apply the model to data on birth weight and on self-reported weight of adults.

Biodiversity is important for balance and function of a broad variety of ecosystems, and identifying factors that influence biodiversity can assist environmental management and maintenance. However, low abundance taxa are often missing from ecosystem surveys. These rare taxa, which may be critical to the ecosystem function, are not accounted for in existing methods for detecting changes in species richness. We introduce a model for total (observed and unobserved) biodiversity that explicitly accounts for these rare taxa. Our method permits rigorous testing for both heterogeneity and biodiversity changes, and simultaneously improves type I and II error rates compared with existing methods. To estimate model parameters we utilize the well-developed literature of meta-analysis. The problem of substantial low abundance taxa missing from samples is especially pronounced in microbiomes, which are the focus of our case-studies.

Meta-analysis combining multiple transcriptomic studies increases statistical power and accuracy in detecting differentially expressed genes. As the next-generation sequencing experiments become mature and affordable, increasing numbers of ribonucleic acid sequencing (‘RNA-seq’) data sets are becoming available in the public domain. Count-data-based technology provides better experimental accuracy, reproducibility and ability to detect low expressed genes. A naive approach to combine multiple RNA-seq studies is to apply differential analysis tools such as edgeR and DESeq to each study and then to combine the summary statistics of *p*-values or effect sizes by conventional meta-analysis methods. Such a two-stage approach loses statistical power, especially for genes with short length or low expression abundance. We propose a full Bayesian hierarchical model (namely, BayesMetaSeq) for RNA-seq meta-analysis by modelling count data, integrating information across genes and across studies, and modelling potentially heterogeneous differential signals across studies via latent variables. A Dirichlet process mixture prior is further applied on the latent variables to provide categorization of detected biomarkers according to their differential expression patterns across studies, facilitating improved interpretation and biological hypothesis generation. Simulations and a real application on multiple brain region human immunodeficiency virus type 1 transgenic rats demonstrate improved sensitivity, accuracy and biological findings of the method.

Sometimes in studies of the dependence of survival time on explanatory variables the natural time origin for defining entry into study cannot be observed and a delayed time origin is used instead. For example, diagnosis of disease may in some patients be made only at death. The effect of such delays is investigated both theoretically and in the context of the England and Wales National Cancer Register.

Numerous biological processes, many impacting on human health, rely on collective cell movement. We develop nine candidate models, based on advection–diffusion partial differential equations, to describe various alternative mechanisms that may drive cell movement. The parameters of these models were inferred from one-dimensional projections of laboratory observations of *Dictyostelium discoideum* cells by sampling from the posterior distribution using the delayed rejection adaptive Metropolis algorithm. The best model was selected by using the widely applicable information criterion. We conclude that cell movement in our study system was driven both by a self-generated gradient in an attractant that the cells could deplete locally, and by chemical interactions between the cells.

We investigate the causal relationship between climate and criminal behaviour. Considering the characteristics of integer-valued time series of criminal incidents, we propose a modified Granger causality test based on the generalized auto-regressive conditional heteroscedasticity type of integer-valued time series models to analyse the relationship between the number of crimes and the temperature as an environmental factor. More precisely, we employ the Poisson, negative binomial and log-linear Poisson integer-valued generalized auto-regressive conditional heteroscedasticity models and particularly adopt a Bayesian method for our analysis. The Bayes factors and posterior probability of the null hypothesis help to determine the causality between the variables considered. Moreover, employing an adaptive Markov chain Monte Carlo sampling scheme, we estimate model parameters and initial values. As an illustration, we evaluate our test through a simulation study and, to examine whether or not temperature affects crime activities, we apply our method to data sets categorized as sexual offences, drug offences, theft of motor vehicles, and domestic-violence-related assault in Ballina, New South Wales, Australia. The result reveals that more sexual offences, drug offences and domestic-violence-related assaults occur during the summer than in other seasons of the year. This evidence strongly advocates a causal relationship between crime and temperature.

We analyse the problem of two clinically inseparable, repeatedly measured responses of ordinal type by also incorporating their missingness process. In our application these are the therapeutic effect and extent of side effects of fluvoxamine. In the case of a composite end point, the scientific questions addressed can be answered only when the responses are modelled jointly. As an extension of the methodology, several missingness not at random models were fitted to a set of observed data and shown to yield approximately the same result as their missingness at random counterparts, although it affects precision. In addition, the effect of various identifying restrictions on multiple imputation is investigated. An alternative numerical approximation method is suggested to reduce computational time.

Complex stochastic models are commonplace in epidemiology, but their utility depends on their calibration to empirical data. History matching is a (pre)calibration method that has been applied successfully to complex deterministic models. In this work, we adapt history matching to stochastic models, by emulating the variance in the model outputs, and therefore accounting for its dependence on the model's input values. The method proposed is applied to a real complex epidemiological model of human immunodeficiency virus in Uganda with 22 inputs and 18 outputs, and is found to increase the efficiency of history matching, requiring 70% of the time and 43% fewer simulator evaluations compared with a previous variant of the method. The insight gained into the structure of the human immunodeficiency virus model, and the constraints placed on it, are then discussed.

Common to both diagnostic tests used in capture–recapture and score tests is the idea that starting from a simple base model it is possible to interrogate data to determine whether more complex parameter structures will be supported. Current recommendations advise that diagnostic tests are performed as a precursor to a model selection step. We show that certain well-known diagnostic tests for examining the fit of capture–recapture models to data are in fact score tests. Because of this direct relationship we investigate a new strategy for model assessment which combines the diagnosis of departure from basic model assumptions with a step-up model selection, all based on score tests. We investigate the power of such an approach to detect common reasons for lack of model fit and compare the performance of this new strategy with the existing recommendations by using simulation. We present motivating examples with real data for which the extra flexibility of score tests results in an improved performance compared with diagnostic tests.

The paper investigates whether accounting for unobserved heterogeneity in farms' size transition processes improves the representation of structural change in agriculture. Considering a mixture of two types of farm, the mover–stayer model is applied for the first time in an agricultural economics context. The maximum likelihood method and the expectation–maximization algorithm are used to estimate the model's parameters. An empirical application to a panel of French farms from 2000 to 2013 shows that the mover–stayer model outperforms the homogeneous Markov chain model in recovering the transition process and predicting the future distribution of farm sizes.

Tremor activity has been recently detected in various tectonic areas world wide and is spatially segmented and temporally recurrent. We design a type of hidden Markov models to investigate this phenomenon, where each state represents a distinct segment of tremor sources. A mixture distribution of a Bernoulli variable and a continuous variable is introduced into the hidden Markov model to solve the problem that tremor clusters are very sparse in time. We applied our model to the tremor data from the Tokai region in Japan to identify distinct segments of tremor source regions and the results reveal the spatiotemporal migration pattern among these segments.

In determining dose limiting toxicities in phase I studies, it is necessary to attribute adverse events to being drug related or not. Such determination is subjective and may introduce bias. We develop methods for removing or at least diminishing the effect of this bias on the estimation of the maximum tolerated dose. The approach that we suggest takes into account the subjectivity in the attribution of adverse events by using model-based dose escalation designs. The results show that gains can be achieved in terms of accuracy by recovering information lost to biases. These biases are a result of ignoring the errors in toxicity attribution.

It has been argued that the use of debit cards may modify cash holding behaviour, as debit card holders may either withdraw cash from automated teller machines or purchase items by using point-of-sale devices at retailers. Within the Rubin causal model, we investigate the causal effects of the use of debit cards on the cash inventories held by households by using data from the Italy Survey of Household Income and Wealth. We adopt the principal stratification approach to incorporate the share of debit card holders who do not use this payment instrument. We use a regression model with the propensity score as the single predictor to adjust for the imbalance in observed covariates. We further develop a sensitivity analysis approach to assess the sensitivity of the proposed model to violation of the key unconfoundedness assumption. Our empirical results suggest statistically significant negative effects of debit cards on the household cash level in Italy.

An important problem within the social, behavioural and health sciences is how to partition an exposure effect (e.g. treatment or risk factor) among specific pathway effects and to quantify the importance of each pathway. Mediation analysis based on the potential outcomes framework is an important tool to address this problem and we consider the estimation of mediation effects for the proportional hazards model. We give precise definitions of the total effect, natural indirect effect and natural direct effect in terms of the survival probability, hazard function and restricted mean survival time within the standard two-stage mediation framework. To estimate the mediation effects on different scales, we propose a mediation formula approach in which simple parametric models (fractional polynomials or restricted cubic splines) are utilized to approximate the baseline log-cumulative-hazard function. Simulation study results demonstrate low bias of the mediation effect estimators and close-to-nominal coverage probability of the confidence intervals for a wide range of complex hazard shapes. We apply this method to the Jackson heart study data and conduct a sensitivity analysis to assess the effect on the mediation effects inference when the no unmeasured mediator–outcome confounding assumption is violated.

Motivated by the recent acquired immune deficiency syndrome clinical trial study A5175, we propose a semiparametric framework to describe time-to-event data, where only the dependence of the mean and variance of the time on the covariates are specified through a restricted moment model. We use a second-order semiparametric efficient score combined with a non-parametric imputation device for estimation. Compared with an imputed weighted least squares method, the approach proposed improves the efficiency of the parameter estimation whenever the third moment of the error distribution is non-zero. We compare the method with a parametric survival regression method in the A5175 study data analysis. In the data analysis, the method proposed shows a better fit to the data with smaller mean-squared residuals. In summary, this work provides a semiparametric framework in modelling and estimation of survival data. The framework has wide applications in data analysis.

Many psoriatic arthritis patients do not progress to permanent joint damage in any of the 28 hand joints, even under prolonged follow-up. This has led several researchers to fit models that estimate the proportion of stayers (those who do not have the propensity to experience the event of interest) and to characterize the rate of developing damaged joints in the movers (those who have the propensity to experience the event of interest). However, when fitted to the same data, the paper demonstrates that the choice of model for the movers can lead to widely varying conclusions on a stayer population, thus implying that, if interest lies in a stayer population, a single analysis should not generally be adopted. The aim of the paper is to provide greater understanding regarding estimation of a stayer population by comparing the inferences, performance and features of multiple fitted models to real and simulated data sets. The models for the movers are based on Poisson processes with patient level random effects and/or dynamic covariates, which are used to induce within-patient correlation, and observation level random effects are used to account for time varying unobserved heterogeneity. The gamma, inverse Gaussian and compound Poisson distributions are considered for the random effects.

Motivated by a real life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyse synthetic graphs to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case-study using a version of the Enron e-mail corpus data set demonstrates the application and usefulness of the proposed techniques in solving the challenging problem of maintaining privacy *and* supporting open access to network data to ensure reproducibility of existing studies and discovering new scientific insights that can be obtained by analysing such data. We use a simple yet effective randomized response mechanism to generate synthetic networks under *ε*-edge differential privacy and then use likelihood-based inference for missing data and Markov chain Monte Carlo techniques to fit exponential family random-graph models to the generated synthetic networks.

It is common in the analysis of social network data to assume a census of the networked population of interest. Often the observations are subject to partial observation due to a known sampling or unknown missing data mechanism. However, most social network analysis ignores the problem of missing data by including only actors with complete observations. We address the modelling of networks with missing data, developing previous ideas in missing data, network modelling and network sampling. We use several methods including the mean value parameterization to show the quantitative and substantive differences between naive and principled modelling approaches. We also develop goodness-of-fit techniques to understand model fit better. The ideas are motivated by an analysis of a friendship network from the National Longitudinal Study of Adolescent Health.

Interconnectedness between stocks and firms plays a crucial role in the volatility contagion phenomena that characterize financial crises, and graphs are a natural tool in their analysis. We propose graphical methods for an analysis of volatility interconnections in the Standard & Poor's 100 data set during the period 2000–2013, which contains the 2007–2008 Great Financial Crisis. The challenges are twofold: first, volatilities are not directly observed and must be extracted from time series of stock returns; second, the observed series, with about 100 stocks, is high dimensional, and curse-of-dimensionality problems are to be faced. To overcome this double challenge, we propose a dynamic factor model methodology, decomposing the panel into a factor-driven and an idiosyncratic component modelled as a sparse vector auto-regressive model. The inversion of this auto-regression, along with suitable identification constraints, produces networks in which, for a given horizon *h*, the weight associated with edge (*i*,*j*) represents the *h*-step-ahead forecast error variance of variable *i* accounted for by variable *j*'s innovations. Then, we show how those graphs yield an assessment of how *systemic* each firm is. They also demonstrate the prominent role of financial firms as sources of contagion during the 2007–2008 crisis.

Motivated by the evaluation of the causal effect of the General Agreement on Tariffs and Trade on bilateral international trade flows, we investigate the role of network structure in propensity score matching under the assumption of strong ignorability. We study the sensitivity of causal inference with respect to the presence of characteristics of the network in the set of confounders conditionally on which strong ignorability is assumed to hold. We find that estimates of the average causal effect are highly sensitive to the node level network statistics in the set of confounders. Therefore, we argue that estimates may suffer from omitted variable bias when the network information is ignored, at least in our application.

The correct identification of the source of a propagation process is crucial in many research fields. As a specific application, we consider source estimation of delays in public transportation networks. We propose two approaches: an effective distance median and a backtracking method. The former is based on a structurally generic effective distance-based approach for the identification of infectious disease origins, and the latter is specifically designed for delay propagation. We examine the performance of both methods in simulation studies and in an application to the German railway system, and we compare the results with those of a centrality-based approach for source detection.

Dupuytren disease is a fibroproliferative disorder with unknown aetiology that often progresses and eventually can cause permanent contractures of the fingers affected. We provide a computationally efficient Bayesian framework to discover potential risk factors and investigate which fingers are jointly affected. Our Bayesian approach is based on Gaussian copula graphical models, which provide a way to discover the underlying conditional independence structure of variables in multivariate data of mixed types. In particular, we combine the semiparametric Gaussian copula with extended rank likelihood to analyse multivariate data of mixed types with arbitrary marginal distributions. For structural learning, we construct a computationally efficient search algorithm by using a transdimensional Markov chain Monte Carlo algorithm based on a birth–death process. In addition, to make our statistical method easily accessible to other researchers, we have implemented our method in C++ and provide an interface with R software as an R package BDgraph, which is freely available from http://CRAN.R-project.org/package=BDgraph.

The estimation of time varying networks for functional magnetic resonance imaging data sets is of increasing importance and interest. We formulate the problem in a high dimensional time series framework and introduce a data-driven method, namely *network change points detection*, which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. Network change points detection is applied to various simulated data and a resting state functional magnetic resonance imaging data set. This new methodology also allows us to identify common functional states within and across subjects. Finally, network change points detection promises to offer a deep insight into the large-scale characterizations and dynamics of the brain.

When experiments are performed on social networks, it is difficult to justify the usual assumption of treatment–unit additivity, owing to the connections between actors in the network. We investigate how connections between experimental units affect the design of experiments on those experimental units. Specifically, where we have unstructured treatments, whose effects propagate according to a linear network effects model which we introduce, we show that optimal designs are no longer necessarily balanced; we further demonstrate how experiments which do not take a network effect into account can lead to much higher variance than necessary and/or a large bias. We show the use of this methodology in a very wide range of experiments in agricultural trials, and crossover trials, as well as experiments on connected individuals in a social network.

Complex network data problems are increasingly common in many fields of application. Our motivation is drawn from strategic marketing studies monitoring customer choices of specific products, along with co-subscription networks encoding multiple-purchasing behaviour. Data are available for several agencies within the same insurance company, and our goal is to exploit co-subscription networks efficiently to inform targeted advertising of cross-sell strategies to currently monoproduct customers. We address this goal by developing a Bayesian hierarchical model, which clusters agencies according to common monoproduct customer choices and co-subscription networks. Within each cluster, we efficiently model customer behaviour via a cluster-dependent mixture of latent eigenmodels. This formulation provides key information on monoproduct customer choices and multiple-purchasing behaviour within each cluster, informing targeted cross-sell strategies. We develop simple algorithms for tractable inference and assess performance in simulations and an application to business intelligence.

Genetic interactions confer robustness on cells in response to genetic perturbations. This often occurs through molecular buffering mechanisms that can be predicted by using, among other features, the degree of coexpression between genes, which is commonly estimated through marginal measures of association such as Pearson or Spearman correlation coefficients. However, marginal correlations are sensitive to indirect effects and often partial correlations are used instead. Yet, partial correlations convey no information about the (linear) influence of the coexpressed genes on the entire multivariate system, which may be crucial to discriminate functional associations from genetic interactions. To address these two shortcomings, here we propose to use the edge weight derived from the covariance decomposition over the paths of the associated gene network. We call this new quantity the networked partial correlation and use it to analyse genetic interactions in yeast.

In regression models for categorical data a linear model is typically related to the response variables via a transformation of probabilities called the link function. We introduce an approach based on two link functions for binary data named the *log-mean* and the *log-mean linear* methods. The choice of the link function plays a key role in the interpretation of the model, and our approach is especially appealing in terms of interpretation of the effects of covariates on the association of responses. Similarly to Poisson regression, the log-mean and log-mean linear regression coefficients of single outcomes are log-relative-risks, and we show that the relative risk interpretation is maintained also in the regressions of the association of responses. Furthermore, certain collections of zero log-mean linear regression coefficients imply that the relative risks for joint responses factorize with respect to the corresponding relative risks for marginal responses. This work is motivated by the analysis of a data set obtained from a case–control study aimed at investigating the effect of human immunodeficiency virus infection on *multimorbidity*, i.e. simultaneous presence of two or more non-infectious comorbidities in one patient.

Symbolic or categorical sequences occur in many contexts and can be characterized, for example, by integer-valued intersymbol distances or binary-valued indicator sequences. The analysis of these numerical sequences often sheds light on the properties of the original symbolic sequences. This work introduces new statistical tools for exploring auto-correlation structure in the indicator sequences, for the specific case of deoxyribonucleic acid (DNA) sequences. It is known that the probability distribution of internucleotide distances of DNA sequences deviates significantly from the distribution obtained by assuming independent random placement (i.e. the geometric distribution) and that the deviations can be used either to discriminate between species or to build phylogenetic trees. To investigate the extent to which auto-correlation structure explains these deviations, the 0–1 indicator sequence of each nucleotide (A, C, G and T) is endowed with a binary auto-regressive (AR) model of optimum order. The corresponding binary AR geometric distribution is derived analytically and compared with the observed internucleotide distance distribution by appropriate goodness-of-fit testing. Results in 34 mitochondrial DNA sequences show that the hypothesis of equal observed/expected frequencies is seldom rejected when a binary AR model is considered instead of independence (76/136 *versus* 125/136 rejections at the 1% level), in spite of -testing tending to reject for large samples, regardless of how close observed/expected values are. Furthermore, binary AR structure also leads to a median discrepancy reduction of 90% for G, 80% for C, 60% for T and 30% for nucleotide A. Therefore, these models are useful to describe the dependences within a given nucleotide and encourage the development of a model-based framework to compact internucleotide distance information and to understand DNA differences among species further.

We analyse the delay between diagnosis of illness and claim settlement in critical illness insurance by using generalized linear-type models under a generalized beta of the second kind family of distributions. A Bayesian approach is employed which allows us to incorporate parameter and model uncertainty and also to impute missing data in a natural manner. We propose methodology involving a latent likelihood ratio test to compare missing data models and a version of posterior predictive *p*-values to assess different models. Bayesian variable selection is also performed, supporting a small number of models with small Bayes factors, and therefore we base our predictions on model averaging instead of on a best-fitting model.

In large multilevel studies effects of interest are often evaluated for a number of more or less related outcomes. For instance, the present work was motivated by the multiplicity issues that arose in the analysis of a cluster-randomized, crossover intervention study evaluating the health benefits of a school meal programme. We propose a novel and versatile framework for simultaneous inference on parameters estimated from linear mixed models that were fitted separately for several outcomes from the same study, but did not necessarily contain the same fixed or random effects. By combining asymptotic representations of parameter estimates from separate model fits we could derive the joint asymptotic normal distribution for all parameter estimates of interest for all outcomes considered. This result enabled the construction of simultaneous confidence intervals and calculation of adjusted *p*-values. For sample sizes of practical relevance we studied simultaneous coverage through simulation, which showed that the approach achieved acceptable coverage probabilities even for small sample sizes (10 clusters) and for 2–16 outcomes. The approach also compared favourably with a joint modelling approach. We also analysed data with 17 outcomes from the motivating study, resulting in adjusted *p*-values that were appreciably less conservative than Bonferroni adjustment.

The association between maternal age of onset of dementia and amyloid deposition (measured by *in vivo* positron emission tomography imaging) in cognitively normal older offspring is of interest. In a regression model for amyloid, special methods are required because of the random right censoring of the covariate of maternal age of onset of dementia. Prior literature has proposed methods to address the problem of censoring due to assay limit of detection, but not random censoring. We propose imputation methods and a survival regression method that do not require parametric assumptions about the distribution of the censored covariate. Existing imputation methods address missing covariates, but not right-censored covariates. In simulation studies, we compare these methods with the simple, but inefficient, complete-case analysis, and with thresholding approaches. We apply the methods to the Alzheimer's study.

We introduce a non-stationary spatiotemporal model for gridded data on the sphere. The model specifies a computationally convenient covariance structure that depends on heterogeneous geography. Widely used statistical models on a spherical domain are non-stationary for different latitudes, but stationary at the same latitude (*axial symmetry*). This assumption has been acknowledged to be too restrictive for quantities such as surface temperature, whose statistical behaviour is influenced by large-scale geographical descriptors such as land and ocean. We propose an evolutionary spectrum approach that can account for different regimes across the Earth's geography and results in a more general and flexible class of models that vastly outperforms axially symmetric models and captures longitudinal patterns that would otherwise be assumed constant. The model can be estimated with a multistep conditional likelihood approximation that preserves the non-stationary features while allowing for easily distributed computations: we show how the model can be fitted to more than 20 million data points in less than 1 day on a state of the art workstation. The resulting estimates from the statistical model can be regarded as a synthetic description (i.e. a compression) of the space–time characteristics of an entire initial condition ensemble.

The increasing awareness of treatment effect heterogeneity has motivated flexible designs of confirmatory clinical trials that prospectively allow investigators to test for treatment efficacy for a subpopulation of patients in addition to the entire population. If a target subpopulation is not well characterized in the design stage, it can be developed at the end of a broad eligibility trial under an adaptive signature design. The paper proposes new procedures for subgroup selection and treatment effect estimation (for the selected subgroup) under an adaptive signature design. We first provide a simple and general characterization of the optimal subgroup that maximizes the power for demonstrating treatment efficacy or the expected gain based on a specified utility function. This characterization motivates a procedure for subgroup selection that involves prediction modelling, augmented inverse probability weighting and low dimensional maximization. A cross-validation procedure can be used to remove or reduce any resubstitution bias that may result from subgroup selection, and a bootstrap procedure can be used to make inference about the treatment effect in the subgroup selected. The approach proposed is evaluated in simulation studies and illustrated with real examples.

Consumer products and services can often be described as mixtures of ingredients. Examples are the mixture of ingredients in a cocktail and the mixture of different components of travel time (e.g. in-vehicle and out-of-vehicle travel time) in a transportation setting. Choice experiments may help to determine how the respondent's choice of a product or service is affected by the combination of ingredients. In such experiments, individuals are confronted with sets of hypothetical products or services and they are asked to choose the most preferred product or service from each set. However, there are no studies on the optimal design of choice experiments involving mixtures. We propose a method for generating optimal designs for such choice experiments and demonstrate the large increase in statistical efficiency that can be obtained by using an optimal design.

We propose a hidden Markov mixture model for the analysis of gene expression measurements mapped to chromosome locations. These expression values represent preprocessed light intensities observed in each probe of Affymetrix oligonucleotide arrays. Here, the algorithm BLAT is used to align thousands of probe sequences to each chromosome. The main goal is to identify genome regions associated with high expression values which define clusters composed of consecutive observations. The model proposed assumes a mixture distribution in which one of the components (the one with the highest expected value) is supposed to accommodate the overexpressed clusters. The model takes advantage of the serial structure of the data and uses the distance information between neighbours to infer about the existence of a Markov dependence. This dependence is crucially important in the detection of overexpressed regions. We propose and discuss a Markov chain Monte Carlo algorithm to fit the model. Finally, the methodology proposed is used to analyse five data sets representing three types of cancer (breast, ovarian and brain).

We propose a three-state frailty Markov model coupled with likelihood-based inference to analyse tooth level life course data in caries research. This analysis is challenging because of intraoral clustering, interval censoring, multiplicity of caries states and computational complexities. We also develop a Bayesian approach to predict future caries transition probabilities given observed life history data. Numerical experiments demonstrate that the methods proposed perform very well in finite samples with moderate sizes. The practical utility of the model is illustrated by using life course data from a unique longitudinal study of dental caries in young low income urban African-American children. In this analysis, we evaluate for any spatial symmetry in the mouth with respect to the life course of dental caries, and whether the same type of tooth has a similar decay process in boys and girls.

In population size estimation, many capture–recapture-type data exhibit a preponderance of ‘1’-counts. This excess of 1s can arise as subjects gain information from the initial capture that provides a desire and ability to avoid subsequent captures. Existing population size estimators that purport to deal with heterogeneity can be much too large in the presence of 1-inflation, which is a specific form of heterogeneity. To deal with the phenomena of excess 1s, we propose the one-inflated positive Poisson model for use as the truncated count distribution in Horvitz–Thompson estimation of the population size.