Medical and public health research increasingly involves the collection of complex and high dimensional data. In particular, functional data—where the unit of observation is a curve or set of curves that are finely sampled over a grid—is frequently obtained. Moreover, researchers often sample multiple curves per person resulting in repeated functional measures. A common question is how to analyze the relationship between two functional variables. We propose a general function-on-function regression model for repeatedly sampled functional data on a fine grid, presenting a simple model as well as a more extensive mixed model framework, and introducing various functional Bayesian inferential procedures that account for multiple testing. We examine these models via simulation and a data analysis with data from a study that used event-related potentials to examine how the brain processes various types of images.

In this article we propose a Bayesian hierarchical model for the identification of differentially expressed genes in *Daphnia magna* organisms exposed to chemical compounds, specifically munition pollutants in water. The model we propose constitutes one of the very first attempts at a rigorous modeling of the biological effects of water purification. We have data acquired from a purification system that comprises four consecutive purification stages, which we refer to as “ponds,” of progressively more contaminated water. We model the expected expression of a gene in a pond as the sum of the mean of the same gene in the previous pond plus a gene-pond specific difference. We incorporate a variable selection mechanism for the identification of the differential expressions, with a prior distribution on the probability of a change that accounts for the available information on the concentration of chemical compounds present in the water. We carry out posterior inference via MCMC stochastic search techniques. In the application, we reduce the complexity of the data by grouping genes according to their functional characteristics, based on the KEGG pathway database. This also increases the biological interpretability of the results. Our model successfully identifies a number of pathways that show differential expression between consecutive purification stages. We also find that changes in the transcriptional response are more strongly associated to the presence of certain compounds, with the remaining contributing to a lesser extent. We discuss the sensitivity of these results to the model parameters that measure the influence of the prior information on the posterior inference.

Times of disease progression are interval-censored when progression status is only known at a series of assessment times. This situation arises routinely in clinical trials and cohort studies when events of interest are only detectable upon imaging, based on blood tests, or upon careful clinical examination. We consider the problem of selecting important prognostic biomarkers from a large set of candidates when disease progression status is only known at irregularly spaced and individual-specific assessment times. Penalized regression techniques (e.g., LASSO, adaptive LASSO, and SCAD) are adapted to handle interval-censored time of disease progression. An expectation–maximization algorithm is described which is empirically shown to perform well. Application to the motivating study of the development of arthritis mutilans in patients with psoriatic arthritis is given and several important human leukocyte antigen (HLA) variables are identified for further investigation.

Vaccine clinical trials with active surveillance for infection often use the time to infection as the primary endpoint. A common method of analysis for such trials is to compare the times to infection between the vaccine and placebo groups using a Cox regression model. With new technology, we can sometimes additionally record the precise number of virions that cause infection rather than just the indicator that infection occurred. In this article, we develop a unified approach for vaccine trials that couples the time to infection with the number of infecting or founder viruses. We assume that the instantaneous risk of a potentially infectious exposure for individuals in the placebo and vaccine groups follows the same proportional intensity model. Following exposure, the number of founder viruses is assumed to be generated from some distribution on , which is allowed to be different for the two groups. Exposures that result in are unobservable. We denote the placebo and vaccine means of by and so that measures the proportion reduction in the mean number of infecting virions due to vaccination per exposure. We develop different semi-parametric methods of estimating . We allow the distribution of to be Poisson or unspecified, and discuss how to incorporate covariates that impact the time to exposure and/or . Interestingly , which is a ratio of untruncated means, can be reliably estimated using truncated data (), even if the placebo and vaccine distributions of are completely unspecified. Simulations of vaccine clinical trials show that the method can reliably recover in realistic settings. We apply our methods to an HIV vaccine trial conducted in injecting drug users.

In many applications, covariates possess a grouping structure that can be incorporated into the analysis to select important groups as well as important members of those groups. One important example arises in genetic association studies, where genes may have several variants capable of contributing to disease. An ideal penalized regression approach would select variables by balancing both the direct evidence of a feature's importance as well as the indirect evidence offered by the grouping structure. This work proposes a new approach we call the group exponential lasso (GEL) which features a decay parameter controlling the degree to which feature selection is coupled together within groups. We demonstrate that the GEL has a number of statistical and computational advantages over previously proposed group penalties such as the group lasso, group bridge, and composite MCP. Finally, we apply these methods to the problem of detecting rare variants in a genetic association study.

Multi-state models can be viewed as generalizations of both the standard and competing risks models for survival data. Models for multi-state data have been the theme of many recent published works. Motivated by bone marrow transplant data, we propose a Bayesian model using the gap times between two successive events in a path of events experienced by a subject. Path specific frailties are introduced to capture the dependence structure of the gap times in the paths with two or more states. Under improper prior distributions for the parameters, we establish propriety of the posterior distribution. An efficient Gibbs sampling algorithm is developed for drawing samples from the posterior distribution. An extensive simulation study is carried out to examine the empirical performance of the proposed approach. A bone marrow transplant data set is analyzed in detail to further demonstrate the proposed methodology.

The test of independence of row and column variables in a contingency table is a widely used statistical test in many areas of application. For complex survey samples, use of the standard Pearson chi-squared test is inappropriate due to correlation among units within the same cluster. Rao and Scott (1981, *Journal of the American Statistical Association* 76, 221–230) proposed an approach in which the standard Pearson chi-squared statistic is multiplied by a design effect to adjust for the complex survey design. Unfortunately, this test fails to exist when one of the observed cell counts equals zero. Even with the large samples typical of many complex surveys, zero cell counts can occur for rare events, small domains, or contingency tables with a large number of cells. Here, we propose Wald and score test statistics for independence based on weighted least squares estimating equations. In contrast to the Rao–Scott test statistic, the proposed Wald and score test statistics always exist. In simulations, the score test is found to perform best with respect to type I error. The proposed method is motivated by, and applied to, post surgical complications data from the United States’ Nationwide Inpatient Sample (NIS) complex survey of hospitals in 2008.

Multiple longitudinal responses are often collected as a means to capture relevant features of the true outcome of interest, which is often hidden and not directly measurable. We outline an approach which models these multivariate longitudinal responses as generated from a hidden disease process. We propose a class of models which uses a hidden Markov model with separate but correlated random effects between multiple longitudinal responses. This approach was motivated by a smoking cessation clinical trial, where a bivariate longitudinal response involving both a continuous and a binomial response was collected for each participant to monitor smoking behavior. A Bayesian method using Markov chain Monte Carlo is used. Comparison of separate univariate response models to the bivariate response models was undertaken. Our methods are demonstrated on the smoking cessation clinical trial dataset, and properties of our approach are examined through extensive simulation studies.

Infants born preterm or small for gestational age have elevated rates of morbidity and mortality. Using birth certificate records in Texas from 2002 to 2004 and Environmental Protection Agency air pollution estimates, we relate the quantile functions of birth weight and gestational age to ozone exposure and multiple predictors, including parental age, race, and education level. We introduce a semi-parametric Bayesian quantile approach that models the full quantile function rather than just a few quantile levels. Our multilevel quantile function model establishes relationships between birth weight and the predictors separately for each week of gestational age and between gestational age and the predictors separately across Texas Public Health Regions. We permit these relationships to vary nonlinearly across gestational age, spatial domain and quantile level and we unite them in a hierarchical model via a basis expansion on the regression coefficients that preserves interpretability. Very low birth weight is a primary concern, so we leverage extreme value theory to supplement our model in the tail of the distribution. Gestational ages are recorded in completed weeks of gestation (integer-valued), so we present methodology for modeling quantile functions of discrete response data. In a simulation study we show that pooling information across gestational age and quantile level substantially reduces MSE of predictor effects. We find that ozone is negatively associated with the lower tail of gestational age in south Texas and across the distribution of birth weight for high gestational ages. Our methods are available in the R package **BSquare**.

Competing risks arise in the analysis of failure times when there is a distinction between different causes of failure. In many studies, it is difficult to obtain complete cause of failure information for all individuals. Thus, several authors have proposed strategies for semi-parametric modeling of competing risks when some causes of failure are missing under the missing at random (MAR) assumption. As many authors have stressed, while semi-parametric models are convenient, fully-parametric regression modeling of the cause-specific hazards (CSH) and cumulative incidence functions (CIF) may be of interest for prediction and is likely to contribute towards a fuller understanding of the time-dynamics of the competing risks mechanism. We propose a so-called “direct likelihood” approach for fitting fully-parametric regression models for these two functionals under MAR. The MAR assumption not being verifiable from the observed data, we propose an approach for performing sensitivity analyses to assess the robustness of inferences to departures from this assumption. The method relies on so-called “pattern-mixture models” from the missing data literature and was evaluated in a simulation study. This sensitivity analysis approach is applicable to various competing risks regression models (fully-parametric or semi-parametric, for the CSH or the CIF). We illustrate the proposed methods with the analysis of a breast cancer clinical trial, including suggestions for ad hoc graphical goodness-of-fit assessments under MAR.

Time-dependent receiver operating characteristic (ROC) curves and their area under the curve (AUC) are important measures to evaluate the prediction accuracy of biomarkers for time-to-event endpoints (e.g., time to disease progression or death). In this article, we propose a direct method to estimate AUC(*t*) as a function of time *t* using a flexible fractional polynomials model, without the middle step of modeling the time-dependent ROC. We develop a pseudo partial-likelihood procedure for parameter estimation and provide a test procedure to compare the predictive performance between biomarkers. We establish the asymptotic properties of the proposed estimator and test statistics. A major advantage of the proposed method is its ease to make inference and to compare the prediction accuracy across biomarkers, rendering our method particularly appealing for studies that require comparing and screening a large number of candidate biomarkers. We evaluate the finite-sample performance of the proposed method through simulation studies and illustrate our method in an application to AIDS Clinical Trials Group 175 data.

Pooled analyses integrate data from multiple studies and achieve a larger sample size for enhanced statistical power. When heterogeneity exists in variables’ effects on the outcome across studies, the simple pooling strategy fails to present a fair and complete picture of the effects of heterogeneous variables. Thus, it is important to investigate the homogeneous and heterogeneous structure of variables in pooled studies. In this article, we consider the pooled cohort studies with time-to-event outcomes and propose a penalized Cox partial likelihood approach with adaptively weighted composite penalties on variables’ homogeneous and heterogeneous effects. We show that our method can characterize the variables as having heterogeneous, homogeneous, or null effects, and estimate non-zero effects. The results are readily extended to high-dimensional applications where the number of parameters is larger than the sample size. The proposed selection and estimation procedure can be implemented using the iterative shooting algorithm. We conduct extensive numerical studies to evaluate the performance of our proposed method and demonstrate it using a pooled analysis of gene expression in patients with ovarian cancer.

Multi-state models are often used for modeling complex event history data. In these models the estimation of the transition probabilities is of particular interest, since they allow for long-term predictions of the process. These quantities have been traditionally estimated by the Aalen–Johansen estimator, which is consistent if the process is Markov. Several non-Markov estimators have been proposed in the recent literature, and their superiority with respect to the Aalen–Johansen estimator has been proved in situations in which the Markov condition is strongly violated. However, the existing estimators have the drawback of requiring that the support of the censoring distribution contains the support of the lifetime distribution, which is not often the case. In this article, we propose two new methods for estimating the transition probabilities in the progressive illness-death model. Some asymptotic results are derived. The proposed estimators are consistent regardless the Markov condition and the referred assumption about the censoring support. We explore the finite sample behavior of the estimators through simulations. The main conclusion of this piece of research is that the proposed estimators are much more efficient than the existing non-Markov estimators in most cases. An application to a clinical trial on colon cancer is included. Extensions to progressive processes beyond the three-state illness-death model are discussed.

We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large *p* small *n* problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.

This article presents methods and inference for causal estimation in semiparametric transformation models for the prevalent survival data. Through the estimation of the transformation models and covariate distribution, we propose a few analytical procedures to estimate the causal survival function. As the data are observational, the unobserved potential outcome (survival time) may be associated with the treatment assignment, and therefore there may exist a systematic imbalance between the data observed from each treatment arm. Further, due to prevalent sampling, subjects are observed only if they have not experienced the failure event when data collection began, causing the prevalent sampling bias. We propose a unified approach, which simultaneously corrects the bias from the prevalent sampling and balances the systematic differences from the observational data. We illustrate in the simulation study that standard analysis without proper adjustment would result in biased causal inference. Large sample properties of the proposed estimation procedures are established by techniques of empirical processes and examined by simulation studies. The proposed methods are applied to the Surveillance, Epidemiology, and End Results (SEER) and Medicare-linked data for women diagnosed with breast cancer.

This work is motivated by a meta-analysis case study on antipsychotic medications. The Michaelis–Menten curve is employed to model the nonlinear relationship between the dose and receptor occupancy across multiple studies. An intraclass correlation coefficient (ICC) is used to quantify the heterogeneity across studies. To interpret the size of heterogeneity, an accurate estimate of ICC and its confidence interval is required. The goal is to apply a recently proposed generic beta-approach for construction the confidence intervals on ICCs for linear mixed effects models to nonlinear mixed effects models using four estimation methods. These estimation methods are the maximum likelihood, second-order generalized estimating equations and two two-step procedures. The beta-approach is compared with a large sample normal approximation (delta method) and bootstrapping. The confidence intervals based on the delta method and the nonparametric percentile bootstrap with various resampling strategies failed in our settings. The beta-approach demonstrates good coverages with both two-step estimation methods and consequently, it is recommended for the computation of confidence interval for ICCs in nonlinear mixed effects models for small studies.

Co-Editors: Yi-Hau Chen (1 January 2014-31 December 2016)

Michael J. Daniels (1 January 2015-31 December 2017)

Jeanine Houwing-Duistermaat (1 January 2013-31 December 2015)

Jeremy M.G. Taylor (retired in 2014)

Executive Editor: Marie Davidian (1 January 2006-31 December 2017)

The purpose of inverse probability of treatment (IPT) weighting in estimation of marginal treatment effects is to construct a pseudo-population without imbalances in measured covariates, thus removing the effects of confounding and informative censoring when performing inference. In this article, we formalize the notion of such a pseudo-population as a data generating mechanism with particular characteristics, and show that this leads to a natural Bayesian interpretation of IPT weighted estimation. Using this interpretation, we are able to propose the first fully Bayesian procedure for estimating parameters of marginal structural models using an IPT weighting. Our approach suggests that the weights should be derived from the posterior predictive treatment assignment and censoring probabilities, answering the question of whether and how the uncertainty in the estimation of the weights should be incorporated in Bayesian inference of marginal treatment effects. The proposed approach is compared to existing methods in simulated data, and applied to an analysis of the Canadian Co-infection Cohort.

We introduce an extension of R-vine copula models to allow for spatial dependencies and model based prediction at unobserved locations. The proposed spatial R-vine model combines the flexibility of vine copulas with the classical geostatistical idea of modeling spatial dependencies using the distances between the variable locations. In particular, the model is able to capture non-Gaussian spatial dependencies. To develop and illustrate our approach, we consider daily mean temperature data observed at 54 monitoring stations in Germany. We identify relationships between the vine copula parameters and the station distances and exploit these in order to reduce the huge number of parameters needed to parametrize a 54-dimensional R-vine model fitted to the data. The new distance based model parametrization results in a distinct reduction in the number of parameters and makes parameter estimation and prediction at unobserved locations feasible. The prediction capabilities are validated using adequate scoring techniques, showing a better performance of the spatial R-vine copula model compared to a Gaussian spatial model.

During development of a drug, typically the choice of dose is based on a Phase II dose-finding trial, where selected doses are included with placebo. Two common statistical dose-finding methods to analyze such trials are separate comparisons of each dose to placebo (using a multiple comparison procedure) or a model-based strategy (where a dose–response model is fitted to all data). The first approach works best when patients are concentrated on few doses, but cannot conclude on doses not tested. Model-based methods allow for interpolation between doses, but the validity depends on the correctness of the assumed dose–response model. Bretz et al. (2005, *Biometrics* 61, 738–748) suggested a combined approach, which selects one or more suitable models from a set of candidate models using a multiple comparison procedure. The method initially requires a priori estimates of any non-linear parameters of the candidate models, such that there is still a degree of model misspecification possible and one can only evaluate one or a few special cases of a general model. We propose an alternative multiple testing procedure, which evaluates a candidate set of plausible dose–response models against each other to select one final model. The method does not require any a priori parameter estimates and controls the Type I error rate of selecting a too complex model.

How is the progression of a virus influenced by properties intrinsic to individual cells? We address this question by studying the susceptibility of cells infected with two strains of the human respiratory syncytial virus (RSV-A and RSV-B) in an *in vitro* experiment. Spatial patterns of infected cells give us insight into how local conditions influence susceptibility to the virus. We observe a complicated attraction and repulsion behavior, a tendency for infected cells to lump together or remain apart. We develop a new spatial point process model to describe this behavior. Inference on spatial point processes is difficult because the likelihood functions of these models contain intractable normalizing constants; we adapt an MCMC algorithm called double Metropolis–Hastings to overcome this computational challenge. Our methods are computationally efficient even for large point patterns consisting of over 10,000 points. We illustrate the application of our model and inferential approach to simulated data examples and fit our model to various RSV experiments. Because our model parameters are easy to interpret, we are able to draw meaningful scientific conclusions from the fitted models.

Joint modeling methods have become popular tools to link important features extracted from longitudinal data to a primary event. While most modeling strategies have focused on the association between the longitudinal mean trajectories and risk of an event, we consider joint models that incorporate information from both long-term trends and short-term variability in a longitudinal submodel. We also consider both shared random effect and latent class (LC) approaches in the primary-outcome model to predict a binary outcome of interest. We develop simulation studies to compare and contrast these two modeling strategies; in particular, we study in detail the effects of the primary-outcome model misspecification. Among other findings, we note that when we analyze data from a shared random-effect using a LC model while the information from the longitudinal data is weak, the LC approach is more sensitive to such a model misspecification. Under this setting, the LC model has a superior performance in within-sample prediction that cannot be duplicated when predicting new samples. This is a unique feature of the LC approach that is new as far as we know to the existing literature. Finally, we use the proposed models to study how follicle stimulating hormone (FSH) trajectories are related to the risk of developing severe hot flashes for participating women in the Penn Ovarian Aging Study.

Data sources with repeated measurements are an appealing resource to understand the relationship between changes in biological markers and risk of a clinical event. While longitudinal data present opportunities to observe changing risk over time, these analyses can be complicated if the measurement of clinical metrics is sparse and/or irregular, making typical statistical methods unsuitable. In this article, we use electronic health record (EHR) data as an example to present an analytic procedure to both create an analytic sample and analyze the data to detect clinically meaningful markers of acute myocardial infarction (MI). Using an EHR from a large national dialysis organization we abstracted the records of 64,318 individuals and identified 4769 people that had an MI during the study period. We describe a nested case-control design to sample appropriate controls and an analytic approach using regression splines. Fitting a mixed-model with truncated power splines we perform a series of goodness-of-fit tests to determine whether any of 11 regularly collected laboratory markers are useful clinical predictors. We test the clinical utility of each marker using an independent test set. The results suggest that EHR data can be easily used to detect markers of clinically acute events. Special software or analytic tools are not needed, even with irregular EHR data.

The availability of cross-platform, large-scale genomic data has enabled the investigation of complex biological relationships for many cancers. Identification of reliable cancer-related biomarkers requires the characterization of multiple interactions across complex genetic networks. MicroRNAs are small non-coding RNAs that regulate gene expression; however, the direct relationship between a microRNA and its target gene is difficult to measure. We propose a novel Bayesian model to identify microRNAs and their target genes that are associated with survival time by incorporating the microRNA regulatory network through prior distributions. We assume that biomarkers involved in regulatory networks are likely associated with survival time. We employ non-local prior distributions and a stochastic search method for the selection of biomarkers associated with the survival outcome. We use KEGG pathway information to incorporate correlated gene effects within regulatory networks. Using simulation studies, we assess the performance of our method, and apply it to experimental data of kidney renal cell carcinoma (KIRC) obtained from The Cancer Genome Atlas. Our novel method validates previously identified cancer biomarkers and identifies biomarkers specific to KIRC progression that were not previously discovered. Using the KIRC data, we confirm that biomarkers involved in regulatory networks are more likely to be associated with survival time, showing connections in one regulatory network for five out of six such genes we identified.

We propose a classification method for longitudinal data. The Bayes classifier is classically used to determine a classification rule where the underlying density in each class needs to be well modeled and estimated. This work is motivated by a real dataset of hormone levels measured at the early stages of pregnancy that can be used to predict *normal* versus *abnormal* pregnancy outcomes. The proposed model, which is a semiparametric linear mixed-effects model (SLMM), is a particular case of the semiparametric nonlinear mixed-effects class of models (SNMM) in which finite dimensional (fixed effects and variance components) and infinite dimensional (an unknown function) parameters have to be estimated. In SNMM's maximum likelihood estimation is performed iteratively alternating parametric and nonparametric procedures. However, if one can make the assumption that the random effects and the unknown function interact in a linear way, more efficient estimation methods can be used. Our contribution is the proposal of a unified estimation procedure based on a penalized EM-type algorithm. The Expectation and Maximization steps are explicit. In this latter step, the unknown function is estimated in a nonparametric fashion using a lasso-type procedure. A simulation study and an application on real data are performed.

This manuscript considers regression models for generalized, multilevel functional responses: functions are *generalized* in that they follow an exponential family distribution and *multilevel* in that they are clustered within groups or subjects. This data structure is increasingly common across scientific domains and is exemplified by our motivating example, in which binary curves indicating physical activity or inactivity are observed for nearly 600 subjects over 5 days. We use a generalized linear model to incorporate scalar covariates into the mean structure, and decompose subject-specific and subject-day-specific deviations using multilevel functional principal components analysis. Thus, functional fixed effects are estimated while accounting for within-function and within-subject correlations, and major directions of variability within and between subjects are identified. Fixed effect coefficient functions and principal component basis functions are estimated using penalized splines; model parameters are estimated in a Bayesian framework using Stan, a programming language that implements a Hamiltonian Monte Carlo sampler. Simulations designed to mimic the application have good estimation and inferential properties with reasonable computation times for moderate datasets, in both cross-sectional and multilevel scenarios; code is publicly available. In the application we identify effects of age and BMI on the time-specific change in probability of being active over a 24-hour period; in addition, the principal components analysis identifies the patterns of activity that distinguish subjects and days within subjects.

The structural information in high-dimensional transposable data allows us to write the data recorded for each subject in a matrix such that both the rows and the columns correspond to variables of interest. One important problem is to test the null hypothesis that the mean matrix has a particular structure without ignoring the dependence structure among and/or between the row and column variables. To address this, we develop a generic and computationally inexpensive nonparametric testing procedure to assess the hypothesis that, in each predefined subset of columns (rows), the column (row) mean vector remains constant. In simulation studies, the proposed testing procedure seems to have good performance and, unlike simple practical approaches, it preserves the nominal size and remains powerful even if the row and/or column variables are not independent. Finally, we illustrate the use of the proposed methodology via two empirical examples from gene expression microarrays.

Pharmacogenetics investigates the relationship between heritable genetic variation and the variation in how individuals respond to drug therapies. Often, gene–drug interactions play a primary role in this response, and identifying these effects can aid in the development of individualized treatment regimes. Haplotypes can hold key information in understanding the association between genetic variation and drug response. However, the standard approach for haplotype-based association analysis does not directly address the research questions dictated by individualized medicine. A complementary post-hoc analysis is required, and this post-hoc analysis is usually under powered after adjusting for multiple comparisons and may lead to seemingly contradictory conclusions. In this work, we propose a penalized likelihood approach that is able to overcome the drawbacks of the standard approach and yield the desired personalized output. We demonstrate the utility of our method by applying it to the Scottish Randomized Trial in Ovarian Cancer. We also conducted simulation studies and showed that the proposed penalized method has comparable or more power than the standard approach and maintains low Type I error rates for both binary and quantitative drug responses. The largest performance gains are seen when the haplotype frequency is low, the difference in effect sizes are small, or the true relationship among the drugs is more complex.

In clinical trials, an intermediate marker measured after randomization can often provide early information about the treatment effect on the final outcome of interest. We explore the use of recurrence time as an auxiliary variable for estimating the treatment effect on overall survival in phase three randomized trials of colon cancer. A multi-state model with an incorporated cured fraction for recurrence is used to jointly model time to recurrence and time to death. We explore different ways in which the information about recurrence time and the assumptions in the model can lead to improved efficiency. Estimates of overall survival and disease-free survival can be derived directly from the model with efficiency gains obtained as compared to Kaplan–Meier estimates. Alternatively, efficiency gains can be achieved by using the model in a weaker way in a multiple imputation procedure, which imputes death times for censored subjects. By using the joint model, recurrence is used as an auxiliary variable in predicting survival times. We demonstrate the potential use of the proposed methods in shortening the length of a trial and reducing sample sizes.

Hidden Markov models (HMMs) are flexible time series models in which the distribution of the observations depends on unobserved serially correlated states. The state-dependent distributions in HMMs are usually taken from some class of parametrically specified distributions. The choice of this class can be difficult, and an unfortunate choice can have serious consequences for example on state estimates, and more generally on the resulting model complexity and interpretation. We demonstrate these practical issues in a real data application concerned with vertical speeds of a diving beaked whale, where we demonstrate that parametric approaches can easily lead to overly complex state processes, impeding meaningful biological inference. In contrast, for the dive data, HMMs with nonparametrically estimated state-dependent distributions are much more parsimonious in terms of the number of states and easier to interpret, while fitting the data equally well. Our nonparametric estimation approach is based on the idea of representing the densities of the state-dependent distributions as linear combinations of a large number of standardized B-spline basis functions, imposing a penalty term on non-smoothness in order to maintain a good balance between goodness-of-fit and smoothness.

The study of hard-to-reach populations presents significant challenges. Typically, a sampling frame is not available, and population members are difficult to identify or recruit from broader sampling frames. This is especially true of populations at high risk for HIV/AIDS. Respondent-driven sampling (RDS) is often used in such settings with the primary goal of estimating the prevalence of infection. In such populations, the number of people at risk for infection and the number of people infected are of fundamental importance. This article presents a case-study of the estimation of the size of the hard-to-reach population based on data collected through RDS. We study two populations of female sex workers and men-who-have-sex-with-men in El Salvador. The approach is Bayesian and we consider different forms of prior information, including using the UNAIDS population size guidelines for this region. We show that the method is able to quantify the amount of information on population size available in RDS samples. As separate validation, we compare our results to those estimated by extrapolating from a capture–recapture study of El Salvadorian cities. The results of our case-study are largely comparable to those of the capture–recapture study when they differ from the UNAIDS guidelines. Our method is widely applicable to data from RDS studies and we provide a software package to facilitate this.

This article develops a Bayesian semiparametric approach to the extended hazard model, with generalization to high-dimensional spatially grouped data. County-level spatial correlation is accommodated marginally through the normal transformation model of Li and Lin (2006, *Journal of the American Statistical Association* **101**, 591–603), using a correlation structure implied by an intrinsic conditionally autoregressive prior. Efficient Markov chain Monte Carlo algorithms are developed, especially applicable to fitting very large, highly censored areal survival data sets. Per-variable tests for proportional hazards, accelerated failure time, and accelerated hazards are efficiently carried out with and without spatial correlation through Bayes factors. The resulting reduced, interpretable spatial models can fit significantly better than a standard additive Cox model with spatial frailties.

Several statistical methods for meta-analysis of diagnostic accuracy studies have been discussed in the presence of a gold standard. However, in practice, the selected reference test may be imperfect due to measurement error, non-existence, invasive nature, or expensive cost of a gold standard. It has been suggested that treating an imperfect reference test as a gold standard can lead to substantial bias in the estimation of diagnostic test accuracy. Recently, two models have been proposed to account for imperfect reference test, namely, a multivariate generalized linear mixed model (MGLMM) and a hierarchical summary receiver operating characteristic (HSROC) model. Both models are very flexible in accounting for heterogeneity in accuracies of tests across studies as well as the dependence between tests. In this article, we show that these two models, although with different formulations, are closely related and are equivalent in the absence of study-level covariates. Furthermore, we provide the exact relations between the parameters of these two models and assumptions under which two models can be reduced to equivalent submodels. On the other hand, we show that some submodels of the MGLMM do not have corresponding equivalent submodels of the HSROC model, and vice versa. With three real examples, we illustrate the cases when fitting the MGLMM and HSROC models leads to equivalent submodels and hence identical inference, and the cases when the inferences from two models are slightly different. Our results generalize the important relations between the bivariate generalized linear mixed model and HSROC model when the reference test is a gold standard.

In the classic discriminant model of two multivariate normal distributions with equal variance matrices, the linear discriminant function is optimal both in terms of the log likelihood ratio and in terms of maximizing the standardized difference (the *t*-statistic) between the means of the two distributions. In a typical case–control study, normality may be sensible for the control sample but heterogeneity and uncertainty in diagnosis may suggest that a more flexible model is needed for the cases. We generalize the *t*-statistic approach by finding the linear function which maximizes a standardized difference but with data from one of the groups (the cases) filtered by a possibly nonlinear function *U*. We study conditions for consistency of the method and find the function *U* which is optimal in the sense of asymptotic efficiency. Optimality may also extend to other measures of discriminatory efficiency such as the area under the receiver operating characteristic curve. The optimal function *U* depends on a scalar probability density function which can be estimated non-parametrically using a standard numerical algorithm. A lasso-like version for variable selection is implemented by adding -regularization to the generalized *t*-statistic. Two microarray data sets in the study of asthma and various cancers are used as motivating examples.

Implementation study is an important tool for deploying state-of-the-art treatments from clinical efficacy studies into a treatment program, with the dual goals of learning about effectiveness of the treatments and improving the quality of care for patients enrolled into the program. In this article, we deal with the design of a treatment program of dynamic treatment regimens (DTRs) for patients with depression post-acute coronary syndrome. We introduce a novel adaptive randomization scheme for a sequential multiple assignment randomized trial of DTRs. Our approach adapts the randomization probabilities to favor treatment sequences having comparatively superior *Q*-functions used in *Q*-learning. The proposed approach addresses three main concerns of an implementation study: it allows incorporation of historical data or opinions, it includes randomization for learning purposes, and it aims to improve care via adaptation throughout the program. We demonstrate how to apply our method to design a depression treatment program using data from a previous study. By simulation, we illustrate that the inputs from historical data are important for the program performance measured by the expected outcomes of the enrollees, but also show that the adaptive randomization scheme is able to compensate poorly specified historical inputs by improving patient outcomes within a reasonable horizon. The simulation results also confirm that the proposed design allows efficient learning of the treatments by alleviating the curse of dimensionality.

Acar and Sun (2013, *Biometrics* **69**, 427–435) presented a generalized Kruskal–Wallis (GKW) test for genetic association studies that incorporated the genotype uncertainty and showed its robust and competitive performance compared to existing methods. We present another interesting way to derive the GKW test via a rank linear model.

We propose a novel statistical framework by supplementing case–control data with summary statistics on the population at risk for a subset of risk factors. Our approach is to first form two unbiased estimating equations, one based on the case–control data and the other on both the case data and the summary statistics, and then optimally combine them to derive another estimating equation to be used for the estimation. The proposed method is computationally simple and more efficient than standard approaches based on case–control data alone. We also establish asymptotic properties of the resulting estimator, and investigate its finite-sample performance through simulation. As a substantive application, we apply the proposed method to investigate risk factors for endometrial cancer, by using data from a recently completed population-based case–control study and summary statistics from the Behavioral Risk Factor Surveillance System, the Population Estimates Program of the US Census Bureau, and the Connecticut Department of Transportation.

In diverse fields of empirical research—including many in the biological sciences—attempts are made to decompose the effect of an exposure on an outcome into its effects via a number of different pathways. For example, we may wish to separate the effect of heavy alcohol consumption on systolic blood pressure (SBP) into effects via body mass index (BMI), via gamma-glutamyl transpeptidase (GGT), and via other pathways. Much progress has been made, mainly due to contributions from the field of causal inference, in understanding the precise nature of statistical estimands that capture such intuitive effects, the assumptions under which they can be identified, and statistical methods for doing so. These contributions have focused almost entirely on settings with a single mediator, or a set of mediators considered *en bloc*; in many applications, however, researchers attempt a much more ambitious decomposition into numerous path-specific effects through many mediators. In this article, we give counterfactual definitions of such path-specific estimands in settings with multiple mediators, when earlier mediators may affect later ones, showing that there are many ways in which decomposition can be done. We discuss the strong assumptions under which the effects are identified, suggesting a sensitivity analysis approach when a particular subset of the assumptions cannot be justified. These ideas are illustrated using data on alcohol consumption, SBP, BMI, and GGT from the Izhevsk Family Study. We aim to bridge the gap from “single mediator theory” to “multiple mediator practice,” highlighting the ambitious nature of this endeavor and giving practical suggestions on how to proceed.

Event-history studies of recurrent events are often conducted in fields such as demography, epidemiology, medicine, and social sciences (Cook and Lawless, 2007, The *Statistical Analysis of Recurrent Events*. New York: Springer-Verlag; Zhao et al., 2011, *Test* **20**, 1–42). For such analysis, two types of data have been extensively investigated: recurrent-event data and panel-count data. However, in practice, one may face a third type of data, mixed recurrent-event and panel-count data or mixed event-history data. Such data occur if some study subjects are monitored or observed continuously and thus provide recurrent-event data, while the others are observed only at discrete times and hence give only panel-count data. A more general situation is that each subject is observed continuously over certain time periods but only at discrete times over other time periods. There exists little literature on the analysis of such mixed data except that published by Zhu et al. (2013, *Statistics in Medicine* **32**, 1954–1963). In this article, we consider the regression analysis of mixed data using the additive rate model and develop some estimating equation-based approaches to estimate the regression parameters of interest. Both finite sample and asymptotic properties of the resulting estimators are established, and the numerical studies suggest that the proposed methodology works well for practical situations. The approach is applied to a Childhood Cancer Survivor Study that motivated this study.

Motivated by modern observational studies, we introduce a class of functional models that expand nested and crossed designs. These models account for the natural inheritance of the correlation structures from sampling designs in studies where the fundamental unit is a function or image. Inference is based on functional quadratics and their relationship with the underlying covariance structure of the latent processes. A computationally fast and scalable estimation procedure is developed for high-dimensional data. Methods are used in applications including high-frequency accelerometer data for daily activity, pitch linguistic data for phonetic analysis, and EEG data for studying electrical brain activity during sleep.

In clinical trials, minimum clinically important difference (MCID) has attracted increasing interest as an important supportive clinical and statistical inference tool. Many estimation methods have been developed based on various intuitions, while little theoretical justification has been established. This article proposes a new estimation framework of the MCID using both diagnostic measurements and patient-reported outcomes (PROs). The framework first formulates the population-based MCID as a large margin classification problem, and then extends to the personalized MCID to allow individualized thresholding value for patients whose clinical profiles may affect their PRO responses. More importantly, the proposed estimation framework is showed to be asymptotically consistent, and a finite-sample upper bound is established for its prediction accuracy compared against the ideal MCID. The advantage of our proposed method is also demonstrated in a variety of simulated experiments as well as two phase-3 clinical trials.

Multistate models are used to characterize individuals’ natural histories through diseases with discrete states. Observational data resources based on electronic medical records pose new opportunities for studying such diseases. However, these data consist of observations of the process at discrete sampling times, which may either be pre-scheduled and non-informative, or symptom-driven and informative about an individual's underlying disease status. We have developed a novel joint observation and disease transition model for this setting. The disease process is modeled according to a latent continuous-time Markov chain; and the observation process, according to a Markov-modulated Poisson process with observation rates that depend on the individual's underlying disease status. The disease process is observed at a combination of informative and non-informative sampling times, with possible misclassification error. We demonstrate that the model is computationally tractable and devise an expectation-maximization algorithm for parameter estimation. Using simulated data, we show how estimates from our joint observation and disease transition model lead to less biased and more precise estimates of the disease rate parameters. We apply the model to a study of secondary breast cancer events, utilizing mammography and biopsy records from a sample of women with a history of primary breast cancer.

The effect of a targeted agent on a cancer patient's clinical outcome putatively is mediated through the agent's effect on one or more early biological events. This is motivated by pre-clinical experiments with cells or animals that identify such events, represented by binary or quantitative biomarkers. When evaluating targeted agents in humans, central questions are whether the distribution of a targeted biomarker changes following treatment, the nature and magnitude of this change, and whether it is associated with clinical outcome. Major difficulties in estimating these effects are that a biomarker's distribution may be complex, vary substantially between patients, and have complicated relationships with clinical outcomes. We present a probabilistically coherent framework for modeling and estimation in this setting, including a hierarchical Bayesian nonparametric mixture model for biomarkers that we use to define a functional profile of pre-versus-post-treatment biomarker distribution change. The functional is similar to the receiver operating characteristic used in diagnostic testing. The hierarchical model yields clusters of individual patient biomarker profile functionals, and we use the profile as a covariate in a regression model for clinical outcome. The methodology is illustrated by analysis of a dataset from a clinical trial in prostate cancer using imatinib to target platelet-derived growth factor, with the clinical aim to improve progression-free survival time.

The N-mixture model is widely used to estimate the abundance of a population in the presence of unknown detection probability from only a set of counts subject to spatial and temporal replication (Royle, 2004, *Biometrics* **60**, 105–115). We explain and exploit the equivalence of N-mixture and multivariate Poisson and negative-binomial models, which provides powerful new approaches for fitting these models. We show that particularly when detection probability and the number of sampling occasions are small, infinite estimates of abundance can arise. We propose a sample covariance as a diagnostic for this event, and demonstrate its good performance in the Poisson case. Infinite estimates may be missed in practice, due to numerical optimization procedures terminating at arbitrarily large values. It is shown that the use of a bound, *K*, for an infinite summation in the N-mixture likelihood can result in underestimation of abundance, so that default values of *K* in computer packages should be avoided. Instead we propose a simple automatic way to choose *K*. The methods are illustrated by analysis of data on Hermann's tortoise *Testudo hermanni*.

Thanks to the growing interest in personalized medicine, joint modeling of longitudinal marker and time-to-event data has recently started to be used to derive dynamic individual risk predictions. Individual predictions are called dynamic because they are updated when information on the subject's health profile grows with time. We focus in this work on statistical methods for quantifying and comparing dynamic predictive accuracy of this kind of prognostic models, accounting for right censoring and possibly competing events. Dynamic area under the ROC curve (AUC) and Brier Score (BS) are used to quantify predictive accuracy. Nonparametric inverse probability of censoring weighting is used to estimate dynamic curves of AUC and BS as functions of the time at which predictions are made. Asymptotic results are established and both pointwise confidence intervals and simultaneous confidence bands are derived. Tests are also proposed to compare the dynamic prediction accuracy curves of two prognostic models. The finite sample behavior of the inference procedures is assessed via simulations. We apply the proposed methodology to compare various prediction models using repeated measures of two psychometric tests to predict dementia in the elderly, accounting for the competing risk of death. Models are estimated on the French Paquid cohort and predictive accuracies are evaluated and compared on the French Three-City cohort.

Multi-site time series studies have reported evidence of an association between short term exposure to particulate matter (PM) and adverse health effects, but the effect size varies across the United States. Variability in the effect may partially be due to differing community level exposure and health characteristics, but also due to the chemical composition of PM which is known to vary greatly by location and time. The objective of this article is to identify particularly harmful components of this chemical mixture. Because of the large number of highly-correlated components, we must incorporate some regularization into a statistical model. We assume that, at each spatial location, the regression coefficients come from a mixture model with the flavor of stochastic search variable selection, but utilize a copula to share information about variable inclusion and effect magnitude across locations. The model differs from current spatial variable selection techniques by accommodating both local and global variable selection. The model is used to study the association between fine PM (PM 2.5m) components, measured at 115 counties nationally over the period 2000–2008, and cardiovascular emergency room admissions among Medicare patients.

Recent advances in genomics and biotechnologies have accelerated the development of molecularly targeted treatments and accompanying markers to predict treatment responsiveness. However, it is common at the initiation of a definitive phase III clinical trial that there is no compelling biological basis or early trial data for a candidate marker regarding its capability in predicting treatment effects. In this case, it is reasonable to include all patients as eligible for randomization but to plan for prospective subgroup analysis based on the marker. One analysis plan in such all-comers designs is the so-called fallback approach that first tests for overall treatment efficacy and then proceeds to test in a biomarker-positive subgroup if the first test is not significant. In this approach, owing to the adaptive nature of the analysis and a correlation between the two tests, a bias will arise in estimating the treatment effect in the biomarker-positive subgroup after a non-significant first overall test. In this article, we formulate the bias function and show a difficulty in obtaining unbiased estimators for a whole range of an associated parameter. To address this issue, we propose bias-corrected estimation methods, including those based on an approximation of the bias function under a bounded range of the parameter using polynomials. We also provide an interval estimation method based on a bivariate doubly truncated normal distribution. Simulation experiments demonstrated a success in bias reduction. Application to a phase III trial for lung cancer is provided.

Analytically or computationally intractable likelihood functions can arise in complex statistical inferential problems making them inaccessible to standard Bayesian inferential methods. Approximate Bayesian computation (ABC) methods address such inferential problems by replacing direct likelihood evaluations with repeated sampling from the model. ABC methods have been predominantly applied to parameter estimation problems and less to model choice problems due to the added difficulty of handling multiple model spaces. The ABC algorithm proposed here addresses model choice problems by extending Fearnhead and Prangle (2012, *Journal of the Royal Statistical Society, Series B* **74**, 1–28) where the posterior mean of the model parameters estimated through regression formed the summary statistics used in the discrepancy measure. An additional stepwise multinomial logistic regression is performed on the model indicator variable in the regression step and the estimated model probabilities are incorporated into the set of summary statistics for model choice purposes. A reversible jump Markov chain Monte Carlo step is also included in the algorithm to increase model diversity for thorough exploration of the model space. This algorithm was applied to a validating example to demonstrate the robustness of the algorithm across a wide range of true model probabilities. Its subsequent use in three pathogen transmission examples of varying complexity illustrates the utility of the algorithm in inferring preference of particular transmission models for the pathogens.

Instrumental variable (IV) methods are popular in non-experimental studies to estimate the causal effects of medical interventions. These approaches allow for the consistent estimation of treatment effects even if important confounding factors are unobserved. Despite the increasing use of these methods, there have been few extensions of IV methods to censored data problems. In this article, we discuss challenges in applying IV techniques to the proportional hazards model and demonstrate the utility of the additive hazards formulation for IV analyses with censored data. Assuming linear structural equation models for the hazard function, we develop a closed-form, two-stage estimator for the causal effect in the additive hazard model. The methods permit both continuous and discrete exposures, and enable the estimation of causal relative survival measures. The asymptotic properties of the estimators are derived and the resulting inferences are shown to perform well in simulation studies and in an application to a data set on the effectiveness of a novel chemotherapeutic agent for colon cancer.

We consider model selection and estimation in a context where there are competing ordinary differential equation (ODE) models, and all the models are special cases of a “full” model. We propose a computationally inexpensive approach that employs statistical estimation of the full model, followed by a combination of a least squares approximation (LSA) and the adaptive Lasso. We show the resulting method, here called the LSA method, to be an (asymptotically) oracle model selection method. The finite sample performance of the proposed LSA method is investigated with Monte Carlo simulations, in which we examine the percentage of selecting true ODE models, the efficiency of the parameter estimation compared to simply using the full and true models, and coverage probabilities of the estimated confidence intervals for ODE parameters, all of which have satisfactory performances. Our method is also demonstrated by selecting the best predator-prey ODE to model a lynx and hare population dynamical system among some well-known and biologically interpretable ODE models.

We present a simple general method for combining two one-sample confidence procedures to obtain inferences in the two-sample problem. Some applications give striking connections to established methods; for example, combining exact binomial confidence procedures gives new confidence intervals on the difference or ratio of proportions that match inferences using Fisher's exact test, and numeric studies show the associated confidence intervals bound the type I error rate. Combining exact one-sample Poisson confidence procedures recreates standard confidence intervals on the ratio, and introduces new ones for the difference. Combining confidence procedures associated with one-sample *t*-tests recreates the Behrens–Fisher intervals. Other applications provide new confidence intervals with fewer assumptions than previously needed. For example, the method creates new confidence intervals on the difference in medians that do not require shift and continuity assumptions. We create a new confidence interval for the difference between two survival distributions at a fixed time point when there is independent censoring by combining the recently developed beta product confidence procedure for each single sample. The resulting interval is designed to guarantee coverage regardless of sample size or censoring distribution, and produces equivalent inferences to Fisher's exact test when there is no censoring. We show theoretically that when combining intervals asymptotically equivalent to normal intervals, our method has asymptotically accurate coverage. Importantly, all situations studied suggest guaranteed nominal coverage for our new interval whenever the original confidence procedures themselves guarantee coverage.

The increasing cost of drug development has raised the demand for surrogate endpoints when evaluating new drugs in clinical trials. However, over the years, it has become clear that surrogate endpoints need to be statistically evaluated and deemed valid, before they can be used as substitutes of “true” endpoints in clinical studies. Nowadays, two paradigms, based on causal-inference and meta-analysis, dominate the scene. Nonetheless, although the literature emanating from these paradigms is wide, till now the relationship between them has largely been left unexplored. In the present work, we discuss the conceptual framework underlying both approaches and study the relationship between them using theoretical elements and the analysis of a real case study. Furthermore, we show that the meta-analytic approach can be embedded within a causal-inference framework on the one hand and that it can be heuristically justified why surrogate endpoints successfully evaluated using this approach will often be appealing from a causal-inference perspective as well, on the other. A newly developed and user friendly R package *Surrogate* is provided to carry out the evaluation exercise.

The Northern Humboldt Current System (NHCS) is the world's most productive ecosystem in terms of fish. In particular, the Peruvian anchovy (Engraulis ringens) is the major prey of the main top predators, like seabirds, fish, humans, and other mammals. In this context, it is important to understand the dynamics of the anchovy distribution to preserve it as well as to exploit its economic capacities. Using the data collected by the “Instituto del Mar del Perú” (IMARPE) during a scientific survey in 2005, we present a statistical analysis that has as main goals: (i) to adapt to the characteristics of the sampled data, such as spatial dependence, high proportions of zeros and big size of samples; (ii) to provide important insights on the dynamics of the anchovy population; and (iii) to propose a model for estimation and prediction of anchovy biomass in the NHCS offshore from Perú. These data were analyzed in a Bayesian framework using the integrated nested Laplace approximation (INLA) method. Further, to select the best model and to study the predictive power of each model, we performed model comparisons and predictive checks, respectively. Finally, we carried out a Bayesian spatial influence diagnostic for the preferred model.

In many scientific and engineering applications, covariates are naturally grouped. When the group structures are available among covariates, people are usually interested in identifying both important groups and important variables within the selected groups. Among existing successful group variable selection methods, some methods fail to conduct the within group selection. Some methods are able to conduct both group and within group selection, but the corresponding objective functions are non-convex. Such a non-convexity may require extra numerical effort. In this article, we propose a novel Log-Exp-Sum(LES) penalty for group variable selection. The LES penalty is strictly convex. It can identify important groups as well as select important variables within the group. We develop an efficient group-level coordinate descent algorithm to fit the model. We also derive non-asymptotic error bounds and asymptotic group selection consistency for our method in the high-dimensional setting where the number of covariates can be much larger than the sample size. Numerical results demonstrate the good performance of our method in both variable selection and prediction. We applied the proposed method to an American Cancer Society breast cancer survivor dataset. The findings are clinically meaningful and may help design intervention programs to improve the qualify of life for breast cancer survivors.

Motivated by field sampling of DNA fragments, we describe a general model for capture–recapture modeling of samples drawn one at a time in continuous-time. Our model is based on Poisson sampling where the sampling time may be unobserved. We show that previously described models correspond to partial likelihoods from our Poisson model and their use may be justified through arguments concerning *S*- and Bayes-ancillarity of discarded information. We demonstrate a further link to continuous-time capture–recapture models and explain observations that have been made about this class of models in terms of partial ancillarity. We illustrate application of our models using data from the European badger (*Meles meles*) in which genotyping of DNA fragments was subject to error.

Research in the field of nonparametric shape constrained regression has been intensive. However, only few publications explicitly deal with unimodality although there is need for such methods in applications, for example, in dose–response analysis. In this article, we propose unimodal spline regression methods that make use of Bernstein–Schoenberg splines and their shape preservation property. To achieve unimodal and smooth solutions we use penalized splines, and extend the penalized spline approach toward penalizing against general parametric functions, instead of using just difference penalties. For tuning parameter selection under a unimodality constraint a restricted maximum likelihood and an alternative Bayesian approach for unimodal regression are developed. We compare the proposed methodologies to other common approaches in a simulation study and apply it to a dose–response data set. All results suggest that the unimodality constraint or the combination of unimodality and a penalty can substantially improve estimation of the functional relationship.

We develop a linear mixed regression model where both the response and the predictor are functions. Model parameters are estimated by maximizing the log likelihood via the ECME algorithm. The estimated variance parameters or covariance matrices are shown to be positive or positive definite at each iteration. In simulation studies, the approach outperforms in terms of the fitting error and the MSE of estimating the “regression coefficients.”

Motivated by objective measurements of physical activity, we take a functional data approach to longitudinal data with simultaneous measurement of a continuous and a binary outcomes. The regression structures are specified as smooth curves measured at various time-points with random effects that have a hierarchical correlation structure. The random effect curves for each variable are summarized using a few important principal components, and the association of the two longitudinal variables is modeled through the association of the principal component scores. We use penalized splines to model the mean curves and the principal component curves, and cast the proposed model into a mixed effects model framework for model fitting, prediction and inference. Via a quasilikelihood type approximation for the binary component, we develop an algorithm to fit the model. Data-based transformation of the continuous variable and selection of the number of principal components are incorporated into the algorithm. The method is applied to the motivating physical activity data and is evaluated empirically by a simulation study. Extensions for different types of outcomes are also discussed.

Studying the interactions between different brain regions is essential to achieve a more complete understanding of brain function. In this article, we focus on identifying functional co-activation patterns and undirected functional networks in neuroimaging studies. We build a functional brain network, using a sparse covariance matrix, with elements representing associations between region-level peak activations. We adopt a penalized likelihood approach to impose sparsity on the covariance matrix based on an extended multivariate Poisson model. We obtain penalized maximum likelihood estimates via the expectation-maximization (EM) algorithm and optimize an associated tuning parameter by maximizing the predictive log-likelihood. Permutation tests on the brain co-activation patterns provide region pair and network-level inference. Simulations suggest that the proposed approach has minimal biases and provides a coverage rate close to 95% of covariance estimations. Conducting a meta-analysis of 162 functional neuroimaging studies on emotions, our model identifies a functional network that consists of connected regions within the basal ganglia, limbic system, and other emotion-related brain regions. We characterize this network through statistical inference on region-pair connections as well as by graph measures.

We consider the problem of robust estimation of the regression relationship between a response and a covariate based on sample in which precise measurements on the covariate are not available but error-prone surrogates for the unobserved covariate are available for each sampled unit. Existing methods often make restrictive and unrealistic assumptions about the density of the covariate and the densities of the regression and the measurement errors, for example, normality and, for the latter two, also homoscedasticity and thus independence from the covariate. In this article we describe Bayesian semiparametric methodology based on mixtures of B-splines and mixtures induced by Dirichlet processes that relaxes these restrictive assumptions. In particular, our models for the aforementioned densities adapt to asymmetry, heavy tails and multimodality. The models for the densities of regression and measurement errors also accommodate conditional heteroscedasticity. In simulation experiments, our method vastly outperforms existing methods. We apply our method to data from nutritional epidemiology.

Mediation analysis is important for understanding the mechanisms whereby one variable causes changes in another. Measurement error could obscure the ability of the potential mediator to explain such changes. This article focuses on developing correction methods for measurement error in the mediator with failure time outcomes. We consider a broad definition of measurement error, including technical error, and error associated with temporal variation. The underlying model with the “true” mediator is assumed to be of the Cox proportional hazards model form. The induced hazard ratio for the observed mediator no longer has a simple form independent of the baseline hazard function, due to the conditioning event. We propose a mean-variance regression calibration approach and a follow-up time regression calibration approach, to approximate the partial likelihood for the induced hazard function. Both methods demonstrate value in assessing mediation effects in simulation studies. These methods are generalized to multiple biomarkers and to both case-cohort and nested case-control sampling designs. We apply these correction methods to the Women's Health Initiative hormone therapy trials to understand the mediation effect of several serum sex hormone measures on the relationship between postmenopausal hormone therapy and breast cancer risk.

Nested case–control sampling is a popular design for large epidemiological cohort studies due to its cost effectiveness. A number of methods have been developed for the estimation of the proportional hazards model with nested case–control data; however, the evaluation of modeling assumption is less attended. In this article, we propose a class of goodness-of-fit test statistics for testing the proportional hazards assumption based on nested case–control data. The test statistics are constructed based on asymptotically mean-zero processes derived from Samuelsen's maximum pseudo-likelihood estimation method. In addition, we develop an innovative resampling scheme to approximate the asymptotic distribution of the test statistics while accounting for the dependent sampling scheme of nested case–control design. Numerical studies are conducted to evaluate the performance of our proposed approach, and an application to the Wilms' Tumor Study is given to illustrate the methodology.

When estimating the effect of an exposure or treatment on an outcome it is important to select the proper subset of confounding variables to include in the model. Including too many covariates increases mean square error on the effect of interest while not including confounding variables biases the exposure effect estimate. We propose a decision-theoretic approach to confounder selection and effect estimation. We first estimate the full standard Bayesian regression model and then post-process the posterior distribution with a loss function that penalizes models omitting important confounders. Our method can be fit easily with existing software and in many situations without the use of Markov chain Monte Carlo methods, resulting in computation on the order of the least squares solution. We prove that the proposed estimator has attractive asymptotic properties. In a simulation study we show that our method outperforms existing methods. We demonstrate our method by estimating the effect of fine particulate matter (PM) exposure on birth weight in Mecklenburg County, North Carolina.

Variable screening has emerged as a crucial first step in the analysis of high-throughput data, but existing procedures can be computationally cumbersome, difficult to justify theoretically, or inapplicable to certain types of analyses. Motivated by a high-dimensional censored quantile regression problem in multiple myeloma genomics, this article makes three contributions. First, we establish a score test-based screening framework, which is widely applicable, extremely computationally efficient, and relatively simple to justify. Secondly, we propose a resampling-based procedure for selecting the number of variables to retain after screening according to the principle of reproducibility. Finally, we propose a new iterative score test screening method which is closely related to sparse regression. In simulations we apply our methods to four different regression models and show that they can outperform existing procedures. We also apply score test screening to an analysis of gene expression data from multiple myeloma patients using a censored quantile regression model to identify high-risk genes.

Recent advance in biotechnology and its wide applications have led to the generation of many high-dimensional gene expression data sets that can be used to address similar biological questions. Meta-analysis plays an important role in summarizing and synthesizing scientific evidence from multiple studies. When the dimensions of datasets are high, it is desirable to incorporate variable selection into meta-analysis to improve model interpretation and prediction. According to our knowledge, all existing methods conduct variable selection with meta-analyzed data in an “all-in-or-all-out” fashion, that is, a gene is either selected in all of studies or not selected in any study. However, due to data heterogeneity commonly exist in meta-analyzed data, including choices of biospecimens, study population, and measurement sensitivity, it is possible that a gene is important in some studies while unimportant in others. In this article, we propose a novel method called meta-lasso for variable selection with high-dimensional meta-analyzed data. Through a hierarchical decomposition on regression coefficients, our method not only borrows strength across multiple data sets to boost the power to identify important genes, but also keeps the selection flexibility among data sets to take into account data heterogeneity. We show that our method possesses the gene selection consistency, that is, when sample size of each data set is large, with high probability, our method can identify all important genes and remove all unimportant genes. Simulation studies demonstrate a good performance of our method. We applied our meta-lasso method to a meta-analysis of five cardiovascular studies. The analysis results are clinically meaningful.

Integrative genomics offers a promising approach to more powerful genetic association studies. The hope is that combining outcome and genotype data with other types of genomic information can lead to more powerful SNP detection. We present a new association test based on a statistical model that explicitly assumes that genetic variations affect the outcome through perturbing gene expression levels. It is shown analytically that the proposed approach can have more power to detect SNPs that are associated with the outcome through transcriptional regulation, compared to tests using the outcome and genotype data alone, and simulations show that our method is relatively robust to misspecification. We also provide a strategy for applying our approach to high-dimensional genomic data. We use this strategy to identify a potentially new association between a SNP and a yeast cell's response to the natural product tomatidine, which standard association analysis did not detect.

Treatment-selection markers predict an individual's response to different therapies, thus allowing for the selection of a therapy with the best predicted outcome. A good marker-based treatment-selection rule can significantly impact public health through the reduction of the disease burden in a cost-effective manner. Our goal in this article is to use data from randomized trials to identify optimal linear and nonlinear biomarker combinations for treatment selection that minimize the total burden to the population caused by either the targeted disease or its treatment. We frame this objective into a general problem of minimizing a weighted sum of 0-1 loss and propose a novel penalized minimization method that is based on the difference of convex functions algorithm (DCA). The corresponding estimator of marker combinations has a kernel property that allows flexible modeling of linear and nonlinear marker combinations. We compare the proposed methods with existing methods for optimizing treatment regimens such as the logistic regression model and the weighted support vector machine. Performances of different weight functions are also investigated. The application of the proposed method is illustrated using a real example from an HIV vaccine trial: we search for a combination of Fc receptor genes for recommending vaccination in preventing HIV infection.

Semi-parametric regression models for the joint estimation of marginal mean and within-cluster pairwise association parameters are used in a variety of settings for population-averaged modeling of multivariate categorical outcomes. Recently, a formulation of alternating logistic regressions based on orthogonalized, marginal residuals has been introduced for correlated binary data. Unlike the original procedure based on conditional residuals, its covariance estimator is invariant to the ordering of observations within clusters. In this article, the orthogonalized residuals method is extended to model correlated ordinal data with a global odds ratio, and shown in a simulation study to be more efficient and less biased with regards to estimating within-cluster association parameters than an existing extension to ordinal data of alternating logistic regressions based on conditional residuals. Orthogonalized residuals are used to estimate a model for three correlated ordinal outcomes measured repeatedly in a longitudinal clinical trial of an intervention to improve recovery of patients' perception of altered sensation following jaw surgery.

Auxiliary covariates are often encountered in biomedical research settings where the primary exposure variable is measured only for a subgroup of study subjects. This article is concerned with generalized linear mixed models in the presence of auxiliary covariate information for clustered data. We propose a novel semiparametric estimation method based on a pairwise likelihood function and develop an estimating equation-based inference procedure by treating both the error structure and random effects as nuisance parameters. This method is robust against misspecification of either error structure or random-effects distribution and allows for dependence between random effects and covariates. We show that the resulting estimators are consistent and asymptotically normal. Extensive simulation studies evaluate the finite sample performance of the proposed estimators and demonstrate their advantage over the validation set based method and the existing method. We illustrate the method with two real data examples.

A popular way to model overdispersed count data, such as the number of falls reported during intervention studies, is by means of the negative binomial (NB) distribution. Classical estimating methods are well-known to be sensitive to model misspecifications, taking the form of patients falling much more than expected in such intervention studies where the NB regression model is used. We extend in this article two approaches for building robust *M*-estimators of the regression parameters in the class of generalized linear models to the NB distribution. The first approach achieves robustness in the response by applying a bounded function on the Pearson residuals arising in the maximum likelihood estimating equations, while the second approach achieves robustness by bounding the unscaled deviance components. For both approaches, we explore different choices for the bounding functions. Through a unified notation, we show how close these approaches may actually be as long as the bounding functions are chosen and tuned appropriately, and provide the asymptotic distributions of the resulting estimators. Moreover, we introduce a robust weighted maximum likelihood estimator for the overdispersion parameter, specific to the NB distribution. Simulations under various settings show that redescending bounding functions yield estimates with smaller biases under contamination while keeping high efficiency at the assumed model, and this for both approaches. We present an application to a recent randomized controlled trial measuring the effectiveness of an exercise program at reducing the number of falls among people suffering from Parkinsons disease to illustrate the diagnostic use of such robust procedures and their need for reliable inference.

Complex computer models play a crucial role in air quality research. These models are used to evaluate potential regulatory impacts of emission control strategies and to estimate air quality in areas without monitoring data. For both of these purposes, it is important to calibrate model output with monitoring data to adjust for model biases and improve spatial prediction. In this article, we propose a new spectral method to study and exploit complex relationships between model output and monitoring data. Spectral methods allow us to estimate the relationship between model output and monitoring data separately at different spatial scales, and to use model output for prediction only at the appropriate scales. The proposed method is computationally efficient and can be implemented using standard software. We apply the method to compare Community Multiscale Air Quality (CMAQ) model output with ozone measurements in the United States in July 2005. We find that CMAQ captures large-scale spatial trends, but has low correlation with the monitoring data at small spatial scales.

There has been a lot of work fitting Ising models to multivariate binary data in order to understand the conditional dependency relationships between the variables. However, additional covariates are frequently recorded together with the binary data, and may influence the dependence relationships. Motivated by such a dataset on genomic instability collected from tumor samples of several types, we propose a sparse covariate dependent Ising model to study both the conditional dependency within the binary data and its relationship with the additional covariates. This results in subject-specific Ising models, where the subject's covariates influence the strength of association between the genes. As in all exploratory data analysis, interpretability of results is important, and we use penalties to induce sparsity in the fitted graphs and in the number of selected covariates. Two algorithms to fit the model are proposed and compared on a set of simulated data, and asymptotic results are established. The results on the tumor dataset and their biological significance are discussed in detail.

In observational microarray studies, issues of confounding invariably arise. One approach to account for measured confounders is to include them as covariates in a multivariate linear model. For this model, however, the application of permutation-based multiple testing procedures is problematic because exchangeability of responses, in general, does not hold. Nevertheless, it is possible to achieve rotatability of transformed responses at the cost of a distributional assumption. We argue that rotation-based multiple testing, by allowing for adjustments for confounding, represents an important extension of permutation-based multiple testing procedures. The proposed methodology is illustrated with a microarray observational study on breast cancer tumors. Software to perform the procedure described in this article is available in the flip R package.

We investigate model for abundance estimation in closed-population capture–recapture studies, where animals are identified from natural marks such as DNA profiles or photographs of distinctive individual features. Model extends the classical model to accommodate errors in identification, by specifying that each sample identification is correct with probability and false with probability . Information about misidentification is gained from a surplus of capture histories with only one entry, which arise from false identifications. We derive an exact closed-form expression for the likelihood for model and show that it can be computed efficiently, in contrast to previous studies which have held the likelihood to be computationally intractable. Our fast computation enables us to conduct a thorough investigation of the statistical properties of the maximum likelihood estimates. We find that the indirect approach to error estimation places high demands on data richness, and good statistical properties in terms of precision and bias require high capture probabilities or many capture occasions. When these requirements are not met, abundance is estimated with very low precision and negative bias, and at the extreme better properties can be obtained by the naive approach of ignoring misidentification error. We recommend that model be used with caution and other strategies for handling misidentification error be considered. We illustrate our study with genetic and photographic surveys of the New Zealand population of southern right whale (*Eubalaena australis*).

In the context of state-space modeling, conventional usage of the deviance information criterion (DIC) evaluates the ability of the model to predict an observation at time *t* given the underlying state at time *t*. Motivated by the failure of conventional DIC to clearly choose between competing multivariate nonlinear Bayesian state-space models for coho salmon population dynamics, and the computational challenge of alternatives, this work proposes a one-step-ahead DIC, , where prediction is conditional on the state at the previous time point. Simulations revealed that worked well for choosing between state-space models with different process or observation equations. In contrast, conventional DIC could be grossly misleading, with a strong preference for the wrong model. This can be explained by its failure to account for inflated estimates of process error arising from the model mis-specification. is not based on a true conditional likelihood, but is shown to have interpretation as a pseudo-DIC in which the compensatory behavior of the inflated process errors is eliminated. It can be easily calculated using the DIC monitors within popular BUGS software when the process and observation equations are conjugate. The improved performance of is demonstrated by application to the multi-stage modeling of coho salmon abundance in Lobster Creek, Oregon.

There has been an increasing interest in the analysis of spatially distributed multivariate binary data motivated by a wide range of research problems. Two types of correlations are usually involved: the correlation between the multiple outcomes at one location and the spatial correlation between the locations for one particular outcome. The commonly used regression models only consider one type of correlations while ignoring or modeling inappropriately the other one. To address this limitation, we adopt a Bayesian nonparametric approach to jointly modeling multivariate spatial binary data by integrating both types of correlations. A multivariate probit model is employed to link the binary outcomes to Gaussian latent variables; and Gaussian processes are applied to specify the spatially correlated random effects. We develop an efficient Markov chain Monte Carlo algorithm for the posterior computation. We illustrate the proposed model on simulation studies and a multidrug-resistant tuberculosis case study.

A Bayesian approach to the prediction of occurred-but-not-yet-reported events is developed for application in real-time public health surveillance. The motivation was the prediction of the daily number of hospitalizations for the hemolytic-uremic syndrome during the large May–July 2011 outbreak of Shiga toxin-producing *Escherichia coli* (STEC) O104:H4 in Germany. Our novel Bayesian approach addresses the count data nature of the problem using negative binomial sampling and shows that right-truncation of the reporting delay distribution under an assumption of time-homogeneity can be handled in a conjugate prior-posterior framework using the generalized Dirichlet distribution. Since, in retrospect, the true number of hospitalizations is available, proper scoring rules for count data are used to evaluate and compare the predictive quality of the procedures during the outbreak. The results show that it is important to take the count nature of the time series into account and that changes in the delay distribution occurred due to intervention measures. As a consequence, we extend the Bayesian analysis to a hierarchical model, which combines a discrete time survival regression model for the delay distribution with a penalized spline for the dynamics of the epidemic curve. Altogether, we conclude that in emerging and time-critical outbreaks, nowcasting approaches are a valuable tool to gain information about current trends.

In many biomedical studies, patients may experience the same type of recurrent event repeatedly over time, such as bleeding, multiple infections and disease. In this article, we propose a Bayesian design to a pivotal clinical trial in which lower risk myelodysplastic syndromes (MDS) patients are treated with MDS disease modifying therapies. One of the key study objectives is to demonstrate the investigational product (treatment) effect on reduction of platelet transfusion and bleeding events while receiving MDS therapies. In this context, we propose a new Bayesian approach for the design of superiority clinical trials using recurrent events frailty regression models. Historical recurrent events data from an already completed phase 2 trial are incorporated into the Bayesian design via the partial borrowing power prior of Ibrahim et al. (2012, *Biometrics* **68**, 578–586). An efficient Gibbs sampling algorithm, a predictive data generation algorithm, and a simulation-based algorithm are developed for sampling from the fitting posterior distribution, generating the predictive recurrent events data, and computing various design quantities such as the type I error rate and power, respectively. An extensive simulation study is conducted to compare the proposed method to the existing frequentist methods and to investigate various operating characteristics of the proposed design.

We address estimation of intervention effects in experimental designs in which (a) interventions are assigned at the cluster level; (b) clusters are selected to form pairs, matched on observed characteristics; and (c) intervention is assigned to one cluster at random within each pair. One goal of policy interest is to estimate the average outcome if all clusters in all pairs are assigned control versus if all clusters in all pairs are assigned to intervention. In such designs, inference that ignores individual level covariates can be imprecise because cluster-level assignment can leave substantial imbalance in the covariate distribution between experimental arms within each pair. However, most existing methods that adjust for covariates have estimands that are not of policy interest. We propose a methodology that explicitly balances the observed covariates among clusters in a pair, and retains the original estimand of interest. We demonstrate our approach through the evaluation of the Guided Care program.

Historical information is always relevant for clinical trial design. Additionally, if incorporated in the analysis of a new trial, historical data allow to reduce the number of subjects. This decreases costs and trial duration, facilitates recruitment, and may be more ethical. Yet, under prior-data conflict, a too optimistic use of historical data may be inappropriate. We address this challenge by deriving a Bayesian meta-analytic-predictive prior from historical data, which is then combined with the new data. This prospective approach is equivalent to a meta-analytic-combined analysis of historical and new data if parameters are exchangeable across trials. The prospective Bayesian version requires a good approximation of the meta-analytic-predictive prior, which is not available analytically. We propose two- or three-component mixtures of standard priors, which allow for good approximations and, for the one-parameter exponential family, straightforward posterior calculations. Moreover, since one of the mixture components is usually vague, mixture priors will often be heavy-tailed and therefore robust. Further robustness and a more rapid reaction to prior-data conflicts can be achieved by adding an extra weakly-informative mixture component. Use of historical prior information is particularly attractive for adaptive trials, as the randomization ratio can then be changed in case of prior-data conflict. Both frequentist operating characteristics and posterior summaries for various data scenarios show that these designs have desirable properties. We illustrate the methodology for a phase II proof-of-concept trial with historical controls from four studies. Robust meta-analytic-predictive priors alleviate prior-data conflicts ' they should encourage better and more frequent use of historical data in clinical trials.

Receiver operating characteristic (ROC) analysis is widely used to evaluate the performance of diagnostic tests with continuous or ordinal responses. A popular study design for assessing the accuracy of diagnostic tests involves multiple readers interpreting multiple diagnostic test results, called the multi-reader, multi-test design. Although several different approaches to analyzing data from this design exist, few methods have discussed the sample size and power issues. In this article, we develop a power formula to compare the correlated areas under the ROC curves (AUC) in a multi-reader, multi-test design. We present a nonparametric approach to estimate and compare the correlated AUCs by extending DeLong et al.'s (1988, *Biometrics* **44**, 837–845) approach. A power formula is derived based on the asymptotic distribution of the nonparametric AUCs. Simulation studies are conducted to demonstrate the performance of the proposed power formula and an example is provided to illustrate the proposed procedure.

The performance of diagnostic tests for disease classification is often measured by accuracy (e.g., sensitivity or specificity); however, costs of the diagnostic test are a concern as well. Combinations of multiple diagnostic tests may improve accuracy, but incur additional costs. Here, we consider serial testing approaches that maintain accuracy while controlling costs of the diagnostic tests. We present a serial risk score classification approach. The basic idea is to sequentially test with additional diagnostic tests just until persons are classified. In this way, it is not necessary to test all persons with all tests. The methods are studied in simulations and compared with logistic regression. We applied the methods to data from HIV cohort studies to identify HIV infected individuals who are recently infected (1 year) by testing with assays for multiple biomarkers. We find that the serial risk score classification approach can maintain accuracy while achieving a reduction in cost compared to testing all individuals with all assays.

In some longitudinal studies the initiation time of the process is not clearly defined, yet it is important to make inference or do predictions about the longitudinal process. The application of interest in this article is to provide a framework for modeling individualized labor curves (longitudinal cervical dilation measurements) where the start of labor is not clearly defined. This is a well-known problem in obstetrics where the benchmark reference time is often chosen as the end of the process (individuals are fully dilated at 10 cm) and time is run backwards. This approach results in valid and efficient inference unless subjects are censored before the end of the process, or if we are focused on prediction. Providing dynamic individualized predictions of the longitudinal labor curve prospectively (where backwards time is unknown) is of interest to aid obstetricians to determine if a labor is on a suitable trajectory. We propose a model for longitudinal labor dilation that uses a random-effects model with unknown time-zero and a random change point. We present a maximum likelihood approach for parameter estimation that uses adaptive Gaussian quadrature for the numerical integration. Further, we propose a Monte Carlo approach for dynamic prediction of the future longitudinal dilation trajectory from past dilation measurements. The methodology is illustrated with longitudinal cervical dilation data from the Consortium of Safe Labor Study.