Group testing, where individuals are tested initially in pools, is widely used to screen a large number of individuals for rare diseases. Triggered by the recent development of assays that detect multiple infections at once, screening programs now involve testing individuals in pools for multiple infections simultaneously. Tebbs, McMahan, and Bilder (2013, *Biometrics*) recently evaluated the performance of a two-stage hierarchical algorithm used to screen for chlamydia and gonorrhea as part of the Infertility Prevention Project in the United States. In this article, we generalize this work to accommodate a larger number of stages. To derive the operating characteristics of higher-stage hierarchical algorithms with more than one infection, we view the pool decoding process as a time-inhomogeneous, finite-state Markov chain. Taking this conceptualization enables us to derive closed-form expressions for the expected number of tests and classification accuracy rates in terms of transition probability matrices. When applied to chlamydia and gonorrhea testing data from four states (Region X of the United States Department of Health and Human Services), higher-stage hierarchical algorithms provide, on average, an estimated 11% reduction in the number of tests when compared to two-stage algorithms. For applications with rarer infections, we show theoretically that this percentage reduction can be much larger.

In many scientific and engineering fields, advanced experimental and computing technologies are producing data that are not just high dimensional, but also internally structured. For instance, statistical units may have heterogeneous origins from distinct studies or subpopulations, and features may be naturally partitioned based on experimental platforms generating them, or on information available about their roles in a given phenomenon. In a regression analysis, exploiting this known structure in the predictor dimension reduction stage that precedes modeling can be an effective way to integrate diverse data. To pursue this, we propose a novel Sufficient Dimension Reduction (SDR) approach that we call *structured Ordinary Least Squares* (sOLS). This combines ideas from existing SDR literature to merge reductions performed within groups of samples and/or predictors. In particular, it leads to a version of OLS for grouped predictors that requires far less computation than recently proposed groupwise SDR procedures, and provides an informal yet effective variable selection tool in these settings. We demonstrate the performance of sOLS by simulation and present a first application to genomic data. The R package “sSDR,” publicly available on CRAN, includes all procedures necessary to implement the sOLS approach.

Bivariate survival data arise frequently in familial association studies of chronic disease onset, as well as in clinical trials and observational studies with multiple time to event endpoints. The association between two event times is often scientifically important. In this article, we examine the association via a novel quantile association measure, which describes the dynamic association as a function of the quantile levels. The quantile association measure is free of marginal distributions, allowing direct evaluation of the underlying association pattern at different locations of the event times. We propose a nonparametric estimator for quantile association, as well as a semiparametric estimator that is superior in smoothness and efficiency. The proposed methods possess desirable asymptotic properties including uniform consistency and root-n convergence. They demonstrate satisfactory numerical performances under a range of dependence structures. An application of our methods suggests interesting association patterns between time to myocardial infarction and time to stroke in an atherosclerosis study.

Model diagnosis, an important issue in statistical modeling, has not yet been addressed adequately for cure models. We focus on mixture cure models in this work and propose some residual-based methods to examine the fit of the mixture cure model, particularly the fit of the latency part of the mixture cure model. The new methods extend the classical residual-based methods to the mixture cure model. Numerical work shows that the proposed methods are capable of detecting lack-of-fit of a mixture cure model, particularly in the latency part, such as outliers, improper covariate functional form, or nonproportionality in hazards if the proportional hazards assumption is employed in the latency part. The methods are illustrated with two real data sets that were previously analyzed with mixture cure models.

Estimating biomarker-index accuracy when only imperfect reference-test information is available is usually performed under the assumption of conditional independence between the biomarker and imperfect reference-test values. We propose to define a latent normally-distributed tolerance-variable underlying the observed dichotomous imperfect reference-test results. Subsequently, we construct a Bayesian latent-class model based on the joint multivariate normal distribution of the latent tolerance and biomarker values, conditional on latent true disease status, which allows accounting for conditional dependence. The accuracy of the continuous biomarker-index is quantified by the AUC of the optimal linear biomarker-combination. Model performance is evaluated by using a simulation study and two sets of data of Alzheimer's disease patients (one from the memory-clinic-based Amsterdam Dementia Cohort and one from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database). Simulation results indicate adequate model performance and bias in estimates of the diagnostic-accuracy measures when the assumption of conditional independence is used when, in fact, it is incorrect. In the considered case studies, conditional dependence between some of the biomarkers and the imperfect reference-test is detected. However, making the conditional independence assumption does not lead to any marked differences in the estimates of diagnostic accuracy.

Sequential multiple assignment randomization trial (SMART) is a powerful design to study Dynamic Treatment Regimes (DTRs) and allows causal comparisons of DTRs. To handle practical challenges of SMART, we propose a SMART with Enrichment (SMARTER) design, which performs stage-wise enrichment for SMART. SMARTER can improve design efficiency, shorten the recruitment period, and partially reduce trial duration to make SMART more practical with limited time and resource. Specifically, at each subsequent stage of a SMART, we enrich the study sample with new patients who have received previous stages’ treatments in a naturalistic fashion without randomization, and only randomize them among the current stage treatment options. One extreme case of the SMARTER is to synthesize separate independent single-stage randomized trials with patients who have received previous stage treatments. We show data from SMARTER allows for unbiased estimation of DTRs as SMART does under certain assumptions. Furthermore, we show analytically that the efficiency gain of the new design over SMART can be significant especially when the dropout rate is high. Lastly, extensive simulation studies are performed to demonstrate performance of SMARTER design, and sample size estimation in a scenario informed by real data from a SMART study is presented.

This article proposes a method to address the problem that can arise when covariates in a regression setting are not Gaussian, which may give rise to approximately mixture-distributed errors, or when a true mixture of regressions produced the data. The method begins with non-Gaussian mixture-based marginal variable screening, followed by fitting a full but relatively smaller mixture regression model to the selected data with help of a new penalization scheme. Under certain regularity conditions, the new screening procedure is shown to possess a sure screening property even when the population is heterogeneous. We further prove that there exists an elbow point in the associated scree plot which results in a consistent estimator of the set of active covariates in the model. By simulations, we demonstrate that the new procedure can substantially improve the performance of the existing procedures in the content of variable screening and data clustering. By applying the proposed procedure to motif data analysis in molecular biology, we demonstrate that the new method holds promise in practice.

Standard false discovery rate (FDR) procedures can provide misleading inference when testing multiple null hypotheses with heterogeneous multinomial data. For example, in the motivating study the goal is to identify species of bacteria near the roots of wheat plants (rhizobacteria) that are moderately or strongly associated with productivity. However, standard procedures discover the most abundant species even when their association is weak and fail to discover many moderate and strong associations when the species are not abundant. This article provides a new FDR-controlling method based on a finite mixture of multinomial distributions and shows that it tends to discover more moderate and strong associations and fewer weak associations when the data are heterogeneous across tests. The new method is applied to the rhizobacteria data and performs favorably over competing methods.

In cancer research, interest frequently centers on factors influencing a latent event that must precede a terminal event. In practice it is often impossible to observe the latent event precisely, making inference about this process difficult. To address this problem, we propose a joint model for the unobserved time to the latent and terminal events, with the two events linked by the baseline hazard. Covariates enter the model parametrically as linear combinations that multiply, respectively, the hazard for the latent event and the hazard for the terminal event conditional on the latent one. We derive the partial likelihood estimators for this problem assuming the latent event is observed, and propose a profile likelihood-based method for estimation when the latent event is unobserved. The baseline hazard in this case is estimated nonparametrically using the EM algorithm, which allows for closed-form Breslow-type estimators at each iteration, bringing improved computational efficiency and stability compared with maximizing the marginal likelihood directly. We present simulation studies to illustrate the finite-sample properties of the method; its use in practice is demonstrated in the analysis of a prostate cancer data set.

Cancer survival comparisons between cohorts are often assessed by estimates of relative or net survival. These measure the difference in mortality between those diagnosed with the disease and the general population. For such comparisons methods are needed to standardize cohort structure (including age at diagnosis) and all-cause mortality rates in the general population. Standardized non-parametric relative survival measures are evaluated by determining how well they (i) ensure the correct rank ordering, (ii) allow for differences in covariate distributions, and (iii) possess robustness and maximal estimation precision. Two relative survival families that subsume the Ederer-I, Ederer-II, and Pohar-Perme statistics are assessed. The aforementioned statistics do not meet our criteria, and are not invariant under a change of covariate distribution. Existing methods for standardization of these statistics are either not invariant to changes in the general population mortality or are not robust. Standardized statistics and estimators are developed to address the deficiencies. They use a reference distribution for covariates such as age, and a reference population mortality survival distribution that is recommended to approach zero with increasing age as fast as the cohort with the worst life expectancy. Estimators are compared using a breast-cancer survival example and computer simulation. The proposals are invariant and robust, and out-perform current methods to standardize the Ederer-II and Pohar-Perme estimators in simulations, particularly for extended follow-up.

In this article, we present a Bayesian hierarchical model for predicting a latent health state from longitudinal clinical measurements. Model development is motivated by the need to integrate multiple sources of data to improve clinical decisions about whether to remove or irradiate a patient's prostate cancer. Existing modeling approaches are extended to accommodate measurement error in cancer state determinations based on biopsied tissue, clinical measurements possibly not missing at random, and informative partial observation of the true state. The proposed model enables estimation of whether an individual's underlying prostate cancer is aggressive, requiring surgery and/or radiation, or indolent, permitting continued surveillance. These individualized predictions can then be communicated to clinicians and patients to inform decision-making. We demonstrate the model with data from a cohort of low-risk prostate cancer patients at Johns Hopkins University and assess predictive accuracy among a subset for whom true cancer state is observed. Simulation studies confirm model performance and explore the impact of adjusting for informative missingness on true state predictions. R code is provided in an online supplement and at http://github.com/rycoley/prediction-prostate-surveillance.

Motivated by a study conducted to evaluate the associations of 51 inflammatory markers and lung cancer risk, we propose several approaches of varying computational complexity for analyzing multiple correlated markers that are also censored due to lower and/or upper limits of detection, using likelihood-based sufficient dimension reduction (SDR) methods. We extend the theory and the likelihood-based SDR framework in two ways: (i) we accommodate censored predictors directly in the likelihood, and (ii) we incorporate variable selection. We find linear combinations that contain all the information that the correlated markers have on an outcome variable (i.e., are sufficient for modeling and prediction of the outcome) while accounting for censoring of the markers. These methods yield efficient estimators and can be applied to any type of outcome, including continuous and categorical. We illustrate and compare all methods using data from the motivating study and in simulations. We find that explicitly accounting for the censoring in the likelihood of the SDR methods can lead to appreciable gains in efficiency and prediction accuracy, and also outperformed multiple imputations combined with standard SDR.

In biomedical research, a steep rise or decline in longitudinal biomarkers may indicate latent disease progression, which may subsequently cause patients to drop out of the study. Ignoring the informative drop-out can cause bias in estimation of the longitudinal model. In such cases, a full parametric specification may be insufficient to capture the complicated pattern of the longitudinal biomarkers. For these types of longitudinal data with the issue of informative drop-outs, we develop a joint partially linear model, with an aim to find the trajectory of the longitudinal biomarker. Specifically, an arbitrary function of time along with linear fixed and random covariate effects is proposed in the model for the biomarker, while a flexible semiparametric transformation model is used to describe the drop-out mechanism. Advantages of this semiparametric joint modeling approach are the following: 1) it provides an easier interpretation, compared to standard nonparametric regression models, and 2) it is a natural way to control for common (observable and unobservable) prognostic factors that may affect both the longitudinal trajectory and the drop-out process. We describe a sieve maximum likelihood estimation procedure using the EM algorithm, where the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are considered to select the number of knots. We show that the proposed estimators achieve desirable asymptotic properties through empirical process theory. The proposed methods are evaluated by simulation studies and applied to prostate cancer data.

Case-cohort (Prentice, 1986) and nested case-control (Thomas, 1977) designs have been widely used as a cost-effective alternative to the full-cohort design. In this article, we propose an efficient likelihood-based estimation method for the accelerated failure time model under case-cohort and nested case-control designs. An EM algorithm is developed to maximize the likelihood function and a kernel smoothing technique is adopted to facilitate the estimation in the M-step of the EM algorithm. We show that the proposed estimators for the regression coefficients are consistent and asymptotically normal. The asymptotic variance of the estimators can be consistently estimated using an EM-aided numerical differentiation method. Simulation studies are conducted to evaluate the finite-sample performance of the estimators and an application to a Wilms tumor data set is also given to illustrate the methodology.

We propose a Bayesian non-parametric (BNP) framework for estimating causal effects of mediation, the natural direct, and indirect, effects. The strategy is to do this in two parts. Part 1 is a flexible model (using BNP) for the observed data distribution. Part 2 is a set of uncheckable assumptions with sensitivity parameters that in conjunction with Part 1 allows identification and estimation of the causal parameters and allows for uncertainty about these assumptions via priors on the sensitivity parameters. For Part 1, we specify a Dirichlet process mixture of multivariate normals as a prior on the joint distribution of the outcome, mediator, and covariates. This approach allows us to obtain a (simple) closed form of each marginal distribution. For Part 2, we consider two sets of assumptions: (a) the standard sequential ignorability (Imai et al., 2010) and (b) weakened set of the conditional independence type assumptions introduced in Daniels et al. (2012) and propose sensitivity analyses for both. We use this approach to assess mediation in a physical activity promotion trial.

Cancer population studies based on cancer registry databases are widely conducted to address various research questions. In general, cancer registry databases do not collect information on cause of death. The net survival rate is defined as the survival rate if a subject would not die for any causes other than cancer. This counterfactual concept is widely used for the analyses of cancer registry data. Perme, Stare, and Estève (2012) proposed a nonparametric estimator of the net survival rate under the assumption that the censoring time is independent of the survival time and covariates. Kodre and Perme (2013) proposed an inverse weighting estimator for the net survival rate under the covariate-dependent censoring. An alternative approach to estimating the net survival rate under covariate-dependent censoring is to apply a regression model for the conditional net survival rate given covariates. In this article, we propose a new estimator for the net survival rate. The proposed estimator is shown to be doubly robust in the sense that it is consistent at least one of the regression models for survival time and for censoring time. We examine the theoretical and empirical properties of our proposed estimator by asymptotic theory and simulation studies. We also apply the proposed method to cancer registry data for gastric cancer patients in Osaka, Japan.

The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration.

Finding rare variants and gene–environment interactions (GXE) is critical in dissecting complex diseases. We consider the problem of detecting GXE where G is a rare haplotype and E is a nongenetic factor. Such methods typically assume G-E independence, which may not hold in many applications. A pertinent example is lung cancer—there is evidence that variants on Chromosome 15q25.1 interact with smoking to affect the risk. However, these variants are associated with smoking behavior rendering the assumption of G-E independence inappropriate. With the motivation of detecting GXE under G-E dependence, we extend an existing approach, logistic Bayesian LASSO, which assumes G-E independence (LBL-GXE-I) by modeling G-E dependence through a multinomial logistic regression (referred to as LBL-GXE-D). Unlike LBL-GXE-I, LBL-GXE-D controls type I error rates in all situations; however, it has reduced power when G-E independence holds. To control type I error without sacrificing power, we further propose a unified approach, LBL-GXE, to incorporate uncertainty in the G-E independence assumption by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that LBL-GXE has power similar to that of LBL-GXE-I when G-E independence holds, yet has well-controlled type I errors in all situations. To illustrate the utility of LBL-GXE, we analyzed a lung cancer dataset and found several significant interactions in the 15q25.1 region, including one between a specific rare haplotype and smoking.

Highly active antiretroviral therapy (HAART) has proved efficient in increasing CD4 counts in many randomized clinical trials. Because randomized trials have some limitations (e.g., short duration, highly selected subjects), it is interesting to assess the effect of treatments using observational studies. This is challenging because treatment is started preferentially in subjects with severe conditions. This general problem had been treated using Marginal Structural Models (MSM) relying on the counterfactual formulation. Another approach to causality is based on dynamical models. We present three discrete-time dynamic models based on linear increments models (LIM): the first one based on one difference equation for CD4 counts, the second with an equilibrium point, and the third based on a system of two difference equations, which allows jointly modeling CD4 counts and viral load. We also consider continuous-time models based on ordinary differential equations with non-linear mixed effects (ODE-NLME). These mechanistic models allow incorporating biological knowledge when available, which leads to increased statistical evidence for detecting treatment effect. Because inference in ODE-NLME is numerically challenging and requires specific methods and softwares, LIM are a valuable intermediary option in terms of consistency, precision, and complexity. We compare the different approaches in simulation and in illustration on the ANRS CO3 Aquitaine Cohort and the Swiss HIV Cohort Study.

We consider simple ordinal model-based probability effect measures for comparing distributions of two groups, adjusted for explanatory variables. An “ordinal superiority” measure summarizes the probability that an observation from one distribution falls above an independent observation from the other distribution, adjusted for explanatory variables in a model. The measure applies directly to normal linear models and to a normal latent variable model for ordinal response variables. It equals for the corresponding ordinal model that applies a probit link function to cumulative multinomial probabilities, for standard normal cdf and effect that is the coefficient of the group indicator variable. For the more general latent variable model for ordinal responses that corresponds to a linear model with other possible error distributions and corresponding link functions for cumulative multinomial probabilities, the ordinal superiority measure equals with the log–log link and equals approximately with the logit link, where is the group effect. Another ordinal superiority measure generalizes the difference of proportions from binary to ordinal responses. We also present related measures directly for ordinal models for the observed response that need not assume corresponding latent response models. We present confidence intervals for the measures and illustrate with an example.

Long-term follow-up is common in many medical investigations where the interest lies in predicting patients’ risks for a future adverse outcome using repeatedly measured predictors over time. A key quantity is the likelihood of developing an adverse outcome among individuals who survived up to time *s* given their covariate information up to time *s*. Simple, yet reliable, methodology for updating the predicted risk of disease progression using longitudinal markers remains elusive. Two main approaches have been considered in the literature. One approach, based on joint modeling (JM) of failure time and longitudinal covariate process (Tsiatis and Davidian, 2004), derives such longitudinal predictive probability from the joint probability of a longitudinal marker and an event at a given time. A second approach, the partly conditional (PC) modeling (Zheng and Heagerty, 2005), directly models the predictive probability conditional on survival up to a landmark time and information accrued by that time. In this article, we propose new PC models for longitudinal prediction that are more flexible than joint modeling and improve the prediction accuracy over existing PC models. We provide procedures for making inference regarding future risk for an individual with longitudinal measures up to a given time. In addition, we conduct simulations to evaluate both JM and PC approaches in order to provide practical guidance on modeling choices. We use standard measures of predictive accuracy adapted to our setting to explore the predictiveness of the two approaches. We illustrate the performance of the two approaches on a dataset from the End Stage Renal Disease Study (ESRDS).

We consider the problem of testing for a dose-related effect based on a candidate set of (typically nonlinear) dose-response models using likelihood-ratio tests. For the considered models this reduces to assessing whether the slope parameter in these nonlinear regression models is zero or not. A technical problem is that the null distribution (when the slope is zero) depends on non-identifiable parameters, so that standard asymptotic results on the distribution of the likelihood-ratio test no longer apply. Asymptotic solutions for this problem have been extensively discussed in the literature. The resulting approximations however are not of simple form and require simulation to calculate the asymptotic distribution. In addition, their appropriateness might be doubtful for the case of a small sample size. Direct simulation to approximate the null distribution is numerically unstable due to the non identifiability of some parameters. In this article, we derive a numerical algorithm to approximate the exact distribution of the likelihood-ratio test under multiple models for normally distributed data. The algorithm uses methods from differential geometry and can be used to evaluate the distribution under the null hypothesis, but also allows for power and sample size calculations. We compare the proposed testing approach to the MCP-Mod methodology and alternative methods for testing for a dose-related trend in a dose-finding example data set and simulations.

Frailty models are here proposed in the tumor dormancy framework, in order to account for possible unobservable dependence mechanisms in cancer studies where a non-negligible proportion of cancer patients relapses years or decades after surgical removal of the primary tumor. Relapses do not seem to follow a memory-less process, since their timing distribution leads to multimodal hazards. From a biomedical perspective, this behavior may be explained by tumor dormancy, i.e., for some patients microscopic tumor foci may remain asymptomatic for a prolonged time interval and, when they escape from dormancy, micrometastatic growth results in a clinical disease appearance. The activation of the growth phase at different metastatic states would explain the occurrence of metastatic recurrences and mortality at different times (multimodal hazard). We propose a new frailty model which includes in the risk function a random source of heterogeneity (frailty variable) affecting the components of the hazard function. Thus, the individual hazard rate results as the product of a random frailty variable and the sum of basic hazard rates. In tumor dormancy, the basic hazard rates correspond to micrometastatic developments starting from different initial states. The frailty variable represents the heterogeneity among patients with respect to relapse, which might be related to unknown mechanisms that regulate tumor dormancy. We use our model to estimate the overall survival in a large breast cancer dataset, showing how this improves the understanding of the underlying biological process.

This article considers sieve estimation in the Cox model with an unknown regression structure based on right-censored data. We propose a semiparametric pursuit method to simultaneously identify and estimate linear and nonparametric covariate effects based on B-spline expansions through a penalized group selection method with concave penalties. We show that the estimators of the linear effects and the nonparametric component are consistent. Furthermore, we establish the asymptotic normality of the estimator of the linear effects. To compute the proposed estimators, we develop a modified blockwise majorization descent algorithm that is efficient and easy to implement. Simulation studies demonstrate that the proposed method performs well in finite sample situations. We also use the primary biliary cirrhosis data to illustrate its application.

Many diseases arise due to exposure to one of multiple possible pathogens. We consider the situation in which disease counts are available over time from a study region, along with a measure of clinical disease severity, for example, mild or severe. In addition, we suppose a subset of the cases are lab tested in order to determine the pathogen responsible for disease. In such a context, we focus interest on modeling the probabilities of disease incidence given pathogen type. The time course of these probabilities is of great interest as is the association with time-varying covariates such as meteorological variables. In this set up, a natural Bayesian approach would be based on imputation of the unsampled pathogen information using Markov Chain Monte Carlo but this is computationally challenging. We describe a practical approach to inference that is easy to implement. We use an empirical Bayes procedure in a first step to estimate summary statistics. We then treat these summary statistics as the observed data and develop a Bayesian generalized additive model. We analyze data on hand, foot, and mouth disease (HFMD) in China in which there are two pathogens of primary interest, enterovirus 71 (EV71) and Coxackie A16 (CA16). We find that both EV71 and CA16 are associated with temperature, relative humidity, and wind speed, with reasonably similar functional forms for both pathogens. The important issue of confounding by time is modeled using a penalized B-spline model with a random effects representation. The level of smoothing is addressed by a careful choice of the prior on the tuning variance.

In this article, we propose an association model to estimate the penetrance (risk) of successive cancers in the presence of competing risks. The association between the successive events is modeled via a copula and a proportional hazards model is specified for each competing event. This work is motivated by the analysis of successive cancers for people with Lynch Syndrome in the presence of competing risks. The proposed inference procedure is adapted to handle missing genetic covariates and selection bias, induced by the data collection protocol of the data at hand. The performance of the proposed estimation procedure is evaluated by simulations and its use is illustrated with data from the Colon Cancer Family Registry (Colon CFR).

Motivated by a study of molecular differences among breast cancer patients, we develop a Bayesian latent factor zero-inflated Poisson (LZIP) model for the analysis of correlated zero-inflated counts. The responses are modeled as independent zero-inflated Poisson distributions conditional on a set of subject-specific latent factors. For each outcome, we express the LZIP model as a function of two discrete random variables: the first captures the propensity to be in an underlying “at-risk” state, while the second represents the count response conditional on being at risk. The latent factors and loadings are assigned conditionally conjugate gamma priors that accommodate overdispersion and dependence among the outcomes. For posterior computation, we propose an efficient data-augmentation algorithm that relies primarily on easily sampled Gibbs steps. We conduct simulation studies to investigate both the inferential properties of the model and the computational capabilities of the proposed sampling algorithm. We apply the method to an analysis of breast cancer genomics data from The Cancer Genome Atlas.

The analysis of multiple outcomes is becoming increasingly common in modern biomedical studies. It is well-known that joint statistical models for multiple outcomes are more flexible and more powerful than fitting a separate model for each outcome; they yield more powerful tests of exposure or treatment effects by taking into account the dependence among outcomes and pooling evidence across outcomes. It is, however, unlikely that all outcomes are related to the same subset of covariates. Therefore, there is interest in identifying exposures or treatments associated with particular outcomes, which we term outcome-specific variable selection. In this work, we propose a variable selection approach for multivariate normal responses that incorporates not only information on the mean model, but also information on the variance–covariance structure of the outcomes. The approach effectively leverages evidence from all correlated outcomes to estimate the effect of a particular covariate on a given outcome. To implement this strategy, we develop a Bayesian method that builds a multivariate prior for the variable selection indicators based on the variance–covariance of the outcomes. We show via simulation that the proposed variable selection strategy can boost power to detect subtle effects without increasing the probability of false discoveries. We apply the approach to the Normative Aging Study (NAS) epigenetic data and identify a subset of five genes in the asthma pathway for which gene-specific DNA methylations are associated with exposures to either black carbon, a marker of traffic pollution, or sulfate, a marker of particles generated by power plants.

Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence, the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and are available on CRAN.

In this article, the existing concept of reversed percentile residual life, or percentile inactivity time, is recast to show that it can be used for routine analysis of time-to-event data under right censoring to summarize “life lost,” which poses several advantages over the existing methods for survival analysis. An estimating equation approach is adopted to avoid estimation of the probability density function of the underlying time-to-event distribution to estimate the variance of the quantile estimator. Additionally a *K*-sample test statistic is proposed to test the ratio of the quantile lost lifespans. Simulation studies are performed to assess finite properties of the proposed *K*-sample statistic in terms of coverage probability and power. The proposed method is illustrated with a real data example from a breast cancer study.

It is traditionally assumed that the random effects in mixed models follow a multivariate normal distribution, making likelihood-based inferences more feasible theoretically and computationally. However, this assumption does not necessarily hold in practice which may lead to biased and unreliable results. We introduce a novel diagnostic test based on the so-called gradient function proposed by Verbeke and Molenberghs (2013) to assess the random-effects distribution. We establish asymptotic properties of our test and show that, under a correctly specified model, the proposed test statistic converges to a weighted sum of independent chi-squared random variables each with one degree of freedom. The weights, which are eigenvalues of a square matrix, can be easily calculated. We also develop a parametric bootstrap algorithm for small samples. Our strategy can be used to check the adequacy of any distribution for random effects in a wide class of mixed models, including linear mixed models, generalized linear mixed models, and non-linear mixed models, with univariate as well as multivariate random effects. Both asymptotic and bootstrap proposals are evaluated via simulations and a real data analysis of a randomized multicenter study on toenail dermatophyte onychomycosis.

Various model selection methods can be applied to seek sparse subsets of the covariates to explain the response of interest in bioinformatics. While such methods often offer very helpful predictive performances, their selections of the covariates may be much less trustworthy. Indeed, when the number of covariates is large, the selections can be highly unstable, even under a slight change of the data. This casts a serious doubt on reproducibility of the identified variables. For a sound scientific understanding of the regression relationship, methods need to be developed to find the most important covariates that have higher chance to be confirmed in future studies. Such a method based on variable selection deviation is proposed and evaluated in this work.

Distortion product otoacoustic emissions (DPOAE) testing is a promising alternative to behavioral hearing tests and auditory brainstem response testing of pediatric cancer patients. The central goal of this study is to assess whether significant changes in the DPOAE frequency/emissions curve (DP-gram) occur in pediatric patients in a test-retest scenario. This is accomplished through the construction of normal reference charts, or credible regions, that DP-gram differences lie in, as well as contour probabilities that measure how abnormal (or in a certain sense rare) a test-retest difference is. A challenge is that the data were collected over varying frequencies, at different time points from baseline, and on possibly one or both ears. A hierarchical structural equation Gaussian process model is proposed to handle the different sources of correlation in the emissions measurements, wherein both subject-specific random effects and variance components governing the smoothness and variability of each child's Gaussian process are coupled together.

We use a nonparametric mixture model for the purpose of estimating the size of a population from multiple lists in which both the individual effects and list effects are allowed to vary. We propose a lower bound of the population size that admits an analytic expression. The lower bound can be estimated without the necessity of model-fitting. The asymptotical normality of the estimator is established. Both the estimator itself and that for the estimable bound of its variance are adjusted. These adjusted versions are shown to be unbiased in the limit. Simulation experiments are performed to assess the proposed approach and real applications are studied.

In this article, we propose a new statistical method—MutRSeq—for detecting differentially expressed single nucleotide variants (SNVs) based on RNA-seq data. Specifically, we focus on nonsynonymous mutations and employ a hierarchical likelihood approach to jointly model observed mutation events as well as read count measurements from RNA-seq experiments. We then introduce a likelihood ratio-based test statistic, which detects changes not only in overall expression levels, but also in allele-specific expression patterns. In addition, this method can jointly test multiple mutations in one gene/pathway. The simulation studies suggest that the proposed method achieves better power than a few competitors under a range of different settings. In the end, we apply this method to a breast cancer data set and identify genes with nonsynonymous mutations differentially expressed between the triple negative breast cancer tumors and other subtypes of breast cancer tumors.

In competing risks setup, the data for each subject consist of the event time, censoring indicator, and event category. However, sometimes the information about the event category can be missing, as, for example, in a case when the date of death is known but the cause of death is not available. In such situations, treating subjects with missing event category as censored leads to the underestimation of the hazard functions. We suggest nonparametric estimators for the cumulative cause-specific hazards and the cumulative incidence functions which use the Nadaraya–Watson estimator to obtain the contribution of an event with missing category to each of the cause-specific hazards. We derive the propertied of the proposed estimators. Optimal bandwidth is determined, which minimizes the mean integrated squared errors of the proposed estimators over time. The methodology is illustrated using data on lung infections in patients from the United States Cystic Fibrosis Foundation Patient Registry.

When functional data come as multiple curves per subject, characterizing the source of variations is not a trivial problem. The complexity of the problem goes deeper when there is phase variation in addition to amplitude variation. We consider clustering problem with multivariate functional data that have phase variations among the functional variables. We propose a conditional subject-specific warping framework in order to extract relevant features for clustering. Using multivariate growth curves of various parts of the body as a motivating example, we demonstrate the effectiveness of the proposed approach. The found clusters have individuals who show different relative growth patterns among different parts of the body.

Dose–response modeling in areas such as toxicology is often conducted using a parametric approach. While estimation of parameters is usually one of the goals, often the main aim of the study is the estimation of quantities derived from the parameters, such as the ED50 dose. From the view of statistical optimal design theory such an objective corresponds to a *c*-optimal design criterion. Unfortunately, *c*-optimal designs often create practical problems, and furthermore commonly do not allow actual estimation of the parameters. It is therefore useful to consider alternative designs which show good *c*-performance, while still being applicable in practice and allowing reasonably good general parameter estimation. In effect, using optimal design terminology this means that a reasonable performance regarding the *D*-criterion is expected as well. In this article, we propose several approaches to the task of combining *c*- and *D*-efficient designs, such as using mixed information functions or setting minimum requirements regarding either *c*- or *D*-efficiency, and show how to algorithmically determine optimal designs in each case. We apply all approaches to a standard situation from toxicology, and obtain a much better balance between *c*- and *D*-performance. Next, we investigate how to adapt the designs to different parameter values. Finally, we show that the methodology used here is not just limited to the combination of *c*- and *D*-designs, but can also be used to handle more general constraint situations such as limits on the cost of an experiment.

A gene may be controlled by distal enhancers and repressors, not merely by regulatory elements in its promoter. Spatial organization of chromosomes is the mechanism that brings genes and their distal regulatory elements into close proximity. Recent molecular techniques, coupled with Next Generation Sequencing (NGS) technology, enable genome-wide detection of physical contacts between distant genomic loci. In particular, Hi-C is an NGS-aided assay for the study of genome-wide spatial interactions. The availability of such data makes it possible to reconstruct the underlying three-dimensional (3D) spatial chromatin structure. In this article, we present the Poisson Random effect Architecture Model (PRAM) for such an inference. The main feature of PRAM that separates it from previous methods is that it addresses the issue of over-dispersion and takes correlations among contact counts into consideration, thereby achieving greater consistency with observed data. PRAM was applied to Hi-C data to illustrate its performance and to compare the predicted distances with those measured by a Fluorescence In Situ Hybridization (FISH) validation experiment. Further, PRAM was compared to other methods in the literature based on both real and simulated data.

Dynamic treatment regimes (DTRs) are sequential decision rules that focus simultaneously on treatment individualization and adaptation over time. To directly identify the optimal DTR in a multi-stage multi-treatment setting, we propose a dynamic statistical learning method, adaptive contrast weighted learning. We develop semiparametric regression-based contrasts with the adaptation of treatment effect ordering for each patient at each stage, and the adaptive contrasts simplify the problem of optimization with multiple treatment comparisons to a weighted classification problem that can be solved by existing machine learning techniques. The algorithm is implemented recursively using backward induction. By combining doubly robust semiparametric regression estimators with machine learning algorithms, the proposed method is robust and efficient for the identification of the optimal DTR, as shown in the simulation studies. We illustrate our method using observational data on esophageal cancer.

Treatments are frequently evaluated in terms of their effect on patient survival. In settings where randomization of treatment is not feasible, observational data are employed, necessitating correction for covariate imbalances. Treatments are usually compared using a hazard ratio. Most existing methods which quantify the treatment effect through the survival function are applicable to treatments assigned at time 0. In the data structure of our interest, subjects typically begin follow-up untreated; time-until-treatment, and the pretreatment death hazard are both heavily influenced by longitudinal covariates; and subjects may experience periods of treatment ineligibility. We propose semiparametric methods for estimating the average difference in restricted mean survival time attributable to a time-dependent treatment, the average effect of treatment among the treated, under current treatment assignment patterns. The pre- and posttreatment models are partly conditional, in that they use the covariate history up to the time of treatment. The pre-treatment model is estimated through recently developed landmark analysis methods. For each treated patient, fitted pre- and posttreatment survival curves are projected out, then averaged in a manner which accounts for the censoring of treatment times. Asymptotic properties are derived and evaluated through simulation. The proposed methods are applied to liver transplant data in order to estimate the effect of liver transplantation on survival among transplant recipients under current practice patterns.

The prior distribution is a key ingredient in Bayesian inference. Prior information on regression coefficients may come from different sources and may or may not be in conflict with the observed data. Various methods have been proposed to quantify a potential prior-data conflict, such as Box's *p*-value. However, there are no clear recommendations how to react to possible prior-data conflict in generalized regression models. To address this deficiency, we propose to adaptively weight a prespecified multivariate normal prior distribution on the regression coefficients. To this end, we relate empirical Bayes estimates of prior weight to Box's *p*-value and propose alternative fully Bayesian approaches. Prior weighting can be done for the joint prior distribution of the regression coefficients or—under prior independence—separately for prespecified blocks of regression coefficients. We outline how the proposed methodology can be implemented using integrated nested Laplace approximations (INLA) and illustrate the applicability with a Bayesian logistic regression model for data from a cross-sectional study. We also provide a simulation study that shows excellent performance of our approach in the case of prior misspecification in terms of root mean squared error and coverage. Supplementary Materials give details on software implementation and code and another application to binary longitudinal data from a randomized clinical trial using a Bayesian generalized linear mixed model.

Meta-analysis has become a widely used tool to combine results from independent studies. The collected studies are homogeneous if they share a common underlying true effect size; otherwise, they are heterogeneous. A fixed-effect model is customarily used when the studies are deemed homogeneous, while a random-effects model is used for heterogeneous studies. Assessing heterogeneity in meta-analysis is critical for model selection and decision making. Ideally, if heterogeneity is present, it should permeate the entire collection of studies, instead of being limited to a small number of outlying studies. Outliers can have great impact on conventional measures of heterogeneity and the conclusions of a meta-analysis. However, no widely accepted guidelines exist for handling outliers. This article proposes several new heterogeneity measures. In the presence of outliers, the proposed measures are less affected than the conventional ones. The performance of the proposed and conventional heterogeneity measures are compared theoretically, by studying their asymptotic properties, and empirically, using simulations and case studies.

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high-dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees—features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.

Many new experimental treatments benefit only a subset of the population. Identifying the baseline covariate profiles of patients who benefit from such a treatment, rather than determining whether or not the treatment has a population-level effect, can substantially lessen the risk in undertaking a clinical trial and expose fewer patients to treatments that do not benefit them. The standard analyses for identifying patient subgroups that benefit from an experimental treatment either do not account for multiplicity, or focus on testing for the presence of treatment–covariate interactions rather than the resulting individualized treatment effects. We propose a Bayesian *credible subgroups* method to identify two bounding subgroups for the benefiting subgroup: one for which it is likely that all members simultaneously have a treatment effect exceeding a specified threshold, and another for which it is likely that no members do. We examine frequentist properties of the credible subgroups method via simulations and illustrate the approach using data from an Alzheimer's disease treatment trial. We conclude with a discussion of the advantages and limitations of this approach to identifying patients for whom the treatment is beneficial.

Cocaine addiction is chronic and persistent, and has become a major social and health problem in many countries. Existing studies have shown that cocaine addicts often undergo episodic periods of addiction to, moderate dependence on, or swearing off cocaine. Given its reversible feature, cocaine use can be formulated as a stochastic process that transits from one state to another, while the impacts of various factors, such as treatment received and individuals’ psychological problems on cocaine use, may vary across states. This article develops a hidden Markov latent variable model to study multivariate longitudinal data concerning cocaine use from a California Civil Addict Program. The proposed model generalizes conventional latent variable models to allow bidirectional transition between cocaine-addiction states and conventional hidden Markov models to allow latent variables and their dynamic interrelationship. We develop a maximum-likelihood approach, along with a Monte Carlo expectation conditional maximization (MCECM) algorithm, to conduct parameter estimation. The asymptotic properties of the parameter estimates and statistics for testing the heterogeneity of model parameters are investigated. The finite sample performance of the proposed methodology is demonstrated by simulation studies. The application to cocaine use study provides insights into the prevention of cocaine use.

Joint modeling is increasingly popular for investigating the relationship between longitudinal and time-to-event data. However, numerical complexity often restricts this approach to linear models for the longitudinal part. Here, we use a novel development of the Stochastic-Approximation Expectation Maximization algorithm that allows joint models defined by nonlinear mixed-effect models. In the context of chemotherapy in metastatic prostate cancer, we show that a variety of patterns for the Prostate Specific Antigen (PSA) kinetics can be captured by using a mechanistic model defined by nonlinear ordinary differential equations. The use of a mechanistic model predicts that biological quantities that cannot be observed, such as treatment-sensitive and treatment-resistant cells, may have a larger impact than PSA value on survival. This suggests that mechanistic joint models could constitute a relevant approach to evaluate the efficacy of treatment and to improve the prediction of survival in patients.

Understanding how aquatic species grow is fundamental in fisheries because stock assessment often relies on growth dependent statistical models. Length-frequency-based methods become important when more applicable data for growth model estimation are either not available or very expensive. In this article, we develop a new framework for growth estimation from length-frequency data using a generalized von Bertalanffy growth model (VBGM) framework that allows for time-dependent covariates to be incorporated. A finite mixture of normal distributions is used to model the length-frequency cohorts of each month with the means constrained to follow a VBGM. The variances of the finite mixture components are constrained to be a function of mean length, reducing the number of parameters and allowing for an estimate of the variance at any length. To optimize the likelihood, we use a minorization–maximization (MM) algorithm with a Nelder–Mead sub-step. This work was motivated by the decline in catches of the blue swimmer crab (BSC) (*Portunus armatus*) off the east coast of Queensland, Australia. We test the method with a simulation study and then apply it to the BSC fishery data.

Our motivating application stems from surveys of natural populations and is characterized by large spatial heterogeneity in the counts, which makes parametric approaches to modeling local animal abundance too restrictive. We adopt a Bayesian nonparametric approach based on mixture models and innovate with respect to popular Dirichlet process mixture of Poisson kernels by increasing the model flexibility at the level both of the kernel and the nonparametric mixing measure. This allows to derive accurate and robust estimates of the distribution of local animal abundance and of the corresponding clusters. The application and a simulation study for different scenarios yield also some general methodological implications. Adding flexibility solely at the level of the mixing measure does not improve inferences, since its impact is severely limited by the rigidity of the Poisson kernel with considerable consequences in terms of bias. However, once a kernel more flexible than the Poisson is chosen, inferences can be robustified by choosing a prior more general than the Dirichlet process. Therefore, to improve the performance of Bayesian nonparametric mixtures for count data one has to enrich the model simultaneously at both levels, the kernel and the mixing measure.

Joint models are used in ageing studies to investigate the association between longitudinal markers and a time-to-event, and have been extended to multiple markers and/or competing risks. The competing risk of death must be considered in the elderly because death and dementia have common risk factors. Moreover, in cohort studies, time-to-dementia is interval-censored since dementia is assessed intermittently. So subjects can develop dementia and die between two visits without being diagnosed. To study predementia cognitive decline, we propose a joint latent class model combining a (possibly multivariate) mixed model and an illness–death model handling both interval censoring (by accounting for a possible unobserved transition to dementia) and semi-competing risks. Parameters are estimated by maximum-likelihood handling interval censoring. The correlation between the marker and the times-to-events is captured by latent classes, homogeneous sub-groups with specific risks of death, dementia, and profiles of cognitive decline. We propose Markovian and semi-Markovian versions. Both approaches are compared to a joint latent-class model for competing risks through a simulation study, and applied in a prospective cohort study of cerebral and functional ageing to distinguish different profiles of cognitive decline associated with risks of dementia and death. The comparison highlights that among subjects with dementia, mortality depends more on age than on duration of dementia. This model distinguishes the so-called terminal predeath decline (among healthy subjects) from the predementia decline.

The log-rank test is widely used to compare two survival distributions in a randomized clinical trial, while partial likelihood (Cox, 1975) is the method of choice for making inference about the hazard ratio under the Cox (1972) proportional hazards model. The Wald 95% confidence interval of the hazard ratio may include the null value of 1 when the *p*-value of the log-rank test is less than 0.05. Peto et al. (1977) provided an estimator for the hazard ratio based on the log-rank statistic; the corresponding 95% confidence interval excludes the null value of 1 if and only if the *p*-value of the log-rank test is less than 0.05. However, Peto's estimator is not consistent, and the corresponding confidence interval does not have correct coverage probability. In this article, we construct the confidence interval by inverting the score test under the (possibly stratified) Cox model, and we modify the variance estimator such that the resulting score test for the null hypothesis of no treatment difference is identical to the log-rank test in the possible presence of ties. Like Peto's method, the proposed confidence interval excludes the null value if and only if the log-rank test is significant. Unlike Peto's method, however, this interval has correct coverage probability. An added benefit of the proposed confidence interval is that it tends to be more accurate and narrower than the Wald confidence interval. We demonstrate the advantages of the proposed method through extensive simulation studies and a colon cancer study.

Interval-censored failure time data occur in many fields such as demography, economics, medical research, and reliability and many inference procedures on them have been developed (Sun, 2006; Chen, Sun, and Peace, 2012). However, most of the existing approaches assume that the mechanism that yields interval censoring is independent of the failure time of interest and it is clear that this may not be true in practice (Zhang et al., 2007; Ma, Hu, and Sun, 2015). In this article, we consider regression analysis of case *K* interval-censored failure time data when the censoring mechanism may be related to the failure time of interest. For the problem, an estimated sieve maximum-likelihood approach is proposed for the data arising from the proportional hazards frailty model and for estimation, a two-step procedure is presented. In the addition, the asymptotic properties of the proposed estimators of regression parameters are established and an extensive simulation study suggests that the method works well. Finally, we apply the method to a set of real interval-censored data that motivated this study.

Variable selection for recovering sparsity in nonadditive and nonparametric models with high-dimensional variables has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high-dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this article we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric regression model. The advantages of our approach are that it can: (i) recover the sparsity; (ii) automatically model unknown and complicated interactions; (iii) connect with several existing approaches including linear nonnegative garrote and multiple kernel learning; and (iv) provide flexibility for both additive and nonadditive nonparametric models. Our approach can be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a Least Squares Kernel Machine (LSKM) and construct the nonnegative garrote objective function as the function of the sparse scale parameters of kernel machine to recover sparsity of input variables whose relevances to the response are measured by the scale parameters. We also provide the asymptotic properties of our approach. We show that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve the power.

The evaluation of cure fractions in oncology research under the well known cure rate model has attracted considerable attention in the literature, but most of the existing testing procedures have relied on restrictive assumptions. A common assumption has been to restrict the cure fraction to a constant under alternatives to homogeneity, thereby neglecting any information from covariates. This article extends the literature by developing a score-based statistic that incorporates covariate information to detect cure fractions, with the existing testing procedure serving as a special case. A complication of this extension, however, is that the implied hypotheses are not typical and standard regularity conditions to conduct the test may not even hold. Using empirical processes arguments, we construct a sup-score test statistic for cure fractions and establish its limiting null distribution as a functional of mixtures of chi-square processes. In practice, we suggest a simple resampling procedure to approximate this limiting distribution. Our simulation results show that the proposed test can greatly improve efficiency over tests that neglect the heterogeneity of the cure fraction under the alternative. The practical utility of the methodology is illustrated using ovarian cancer survival data with long-term follow-up from the surveillance, epidemiology, and end results registry.

Recently, massive functional data have been widely collected over space across a set of grid points in various imaging studies. It is interesting to correlate functional data with various clinical variables, such as age and gender, in order to address scientific questions of interest. The aim of this article is to develop a single-index varying coefficient (SIVC) model for establishing a varying association between functional responses (e.g., image) and a set of covariates. It enjoys several unique features of both varying-coefficient and single-index models. An estimation procedure is developed to estimate varying coefficient functions, the index function, and the covariance function of individual functions. The optimal integration of information across different grid points is systematically delineated and the asymptotic properties (e.g., consistency and convergence rate) of all estimators are examined. Simulation studies are conducted to assess the finite-sample performance of the proposed estimation procedure. Furthermore, our real data analysis of a white matter tract dataset obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study confirms the advantage and accuracy of SIVC model over the popular varying coefficient model.

We consider the problem of selecting covariates in a spatial regression model when the response is binary. Penalized likelihood-based approach is proved to be effective for both variable selection and estimation simultaneously. In the context of a spatially dependent binary variable, an uniquely interpretable likelihood is not available, rather a quasi-likelihood might be more suitable. We develop a penalized quasi-likelihood with spatial dependence for simultaneous variable selection and parameter estimation along with an efficient computational algorithm. The theoretical properties including asymptotic normality and consistency are studied under increasing domain asymptotics framework. An extensive simulation study is conducted to validate the methodology. Real data examples are provided for illustration and applicability. Although theoretical justification has not been made, we also investigate empirical performance of the proposed penalized quasi-likelihood approach for spatial count data to explore suitability of this method to a general exponential family of distributions.

For the classical, homoscedastic measurement error model, moment reconstruction (Freedman et al., 2004, 2008) and moment-adjusted imputation (Thomas et al., 2011) are appealing, computationally simple imputation-like methods for general model fitting. Like classical regression calibration, the idea is to replace the unobserved variable subject to measurement error with a proxy that can be used in a variety of analyses. Moment reconstruction and moment-adjusted imputation differ from regression calibration in that they attempt to match multiple features of the latent variable, and also to match some of the latent variable's relationships with the response and additional covariates. In this note, we consider a problem where true exposure is generated by a complex, nonlinear random effects modeling process, and develop analogues of moment reconstruction and moment-adjusted imputation for this case. This general model includes classical measurement errors, Berkson measurement errors, mixtures of Berkson and classical errors and problems that are not measurement error problems, but also cases where the data-generating process for true exposure is a complex, nonlinear random effects modeling process. The methods are illustrated using the National Institutes of Health–AARP Diet and Health Study where the latent variable is a dietary pattern score called the Healthy Eating Index-2005. We also show how our general model includes methods used in radiation epidemiology as a special case. Simulations are used to illustrate the methods.

The peptide microarray immunoassay simultaneously screens sample serum against thousands of peptides, determining the presence of antibodies bound to array probes. Peptide microarrays tiling immunogenic regions of pathogens (e.g., envelope proteins of a virus) are an important high throughput tool for querying and mapping antibody binding. Because of the assay's many steps, from probe synthesis to incubation, peptide microarray data can be noisy with extreme outliers. In addition, subjects may produce different antibody profiles in response to an identical vaccine stimulus or infection, due to variability among subjects’ immune systems. We present a robust Bayesian hierarchical model for peptide microarray experiments, pepBayes, to estimate the probability of antibody response for each subject/peptide combination. Heavy-tailed error distributions accommodate outliers and extreme responses, and tailored random effect terms automatically incorporate technical effects prevalent in the assay. We apply our model to two vaccine trial data sets to demonstrate model performance. Our approach enjoys high sensitivity and specificity when detecting vaccine induced antibody responses. A simulation study shows an adaptive thresholding classification method has appropriate false discovery rate control with high sensitivity, and receiver operating characteristics generated on vaccine trial data suggest that pepBayes clearly separates responses from non-responses.

In many classical estimation problems, the parameter space has a boundary. In most cases, the standard asymptotic properties of the estimator do not hold when some of the underlying true parameters lie on the boundary. However, without knowledge of the true parameter values, confidence intervals constructed assuming that the parameters lie in the interior are generally over-conservative. A penalized estimation method is proposed in this article to address this issue. An adaptive lasso procedure is employed to shrink the parameters to the boundary, yielding oracle inference which adapt to whether or not the true parameters are on the boundary. When the true parameters are on the boundary, the inference is equivalent to that which would be achieved with a priori knowledge of the boundary, while if the converse is true, the inference is equivalent to that which is obtained in the interior of the parameter space. The method is demonstrated under two practical scenarios, namely the frailty survival model and linear regression with order-restricted parameters. Simulation studies and real data analyses show that the method performs well with realistic sample sizes and exhibits certain advantages over standard methods.

Semi-parametric methods are often used for the estimation of intervention effects on correlated outcomes in cluster-randomized trials (CRTs). When outcomes are missing at random (MAR), Inverse Probability Weighted (IPW) methods incorporating baseline covariates can be used to deal with informative missingness. Also, augmented generalized estimating equations (AUG) correct for imbalance in baseline covariates but need to be extended for MAR outcomes. However, in the presence of interactions between treatment and baseline covariates, neither method alone produces consistent estimates for the marginal treatment effect if the model for interaction is not correctly specified. We propose an AUG–IPW estimator that weights by the inverse of the probability of being a complete case and allows different outcome models in each intervention arm. This estimator is doubly robust (DR); it gives correct estimates whether the missing data process or the outcome model is correctly specified. We consider the problem of covariate interference which arises when the outcome of an individual may depend on covariates of other individuals. When interfering covariates are not modeled, the DR property prevents bias as long as covariate interference is not present simultaneously for the outcome and the missingness. An R package is developed implementing the proposed method. An extensive simulation study and an application to a CRT of HIV risk reduction-intervention in South Africa illustrate the method.

In this work a new metric of surrogacy, the so-called individual causal association (ICA), is introduced using information-theoretic concepts and a causal inference model for a binary surrogate and true endpoint. The ICA has a simple and appealing interpretation in terms of uncertainty reduction and, in some scenarios, it seems to provide a more coherent assessment of the validity of a surrogate than existing measures. The identifiability issues are tackled using a two-step procedure. In the first step, the region of the parametric space of the distribution of the potential outcomes, compatible with the data at hand, is geometrically characterized. Further, in a second step, a Monte Carlo approach is proposed to study the behavior of the ICA on the previous region. The method is illustrated using data from the Collaborative Initial Glaucoma Treatment Study. A newly developed and user-friendly R package *Surrogate* is provided to carry out the evaluation exercise.

Spatial data have become increasingly common in epidemiology and public health research thanks to advances in GIS (Geographic Information Systems) technology. In health research, for example, it is common for epidemiologists to incorporate geographically indexed data into their studies. In practice, however, the spatially defined covariates are often measured with error. Naive estimators of regression coefficients are attenuated if measurement error is ignored. Moreover, the classical measurement error theory is inapplicable in the context of spatial modeling because of the presence of spatial correlation among the observations. We propose a semiparametric regression approach to obtain bias-corrected estimates of regression parameters and derive their large sample properties. We evaluate the performance of the proposed method through simulation studies and illustrate using data on Ischemic Heart Disease (IHD). Both simulation and practical application demonstrate that the proposed method can be effective in practice.

We show how a spatial point process, where to each point there is associated a random quantitative mark, can be identified with a spatio-temporal point process specified by a conditional intensity function. For instance, the points can be tree locations, the marks can express the size of trees, and the conditional intensity function can describe the distribution of a tree (i.e., its location and size) conditionally on the larger trees. This enable us to construct parametric statistical models which are easily interpretable and where maximum-likelihood-based inference is tractable.

Capture–recapture methods are used to estimate the size of a population of interest which is only partially observed. In such studies, each member of the population carries a count of the number of times it has been identified during the observational period. In real-life applications, only positive counts are recorded, and we get a truncated at zero-observed distribution. We need to use the truncated count distribution to estimate the number of unobserved units. We consider ratios of neighboring count probabilities, estimated by ratios of observed frequencies, regardless of whether we have a zero-truncated or an untruncated distribution. Rocchetti et al. (2011) have shown that, for densities in the Katz family, these ratios can be modeled by a regression approach, and Rocchetti et al. (2014) have specialized the approach to the beta-binomial distribution. Once the regression model has been estimated, the unobserved frequency of zero counts can be simply derived. The guiding principle is that it is often easier to find an appropriate regression model than a proper model for the count distribution. However, a full analysis of the connection between the regression model and the associated count distribution has been missing. In this manuscript, we fill the gap and show that the regression model approach leads, under general conditions, to a valid count distribution; we also consider a wider class of regression models, based on fractional polynomials. The proposed approach is illustrated by analyzing various empirical applications, and by means of a simulation study.

In many scientific fields, it is a common practice to collect a sequence of 0-1 binary responses from a subject across time, space, or a collection of covariates. Researchers are interested in finding out how the expected binary outcome is related to covariates, and aim at better prediction in the future 0-1 outcomes. Gaussian processes have been widely used to model nonlinear systems; in particular to model the latent structure in a binary regression model allowing nonlinear functional relationship between covariates and the expectation of binary outcomes. A critical issue in modeling binary response data is the appropriate choice of link functions. Commonly adopted link functions such as probit or logit links have fixed skewness and lack the flexibility to allow the data to determine the degree of the skewness. To address this limitation, we propose a flexible binary regression model which combines a generalized extreme value link function with a Gaussian process prior on the latent structure. Bayesian computation is employed in model estimation. Posterior consistency of the resulting posterior distribution is demonstrated. The flexibility and gains of the proposed model are illustrated through detailed simulation studies and two real data examples. Empirical results show that the proposed model outperforms a set of alternative models, which only have either a Gaussian process prior on the latent regression function or a Dirichlet prior on the link function.

Nonparametric estimation of monotone regression functions is a classical problem of practical importance. Robust estimation of monotone regression functions in situations involving interval-censored data is a challenging yet unresolved problem. Herein, we propose a nonparametric estimation method based on the principle of isotonic regression. Using empirical process theory, we show that the proposed estimator is asymptotically consistent under a specific metric. We further conduct a simulation study to evaluate the performance of the estimator in finite sample situations. As an illustration, we use the proposed method to estimate the mean body weight functions in a group of adolescents after they reach pubertal growth spurt.

Identifying factors associated with increased medical cost is important for many micro- and macro-institutions, including the national economy and public health, insurers and the insured. However, assembling comprehensive national databases that include both the cost and individual-level predictors can prove challenging. Alternatively, one can use data from smaller studies with the understanding that conclusions drawn from such analyses may be limited to the participant population. At the same time, smaller clinical studies have limited follow-up and lifetime medical cost may not be fully observed for all study participants. In this context, we develop new model selection methods and inference procedures for secondary analyses of clinical trial data when lifetime medical cost is subject to induced censoring. Our model selection methods extend a theory of penalized estimating function to a calibration regression estimator tailored for this data type. Next, we develop a novel inference procedure for the unpenalized regression estimator using perturbation and resampling theory. Then, we extend this resampling plan to accommodate regularized coefficient estimation of censored lifetime medical cost and develop postselection inference procedures for the final model. Our methods are motivated by data from Southwest Oncology Group Protocol 9509, a clinical trial of patients with advanced nonsmall cell lung cancer, and our models of lifetime medical cost are specific to this population. But the methods presented in this article are built on rather general techniques and could be applied to larger databases as those data become available.

We consider methods for estimating the treatment effect and/or the covariate by treatment interaction effect in a randomized clinical trial under noncompliance with time-to-event outcome. As in Cuzick et al. (2007), assuming that the patient population consists of three (possibly latent) subgroups based on treatment preference: the *ambivalent* group, the *insisters*, and the *refusers*, we estimate the effects among the *ambivalent* group. The parameters have causal interpretations under standard assumptions. The article contains two main contributions. First, we propose a weighted per-protocol (Wtd PP) estimator through incorporating time-varying weights in a proportional hazards model. In the second part of the article, under the model considered in Cuzick et al. (2007), we propose an EM algorithm to maximize a full likelihood (FL) as well as the pseudo likelihood (PL) considered in Cuzick et al. (2007). The E step of the algorithm involves computing the conditional expectation of a linear function of the latent membership, and the main advantage of the EM algorithm is that the risk parameters can be updated by fitting a weighted Cox model using standard software and the baseline hazard can be updated using closed-form solutions. Simulations show that the EM algorithm is computationally much more efficient than directly maximizing the observed likelihood. The main advantage of the Wtd PP approach is that it is more robust to model misspecifications among the *insisters* and *refusers* since the outcome model does not impose distributional assumptions among these two groups.

We propose a new sparse estimation method for Cox (1972) proportional hazards models by optimizing an approximated information criterion. The main idea involves approximation of the norm with a continuous or smooth unit dent function. The proposed method bridges the best subset selection and regularization by borrowing strength from both. It mimics the best subset selection using a penalized likelihood approach yet with no need of a tuning parameter. We further reformulate the problem with a reparameterization step so that it reduces to one unconstrained nonconvex yet smooth programming problem, which can be solved efficiently as in computing the maximum partial likelihood estimator (MPLE). Furthermore, the reparameterization tactic yields an additional advantage in terms of circumventing postselection inference. The oracle property of the proposed method is established. Both simulated experiments and empirical examples are provided for assessment and illustration.

In population-based cancer studies, it is often interesting to compare cancer survival between different populations. However, in such studies, the exact causes of death are often unavailable or unreliable. Net survival methods were developed to overcome this difficulty. Net survival is the survival that would be observed if the disease under study was the only possible cause of death. The Pohar-Perme estimator (PPE) is a nonparametric consistent estimator of net survival. In this article, we present a log-rank-type test for comparing net survival functions (as estimated by PPE) between several groups. We put the test within the counting process framework to introduce the inverse probability weighting procedure as required by the PPE. We built a stratified version to control for categorical covariates that affect the outcome. We performed simulation studies to evaluate the performance of this test and worked an application on real data.

Semi-competing risks data are often encountered in chronic disease follow-up studies that record both nonterminal events (e.g., disease landmark events) and terminal events (e.g., death). Studying the relationship between the nonterminal event and the terminal event can provide insightful information on disease progression. In this article, we propose a new sensible dependence measure tailored to addressing such an interest. We develop a nonparametric estimator, which is general enough to handle both independent right censoring and left truncation. Our strategy of connecting the new dependence measure with quantile regression enables a natural extension to adjust for covariates with minor additional assumptions imposed. We establish the asymptotic properties of the proposed estimators and develop inferences accordingly. Simulation studies suggest good finite-sample performance of the proposed methods. Our proposals are illustrated via an application to Denmark diabetes registry data.

This article considers nonparametric methods for studying recurrent disease and death with competing risks. We first point out that comparisons based on the well-known cumulative incidence function can be confounded by different prevalence rates of the competing events, and that comparisons of the conditional distribution of the survival time given the failure event type are more relevant for investigating the prognosis of different patterns of recurrence disease. We then propose nonparametric estimators for the conditional cumulative incidence function as well as the conditional bivariate cumulative incidence function for the bivariate gap times, that is, the time to disease recurrence and the residual lifetime after recurrence. To quantify the association between the two gap times in the competing risks setting, a modified Kendall's tau statistic is proposed. The proposed estimators for the conditional bivariate cumulative incidence distribution and the association measure account for the induced dependent censoring for the second gap time. Uniform consistency and weak convergence of the proposed estimators are established. Hypothesis testing procedures for two-sample comparisons are discussed. Numerical simulation studies with practical sample sizes are conducted to evaluate the performance of the proposed nonparametric estimators and tests. An application to data from a pancreatic cancer study is presented to illustrate the methods developed in this article.

The next-generation sequencing data, called high-throughput sequencing data, are recorded as count data, which are generally far from normal distribution. Under the assumption that the count data follow the Poisson -normal distribution, this article provides an -penalized likelihood framework and an efficient search algorithm to estimate the structure of sparse directed acyclic graphs (DAGs) for multivariate counts data. In searching for the solution, we use iterative optimization procedures to estimate the adjacency matrix and the variance matrix of the latent variables. The simulation result shows that our proposed method outperforms the approach which assumes multivariate normal distributions, and the -transformation approach. It also shows that the proposed method outperforms the rank-based PC method under sparse network or hub network structures. As a real data example, we demonstrate the efficiency of the proposed method in estimating the gene regulatory networks of the ovarian cancer study.

Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using negative binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many strategies have been proposed to estimate this parameter, but when differential analysis is the purpose, they often result in procedures based on plug-in estimates, and we show here that this discrepancy between the estimation framework and the testing framework can lead to uncontrolled type-I errors. Instead, we propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Three consistent statistical tests are developed for differential expression analysis. We show through a wide simulation study that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it reaches the nominal value for the type-I error, while keeping elevate discriminative power between differentially and not differentially expressed genes. The method is finally illustrated on prostate cancer RNA-Seq data.

Array-based CGH experiments are designed to detect genomic aberrations or regions of DNA copy-number variation that are associated with an outcome, typically a state of disease. Most of the existing statistical methods target on detecting DNA copy number variations in a single sample or array. We focus on the detection of group effect variation, through simultaneous study of multiple samples from multiple groups. Rather than using direct segmentation or smoothing techniques, as commonly seen in existing detection methods, we develop a sequential model selection procedure that is guided by a modified Bayesian information criterion. This approach improves detection accuracy by accumulatively utilizing information across contiguous clones, and has computational advantage over the existing popular detection methods. Our empirical investigation suggests that the performance of the proposed method is superior to that of the existing detection methods, in particular, in detecting small segments or separating neighboring segments with differential degrees of copy-number variation.

The twin method refers to the use of data from same-sex identical and fraternal twins to estimate the genetic and environmental contributions to a trait or outcome. The standard twin method is the variance component twin method that estimates heritability, the fraction of variance attributed to additive genetic inheritance. The latent class twin method estimates two quantities that are easier to interpret than heritability: the genetic prevalence, which is the fraction of persons in the genetic susceptibility latent class, and the heritability fraction, which is the fraction of persons in the genetic susceptibility latent class with the trait or outcome. We extend the latent class twin method in three important ways. First, we incorporate an additive genetic model to broaden the sensitivity analysis beyond the original autosomal dominant and recessive genetic models. Second, we specify a separate survival model to simplify computations and improve convergence. Third, we show how to easily adjust for covariates by extending the method of propensity scores from a treatment difference to zygosity. Applying the latent class twin method to data on breast cancer among Nordic twins, we estimated a genetic prevalence of 1%, a result with important implications for breast cancer prevention research.

We introduce in this work the Interval Testing Procedure (ITP), a novel inferential technique for functional data. The procedure can be used to test different functional hypotheses, e.g., distributional equality between two or more functional populations, equality of mean function of a functional population to a reference. ITP involves three steps: (i) the representation of data on a (possibly high-dimensional) functional basis; (ii) the test of each possible set of consecutive basis coefficients; (iii) the computation of the adjusted *p*-values associated to each basis component, by means of a new strategy here proposed. We define a new type of error control, the interval-wise control of the family wise error rate, particularly suited for functional data. We show that ITP is provided with such a control. A simulation study comparing ITP with other testing procedures is reported. ITP is then applied to the analysis of hemodynamical features involved with cerebral aneurysm pathology. ITP is implemented in the fdatest R package.

Functional principal component analysis (FPCA) is a popular approach to explore major sources of variation in a sample of random curves. These major sources of variation are represented by functional principal components (FPCs). The intervals where the values of FPCs are significant are interpreted as where sample curves have major variations. However, these intervals are often hard for naïve users to identify, because of the vague definition of “significant values”. In this article, we develop a novel penalty-based method to derive FPCs that are only nonzero precisely in the intervals where the values of FPCs are significant, whence the derived FPCs possess better interpretability than the FPCs derived from existing methods. To compute the proposed FPCs, we devise an efficient algorithm based on projection deflation techniques. We show that the proposed interpretable FPCs are strongly consistent and asymptotically normal under mild conditions. Simulation studies confirm that with a competitive performance in explaining variations of sample curves, the proposed FPCs are more interpretable than the traditional counterparts. This advantage is demonstrated by analyzing two real datasets, namely, electroencephalography data and Canadian weather data.

Dynamic treatment regimens (DTRs) recommend treatments based on evolving subject-level data. The optimal DTR is that which maximizes expected patient outcome and as such its identification is of primary interest in the personalized medicine setting. When analyzing data from observational studies using semi-parametric approaches, there are two primary components which can be modeled: the expected level of treatment and the expected outcome for a patient given their other covariates. In an effort to offer greater flexibility, the so-called doubly robust methods have been developed which offer consistent parameter estimators as long as at least one of these two models is correctly specified. However, in practice it can be difficult to be confident if this is the case. Using G-estimation as our example method, we demonstrate how the property of double robustness itself can be used to provide evidence that a specified model is or is not correct. This approach is illustrated through simulation studies as well as data from the Multicenter AIDS Cohort Study.

A dynamic treatment regimen consists of decision rules that recommend how to individualize treatment to patients based on available treatment and covariate history. In many scientific domains, these decision rules are shared across stages of intervention. As an illustrative example, we discuss STAR*D, a multistage randomized clinical trial for treating major depression. Estimating these shared decision rules often amounts to estimating parameters indexing the decision rules that are shared across stages. In this article, we propose a novel simultaneous estimation procedure for the shared parameters based on Q-learning. We provide an extensive simulation study to illustrate the merit of the proposed method over simple competitors, in terms of the treatment allocation matching of the procedure with the “oracle” procedure, defined as the one that makes treatment recommendations based on the true parameter values as opposed to their estimates. We also look at bias and mean squared error of the individual parameter-estimates as secondary metrics. Finally, we analyze the STAR*D data using the proposed method.

To evaluate a new therapy versus a control via a randomized, comparative clinical study or a series of trials, due to heterogeneity of the study patient population, a pre-specified, *predictive* enrichment procedure may be implemented to identify an “enrichable” subpopulation. For patients in this subpopulation, the therapy is expected to have a desirable overall risk-benefit profile. To develop and validate such a “therapy-diagnostic co-development” strategy, a three-step procedure may be conducted with three independent data sets from a series of similar studies or a single trial. At the first stage, we create various candidate scoring systems based on the baseline information of the patients via, for example, parametric models using the first data set. Each individual score reflects an anticipated average treatment difference for future patients who share similar baseline profiles. A large score indicates that these patients tend to benefit from the new therapy. At the second step, a potentially promising, enrichable subgroup is identified using the totality of evidence from these scoring systems. At the final stage, we validate such a selection via two-sample inference procedures for assessing the treatment effectiveness statistically and clinically with the third data set, the so-called holdout sample. When the study size is not large, one may combine the first two steps using a “cross-training-evaluation” process. Comprehensive numerical studies are conducted to investigate the operational characteristics of the proposed method. The entire enrichment procedure is illustrated with the data from a cardiovascular trial to evaluate a beta-blocker versus a placebo for treating chronic heart failure patients.

Motivated by an ongoing study to develop a screening test able to identify patients with undiagnosed Sjögren's Syndrome in a symptomatic population, we propose methodology to combine multiple biomarkers and evaluate their performance in a two-stage group sequential design that proceeds as follows: biomarker data is collected from first stage samples; the biomarker panel is built and evaluated; if the panel meets pre-specified performance criteria the study continues to the second stage and the remaining samples are assayed. The design allows us to conserve valuable specimens in the case of inadequate biomarker panel performance. We propose a nonparametric conditional resampling algorithm that uses all the study data to provide unbiased estimates of the biomarker combination rule and the sensitivity of the panel corresponding to specificity of 1-t on the receiver operating characteristic curve (ROC). The Copas and Corbett (2002) correction, for bias resulting from using the same data to derive the combination rule and estimate the ROC, was also evaluated and an improved version was incorporated. An extensive simulation study was conducted to evaluate finite sample performance and propose guidelines for designing studies of this type. The methods were implemented in the National Cancer Institutes Early Detection Network Urinary PCA3 Evaluation Trial.

Prediction models for disease risk and prognosis play an important role in biomedical research, and evaluating their predictive accuracy in the presence of censored data is of substantial interest. The standard concordance (*c*) statistic has been extended to provide a summary measure of predictive accuracy for survival models. Motivated by a prostate cancer study, we address several issues associated with evaluating survival prediction models based on statistic with a focus on estimators using the technique of inverse probability of censoring weighting (IPCW). Compared to the existing work, we provide complete results on the asymptotic properties of the IPCW estimators under the assumption of coarsening at random (CAR), and propose a sensitivity analysis under the mechanism of noncoarsening at random (NCAR). In addition, we extend the IPCW approach as well as the sensitivity analysis to high-dimensional settings. The predictive accuracy of prediction models for cancer recurrence after prostatectomy is assessed by applying the proposed approaches. We find that the estimated predictive accuracy for the models in consideration is sensitive to NCAR assumption, and thus identify the best predictive model. Finally, we further evaluate the performance of the proposed methods in both settings of low-dimensional and high-dimensional data under CAR and NCAR through simulations.

In oncology, the international WHO and RECIST criteria have allowed the standardization of tumor response evaluation in order to identify the time of disease progression. These semi-quantitative measurements are often used as endpoints in phase II and phase III trials to study the efficacy of new therapies. However, through categorization of the continuous tumor size, information can be lost and they can be challenged by recently developed methods of modeling biomarkers in a longitudinal way. Thus, it is of interest to compare the predictive ability of cancer progressions based on categorical criteria and quantitative measures of tumor size (left-censored due to detection limit problems) and/or appearance of new lesions on overall survival. We propose a joint model for a simultaneous analysis of three types of data: a longitudinal marker, recurrent events, and a terminal event. The model allows to determine in a randomized clinical trial on which particular component treatment acts mostly. A simulation study is performed and shows that the proposed trivariate model is appropriate for practical use. We propose statistical tools that evaluate predictive accuracy for joint models to compare our model to models based on categorical criteria and their components. We apply the model to a randomized phase III clinical trial of metastatic colorectal cancer, conducted by the Fédération Francophone de Cancérologie Digestive (FFCD 2000–05 trial), which assigned 410 patients to two therapeutic strategies with multiple successive chemotherapy regimens.

Predicting binary events such as newborns with large birthweight is important for obstetricians in their attempt to reduce both maternal and fetal morbidity and mortality. Such predictions have been a challenge in obstetric practice, where longitudinal ultrasound measurements taken at multiple gestational times during pregnancy may be useful for predicting various poor pregnancy outcomes. The focus of this article is on developing a flexible class of joint models for the multivariate longitudinal ultrasound measurements that can be used for predicting a binary event at birth. A skewed multivariate random effects model is proposed for the ultrasound measurements, and the skewed generalized *t*-link is assumed for the link function relating the binary event and the underlying longitudinal processes. We consider a shared random effect to link the two processes together. Markov chain Monte Carlo sampling is used to carry out Bayesian posterior computation. Several variations of the proposed model are considered and compared via the deviance information criterion, the logarithm of pseudomarginal likelihood, and with a training-test set prediction paradigm. The proposed methodology is illustrated with data from the NICHD Successive Small-for-Gestational-Age Births study, a large prospective fetal growth cohort conducted in Norway and Sweden.

Clinical trials often collect multiple outcomes on each patient, as the treatment may be expected to affect the patient on many dimensions. For example, a treatment for a neurological disease such as ALS is intended to impact several dimensions of neurological function as well as survival. The assessment of treatment on the basis of multiple outcomes is challenging, both in terms of selecting a test and interpreting the results. Several global tests have been proposed, and we provide a general approach to selecting and executing a global test. The tests require minimal parametric assumptions, are flexible about weighting of the various outcomes, and are appropriate even when some or all of the outcomes are censored. The test we propose is based on a simple scoring mechanism applied to each pair of subjects for each endpoint. The pairwise scores are then reduced to a summary score, and a rank-sum test is applied to the summary scores. This can be seen as a generalization of previously proposed nonparametric global tests (e.g., O'Brien, 1984). We discuss the choice of optimal weighting schemes based on power and relative importance of the outcomes. As the optimal weights are generally unknown in practice, we also propose an adaptive weighting scheme and evaluate its performance in simulations. We apply the methods to analyze the impact of a treatment on neurological function and death in an ALS trial.

Amyotrophic lateral sclerosis (ALS) is a neurodegenerative condition characterized by the progressive deterioration of motor neurons in the cortex and spinal cord. Using an automated robotic microscope platform that enables the longitudinal tracking of thousands of single neurons, we examine the effects a large library of compounds on modulating the survival of primary neurons expressing a mutation known to cause ALS. The goal of our analysis is to identify the few potentially beneficial compounds among the many assayed, the vast majority of which do not extend neuronal survival. This resembles the large-scale simultaneous inference scenario familiar from microarray analysis, but transferred to the survival analysis setting due to the novel experimental setup. We apply a three-component mixture model to censored survival times of thousands of individual neurons subjected to hundreds of different compounds. The shrinkage induced by our model significantly improves performance in simulations relative to performing treatment-wise survival analysis and subsequent multiple testing adjustment. Our analysis identified compounds that provide insight into potential novel therapeutic strategies for ALS.

Meta-analysis of trans-ethnic genome-wide association studies (GWAS) has proven to be a practical and profitable approach for identifying loci that contribute to the risk of complex diseases. However, the expected genetic effect heterogeneity cannot easily be accommodated through existing fixed-effects and random-effects methods. In response, we propose a novel random effect model for trans-ethnic meta-analysis with flexible modeling of the expected genetic effect heterogeneity across diverse populations. Specifically, we adopt a modified random effect model from the kernel regression framework, in which genetic effect coefficients are random variables whose correlation structure reflects the genetic distances across ancestry groups. In addition, we use the adaptive variance component test to achieve robust power regardless of the degree of genetic effect heterogeneity. Simulation studies show that our proposed method has well-calibrated type I error rates at very stringent significance levels and can improve power over the traditional meta-analysis methods. We reanalyzed the published type 2 diabetes GWAS meta-analysis (Consortium et al., 2014) and successfully identified one additional SNP that clearly exhibits genetic effect heterogeneity across different ancestry groups. Furthermore, our proposed method provides scalable computing time for genome-wide datasets, in which an analysis of one million SNPs would require less than 3 hours.

We discuss the use of the determinantal point process (DPP) as a prior for latent structure in biomedical applications, where inference often centers on the interpretation of latent features as biologically or clinically meaningful structure. Typical examples include mixture models, when the terms of the mixture are meant to represent clinically meaningful subpopulations (of patients, genes, etc.). Another class of examples are feature allocation models. We propose the DPP prior as a repulsive prior on latent mixture components in the first example, and as prior on feature-specific parameters in the second case. We argue that the DPP is in general an attractive prior model for latent structure when biologically relevant interpretation of such structure is desired. We illustrate the advantages of DPP prior in three case studies, including inference in mixture models for magnetic resonance images (MRI) and for protein expression, and a feature allocation model for gene expression using data from The Cancer Genome Atlas. An important part of our argument are efficient and straightforward posterior simulation methods. We implement a variation of reversible jump Markov chain Monte Carlo simulation for inference under the DPP prior, using a density with respect to the unit rate Poisson process.

Potential reductions in laboratory assay costs afforded by pooling equal aliquots of biospecimens have long been recognized in disease surveillance and epidemiological research and, more recently, have motivated design and analytic developments in regression settings. For example, Weinberg and Umbach (1999, *Biometrics* **55**, 718–726) provided methods for fitting set-based logistic regression models to case-control data when a continuous exposure variable (e.g., a biomarker) is assayed on pooled specimens. We focus on improving estimation efficiency by utilizing available subject-specific information at the pool allocation stage. We find that a strategy that we call “(y,**c**)-pooling,” which forms pooling sets of individuals within strata defined jointly by the outcome and other covariates, provides more precise estimation of the risk parameters associated with those covariates than does pooling within strata defined only by the outcome. We review the approach to set-based analysis through offsets developed by Weinberg and Umbach in a recent correction to their original paper. We propose a method for variance estimation under this design and use simulations and a real-data example to illustrate the precision benefits of (y,**c**)-pooling relative to y-pooling. We also note and illustrate that set-based models permit estimation of covariate interactions with exposure.

Vaccine-induced protection may not be homogeneous across individuals. It is possible that a vaccine gives complete protection for a portion of individuals, while the rest acquire only incomplete (leaky) protection of varying magnitude. If vaccine efficacy is estimated under wrong assumptions about such individual level heterogeneity, the resulting estimates may be difficult to interpret. For instance, population-level predictions based on such estimates may be biased. We consider the problem of estimating heterogeneous vaccine efficacy against an infection that can be acquired multiple times (susceptible-infected-susceptible model). The estimation is based on a limited number of repeated measurements of the current status of each individual, a situation commonly encountered in practice. We investigate how the placement of consecutive samples affects the estimability and efficiency of vaccine efficacy parameters. The same sampling frequency may not be optimal for efficient estimation of all components of heterogeneous vaccine protection. However, we suggest practical guidelines allowing estimation of all components. For situations in which the estimability of individual components fails, we suggest to use summary measures of vaccine efficacy.

Zero-inflated regression models have emerged as a popular tool within the parametric framework to characterize count data with excess zeros. Despite their increasing popularity, much of the literature on real applications of these models has centered around the latent class formulation where the mean response of the so-called at-risk or susceptible population and the susceptibility probability are both related to covariates. While this formulation in some instances provides an interesting representation of the data, it often fails to produce easily interpretable covariate effects on the overall mean response. In this article, we propose two approaches that circumvent this limitation. The first approach consists of estimating the effect of covariates on the overall mean from the assumed latent class models, while the second approach formulates a model that directly relates the overall mean to covariates. Our results are illustrated by extensive numerical simulations and an application to an oral health study on low income African-American children, where the overall mean model is used to evaluate the effect of sugar consumption on caries indices.

While there are many validated prognostic classifiers used in practice, often their accuracy is modest and heterogeneity in clinical outcomes exists in one or more risk subgroups. Newly available markers, such as genomic mutations, may be used to improve the accuracy of an existing classifier by reclassifying patients from a heterogenous group into a higher or lower risk category. The statistical tools typically applied to develop the initial classifiers are not easily adapted toward this reclassification goal. In this article, we develop a new method designed to refine an existing prognostic classifier by incorporating new markers. The two-stage algorithm called Boomerang first searches for modifications of the existing classifier that increase the overall predictive accuracy and then merges to a prespecified number of risk groups. Resampling techniques are proposed to assess the improvement in predictive accuracy when an independent validation data set is not available. The performance of the algorithm is assessed under various simulation scenarios where the marker frequency, degree of censoring, and total sample size are varied. The results suggest that the method selects few false positive markers and is able to improve the predictive accuracy of the classifier in many settings. Lastly, the method is illustrated on an acute myeloid leukemia data set where a new refined classifier incorporates four new mutations into the existing three category classifier and is validated on an independent data set.

Li, Fine, and Brookhart (2015) presented an extension of the two-stage least squares (2SLS) method for additive hazards models which requires an assumption that the censoring distribution is unrelated to the endogenous exposure variable. We present another extension of 2SLS that can address this limitation.