The assessment of patients’ functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set.

An important goal of censored quantile regression is to provide reliable predictions of survival quantiles, which are often reported in practice to offer robust and comprehensive biomedical summaries. However, formal methods for evaluating and comparing *working* quantile regression models in terms of their performance in predicting survival quantiles have been lacking, especially when the working models are subject to model mis-specification. In this article, we proposes a sensible and rigorous framework to fill in this gap. We introduce and justify a predictive performance measure defined based on the check loss function. We derive estimators of the proposed predictive performance measure and study their distributional properties and the corresponding inference procedures. More importantly, we develop model comparison procedures that enable thorough evaluations of model predictive performance among nested or non-nested models. Our proposals properly accommodate random censoring to the survival outcome and the realistic complication of model mis-specification, and thus are generally applicable. Extensive simulations and a real data example demonstrate satisfactory performances of the proposed methods in real life settings.

This article investigates a generalized semiparametric varying-coefficient model for longitudinal data that can flexibly model three types of covariate effects: time-constant effects, time-varying effects, and covariate-varying effects. Different link functions can be selected to provide a rich family of models for longitudinal data. The model assumes that the time-varying effects are unspecified functions of time and the covariate-varying effects are parametric functions of an exposure variable specified up to a finite number of unknown parameters. The estimation procedure is developed using local linear smoothing and profile weighted least squares estimation techniques. Hypothesis testing procedures are developed to test the parametric functions of the covariate-varying effects. The asymptotic distributions of the proposed estimators are established. A working formula for bandwidth selection is discussed and examined through simulations. Our simulation study shows that the proposed methods have satisfactory finite sample performance. The proposed methods are applied to the ACTG 244 clinical trial of HIV infected patients being treated with Zidovudine to examine the effects of antiretroviral treatment switching before and after HIV develops the T215Y/F drug resistance mutation. Our analysis shows benefits of treatment switching to the combination therapies as compared to continuing with ZDV monotherapy before and after developing the 215-mutation.

Prognostic survival models are commonly evaluated in terms of both their calibration and their discrimination. Comparing observed and predicted survival curves can assess calibration, while discrimination is typically summarized through comparison of the properties of “cases” or subjects who experience an event, and the properties of “controls” represented by event-free individuals. For binary data, discrimination is characterized either by using the relative ranks of cases and controls and a receiver operating characteristic (ROC) curve, or by summarizing the magnitude of risk placed on cases and controls through calculation of the discrimination slope (DS). In this article, we propose a risk-based measure of time-varying discrimination that generalizes the discrimination slope to allow use with incident events and hazard models. We refer to the new measure as the hazard discrimination summary (HDS) since it compares the relative risk among incident cases to their associated dynamic risk set controls. We introduce both a model-based estimation procedure that adopts the Cox model, and an alternative approach that locally relaxes the proportional hazards assumption. We illustrate the proposed methods using both a benchmark survival data set, and an oncology study where primary interest is in the time-varying performance of candidate biomarkers.

Prostate cancer patients are closely followed after the initial therapy and salvage treatment may be prescribed to prevent or delay cancer recurrence. The salvage treatment decision is usually made dynamically based on the patient's evolving history of disease status and other time-dependent clinical covariates. A multi-center prostate cancer observational study has provided us data on longitudinal prostate specific antigen (PSA) measurements, time-varying salvage treatment, and cancer recurrence time. These data enable us to estimate the best dynamic regime of salvage treatment, while accounting for the complicated confounding of time-varying covariates present in the data. A Random Forest based method is used to model the probability of regime adherence and inverse probability weights are used to account for the complexity of selection bias in regime adherence. The optimal regime is then identified by the largest restricted mean survival time. We conduct simulation studies with different PSA trends to mimic both simple and complex regime adherence mechanisms. The proposed method can efficiently accommodate complex and possibly unknown adherence mechanisms, and it is robust to cases where the proportional hazards assumption is violated. We apply the method to data collected from the observational study and estimate the best salvage treatment regime in managing the risk of prostate cancer recurrence.

It is standard practice for covariates to enter a parametric model through a single distributional parameter of interest, for example, the scale parameter in many standard survival models. Indeed, the well-known proportional hazards model is of this kind. In this article, we discuss a more general approach whereby covariates enter the model through *more than one* distributional parameter simultaneously (e.g., scale *and* shape parameters). We refer to this practice as “multi-parameter regression” (MPR) modeling and explore its use in a survival analysis context. We find that multi-parameter regression leads to more flexible models which can offer greater insight into the underlying data generating process. To illustrate the concept, we consider the two-parameter Weibull model which leads to time-dependent hazard ratios, thus relaxing the typical proportional hazards assumption and motivating a new test of proportionality. A novel variable selection strategy is introduced for such multi-parameter regression models. It accounts for the correlation arising between the estimated regression coefficients in two or more linear predictors—a feature which has not been considered by other authors in similar settings. The methods discussed have been implemented in the mpr package in R.

Researchers estimating causal effects are increasingly challenged with decisions on how to best control for a potentially high-dimensional set of confounders. Typically, a single propensity score model is chosen and used to adjust for confounding, while the uncertainty surrounding which covariates to include into the propensity score model is often ignored, and failure to include even one important confounder will results in bias. We propose a practical and generalizable approach that overcomes the limitations described above through the use of model averaging. We develop and evaluate this approach in the context of double robust estimation. More specifically, we introduce the model averaged double robust (MA-DR) estimators, which account for model uncertainty in both the propensity score and outcome model through the use of model averaging. The MA-DR estimators are defined as weighted averages of double robust estimators, where each double robust estimator corresponds to a specific choice of the outcome model and the propensity score model. The MA-DR estimators extend the desirable double robustness property by achieving consistency under the much weaker assumption that either the true propensity score model or the true outcome model be within a specified, possibly large, class of models. Using simulation studies, we also assessed small sample properties, and found that MA-DR estimators can reduce mean squared error substantially, particularly when the set of potential confounders is large relative to the sample size. We apply the methodology to estimate the average causal effect of temozolomide plus radiotherapy versus radiotherapy alone on one-year survival in a cohort of 1887 Medicare enrollees who were diagnosed with glioblastoma between June 2005 and December 2009.

We study threshold regression models that allow the relationship between the outcome and a covariate of interest to change across a threshold value in the covariate. In particular, we focus on continuous threshold models, which experience no jump at the threshold. Continuous threshold regression functions can provide a useful summary of the association between outcome and the covariate of interest, because they offer a balance between flexibility and simplicity. Motivated by collaborative works in studying immune response biomarkers of transmission of infectious diseases, we study estimation of continuous threshold models in this article with particular attention to inference under model misspecification. We derive the limiting distribution of the maximum likelihood estimator, and propose both Wald and test-inversion confidence intervals. We evaluate finite sample performance of our methods, compare them with bootstrap confidence intervals, and provide guidelines for practitioners to choose the most appropriate method in real data analysis. We illustrate the application of our methods with examples from the HIV-1 immune correlates studies.

Understanding the complex interplay among protein coding genes and regulatory elements requires rigorous interrogation with analytic tools designed for discerning the relative contributions of overlapping genomic regions. To this aim, we offer a novel application of Bayesian variable selection (BVS) for classifying genomic class level associations using existing large meta-analysis summary level resources. This approach is applied using the expectation maximization variable selection (EMVS) algorithm to typed and imputed SNPs across 502 protein coding genes (PCGs) and 220 long intergenic non-coding RNAs (lncRNAs) that overlap 45 known loci for coronary artery disease (CAD) using publicly available Global Lipids Gentics Consortium (GLGC) (Teslovich et al., 2010; Willer et al., 2013) meta-analysis summary statistics for low-density lipoprotein cholesterol (LDL-C). The analysis reveals 33 PCGs and three lncRNAs across 11 loci with 50% posterior probabilities for inclusion in an additive model of association. The findings are consistent with previous reports, while providing some new insight into the architecture of LDL-cholesterol to be investigated further. As genomic taxonomies continue to evolve, additional classes such as enhancer elements and splicing regions, can easily be layered into the proposed analysis framework. Moreover, application of this approach to alternative publicly available meta-analysis resources, or more generally as a post-analytic strategy to further interrogate regions that are identified through single point analysis, is straightforward. All coding examples are implemented in R version 3.2.1 and provided as supplemental material.

Quantitative trait locus analysis has been used as an important tool to identify markers where the phenotype or quantitative trait is linked with the genotype. Most existing tests for single locus association with quantitative traits aim at the detection of the mean differences across genotypic groups. However, recent research has revealed functional genetic loci that affect the variance of traits, known as variability-controlling quantitative trait locus. In addition, it has been suggested that many genotypes have both mean and variance effects, while the mean effects or variance effects alone may not be strong enough to be detected. The existing methods accounting for unequal variances include the Levene's test, the Lepage test, and the *D*-test, but suffer from their limitations of lack of robustness or lack of power. We propose a semiparametric model and a novel pairwise conditional likelihood ratio test. Specifically, the semiparametric model is designed to identify the combined differences in higher moments among genotypic groups. The pairwise likelihood is constructed based on conditioning procedure, where the unknown reference distribution is eliminated. We show that the proposed pairwise likelihood ratio test has a simple asymptotic chi-square distribution, which does not require permutation or bootstrap procedures. Simulation studies show that the proposed test performs well in controlling Type I errors and having competitive power in identifying the differences across genotypic groups. In addition, the proposed test has certain robustness to model mis-specifications. The proposed test is illustrated by an example of identifying both mean and variances effects in body mass index using the Framingham Heart Study data.

Genetic risk prediction is an important component of individualized medicine, but prediction accuracies remain low for many complex diseases. A fundamental limitation is the sample sizes of the studies on which the prediction algorithms are trained. One way to increase the effective sample size is to integrate information from previously existing studies. However, it can be difficult to find existing data that examine the target disease of interest, especially if that disease is rare or poorly studied. Furthermore, individual-level genotype data from these auxiliary studies are typically difficult to obtain. This article proposes a new approach to integrative genetic risk prediction of complex diseases with binary phenotypes. It accommodates possible heterogeneity in the genetic etiologies of the target and auxiliary diseases using a tuning parameter-free non-parametric empirical Bayes procedure, and can be trained using only auxiliary summary statistics. Simulation studies show that the proposed method can provide superior predictive accuracy relative to non-integrative as well as integrative classifiers. The method is applied to a recent study of pediatric autoimmune diseases, where it substantially reduces prediction error for certain target/auxiliary disease combinations. The proposed method is implemented in the R package ssa.

Alternating logistic regressions is an estimating equations procedure used to model marginal means of correlated binary outcomes while simultaneously specifying a within-cluster association model for log odds ratios of outcome pairs. A recent generalization of alternating logistic regressions, known as orthogonalized residuals, is extended to incorporate finite sample adjustments in the estimation of the log odds ratio model parameters for when there is a moderately small number of clusters. Bias adjustments are made both in the sandwich variance estimators and in the estimating equations for the association parameters. The proposed methods are demonstrated in a repeated cross-sectional cluster trial to reduce underage drinking in the United States, and in an analysis of dental caries incidence in a cluster randomized trial of 30 aboriginal communities in the Northern Territory of Australia. A simulation study demonstrates improved performance with respect to bias and coverage of their estimators relative to those based on the uncorrected orthogonalized residuals procedure.

For a fallback randomized clinical trial design with a marker, Choai and Matsui (2015, *Biometrics* 71, 25–32) estimate the bias of the estimator of the treatment effect in the marker-positive subgroup conditional on the treatment effect not being statistically significant in the overall population. This is used to construct and examine conditionally bias-corrected estimators of the treatment effect for the marker-positive subgroup. We argue that it may not be appropriate to correct for conditional bias in this setting. Instead, we consider the unconditional bias of estimators of the treatment effect for marker-positive patients.

In precision medicine, a patient is treated with targeted therapies that are predicted to be effective based on the patient's baseline characteristics such as biomarker profiles. Oftentimes, patient subgroups are unknown and must be learned through inference using observed data. We present SCUBA, a Subgroup ClUster-based Bayesian Adaptive design aiming to fulfill two simultaneous goals in a clinical trial, 1) to treatments enrich the allocation of each subgroup of patients to their precision and desirable treatments and 2) to report multiple subgroup-treatment pairs (STPs). Using random partitions and semiparametric Bayesian models, SCUBA provides coherent and probabilistic assessment of potential patient subgroups and their associated targeted therapies. Each STP can then be used for future confirmatory studies for regulatory approval. Through extensive simulation studies, we present an application of SCUBA to an innovative clinical trial in gastroesphogeal cancer.

Conventional distance sampling (CDS) methods assume that animals are uniformly distributed in the vicinity of lines or points. But when animals move in response to observers before detection, or when lines or points are not located randomly, this assumption may fail. By formulating distance sampling models as survival models, we show that using time to first detection in addition to perpendicular distance (line transect surveys) or radial distance (point transect surveys) allows estimation of detection probability, and hence density, when animal distribution in the vicinity of lines or points is not uniform and is unknown. We also show that times to detection can provide information about failure of the CDS assumption that detection probability is 1 at distance zero. We obtain a maximum likelihood estimator of line transect survey detection probability and effective strip half-width using times to detection, and we investigate its properties by simulation in situations where animals are nonuniformly distributed and their distribution is unknown. The estimator is found to perform well when detection probability at distance zero is 1. It allows unbiased estimates of density to be obtained in this case from surveys in which there has been responsive movement prior to animals coming within detectable range. When responsive movement continues within detectable range, estimates may be biased but are likely less biased than estimates from methods that assuming no responsive movement. We illustrate by estimating primate density from a line transect survey in which animals are known to avoid the transect line, and a shipboard survey of dolphins that are attracted to it.

In randomized studies involving severely ill patients, functional outcomes are often unobserved due to missed clinic visits, premature withdrawal, or death. It is well known that if these unobserved functional outcomes are not handled properly, biased treatment comparisons can be produced. In this article, we propose a procedure for comparing treatments that is based on a composite endpoint that combines information on both the functional outcome and survival. We further propose a missing data imputation scheme and sensitivity analysis strategy to handle the unobserved functional outcomes not due to death. Illustrations of the proposed method are given by analyzing data from a recent non-small cell lung cancer clinical trial and a recent trial of sedation interruption among mechanically ventilated patients.

Survival model construction can be guided by goodness-of-fit techniques as well as measures of predictive strength. Here, we aim to bring together these distinct techniques within the context of a single framework. The goal is how to best characterize and code the effects of the variables, in particular time dependencies, when taken either singly or in combination with other related covariates. Simple graphical techniques can provide an immediate visual indication as to the goodness-of-fit but, in cases of departure from model assumptions, will point in the direction of a more involved and richer alternative model. These techniques appear to be intuitive. This intuition is backed up by formal theorems that underlie the process of building richer models from simpler ones. Measures of predictive strength are used in conjunction with these goodness-of-fit techniques and, again, formal theorems show that these measures can be used to help identify models closest to the unknown non-proportional hazards mechanism that we can suppose generates the observations. Illustrations from studies in breast cancer show how these tools can be of help in guiding the practical problem of efficient model construction for survival data.

We propose a subgroup identification approach for inferring optimal and interpretable personalized treatment rules with high-dimensional covariates. Our approach is based on a two-step greedy tree algorithm to pursue signals in a high-dimensional space. In the first step, we transform the treatment selection problem into a weighted classification problem that can utilize tree-based methods. In the second step, we adopt a newly proposed tree-based method, known as reinforcement learning trees, to detect features involved in the optimal treatment rules and to construct binary splitting rules. The method is further extended to right censored survival data by using the accelerated failure time model and introducing double weighting to the classification trees. The performance of the proposed method is demonstrated via simulation studies, as well as analyses of the Cancer Cell Line Encyclopedia (CCLE) data and the Tamoxifen breast cancer data.

In many biomedical studies that involve correlated data, an outcome is often repeatedly measured for each individual subject along with the number of these measurements, which is also treated as an observed outcome. This type of data has been referred as multivariate random length data by Barnhart and Sampson (1995). A common approach to handling such type of data is to jointly model the multiple measurements and the random length. In previous literature, a key assumption is the multivariate normality for the multiple measurements. Motivated by a reproductive study, we propose a new copula-based joint model which relaxes the normality assumption. Specifically, we adopt the Clayton–Oakes model for multiple measurements with flexible marginal distributions specified as semi-parametric transformation models. The random length is modeled via a generalized linear model. We develop an approximate EM algorithm to derive parameter estimators and standard errors of the estimators are obtained through bootstrapping procedures and the finite-sample performance of the proposed method is investigated using simulation studies. We apply our method to the Mount Sinai Study of Women Office Workers (MSSWOW), where women were prospectively followed for 1 year for studying fertility.

In a sensitivity analysis in an observational study with a binary outcome, is it better to use all of the data or to focus on subgroups that are expected to experience the largest treatment effects? The answer depends on features of the data that may be difficult to anticipate, a trade-off between unknown effect-sizes and known sample sizes. We propose a sensitivity analysis for an adaptive test similar to the Mantel–Haenszel test. The adaptive test performs two highly correlated analyses, one focused analysis using a subgroup, one combined analysis using all of the data, correcting for multiple testing using the joint distribution of the two test statistics. Because the two component tests are highly correlated, this correction for multiple testing is small compared with, for instance, the Bonferroni inequality. The test has the maximum design sensitivity of two component tests. A simulation evaluates the power of a sensitivity analysis using the adaptive test. Two examples are presented. An R package, sensitivity2x2xk, implements the procedure.

Integration of genomic data from multiple platforms has the capability to increase precision, accuracy, and statistical power in the identification of prognostic biomarkers. A fundamental problem faced in many multi-platform studies is unbalanced sample sizes due to the inability to obtain measurements from all the platforms for all the patients in the study. We have developed a novel Bayesian approach that integrates multi-regression models to identify a small set of biomarkers that can accurately predict time-to-event outcomes. This method fully exploits the amount of available information across platforms and does not exclude any of the subjects from the analysis. Through simulations, we demonstrate the utility of our method and compare its performance to that of methods that do not borrow information across regression models. Motivated by The Cancer Genome Atlas kidney renal cell carcinoma dataset, our methodology provides novel insights missed by non-integrative models.

Personalized cancer therapy requires clinical trials with smaller sample sizes compared to trials involving unselected populations that have not been divided into biomarker subgroups. The use of exponential survival modeling for survival endpoints has the potential of gaining 35% efficiency or saving 28% required sample size (Miller, 1983), making personalized therapy trials more feasible. However, the use of exponential survival has not been fully accepted in cancer research practice due to uncertainty about whether or not the exponential assumption holds. We propose a test for identifying violations of the exponential assumption using a reduced piecewise exponential approach. Compared with an alternative goodness-of-fit test, which suffers from inflation of type I error rate under various censoring mechanisms, the proposed test maintains the correct type I error rate. We conduct power analysis using simulated data based on different types of cancer survival distribution in the SEER registry database, and demonstrate the implementation of this approach in existing cancer clinical trials.

Group testing, where individuals are tested initially in pools, is widely used to screen a large number of individuals for rare diseases. Triggered by the recent development of assays that detect multiple infections at once, screening programs now involve testing individuals in pools for multiple infections simultaneously. Tebbs, McMahan, and Bilder (2013, *Biometrics*) recently evaluated the performance of a two-stage hierarchical algorithm used to screen for chlamydia and gonorrhea as part of the Infertility Prevention Project in the United States. In this article, we generalize this work to accommodate a larger number of stages. To derive the operating characteristics of higher-stage hierarchical algorithms with more than one infection, we view the pool decoding process as a time-inhomogeneous, finite-state Markov chain. Taking this conceptualization enables us to derive closed-form expressions for the expected number of tests and classification accuracy rates in terms of transition probability matrices. When applied to chlamydia and gonorrhea testing data from four states (Region X of the United States Department of Health and Human Services), higher-stage hierarchical algorithms provide, on average, an estimated 11% reduction in the number of tests when compared to two-stage algorithms. For applications with rarer infections, we show theoretically that this percentage reduction can be much larger.

In many scientific and engineering fields, advanced experimental and computing technologies are producing data that are not just high dimensional, but also internally structured. For instance, statistical units may have heterogeneous origins from distinct studies or subpopulations, and features may be naturally partitioned based on experimental platforms generating them, or on information available about their roles in a given phenomenon. In a regression analysis, exploiting this known structure in the predictor dimension reduction stage that precedes modeling can be an effective way to integrate diverse data. To pursue this, we propose a novel Sufficient Dimension Reduction (SDR) approach that we call *structured Ordinary Least Squares* (sOLS). This combines ideas from existing SDR literature to merge reductions performed within groups of samples and/or predictors. In particular, it leads to a version of OLS for grouped predictors that requires far less computation than recently proposed groupwise SDR procedures, and provides an informal yet effective variable selection tool in these settings. We demonstrate the performance of sOLS by simulation and present a first application to genomic data. The R package “sSDR,” publicly available on CRAN, includes all procedures necessary to implement the sOLS approach.

Bivariate survival data arise frequently in familial association studies of chronic disease onset, as well as in clinical trials and observational studies with multiple time to event endpoints. The association between two event times is often scientifically important. In this article, we examine the association via a novel quantile association measure, which describes the dynamic association as a function of the quantile levels. The quantile association measure is free of marginal distributions, allowing direct evaluation of the underlying association pattern at different locations of the event times. We propose a nonparametric estimator for quantile association, as well as a semiparametric estimator that is superior in smoothness and efficiency. The proposed methods possess desirable asymptotic properties including uniform consistency and root-n convergence. They demonstrate satisfactory numerical performances under a range of dependence structures. An application of our methods suggests interesting association patterns between time to myocardial infarction and time to stroke in an atherosclerosis study.

Model diagnosis, an important issue in statistical modeling, has not yet been addressed adequately for cure models. We focus on mixture cure models in this work and propose some residual-based methods to examine the fit of the mixture cure model, particularly the fit of the latency part of the mixture cure model. The new methods extend the classical residual-based methods to the mixture cure model. Numerical work shows that the proposed methods are capable of detecting lack-of-fit of a mixture cure model, particularly in the latency part, such as outliers, improper covariate functional form, or nonproportionality in hazards if the proportional hazards assumption is employed in the latency part. The methods are illustrated with two real data sets that were previously analyzed with mixture cure models.

Estimating biomarker-index accuracy when only imperfect reference-test information is available is usually performed under the assumption of conditional independence between the biomarker and imperfect reference-test values. We propose to define a latent normally-distributed tolerance-variable underlying the observed dichotomous imperfect reference-test results. Subsequently, we construct a Bayesian latent-class model based on the joint multivariate normal distribution of the latent tolerance and biomarker values, conditional on latent true disease status, which allows accounting for conditional dependence. The accuracy of the continuous biomarker-index is quantified by the AUC of the optimal linear biomarker-combination. Model performance is evaluated by using a simulation study and two sets of data of Alzheimer's disease patients (one from the memory-clinic-based Amsterdam Dementia Cohort and one from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database). Simulation results indicate adequate model performance and bias in estimates of the diagnostic-accuracy measures when the assumption of conditional independence is used when, in fact, it is incorrect. In the considered case studies, conditional dependence between some of the biomarkers and the imperfect reference-test is detected. However, making the conditional independence assumption does not lead to any marked differences in the estimates of diagnostic accuracy.

Sequential multiple assignment randomization trial (SMART) is a powerful design to study Dynamic Treatment Regimes (DTRs) and allows causal comparisons of DTRs. To handle practical challenges of SMART, we propose a SMART with Enrichment (SMARTER) design, which performs stage-wise enrichment for SMART. SMARTER can improve design efficiency, shorten the recruitment period, and partially reduce trial duration to make SMART more practical with limited time and resource. Specifically, at each subsequent stage of a SMART, we enrich the study sample with new patients who have received previous stages’ treatments in a naturalistic fashion without randomization, and only randomize them among the current stage treatment options. One extreme case of the SMARTER is to synthesize separate independent single-stage randomized trials with patients who have received previous stage treatments. We show data from SMARTER allows for unbiased estimation of DTRs as SMART does under certain assumptions. Furthermore, we show analytically that the efficiency gain of the new design over SMART can be significant especially when the dropout rate is high. Lastly, extensive simulation studies are performed to demonstrate performance of SMARTER design, and sample size estimation in a scenario informed by real data from a SMART study is presented.

This article proposes a method to address the problem that can arise when covariates in a regression setting are not Gaussian, which may give rise to approximately mixture-distributed errors, or when a true mixture of regressions produced the data. The method begins with non-Gaussian mixture-based marginal variable screening, followed by fitting a full but relatively smaller mixture regression model to the selected data with help of a new penalization scheme. Under certain regularity conditions, the new screening procedure is shown to possess a sure screening property even when the population is heterogeneous. We further prove that there exists an elbow point in the associated scree plot which results in a consistent estimator of the set of active covariates in the model. By simulations, we demonstrate that the new procedure can substantially improve the performance of the existing procedures in the content of variable screening and data clustering. By applying the proposed procedure to motif data analysis in molecular biology, we demonstrate that the new method holds promise in practice.

Standard false discovery rate (FDR) procedures can provide misleading inference when testing multiple null hypotheses with heterogeneous multinomial data. For example, in the motivating study the goal is to identify species of bacteria near the roots of wheat plants (rhizobacteria) that are moderately or strongly associated with productivity. However, standard procedures discover the most abundant species even when their association is weak and fail to discover many moderate and strong associations when the species are not abundant. This article provides a new FDR-controlling method based on a finite mixture of multinomial distributions and shows that it tends to discover more moderate and strong associations and fewer weak associations when the data are heterogeneous across tests. The new method is applied to the rhizobacteria data and performs favorably over competing methods.

In cancer research, interest frequently centers on factors influencing a latent event that must precede a terminal event. In practice it is often impossible to observe the latent event precisely, making inference about this process difficult. To address this problem, we propose a joint model for the unobserved time to the latent and terminal events, with the two events linked by the baseline hazard. Covariates enter the model parametrically as linear combinations that multiply, respectively, the hazard for the latent event and the hazard for the terminal event conditional on the latent one. We derive the partial likelihood estimators for this problem assuming the latent event is observed, and propose a profile likelihood-based method for estimation when the latent event is unobserved. The baseline hazard in this case is estimated nonparametrically using the EM algorithm, which allows for closed-form Breslow-type estimators at each iteration, bringing improved computational efficiency and stability compared with maximizing the marginal likelihood directly. We present simulation studies to illustrate the finite-sample properties of the method; its use in practice is demonstrated in the analysis of a prostate cancer data set.

Cancer survival comparisons between cohorts are often assessed by estimates of relative or net survival. These measure the difference in mortality between those diagnosed with the disease and the general population. For such comparisons methods are needed to standardize cohort structure (including age at diagnosis) and all-cause mortality rates in the general population. Standardized non-parametric relative survival measures are evaluated by determining how well they (i) ensure the correct rank ordering, (ii) allow for differences in covariate distributions, and (iii) possess robustness and maximal estimation precision. Two relative survival families that subsume the Ederer-I, Ederer-II, and Pohar-Perme statistics are assessed. The aforementioned statistics do not meet our criteria, and are not invariant under a change of covariate distribution. Existing methods for standardization of these statistics are either not invariant to changes in the general population mortality or are not robust. Standardized statistics and estimators are developed to address the deficiencies. They use a reference distribution for covariates such as age, and a reference population mortality survival distribution that is recommended to approach zero with increasing age as fast as the cohort with the worst life expectancy. Estimators are compared using a breast-cancer survival example and computer simulation. The proposals are invariant and robust, and out-perform current methods to standardize the Ederer-II and Pohar-Perme estimators in simulations, particularly for extended follow-up.

In this article, we present a Bayesian hierarchical model for predicting a latent health state from longitudinal clinical measurements. Model development is motivated by the need to integrate multiple sources of data to improve clinical decisions about whether to remove or irradiate a patient's prostate cancer. Existing modeling approaches are extended to accommodate measurement error in cancer state determinations based on biopsied tissue, clinical measurements possibly not missing at random, and informative partial observation of the true state. The proposed model enables estimation of whether an individual's underlying prostate cancer is aggressive, requiring surgery and/or radiation, or indolent, permitting continued surveillance. These individualized predictions can then be communicated to clinicians and patients to inform decision-making. We demonstrate the model with data from a cohort of low-risk prostate cancer patients at Johns Hopkins University and assess predictive accuracy among a subset for whom true cancer state is observed. Simulation studies confirm model performance and explore the impact of adjusting for informative missingness on true state predictions. R code is provided in an online supplement and at http://github.com/rycoley/prediction-prostate-surveillance.

Motivated by a study conducted to evaluate the associations of 51 inflammatory markers and lung cancer risk, we propose several approaches of varying computational complexity for analyzing multiple correlated markers that are also censored due to lower and/or upper limits of detection, using likelihood-based sufficient dimension reduction (SDR) methods. We extend the theory and the likelihood-based SDR framework in two ways: (i) we accommodate censored predictors directly in the likelihood, and (ii) we incorporate variable selection. We find linear combinations that contain all the information that the correlated markers have on an outcome variable (i.e., are sufficient for modeling and prediction of the outcome) while accounting for censoring of the markers. These methods yield efficient estimators and can be applied to any type of outcome, including continuous and categorical. We illustrate and compare all methods using data from the motivating study and in simulations. We find that explicitly accounting for the censoring in the likelihood of the SDR methods can lead to appreciable gains in efficiency and prediction accuracy, and also outperformed multiple imputations combined with standard SDR.

In biomedical research, a steep rise or decline in longitudinal biomarkers may indicate latent disease progression, which may subsequently cause patients to drop out of the study. Ignoring the informative drop-out can cause bias in estimation of the longitudinal model. In such cases, a full parametric specification may be insufficient to capture the complicated pattern of the longitudinal biomarkers. For these types of longitudinal data with the issue of informative drop-outs, we develop a joint partially linear model, with an aim to find the trajectory of the longitudinal biomarker. Specifically, an arbitrary function of time along with linear fixed and random covariate effects is proposed in the model for the biomarker, while a flexible semiparametric transformation model is used to describe the drop-out mechanism. Advantages of this semiparametric joint modeling approach are the following: 1) it provides an easier interpretation, compared to standard nonparametric regression models, and 2) it is a natural way to control for common (observable and unobservable) prognostic factors that may affect both the longitudinal trajectory and the drop-out process. We describe a sieve maximum likelihood estimation procedure using the EM algorithm, where the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are considered to select the number of knots. We show that the proposed estimators achieve desirable asymptotic properties through empirical process theory. The proposed methods are evaluated by simulation studies and applied to prostate cancer data.

Case-cohort (Prentice, 1986) and nested case-control (Thomas, 1977) designs have been widely used as a cost-effective alternative to the full-cohort design. In this article, we propose an efficient likelihood-based estimation method for the accelerated failure time model under case-cohort and nested case-control designs. An EM algorithm is developed to maximize the likelihood function and a kernel smoothing technique is adopted to facilitate the estimation in the M-step of the EM algorithm. We show that the proposed estimators for the regression coefficients are consistent and asymptotically normal. The asymptotic variance of the estimators can be consistently estimated using an EM-aided numerical differentiation method. Simulation studies are conducted to evaluate the finite-sample performance of the estimators and an application to a Wilms tumor data set is also given to illustrate the methodology.

We propose a Bayesian non-parametric (BNP) framework for estimating causal effects of mediation, the natural direct, and indirect, effects. The strategy is to do this in two parts. Part 1 is a flexible model (using BNP) for the observed data distribution. Part 2 is a set of uncheckable assumptions with sensitivity parameters that in conjunction with Part 1 allows identification and estimation of the causal parameters and allows for uncertainty about these assumptions via priors on the sensitivity parameters. For Part 1, we specify a Dirichlet process mixture of multivariate normals as a prior on the joint distribution of the outcome, mediator, and covariates. This approach allows us to obtain a (simple) closed form of each marginal distribution. For Part 2, we consider two sets of assumptions: (a) the standard sequential ignorability (Imai et al., 2010) and (b) weakened set of the conditional independence type assumptions introduced in Daniels et al. (2012) and propose sensitivity analyses for both. We use this approach to assess mediation in a physical activity promotion trial.

Cancer population studies based on cancer registry databases are widely conducted to address various research questions. In general, cancer registry databases do not collect information on cause of death. The net survival rate is defined as the survival rate if a subject would not die for any causes other than cancer. This counterfactual concept is widely used for the analyses of cancer registry data. Perme, Stare, and Estève (2012) proposed a nonparametric estimator of the net survival rate under the assumption that the censoring time is independent of the survival time and covariates. Kodre and Perme (2013) proposed an inverse weighting estimator for the net survival rate under the covariate-dependent censoring. An alternative approach to estimating the net survival rate under covariate-dependent censoring is to apply a regression model for the conditional net survival rate given covariates. In this article, we propose a new estimator for the net survival rate. The proposed estimator is shown to be doubly robust in the sense that it is consistent at least one of the regression models for survival time and for censoring time. We examine the theoretical and empirical properties of our proposed estimator by asymptotic theory and simulation studies. We also apply the proposed method to cancer registry data for gastric cancer patients in Osaka, Japan.

The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration.

Finding rare variants and gene–environment interactions (GXE) is critical in dissecting complex diseases. We consider the problem of detecting GXE where G is a rare haplotype and E is a nongenetic factor. Such methods typically assume G-E independence, which may not hold in many applications. A pertinent example is lung cancer—there is evidence that variants on Chromosome 15q25.1 interact with smoking to affect the risk. However, these variants are associated with smoking behavior rendering the assumption of G-E independence inappropriate. With the motivation of detecting GXE under G-E dependence, we extend an existing approach, logistic Bayesian LASSO, which assumes G-E independence (LBL-GXE-I) by modeling G-E dependence through a multinomial logistic regression (referred to as LBL-GXE-D). Unlike LBL-GXE-I, LBL-GXE-D controls type I error rates in all situations; however, it has reduced power when G-E independence holds. To control type I error without sacrificing power, we further propose a unified approach, LBL-GXE, to incorporate uncertainty in the G-E independence assumption by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that LBL-GXE has power similar to that of LBL-GXE-I when G-E independence holds, yet has well-controlled type I errors in all situations. To illustrate the utility of LBL-GXE, we analyzed a lung cancer dataset and found several significant interactions in the 15q25.1 region, including one between a specific rare haplotype and smoking.

Highly active antiretroviral therapy (HAART) has proved efficient in increasing CD4 counts in many randomized clinical trials. Because randomized trials have some limitations (e.g., short duration, highly selected subjects), it is interesting to assess the effect of treatments using observational studies. This is challenging because treatment is started preferentially in subjects with severe conditions. This general problem had been treated using Marginal Structural Models (MSM) relying on the counterfactual formulation. Another approach to causality is based on dynamical models. We present three discrete-time dynamic models based on linear increments models (LIM): the first one based on one difference equation for CD4 counts, the second with an equilibrium point, and the third based on a system of two difference equations, which allows jointly modeling CD4 counts and viral load. We also consider continuous-time models based on ordinary differential equations with non-linear mixed effects (ODE-NLME). These mechanistic models allow incorporating biological knowledge when available, which leads to increased statistical evidence for detecting treatment effect. Because inference in ODE-NLME is numerically challenging and requires specific methods and softwares, LIM are a valuable intermediary option in terms of consistency, precision, and complexity. We compare the different approaches in simulation and in illustration on the ANRS CO3 Aquitaine Cohort and the Swiss HIV Cohort Study.

We consider simple ordinal model-based probability effect measures for comparing distributions of two groups, adjusted for explanatory variables. An “ordinal superiority” measure summarizes the probability that an observation from one distribution falls above an independent observation from the other distribution, adjusted for explanatory variables in a model. The measure applies directly to normal linear models and to a normal latent variable model for ordinal response variables. It equals for the corresponding ordinal model that applies a probit link function to cumulative multinomial probabilities, for standard normal cdf and effect that is the coefficient of the group indicator variable. For the more general latent variable model for ordinal responses that corresponds to a linear model with other possible error distributions and corresponding link functions for cumulative multinomial probabilities, the ordinal superiority measure equals with the log–log link and equals approximately with the logit link, where is the group effect. Another ordinal superiority measure generalizes the difference of proportions from binary to ordinal responses. We also present related measures directly for ordinal models for the observed response that need not assume corresponding latent response models. We present confidence intervals for the measures and illustrate with an example.

Long-term follow-up is common in many medical investigations where the interest lies in predicting patients’ risks for a future adverse outcome using repeatedly measured predictors over time. A key quantity is the likelihood of developing an adverse outcome among individuals who survived up to time *s* given their covariate information up to time *s*. Simple, yet reliable, methodology for updating the predicted risk of disease progression using longitudinal markers remains elusive. Two main approaches have been considered in the literature. One approach, based on joint modeling (JM) of failure time and longitudinal covariate process (Tsiatis and Davidian, 2004), derives such longitudinal predictive probability from the joint probability of a longitudinal marker and an event at a given time. A second approach, the partly conditional (PC) modeling (Zheng and Heagerty, 2005), directly models the predictive probability conditional on survival up to a landmark time and information accrued by that time. In this article, we propose new PC models for longitudinal prediction that are more flexible than joint modeling and improve the prediction accuracy over existing PC models. We provide procedures for making inference regarding future risk for an individual with longitudinal measures up to a given time. In addition, we conduct simulations to evaluate both JM and PC approaches in order to provide practical guidance on modeling choices. We use standard measures of predictive accuracy adapted to our setting to explore the predictiveness of the two approaches. We illustrate the performance of the two approaches on a dataset from the End Stage Renal Disease Study (ESRDS).

We consider the problem of testing for a dose-related effect based on a candidate set of (typically nonlinear) dose-response models using likelihood-ratio tests. For the considered models this reduces to assessing whether the slope parameter in these nonlinear regression models is zero or not. A technical problem is that the null distribution (when the slope is zero) depends on non-identifiable parameters, so that standard asymptotic results on the distribution of the likelihood-ratio test no longer apply. Asymptotic solutions for this problem have been extensively discussed in the literature. The resulting approximations however are not of simple form and require simulation to calculate the asymptotic distribution. In addition, their appropriateness might be doubtful for the case of a small sample size. Direct simulation to approximate the null distribution is numerically unstable due to the non identifiability of some parameters. In this article, we derive a numerical algorithm to approximate the exact distribution of the likelihood-ratio test under multiple models for normally distributed data. The algorithm uses methods from differential geometry and can be used to evaluate the distribution under the null hypothesis, but also allows for power and sample size calculations. We compare the proposed testing approach to the MCP-Mod methodology and alternative methods for testing for a dose-related trend in a dose-finding example data set and simulations.

Frailty models are here proposed in the tumor dormancy framework, in order to account for possible unobservable dependence mechanisms in cancer studies where a non-negligible proportion of cancer patients relapses years or decades after surgical removal of the primary tumor. Relapses do not seem to follow a memory-less process, since their timing distribution leads to multimodal hazards. From a biomedical perspective, this behavior may be explained by tumor dormancy, i.e., for some patients microscopic tumor foci may remain asymptomatic for a prolonged time interval and, when they escape from dormancy, micrometastatic growth results in a clinical disease appearance. The activation of the growth phase at different metastatic states would explain the occurrence of metastatic recurrences and mortality at different times (multimodal hazard). We propose a new frailty model which includes in the risk function a random source of heterogeneity (frailty variable) affecting the components of the hazard function. Thus, the individual hazard rate results as the product of a random frailty variable and the sum of basic hazard rates. In tumor dormancy, the basic hazard rates correspond to micrometastatic developments starting from different initial states. The frailty variable represents the heterogeneity among patients with respect to relapse, which might be related to unknown mechanisms that regulate tumor dormancy. We use our model to estimate the overall survival in a large breast cancer dataset, showing how this improves the understanding of the underlying biological process.

This article considers sieve estimation in the Cox model with an unknown regression structure based on right-censored data. We propose a semiparametric pursuit method to simultaneously identify and estimate linear and nonparametric covariate effects based on B-spline expansions through a penalized group selection method with concave penalties. We show that the estimators of the linear effects and the nonparametric component are consistent. Furthermore, we establish the asymptotic normality of the estimator of the linear effects. To compute the proposed estimators, we develop a modified blockwise majorization descent algorithm that is efficient and easy to implement. Simulation studies demonstrate that the proposed method performs well in finite sample situations. We also use the primary biliary cirrhosis data to illustrate its application.

Many diseases arise due to exposure to one of multiple possible pathogens. We consider the situation in which disease counts are available over time from a study region, along with a measure of clinical disease severity, for example, mild or severe. In addition, we suppose a subset of the cases are lab tested in order to determine the pathogen responsible for disease. In such a context, we focus interest on modeling the probabilities of disease incidence given pathogen type. The time course of these probabilities is of great interest as is the association with time-varying covariates such as meteorological variables. In this set up, a natural Bayesian approach would be based on imputation of the unsampled pathogen information using Markov Chain Monte Carlo but this is computationally challenging. We describe a practical approach to inference that is easy to implement. We use an empirical Bayes procedure in a first step to estimate summary statistics. We then treat these summary statistics as the observed data and develop a Bayesian generalized additive model. We analyze data on hand, foot, and mouth disease (HFMD) in China in which there are two pathogens of primary interest, enterovirus 71 (EV71) and Coxackie A16 (CA16). We find that both EV71 and CA16 are associated with temperature, relative humidity, and wind speed, with reasonably similar functional forms for both pathogens. The important issue of confounding by time is modeled using a penalized B-spline model with a random effects representation. The level of smoothing is addressed by a careful choice of the prior on the tuning variance.

In this article, we propose an association model to estimate the penetrance (risk) of successive cancers in the presence of competing risks. The association between the successive events is modeled via a copula and a proportional hazards model is specified for each competing event. This work is motivated by the analysis of successive cancers for people with Lynch Syndrome in the presence of competing risks. The proposed inference procedure is adapted to handle missing genetic covariates and selection bias, induced by the data collection protocol of the data at hand. The performance of the proposed estimation procedure is evaluated by simulations and its use is illustrated with data from the Colon Cancer Family Registry (Colon CFR).

Motivated by a study of molecular differences among breast cancer patients, we develop a Bayesian latent factor zero-inflated Poisson (LZIP) model for the analysis of correlated zero-inflated counts. The responses are modeled as independent zero-inflated Poisson distributions conditional on a set of subject-specific latent factors. For each outcome, we express the LZIP model as a function of two discrete random variables: the first captures the propensity to be in an underlying “at-risk” state, while the second represents the count response conditional on being at risk. The latent factors and loadings are assigned conditionally conjugate gamma priors that accommodate overdispersion and dependence among the outcomes. For posterior computation, we propose an efficient data-augmentation algorithm that relies primarily on easily sampled Gibbs steps. We conduct simulation studies to investigate both the inferential properties of the model and the computational capabilities of the proposed sampling algorithm. We apply the method to an analysis of breast cancer genomics data from The Cancer Genome Atlas.

The analysis of multiple outcomes is becoming increasingly common in modern biomedical studies. It is well-known that joint statistical models for multiple outcomes are more flexible and more powerful than fitting a separate model for each outcome; they yield more powerful tests of exposure or treatment effects by taking into account the dependence among outcomes and pooling evidence across outcomes. It is, however, unlikely that all outcomes are related to the same subset of covariates. Therefore, there is interest in identifying exposures or treatments associated with particular outcomes, which we term outcome-specific variable selection. In this work, we propose a variable selection approach for multivariate normal responses that incorporates not only information on the mean model, but also information on the variance–covariance structure of the outcomes. The approach effectively leverages evidence from all correlated outcomes to estimate the effect of a particular covariate on a given outcome. To implement this strategy, we develop a Bayesian method that builds a multivariate prior for the variable selection indicators based on the variance–covariance of the outcomes. We show via simulation that the proposed variable selection strategy can boost power to detect subtle effects without increasing the probability of false discoveries. We apply the approach to the Normative Aging Study (NAS) epigenetic data and identify a subset of five genes in the asthma pathway for which gene-specific DNA methylations are associated with exposures to either black carbon, a marker of traffic pollution, or sulfate, a marker of particles generated by power plants.

Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence, the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and are available on CRAN.

In this article, the existing concept of reversed percentile residual life, or percentile inactivity time, is recast to show that it can be used for routine analysis of time-to-event data under right censoring to summarize “life lost,” which poses several advantages over the existing methods for survival analysis. An estimating equation approach is adopted to avoid estimation of the probability density function of the underlying time-to-event distribution to estimate the variance of the quantile estimator. Additionally a *K*-sample test statistic is proposed to test the ratio of the quantile lost lifespans. Simulation studies are performed to assess finite properties of the proposed *K*-sample statistic in terms of coverage probability and power. The proposed method is illustrated with a real data example from a breast cancer study.

It is traditionally assumed that the random effects in mixed models follow a multivariate normal distribution, making likelihood-based inferences more feasible theoretically and computationally. However, this assumption does not necessarily hold in practice which may lead to biased and unreliable results. We introduce a novel diagnostic test based on the so-called gradient function proposed by Verbeke and Molenberghs (2013) to assess the random-effects distribution. We establish asymptotic properties of our test and show that, under a correctly specified model, the proposed test statistic converges to a weighted sum of independent chi-squared random variables each with one degree of freedom. The weights, which are eigenvalues of a square matrix, can be easily calculated. We also develop a parametric bootstrap algorithm for small samples. Our strategy can be used to check the adequacy of any distribution for random effects in a wide class of mixed models, including linear mixed models, generalized linear mixed models, and non-linear mixed models, with univariate as well as multivariate random effects. Both asymptotic and bootstrap proposals are evaluated via simulations and a real data analysis of a randomized multicenter study on toenail dermatophyte onychomycosis.

Various model selection methods can be applied to seek sparse subsets of the covariates to explain the response of interest in bioinformatics. While such methods often offer very helpful predictive performances, their selections of the covariates may be much less trustworthy. Indeed, when the number of covariates is large, the selections can be highly unstable, even under a slight change of the data. This casts a serious doubt on reproducibility of the identified variables. For a sound scientific understanding of the regression relationship, methods need to be developed to find the most important covariates that have higher chance to be confirmed in future studies. Such a method based on variable selection deviation is proposed and evaluated in this work.

Distortion product otoacoustic emissions (DPOAE) testing is a promising alternative to behavioral hearing tests and auditory brainstem response testing of pediatric cancer patients. The central goal of this study is to assess whether significant changes in the DPOAE frequency/emissions curve (DP-gram) occur in pediatric patients in a test-retest scenario. This is accomplished through the construction of normal reference charts, or credible regions, that DP-gram differences lie in, as well as contour probabilities that measure how abnormal (or in a certain sense rare) a test-retest difference is. A challenge is that the data were collected over varying frequencies, at different time points from baseline, and on possibly one or both ears. A hierarchical structural equation Gaussian process model is proposed to handle the different sources of correlation in the emissions measurements, wherein both subject-specific random effects and variance components governing the smoothness and variability of each child's Gaussian process are coupled together.

We use a nonparametric mixture model for the purpose of estimating the size of a population from multiple lists in which both the individual effects and list effects are allowed to vary. We propose a lower bound of the population size that admits an analytic expression. The lower bound can be estimated without the necessity of model-fitting. The asymptotical normality of the estimator is established. Both the estimator itself and that for the estimable bound of its variance are adjusted. These adjusted versions are shown to be unbiased in the limit. Simulation experiments are performed to assess the proposed approach and real applications are studied.

In this work a new metric of surrogacy, the so-called individual causal association (ICA), is introduced using information-theoretic concepts and a causal inference model for a binary surrogate and true endpoint. The ICA has a simple and appealing interpretation in terms of uncertainty reduction and, in some scenarios, it seems to provide a more coherent assessment of the validity of a surrogate than existing measures. The identifiability issues are tackled using a two-step procedure. In the first step, the region of the parametric space of the distribution of the potential outcomes, compatible with the data at hand, is geometrically characterized. Further, in a second step, a Monte Carlo approach is proposed to study the behavior of the ICA on the previous region. The method is illustrated using data from the Collaborative Initial Glaucoma Treatment Study. A newly developed and user-friendly R package *Surrogate* is provided to carry out the evaluation exercise.

Spatial data have become increasingly common in epidemiology and public health research thanks to advances in GIS (Geographic Information Systems) technology. In health research, for example, it is common for epidemiologists to incorporate geographically indexed data into their studies. In practice, however, the spatially defined covariates are often measured with error. Naive estimators of regression coefficients are attenuated if measurement error is ignored. Moreover, the classical measurement error theory is inapplicable in the context of spatial modeling because of the presence of spatial correlation among the observations. We propose a semiparametric regression approach to obtain bias-corrected estimates of regression parameters and derive their large sample properties. We evaluate the performance of the proposed method through simulation studies and illustrate using data on Ischemic Heart Disease (IHD). Both simulation and practical application demonstrate that the proposed method can be effective in practice.

We show how a spatial point process, where to each point there is associated a random quantitative mark, can be identified with a spatio-temporal point process specified by a conditional intensity function. For instance, the points can be tree locations, the marks can express the size of trees, and the conditional intensity function can describe the distribution of a tree (i.e., its location and size) conditionally on the larger trees. This enable us to construct parametric statistical models which are easily interpretable and where maximum-likelihood-based inference is tractable.

Capture–recapture methods are used to estimate the size of a population of interest which is only partially observed. In such studies, each member of the population carries a count of the number of times it has been identified during the observational period. In real-life applications, only positive counts are recorded, and we get a truncated at zero-observed distribution. We need to use the truncated count distribution to estimate the number of unobserved units. We consider ratios of neighboring count probabilities, estimated by ratios of observed frequencies, regardless of whether we have a zero-truncated or an untruncated distribution. Rocchetti et al. (2011) have shown that, for densities in the Katz family, these ratios can be modeled by a regression approach, and Rocchetti et al. (2014) have specialized the approach to the beta-binomial distribution. Once the regression model has been estimated, the unobserved frequency of zero counts can be simply derived. The guiding principle is that it is often easier to find an appropriate regression model than a proper model for the count distribution. However, a full analysis of the connection between the regression model and the associated count distribution has been missing. In this manuscript, we fill the gap and show that the regression model approach leads, under general conditions, to a valid count distribution; we also consider a wider class of regression models, based on fractional polynomials. The proposed approach is illustrated by analyzing various empirical applications, and by means of a simulation study.

In many scientific fields, it is a common practice to collect a sequence of 0-1 binary responses from a subject across time, space, or a collection of covariates. Researchers are interested in finding out how the expected binary outcome is related to covariates, and aim at better prediction in the future 0-1 outcomes. Gaussian processes have been widely used to model nonlinear systems; in particular to model the latent structure in a binary regression model allowing nonlinear functional relationship between covariates and the expectation of binary outcomes. A critical issue in modeling binary response data is the appropriate choice of link functions. Commonly adopted link functions such as probit or logit links have fixed skewness and lack the flexibility to allow the data to determine the degree of the skewness. To address this limitation, we propose a flexible binary regression model which combines a generalized extreme value link function with a Gaussian process prior on the latent structure. Bayesian computation is employed in model estimation. Posterior consistency of the resulting posterior distribution is demonstrated. The flexibility and gains of the proposed model are illustrated through detailed simulation studies and two real data examples. Empirical results show that the proposed model outperforms a set of alternative models, which only have either a Gaussian process prior on the latent regression function or a Dirichlet prior on the link function.

Nonparametric estimation of monotone regression functions is a classical problem of practical importance. Robust estimation of monotone regression functions in situations involving interval-censored data is a challenging yet unresolved problem. Herein, we propose a nonparametric estimation method based on the principle of isotonic regression. Using empirical process theory, we show that the proposed estimator is asymptotically consistent under a specific metric. We further conduct a simulation study to evaluate the performance of the estimator in finite sample situations. As an illustration, we use the proposed method to estimate the mean body weight functions in a group of adolescents after they reach pubertal growth spurt.

Identifying factors associated with increased medical cost is important for many micro- and macro-institutions, including the national economy and public health, insurers and the insured. However, assembling comprehensive national databases that include both the cost and individual-level predictors can prove challenging. Alternatively, one can use data from smaller studies with the understanding that conclusions drawn from such analyses may be limited to the participant population. At the same time, smaller clinical studies have limited follow-up and lifetime medical cost may not be fully observed for all study participants. In this context, we develop new model selection methods and inference procedures for secondary analyses of clinical trial data when lifetime medical cost is subject to induced censoring. Our model selection methods extend a theory of penalized estimating function to a calibration regression estimator tailored for this data type. Next, we develop a novel inference procedure for the unpenalized regression estimator using perturbation and resampling theory. Then, we extend this resampling plan to accommodate regularized coefficient estimation of censored lifetime medical cost and develop postselection inference procedures for the final model. Our methods are motivated by data from Southwest Oncology Group Protocol 9509, a clinical trial of patients with advanced nonsmall cell lung cancer, and our models of lifetime medical cost are specific to this population. But the methods presented in this article are built on rather general techniques and could be applied to larger databases as those data become available.

We consider methods for estimating the treatment effect and/or the covariate by treatment interaction effect in a randomized clinical trial under noncompliance with time-to-event outcome. As in Cuzick et al. (2007), assuming that the patient population consists of three (possibly latent) subgroups based on treatment preference: the *ambivalent* group, the *insisters*, and the *refusers*, we estimate the effects among the *ambivalent* group. The parameters have causal interpretations under standard assumptions. The article contains two main contributions. First, we propose a weighted per-protocol (Wtd PP) estimator through incorporating time-varying weights in a proportional hazards model. In the second part of the article, under the model considered in Cuzick et al. (2007), we propose an EM algorithm to maximize a full likelihood (FL) as well as the pseudo likelihood (PL) considered in Cuzick et al. (2007). The E step of the algorithm involves computing the conditional expectation of a linear function of the latent membership, and the main advantage of the EM algorithm is that the risk parameters can be updated by fitting a weighted Cox model using standard software and the baseline hazard can be updated using closed-form solutions. Simulations show that the EM algorithm is computationally much more efficient than directly maximizing the observed likelihood. The main advantage of the Wtd PP approach is that it is more robust to model misspecifications among the *insisters* and *refusers* since the outcome model does not impose distributional assumptions among these two groups.

We propose a new sparse estimation method for Cox (1972) proportional hazards models by optimizing an approximated information criterion. The main idea involves approximation of the norm with a continuous or smooth unit dent function. The proposed method bridges the best subset selection and regularization by borrowing strength from both. It mimics the best subset selection using a penalized likelihood approach yet with no need of a tuning parameter. We further reformulate the problem with a reparameterization step so that it reduces to one unconstrained nonconvex yet smooth programming problem, which can be solved efficiently as in computing the maximum partial likelihood estimator (MPLE). Furthermore, the reparameterization tactic yields an additional advantage in terms of circumventing postselection inference. The oracle property of the proposed method is established. Both simulated experiments and empirical examples are provided for assessment and illustration.

In population-based cancer studies, it is often interesting to compare cancer survival between different populations. However, in such studies, the exact causes of death are often unavailable or unreliable. Net survival methods were developed to overcome this difficulty. Net survival is the survival that would be observed if the disease under study was the only possible cause of death. The Pohar-Perme estimator (PPE) is a nonparametric consistent estimator of net survival. In this article, we present a log-rank-type test for comparing net survival functions (as estimated by PPE) between several groups. We put the test within the counting process framework to introduce the inverse probability weighting procedure as required by the PPE. We built a stratified version to control for categorical covariates that affect the outcome. We performed simulation studies to evaluate the performance of this test and worked an application on real data.

Semi-competing risks data are often encountered in chronic disease follow-up studies that record both nonterminal events (e.g., disease landmark events) and terminal events (e.g., death). Studying the relationship between the nonterminal event and the terminal event can provide insightful information on disease progression. In this article, we propose a new sensible dependence measure tailored to addressing such an interest. We develop a nonparametric estimator, which is general enough to handle both independent right censoring and left truncation. Our strategy of connecting the new dependence measure with quantile regression enables a natural extension to adjust for covariates with minor additional assumptions imposed. We establish the asymptotic properties of the proposed estimators and develop inferences accordingly. Simulation studies suggest good finite-sample performance of the proposed methods. Our proposals are illustrated via an application to Denmark diabetes registry data.

This article considers nonparametric methods for studying recurrent disease and death with competing risks. We first point out that comparisons based on the well-known cumulative incidence function can be confounded by different prevalence rates of the competing events, and that comparisons of the conditional distribution of the survival time given the failure event type are more relevant for investigating the prognosis of different patterns of recurrence disease. We then propose nonparametric estimators for the conditional cumulative incidence function as well as the conditional bivariate cumulative incidence function for the bivariate gap times, that is, the time to disease recurrence and the residual lifetime after recurrence. To quantify the association between the two gap times in the competing risks setting, a modified Kendall's tau statistic is proposed. The proposed estimators for the conditional bivariate cumulative incidence distribution and the association measure account for the induced dependent censoring for the second gap time. Uniform consistency and weak convergence of the proposed estimators are established. Hypothesis testing procedures for two-sample comparisons are discussed. Numerical simulation studies with practical sample sizes are conducted to evaluate the performance of the proposed nonparametric estimators and tests. An application to data from a pancreatic cancer study is presented to illustrate the methods developed in this article.

The next-generation sequencing data, called high-throughput sequencing data, are recorded as count data, which are generally far from normal distribution. Under the assumption that the count data follow the Poisson -normal distribution, this article provides an -penalized likelihood framework and an efficient search algorithm to estimate the structure of sparse directed acyclic graphs (DAGs) for multivariate counts data. In searching for the solution, we use iterative optimization procedures to estimate the adjacency matrix and the variance matrix of the latent variables. The simulation result shows that our proposed method outperforms the approach which assumes multivariate normal distributions, and the -transformation approach. It also shows that the proposed method outperforms the rank-based PC method under sparse network or hub network structures. As a real data example, we demonstrate the efficiency of the proposed method in estimating the gene regulatory networks of the ovarian cancer study.

Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using negative binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many strategies have been proposed to estimate this parameter, but when differential analysis is the purpose, they often result in procedures based on plug-in estimates, and we show here that this discrepancy between the estimation framework and the testing framework can lead to uncontrolled type-I errors. Instead, we propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Three consistent statistical tests are developed for differential expression analysis. We show through a wide simulation study that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it reaches the nominal value for the type-I error, while keeping elevate discriminative power between differentially and not differentially expressed genes. The method is finally illustrated on prostate cancer RNA-Seq data.

Array-based CGH experiments are designed to detect genomic aberrations or regions of DNA copy-number variation that are associated with an outcome, typically a state of disease. Most of the existing statistical methods target on detecting DNA copy number variations in a single sample or array. We focus on the detection of group effect variation, through simultaneous study of multiple samples from multiple groups. Rather than using direct segmentation or smoothing techniques, as commonly seen in existing detection methods, we develop a sequential model selection procedure that is guided by a modified Bayesian information criterion. This approach improves detection accuracy by accumulatively utilizing information across contiguous clones, and has computational advantage over the existing popular detection methods. Our empirical investigation suggests that the performance of the proposed method is superior to that of the existing detection methods, in particular, in detecting small segments or separating neighboring segments with differential degrees of copy-number variation.

The twin method refers to the use of data from same-sex identical and fraternal twins to estimate the genetic and environmental contributions to a trait or outcome. The standard twin method is the variance component twin method that estimates heritability, the fraction of variance attributed to additive genetic inheritance. The latent class twin method estimates two quantities that are easier to interpret than heritability: the genetic prevalence, which is the fraction of persons in the genetic susceptibility latent class, and the heritability fraction, which is the fraction of persons in the genetic susceptibility latent class with the trait or outcome. We extend the latent class twin method in three important ways. First, we incorporate an additive genetic model to broaden the sensitivity analysis beyond the original autosomal dominant and recessive genetic models. Second, we specify a separate survival model to simplify computations and improve convergence. Third, we show how to easily adjust for covariates by extending the method of propensity scores from a treatment difference to zygosity. Applying the latent class twin method to data on breast cancer among Nordic twins, we estimated a genetic prevalence of 1%, a result with important implications for breast cancer prevention research.

We introduce in this work the Interval Testing Procedure (ITP), a novel inferential technique for functional data. The procedure can be used to test different functional hypotheses, e.g., distributional equality between two or more functional populations, equality of mean function of a functional population to a reference. ITP involves three steps: (i) the representation of data on a (possibly high-dimensional) functional basis; (ii) the test of each possible set of consecutive basis coefficients; (iii) the computation of the adjusted *p*-values associated to each basis component, by means of a new strategy here proposed. We define a new type of error control, the interval-wise control of the family wise error rate, particularly suited for functional data. We show that ITP is provided with such a control. A simulation study comparing ITP with other testing procedures is reported. ITP is then applied to the analysis of hemodynamical features involved with cerebral aneurysm pathology. ITP is implemented in the fdatest R package.

Functional principal component analysis (FPCA) is a popular approach to explore major sources of variation in a sample of random curves. These major sources of variation are represented by functional principal components (FPCs). The intervals where the values of FPCs are significant are interpreted as where sample curves have major variations. However, these intervals are often hard for naïve users to identify, because of the vague definition of “significant values”. In this article, we develop a novel penalty-based method to derive FPCs that are only nonzero precisely in the intervals where the values of FPCs are significant, whence the derived FPCs possess better interpretability than the FPCs derived from existing methods. To compute the proposed FPCs, we devise an efficient algorithm based on projection deflation techniques. We show that the proposed interpretable FPCs are strongly consistent and asymptotically normal under mild conditions. Simulation studies confirm that with a competitive performance in explaining variations of sample curves, the proposed FPCs are more interpretable than the traditional counterparts. This advantage is demonstrated by analyzing two real datasets, namely, electroencephalography data and Canadian weather data.

Dynamic treatment regimens (DTRs) recommend treatments based on evolving subject-level data. The optimal DTR is that which maximizes expected patient outcome and as such its identification is of primary interest in the personalized medicine setting. When analyzing data from observational studies using semi-parametric approaches, there are two primary components which can be modeled: the expected level of treatment and the expected outcome for a patient given their other covariates. In an effort to offer greater flexibility, the so-called doubly robust methods have been developed which offer consistent parameter estimators as long as at least one of these two models is correctly specified. However, in practice it can be difficult to be confident if this is the case. Using G-estimation as our example method, we demonstrate how the property of double robustness itself can be used to provide evidence that a specified model is or is not correct. This approach is illustrated through simulation studies as well as data from the Multicenter AIDS Cohort Study.

A dynamic treatment regimen consists of decision rules that recommend how to individualize treatment to patients based on available treatment and covariate history. In many scientific domains, these decision rules are shared across stages of intervention. As an illustrative example, we discuss STAR*D, a multistage randomized clinical trial for treating major depression. Estimating these shared decision rules often amounts to estimating parameters indexing the decision rules that are shared across stages. In this article, we propose a novel simultaneous estimation procedure for the shared parameters based on Q-learning. We provide an extensive simulation study to illustrate the merit of the proposed method over simple competitors, in terms of the treatment allocation matching of the procedure with the “oracle” procedure, defined as the one that makes treatment recommendations based on the true parameter values as opposed to their estimates. We also look at bias and mean squared error of the individual parameter-estimates as secondary metrics. Finally, we analyze the STAR*D data using the proposed method.

To evaluate a new therapy versus a control via a randomized, comparative clinical study or a series of trials, due to heterogeneity of the study patient population, a pre-specified, *predictive* enrichment procedure may be implemented to identify an “enrichable” subpopulation. For patients in this subpopulation, the therapy is expected to have a desirable overall risk-benefit profile. To develop and validate such a “therapy-diagnostic co-development” strategy, a three-step procedure may be conducted with three independent data sets from a series of similar studies or a single trial. At the first stage, we create various candidate scoring systems based on the baseline information of the patients via, for example, parametric models using the first data set. Each individual score reflects an anticipated average treatment difference for future patients who share similar baseline profiles. A large score indicates that these patients tend to benefit from the new therapy. At the second step, a potentially promising, enrichable subgroup is identified using the totality of evidence from these scoring systems. At the final stage, we validate such a selection via two-sample inference procedures for assessing the treatment effectiveness statistically and clinically with the third data set, the so-called holdout sample. When the study size is not large, one may combine the first two steps using a “cross-training-evaluation” process. Comprehensive numerical studies are conducted to investigate the operational characteristics of the proposed method. The entire enrichment procedure is illustrated with the data from a cardiovascular trial to evaluate a beta-blocker versus a placebo for treating chronic heart failure patients.

Motivated by an ongoing study to develop a screening test able to identify patients with undiagnosed Sjögren's Syndrome in a symptomatic population, we propose methodology to combine multiple biomarkers and evaluate their performance in a two-stage group sequential design that proceeds as follows: biomarker data is collected from first stage samples; the biomarker panel is built and evaluated; if the panel meets pre-specified performance criteria the study continues to the second stage and the remaining samples are assayed. The design allows us to conserve valuable specimens in the case of inadequate biomarker panel performance. We propose a nonparametric conditional resampling algorithm that uses all the study data to provide unbiased estimates of the biomarker combination rule and the sensitivity of the panel corresponding to specificity of 1-t on the receiver operating characteristic curve (ROC). The Copas and Corbett (2002) correction, for bias resulting from using the same data to derive the combination rule and estimate the ROC, was also evaluated and an improved version was incorporated. An extensive simulation study was conducted to evaluate finite sample performance and propose guidelines for designing studies of this type. The methods were implemented in the National Cancer Institutes Early Detection Network Urinary PCA3 Evaluation Trial.

Prediction models for disease risk and prognosis play an important role in biomedical research, and evaluating their predictive accuracy in the presence of censored data is of substantial interest. The standard concordance (*c*) statistic has been extended to provide a summary measure of predictive accuracy for survival models. Motivated by a prostate cancer study, we address several issues associated with evaluating survival prediction models based on statistic with a focus on estimators using the technique of inverse probability of censoring weighting (IPCW). Compared to the existing work, we provide complete results on the asymptotic properties of the IPCW estimators under the assumption of coarsening at random (CAR), and propose a sensitivity analysis under the mechanism of noncoarsening at random (NCAR). In addition, we extend the IPCW approach as well as the sensitivity analysis to high-dimensional settings. The predictive accuracy of prediction models for cancer recurrence after prostatectomy is assessed by applying the proposed approaches. We find that the estimated predictive accuracy for the models in consideration is sensitive to NCAR assumption, and thus identify the best predictive model. Finally, we further evaluate the performance of the proposed methods in both settings of low-dimensional and high-dimensional data under CAR and NCAR through simulations.

In oncology, the international WHO and RECIST criteria have allowed the standardization of tumor response evaluation in order to identify the time of disease progression. These semi-quantitative measurements are often used as endpoints in phase II and phase III trials to study the efficacy of new therapies. However, through categorization of the continuous tumor size, information can be lost and they can be challenged by recently developed methods of modeling biomarkers in a longitudinal way. Thus, it is of interest to compare the predictive ability of cancer progressions based on categorical criteria and quantitative measures of tumor size (left-censored due to detection limit problems) and/or appearance of new lesions on overall survival. We propose a joint model for a simultaneous analysis of three types of data: a longitudinal marker, recurrent events, and a terminal event. The model allows to determine in a randomized clinical trial on which particular component treatment acts mostly. A simulation study is performed and shows that the proposed trivariate model is appropriate for practical use. We propose statistical tools that evaluate predictive accuracy for joint models to compare our model to models based on categorical criteria and their components. We apply the model to a randomized phase III clinical trial of metastatic colorectal cancer, conducted by the Fédération Francophone de Cancérologie Digestive (FFCD 2000–05 trial), which assigned 410 patients to two therapeutic strategies with multiple successive chemotherapy regimens.

Predicting binary events such as newborns with large birthweight is important for obstetricians in their attempt to reduce both maternal and fetal morbidity and mortality. Such predictions have been a challenge in obstetric practice, where longitudinal ultrasound measurements taken at multiple gestational times during pregnancy may be useful for predicting various poor pregnancy outcomes. The focus of this article is on developing a flexible class of joint models for the multivariate longitudinal ultrasound measurements that can be used for predicting a binary event at birth. A skewed multivariate random effects model is proposed for the ultrasound measurements, and the skewed generalized *t*-link is assumed for the link function relating the binary event and the underlying longitudinal processes. We consider a shared random effect to link the two processes together. Markov chain Monte Carlo sampling is used to carry out Bayesian posterior computation. Several variations of the proposed model are considered and compared via the deviance information criterion, the logarithm of pseudomarginal likelihood, and with a training-test set prediction paradigm. The proposed methodology is illustrated with data from the NICHD Successive Small-for-Gestational-Age Births study, a large prospective fetal growth cohort conducted in Norway and Sweden.

Clinical trials often collect multiple outcomes on each patient, as the treatment may be expected to affect the patient on many dimensions. For example, a treatment for a neurological disease such as ALS is intended to impact several dimensions of neurological function as well as survival. The assessment of treatment on the basis of multiple outcomes is challenging, both in terms of selecting a test and interpreting the results. Several global tests have been proposed, and we provide a general approach to selecting and executing a global test. The tests require minimal parametric assumptions, are flexible about weighting of the various outcomes, and are appropriate even when some or all of the outcomes are censored. The test we propose is based on a simple scoring mechanism applied to each pair of subjects for each endpoint. The pairwise scores are then reduced to a summary score, and a rank-sum test is applied to the summary scores. This can be seen as a generalization of previously proposed nonparametric global tests (e.g., O'Brien, 1984). We discuss the choice of optimal weighting schemes based on power and relative importance of the outcomes. As the optimal weights are generally unknown in practice, we also propose an adaptive weighting scheme and evaluate its performance in simulations. We apply the methods to analyze the impact of a treatment on neurological function and death in an ALS trial.

Amyotrophic lateral sclerosis (ALS) is a neurodegenerative condition characterized by the progressive deterioration of motor neurons in the cortex and spinal cord. Using an automated robotic microscope platform that enables the longitudinal tracking of thousands of single neurons, we examine the effects a large library of compounds on modulating the survival of primary neurons expressing a mutation known to cause ALS. The goal of our analysis is to identify the few potentially beneficial compounds among the many assayed, the vast majority of which do not extend neuronal survival. This resembles the large-scale simultaneous inference scenario familiar from microarray analysis, but transferred to the survival analysis setting due to the novel experimental setup. We apply a three-component mixture model to censored survival times of thousands of individual neurons subjected to hundreds of different compounds. The shrinkage induced by our model significantly improves performance in simulations relative to performing treatment-wise survival analysis and subsequent multiple testing adjustment. Our analysis identified compounds that provide insight into potential novel therapeutic strategies for ALS.

Meta-analysis of trans-ethnic genome-wide association studies (GWAS) has proven to be a practical and profitable approach for identifying loci that contribute to the risk of complex diseases. However, the expected genetic effect heterogeneity cannot easily be accommodated through existing fixed-effects and random-effects methods. In response, we propose a novel random effect model for trans-ethnic meta-analysis with flexible modeling of the expected genetic effect heterogeneity across diverse populations. Specifically, we adopt a modified random effect model from the kernel regression framework, in which genetic effect coefficients are random variables whose correlation structure reflects the genetic distances across ancestry groups. In addition, we use the adaptive variance component test to achieve robust power regardless of the degree of genetic effect heterogeneity. Simulation studies show that our proposed method has well-calibrated type I error rates at very stringent significance levels and can improve power over the traditional meta-analysis methods. We reanalyzed the published type 2 diabetes GWAS meta-analysis (Consortium et al., 2014) and successfully identified one additional SNP that clearly exhibits genetic effect heterogeneity across different ancestry groups. Furthermore, our proposed method provides scalable computing time for genome-wide datasets, in which an analysis of one million SNPs would require less than 3 hours.

We discuss the use of the determinantal point process (DPP) as a prior for latent structure in biomedical applications, where inference often centers on the interpretation of latent features as biologically or clinically meaningful structure. Typical examples include mixture models, when the terms of the mixture are meant to represent clinically meaningful subpopulations (of patients, genes, etc.). Another class of examples are feature allocation models. We propose the DPP prior as a repulsive prior on latent mixture components in the first example, and as prior on feature-specific parameters in the second case. We argue that the DPP is in general an attractive prior model for latent structure when biologically relevant interpretation of such structure is desired. We illustrate the advantages of DPP prior in three case studies, including inference in mixture models for magnetic resonance images (MRI) and for protein expression, and a feature allocation model for gene expression using data from The Cancer Genome Atlas. An important part of our argument are efficient and straightforward posterior simulation methods. We implement a variation of reversible jump Markov chain Monte Carlo simulation for inference under the DPP prior, using a density with respect to the unit rate Poisson process.

Potential reductions in laboratory assay costs afforded by pooling equal aliquots of biospecimens have long been recognized in disease surveillance and epidemiological research and, more recently, have motivated design and analytic developments in regression settings. For example, Weinberg and Umbach (1999, *Biometrics* **55**, 718–726) provided methods for fitting set-based logistic regression models to case-control data when a continuous exposure variable (e.g., a biomarker) is assayed on pooled specimens. We focus on improving estimation efficiency by utilizing available subject-specific information at the pool allocation stage. We find that a strategy that we call “(y,**c**)-pooling,” which forms pooling sets of individuals within strata defined jointly by the outcome and other covariates, provides more precise estimation of the risk parameters associated with those covariates than does pooling within strata defined only by the outcome. We review the approach to set-based analysis through offsets developed by Weinberg and Umbach in a recent correction to their original paper. We propose a method for variance estimation under this design and use simulations and a real-data example to illustrate the precision benefits of (y,**c**)-pooling relative to y-pooling. We also note and illustrate that set-based models permit estimation of covariate interactions with exposure.

Vaccine-induced protection may not be homogeneous across individuals. It is possible that a vaccine gives complete protection for a portion of individuals, while the rest acquire only incomplete (leaky) protection of varying magnitude. If vaccine efficacy is estimated under wrong assumptions about such individual level heterogeneity, the resulting estimates may be difficult to interpret. For instance, population-level predictions based on such estimates may be biased. We consider the problem of estimating heterogeneous vaccine efficacy against an infection that can be acquired multiple times (susceptible-infected-susceptible model). The estimation is based on a limited number of repeated measurements of the current status of each individual, a situation commonly encountered in practice. We investigate how the placement of consecutive samples affects the estimability and efficiency of vaccine efficacy parameters. The same sampling frequency may not be optimal for efficient estimation of all components of heterogeneous vaccine protection. However, we suggest practical guidelines allowing estimation of all components. For situations in which the estimability of individual components fails, we suggest to use summary measures of vaccine efficacy.

Zero-inflated regression models have emerged as a popular tool within the parametric framework to characterize count data with excess zeros. Despite their increasing popularity, much of the literature on real applications of these models has centered around the latent class formulation where the mean response of the so-called at-risk or susceptible population and the susceptibility probability are both related to covariates. While this formulation in some instances provides an interesting representation of the data, it often fails to produce easily interpretable covariate effects on the overall mean response. In this article, we propose two approaches that circumvent this limitation. The first approach consists of estimating the effect of covariates on the overall mean from the assumed latent class models, while the second approach formulates a model that directly relates the overall mean to covariates. Our results are illustrated by extensive numerical simulations and an application to an oral health study on low income African-American children, where the overall mean model is used to evaluate the effect of sugar consumption on caries indices.

While there are many validated prognostic classifiers used in practice, often their accuracy is modest and heterogeneity in clinical outcomes exists in one or more risk subgroups. Newly available markers, such as genomic mutations, may be used to improve the accuracy of an existing classifier by reclassifying patients from a heterogenous group into a higher or lower risk category. The statistical tools typically applied to develop the initial classifiers are not easily adapted toward this reclassification goal. In this article, we develop a new method designed to refine an existing prognostic classifier by incorporating new markers. The two-stage algorithm called Boomerang first searches for modifications of the existing classifier that increase the overall predictive accuracy and then merges to a prespecified number of risk groups. Resampling techniques are proposed to assess the improvement in predictive accuracy when an independent validation data set is not available. The performance of the algorithm is assessed under various simulation scenarios where the marker frequency, degree of censoring, and total sample size are varied. The results suggest that the method selects few false positive markers and is able to improve the predictive accuracy of the classifier in many settings. Lastly, the method is illustrated on an acute myeloid leukemia data set where a new refined classifier incorporates four new mutations into the existing three category classifier and is validated on an independent data set.

Li, Fine, and Brookhart (2015) presented an extension of the two-stage least squares (2SLS) method for additive hazards models which requires an assumption that the censoring distribution is unrelated to the endogenous exposure variable. We present another extension of 2SLS that can address this limitation.