We discuss the use of the determinantal point process (DPP) as a prior for latent structure in biomedical applications, where inference often centers on the interpretation of latent features as biologically or clinically meaningful structure. Typical examples include mixture models, when the terms of the mixture are meant to represent clinically meaningful subpopulations (of patients, genes, etc.). Another class of examples are feature allocation models. We propose the DPP prior as a repulsive prior on latent mixture components in the first example, and as prior on feature-specific parameters in the second case. We argue that the DPP is in general an attractive prior model for latent structure when biologically relevant interpretation of such structure is desired. We illustrate the advantages of DPP prior in three case studies, including inference in mixture models for magnetic resonance images (MRI) and for protein expression, and a feature allocation model for gene expression using data from The Cancer Genome Atlas. An important part of our argument are efficient and straightforward posterior simulation methods. We implement a variation of reversible jump Markov chain Monte Carlo simulation for inference under the DPP prior, using a density with respect to the unit rate Poisson process.

We propose a new sparse estimation method for Cox (1972) proportional hazards models by optimizing an approximated information criterion. The main idea involves approximation of the norm with a continuous or smooth unit dent function. The proposed method bridges the best subset selection and regularization by borrowing strength from both. It mimics the best subset selection using a penalized likelihood approach yet with no need of a tuning parameter. We further reformulate the problem with a reparameterization step so that it reduces to one unconstrained nonconvex yet smooth programming problem, which can be solved efficiently as in computing the maximum partial likelihood estimator (MPLE). Furthermore, the reparameterization tactic yields an additional advantage in terms of circumventing postselection inference. The oracle property of the proposed method is established. Both simulated experiments and empirical examples are provided for assessment and illustration.

Capture–recapture methods are used to estimate the size of a population of interest which is only partially observed. In such studies, each member of the population carries a count of the number of times it has been identified during the observational period. In real-life applications, only positive counts are recorded, and we get a truncated at zero-observed distribution. We need to use the truncated count distribution to estimate the number of unobserved units. We consider ratios of neighboring count probabilities, estimated by ratios of observed frequencies, regardless of whether we have a zero-truncated or an untruncated distribution. Rocchetti et al. (2011) have shown that, for densities in the Katz family, these ratios can be modeled by a regression approach, and Rocchetti et al. (2014) have specialized the approach to the beta-binomial distribution. Once the regression model has been estimated, the unobserved frequency of zero counts can be simply derived. The guiding principle is that it is often easier to find an appropriate regression model than a proper model for the count distribution. However, a full analysis of the connection between the regression model and the associated count distribution has been missing. In this manuscript, we fill the gap and show that the regression model approach leads, under general conditions, to a valid count distribution; we also consider a wider class of regression models, based on fractional polynomials. The proposed approach is illustrated by analyzing various empirical applications, and by means of a simulation study.

In this work a new metric of surrogacy, the so-called individual causal association (ICA), is introduced using information-theoretic concepts and a causal inference model for a binary surrogate and true endpoint. The ICA has a simple and appealing interpretation in terms of uncertainty reduction and, in some scenarios, it seems to provide a more coherent assessment of the validity of a surrogate than existing measures. The identifiability issues are tackled using a two-step procedure. In the first step, the region of the parametric space of the distribution of the potential outcomes, compatible with the data at hand, is geometrically characterized. Further, in a second step, a Monte Carlo approach is proposed to study the behavior of the ICA on the previous region. The method is illustrated using data from the Collaborative Initial Glaucoma Treatment Study. A newly developed and user-friendly R package *Surrogate* is provided to carry out the evaluation exercise.

The next-generation sequencing data, called high-throughput sequencing data, are recorded as count data, which are generally far from normal distribution. Under the assumption that the count data follow the Poisson -normal distribution, this article provides an -penalized likelihood framework and an efficient search algorithm to estimate the structure of sparse directed acyclic graphs (DAGs) for multivariate counts data. In searching for the solution, we use iterative optimization procedures to estimate the adjacency matrix and the variance matrix of the latent variables. The simulation result shows that our proposed method outperforms the approach which assumes multivariate normal distributions, and the -transformation approach. It also shows that the proposed method outperforms the rank-based PC method under sparse network or hub network structures. As a real data example, we demonstrate the efficiency of the proposed method in estimating the gene regulatory networks of the ovarian cancer study.

Motivated by an ongoing study to develop a screening test able to identify patients with undiagnosed Sjögren's Syndrome in a symptomatic population, we propose methodology to combine multiple biomarkers and evaluate their performance in a two-stage group sequential design that proceeds as follows: biomarker data is collected from first stage samples; the biomarker panel is built and evaluated; if the panel meets pre-specified performance criteria the study continues to the second stage and the remaining samples are assayed. The design allows us to conserve valuable specimens in the case of inadequate biomarker panel performance. We propose a nonparametric conditional resampling algorithm that uses all the study data to provide unbiased estimates of the biomarker combination rule and the sensitivity of the panel corresponding to specificity of 1-t on the receiver operating characteristic curve (ROC). The Copas and Corbett (2002) correction, for bias resulting from using the same data to derive the combination rule and estimate the ROC, was also evaluated and an improved version was incorporated. An extensive simulation study was conducted to evaluate finite sample performance and propose guidelines for designing studies of this type. The methods were implemented in the National Cancer Institutes Early Detection Network Urinary PCA3 Evaluation Trial.

Amyotrophic lateral sclerosis (ALS) is a neurodegenerative condition characterized by the progressive deterioration of motor neurons in the cortex and spinal cord. Using an automated robotic microscope platform that enables the longitudinal tracking of thousands of single neurons, we examine the effects a large library of compounds on modulating the survival of primary neurons expressing a mutation known to cause ALS. The goal of our analysis is to identify the few potentially beneficial compounds among the many assayed, the vast majority of which do not extend neuronal survival. This resembles the large-scale simultaneous inference scenario familiar from microarray analysis, but transferred to the survival analysis setting due to the novel experimental setup. We apply a three-component mixture model to censored survival times of thousands of individual neurons subjected to hundreds of different compounds. The shrinkage induced by our model significantly improves performance in simulations relative to performing treatment-wise survival analysis and subsequent multiple testing adjustment. Our analysis identified compounds that provide insight into potential novel therapeutic strategies for ALS.

In population-based cancer studies, it is often interesting to compare cancer survival between different populations. However, in such studies, the exact causes of death are often unavailable or unreliable. Net survival methods were developed to overcome this difficulty. Net survival is the survival that would be observed if the disease under study was the only possible cause of death. The Pohar-Perme estimator (PPE) is a nonparametric consistent estimator of net survival. In this article, we present a log-rank-type test for comparing net survival functions (as estimated by PPE) between several groups. We put the test within the counting process framework to introduce the inverse probability weighting procedure as required by the PPE. We built a stratified version to control for categorical covariates that affect the outcome. We performed simulation studies to evaluate the performance of this test and worked an application on real data.

We introduce in this work the Interval Testing Procedure (ITP), a novel inferential technique for functional data. The procedure can be used to test different functional hypotheses, e.g., distributional equality between two or more functional populations, equality of mean function of a functional population to a reference. ITP involves three steps: (i) the representation of data on a (possibly high-dimensional) functional basis; (ii) the test of each possible set of consecutive basis coefficients; (iii) the computation of the adjusted *p*-values associated to each basis component, by means of a new strategy here proposed. We define a new type of error control, the interval-wise control of the family wise error rate, particularly suited for functional data. We show that ITP is provided with such a control. A simulation study comparing ITP with other testing procedures is reported. ITP is then applied to the analysis of hemodynamical features involved with cerebral aneurysm pathology. ITP is implemented in the fdatest R package.

Clinical trials often collect multiple outcomes on each patient, as the treatment may be expected to affect the patient on many dimensions. For example, a treatment for a neurological disease such as ALS is intended to impact several dimensions of neurological function as well as survival. The assessment of treatment on the basis of multiple outcomes is challenging, both in terms of selecting a test and interpreting the results. Several global tests have been proposed, and we provide a general approach to selecting and executing a global test. The tests require minimal parametric assumptions, are flexible about weighting of the various outcomes, and are appropriate even when some or all of the outcomes are censored. The test we propose is based on a simple scoring mechanism applied to each pair of subjects for each endpoint. The pairwise scores are then reduced to a summary score, and a rank-sum test is applied to the summary scores. This can be seen as a generalization of previously proposed nonparametric global tests (e.g., O'Brien, 1984). We discuss the choice of optimal weighting schemes based on power and relative importance of the outcomes. As the optimal weights are generally unknown in practice, we also propose an adaptive weighting scheme and evaluate its performance in simulations. We apply the methods to analyze the impact of a treatment on neurological function and death in an ALS trial.

This article presents a new approach to modeling group animal movement in continuous time. The movement of a group of animals is modeled as a multivariate Ornstein Uhlenbeck diffusion process in a high-dimensional space. Each individual of the group is attracted to a leading point which is generally unobserved, and the movement of the leading point is also an Ornstein Uhlenbeck process attracted to an unknown attractor. The Ornstein Uhlenbeck bridge is applied to reconstruct the location of the leading point. All movement parameters are estimated using Markov chain Monte Carlo sampling, specifically a Metropolis Hastings algorithm. We apply the method to a small group of simultaneously tracked reindeer, *Rangifer tarandus tarandus*, showing that the method detects dependency in movement between individuals.

We consider methods for estimating the treatment effect and/or the covariate by treatment interaction effect in a randomized clinical trial under noncompliance with time-to-event outcome. As in Cuzick et al. (2007), assuming that the patient population consists of three (possibly latent) subgroups based on treatment preference: the *ambivalent* group, the *insisters*, and the *refusers*, we estimate the effects among the *ambivalent* group. The parameters have causal interpretations under standard assumptions. The article contains two main contributions. First, we propose a weighted per-protocol (Wtd PP) estimator through incorporating time-varying weights in a proportional hazards model. In the second part of the article, under the model considered in Cuzick et al. (2007), we propose an EM algorithm to maximize a full likelihood (FL) as well as the pseudo likelihood (PL) considered in Cuzick et al. (2007). The E step of the algorithm involves computing the conditional expectation of a linear function of the latent membership, and the main advantage of the EM algorithm is that the risk parameters can be updated by fitting a weighted Cox model using standard software and the baseline hazard can be updated using closed-form solutions. Simulations show that the EM algorithm is computationally much more efficient than directly maximizing the observed likelihood. The main advantage of the Wtd PP approach is that it is more robust to model misspecifications among the *insisters* and *refusers* since the outcome model does not impose distributional assumptions among these two groups.

Vaccine-induced protection may not be homogeneous across individuals. It is possible that a vaccine gives complete protection for a portion of individuals, while the rest acquire only incomplete (leaky) protection of varying magnitude. If vaccine efficacy is estimated under wrong assumptions about such individual level heterogeneity, the resulting estimates may be difficult to interpret. For instance, population-level predictions based on such estimates may be biased. We consider the problem of estimating heterogeneous vaccine efficacy against an infection that can be acquired multiple times (susceptible-infected-susceptible model). The estimation is based on a limited number of repeated measurements of the current status of each individual, a situation commonly encountered in practice. We investigate how the placement of consecutive samples affects the estimability and efficiency of vaccine efficacy parameters. The same sampling frequency may not be optimal for efficient estimation of all components of heterogeneous vaccine protection. However, we suggest practical guidelines allowing estimation of all components. For situations in which the estimability of individual components fails, we suggest to use summary measures of vaccine efficacy.

Spatial data have become increasingly common in epidemiology and public health research thanks to advances in GIS (Geographic Information Systems) technology. In health research, for example, it is common for epidemiologists to incorporate geographically indexed data into their studies. In practice, however, the spatially defined covariates are often measured with error. Naive estimators of regression coefficients are attenuated if measurement error is ignored. Moreover, the classical measurement error theory is inapplicable in the context of spatial modeling because of the presence of spatial correlation among the observations. We propose a semiparametric regression approach to obtain bias-corrected estimates of regression parameters and derive their large sample properties. We evaluate the performance of the proposed method through simulation studies and illustrate using data on Ischemic Heart Disease (IHD). Both simulation and practical application demonstrate that the proposed method can be effective in practice.

Dynamic treatment regimens (DTRs) recommend treatments based on evolving subject-level data. The optimal DTR is that which maximizes expected patient outcome and as such its identification is of primary interest in the personalized medicine setting. When analyzing data from observational studies using semi-parametric approaches, there are two primary components which can be modeled: the expected level of treatment and the expected outcome for a patient given their other covariates. In an effort to offer greater flexibility, the so-called doubly robust methods have been developed which offer consistent parameter estimators as long as at least one of these two models is correctly specified. However, in practice it can be difficult to be confident if this is the case. Using G-estimation as our example method, we demonstrate how the property of double robustness itself can be used to provide evidence that a specified model is or is not correct. This approach is illustrated through simulation studies as well as data from the Multicenter AIDS Cohort Study.

Prediction models for disease risk and prognosis play an important role in biomedical research, and evaluating their predictive accuracy in the presence of censored data is of substantial interest. The standard concordance (*c*) statistic has been extended to provide a summary measure of predictive accuracy for survival models. Motivated by a prostate cancer study, we address several issues associated with evaluating survival prediction models based on statistic with a focus on estimators using the technique of inverse probability of censoring weighting (IPCW). Compared to the existing work, we provide complete results on the asymptotic properties of the IPCW estimators under the assumption of coarsening at random (CAR), and propose a sensitivity analysis under the mechanism of noncoarsening at random (NCAR). In addition, we extend the IPCW approach as well as the sensitivity analysis to high-dimensional settings. The predictive accuracy of prediction models for cancer recurrence after prostatectomy is assessed by applying the proposed approaches. We find that the estimated predictive accuracy for the models in consideration is sensitive to NCAR assumption, and thus identify the best predictive model. Finally, we further evaluate the performance of the proposed methods in both settings of low-dimensional and high-dimensional data under CAR and NCAR through simulations.

Nonparametric estimation of monotone regression functions is a classical problem of practical importance. Robust estimation of monotone regression functions in situations involving interval-censored data is a challenging yet unresolved problem. Herein, we propose a nonparametric estimation method based on the principle of isotonic regression. Using empirical process theory, we show that the proposed estimator is asymptotically consistent under a specific metric. We further conduct a simulation study to evaluate the performance of the estimator in finite sample situations. As an illustration, we use the proposed method to estimate the mean body weight functions in a group of adolescents after they reach pubertal growth spurt.

While there are many validated prognostic classifiers used in practice, often their accuracy is modest and heterogeneity in clinical outcomes exists in one or more risk subgroups. Newly available markers, such as genomic mutations, may be used to improve the accuracy of an existing classifier by reclassifying patients from a heterogenous group into a higher or lower risk category. The statistical tools typically applied to develop the initial classifiers are not easily adapted toward this reclassification goal. In this article, we develop a new method designed to refine an existing prognostic classifier by incorporating new markers. The two-stage algorithm called Boomerang first searches for modifications of the existing classifier that increase the overall predictive accuracy and then merges to a prespecified number of risk groups. Resampling techniques are proposed to assess the improvement in predictive accuracy when an independent validation data set is not available. The performance of the algorithm is assessed under various simulation scenarios where the marker frequency, degree of censoring, and total sample size are varied. The results suggest that the method selects few false positive markers and is able to improve the predictive accuracy of the classifier in many settings. Lastly, the method is illustrated on an acute myeloid leukemia data set where a new refined classifier incorporates four new mutations into the existing three category classifier and is validated on an independent data set.

Li, Fine, and Brookhart (2015) presented an extension of the two-stage least squares (2SLS) method for additive hazards models which requires an assumption that the censoring distribution is unrelated to the endogenous exposure variable. We present another extension of 2SLS that can address this limitation.

Predicting binary events such as newborns with large birthweight is important for obstetricians in their attempt to reduce both maternal and fetal morbidity and mortality. Such predictions have been a challenge in obstetric practice, where longitudinal ultrasound measurements taken at multiple gestational times during pregnancy may be useful for predicting various poor pregnancy outcomes. The focus of this article is on developing a flexible class of joint models for the multivariate longitudinal ultrasound measurements that can be used for predicting a binary event at birth. A skewed multivariate random effects model is proposed for the ultrasound measurements, and the skewed generalized *t*-link is assumed for the link function relating the binary event and the underlying longitudinal processes. We consider a shared random effect to link the two processes together. Markov chain Monte Carlo sampling is used to carry out Bayesian posterior computation. Several variations of the proposed model are considered and compared via the deviance information criterion, the logarithm of pseudomarginal likelihood, and with a training-test set prediction paradigm. The proposed methodology is illustrated with data from the NICHD Successive Small-for-Gestational-Age Births study, a large prospective fetal growth cohort conducted in Norway and Sweden.

The twin method refers to the use of data from same-sex identical and fraternal twins to estimate the genetic and environmental contributions to a trait or outcome. The standard twin method is the variance component twin method that estimates heritability, the fraction of variance attributed to additive genetic inheritance. The latent class twin method estimates two quantities that are easier to interpret than heritability: the genetic prevalence, which is the fraction of persons in the genetic susceptibility latent class, and the heritability fraction, which is the fraction of persons in the genetic susceptibility latent class with the trait or outcome. We extend the latent class twin method in three important ways. First, we incorporate an additive genetic model to broaden the sensitivity analysis beyond the original autosomal dominant and recessive genetic models. Second, we specify a separate survival model to simplify computations and improve convergence. Third, we show how to easily adjust for covariates by extending the method of propensity scores from a treatment difference to zygosity. Applying the latent class twin method to data on breast cancer among Nordic twins, we estimated a genetic prevalence of 1%, a result with important implications for breast cancer prevention research.

Large assembled cohorts with banked biospecimens offer valuable opportunities to identify novel markers for risk prediction. When the outcome of interest is rare, an effective strategy to conserve limited biological resources while maintaining reasonable statistical power is the case cohort (CCH) sampling design, in which expensive markers are measured on a subset of cases and controls. However, the CCH design introduces significant analytical complexity due to outcome-dependent, finite-population sampling. Current methods for analyzing CCH studies focus primarily on the estimation of simple survival models with linear effects; testing and estimation procedures that can efficiently capture complex non-linear marker effects for CCH data remain elusive. In this article, we propose inverse probability weighted (IPW) variance component type tests for identifying important marker sets through a Cox proportional hazards kernel machine () regression framework previously considered for full cohort studies (Cai et al., 2011). The optimal choice of kernel, while vitally important to attain high power, is typically unknown for a given dataset. Thus, we also develop robust testing procedures that adaptively combine information from multiple kernels. The proposed IPW test statistics have complex null distributions that cannot easily be approximated explicitly. Furthermore, due to the correlation induced by CCH sampling, standard resampling methods such as the bootstrap fail to approximate the distribution correctly. We, therefore, propose a novel perturbation resampling scheme that can effectively recover the induced correlation structure. Results from extensive simulation studies suggest that the proposed IPW testing procedures work well in finite samples. The proposed methods are further illustrated by application to a Danish CCH study of Apolipoprotein C-III markers on the risk of coronary heart disease.

To evaluate a new therapy versus a control via a randomized, comparative clinical study or a series of trials, due to heterogeneity of the study patient population, a pre-specified, *predictive* enrichment procedure may be implemented to identify an “enrichable” subpopulation. For patients in this subpopulation, the therapy is expected to have a desirable overall risk-benefit profile. To develop and validate such a “therapy-diagnostic co-development” strategy, a three-step procedure may be conducted with three independent data sets from a series of similar studies or a single trial. At the first stage, we create various candidate scoring systems based on the baseline information of the patients via, for example, parametric models using the first data set. Each individual score reflects an anticipated average treatment difference for future patients who share similar baseline profiles. A large score indicates that these patients tend to benefit from the new therapy. At the second step, a potentially promising, enrichable subgroup is identified using the totality of evidence from these scoring systems. At the final stage, we validate such a selection via two-sample inference procedures for assessing the treatment effectiveness statistically and clinically with the third data set, the so-called holdout sample. When the study size is not large, one may combine the first two steps using a “cross-training-evaluation” process. Comprehensive numerical studies are conducted to investigate the operational characteristics of the proposed method. The entire enrichment procedure is illustrated with the data from a cardiovascular trial to evaluate a beta-blocker versus a placebo for treating chronic heart failure patients.

We show how a spatial point process, where to each point there is associated a random quantitative mark, can be identified with a spatio-temporal point process specified by a conditional intensity function. For instance, the points can be tree locations, the marks can express the size of trees, and the conditional intensity function can describe the distribution of a tree (i.e., its location and size) conditionally on the larger trees. This enable us to construct parametric statistical models which are easily interpretable and where maximum-likelihood-based inference is tractable.

Identifying factors associated with increased medical cost is important for many micro- and macro-institutions, including the national economy and public health, insurers and the insured. However, assembling comprehensive national databases that include both the cost and individual-level predictors can prove challenging. Alternatively, one can use data from smaller studies with the understanding that conclusions drawn from such analyses may be limited to the participant population. At the same time, smaller clinical studies have limited follow-up and lifetime medical cost may not be fully observed for all study participants. In this context, we develop new model selection methods and inference procedures for secondary analyses of clinical trial data when lifetime medical cost is subject to induced censoring. Our model selection methods extend a theory of penalized estimating function to a calibration regression estimator tailored for this data type. Next, we develop a novel inference procedure for the unpenalized regression estimator using perturbation and resampling theory. Then, we extend this resampling plan to accommodate regularized coefficient estimation of censored lifetime medical cost and develop postselection inference procedures for the final model. Our methods are motivated by data from Southwest Oncology Group Protocol 9509, a clinical trial of patients with advanced nonsmall cell lung cancer, and our models of lifetime medical cost are specific to this population. But the methods presented in this article are built on rather general techniques and could be applied to larger databases as those data become available.

In many scientific fields, it is a common practice to collect a sequence of 0-1 binary responses from a subject across time, space, or a collection of covariates. Researchers are interested in finding out how the expected binary outcome is related to covariates, and aim at better prediction in the future 0-1 outcomes. Gaussian processes have been widely used to model nonlinear systems; in particular to model the latent structure in a binary regression model allowing nonlinear functional relationship between covariates and the expectation of binary outcomes. A critical issue in modeling binary response data is the appropriate choice of link functions. Commonly adopted link functions such as probit or logit links have fixed skewness and lack the flexibility to allow the data to determine the degree of the skewness. To address this limitation, we propose a flexible binary regression model which combines a generalized extreme value link function with a Gaussian process prior on the latent structure. Bayesian computation is employed in model estimation. Posterior consistency of the resulting posterior distribution is demonstrated. The flexibility and gains of the proposed model are illustrated through detailed simulation studies and two real data examples. Empirical results show that the proposed model outperforms a set of alternative models, which only have either a Gaussian process prior on the latent regression function or a Dirichlet prior on the link function.

Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using negative binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many strategies have been proposed to estimate this parameter, but when differential analysis is the purpose, they often result in procedures based on plug-in estimates, and we show here that this discrepancy between the estimation framework and the testing framework can lead to uncontrolled type-I errors. Instead, we propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Three consistent statistical tests are developed for differential expression analysis. We show through a wide simulation study that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it reaches the nominal value for the type-I error, while keeping elevate discriminative power between differentially and not differentially expressed genes. The method is finally illustrated on prostate cancer RNA-Seq data.

Functional principal component analysis (FPCA) is a popular approach to explore major sources of variation in a sample of random curves. These major sources of variation are represented by functional principal components (FPCs). The intervals where the values of FPCs are significant are interpreted as where sample curves have major variations. However, these intervals are often hard for naïve users to identify, because of the vague definition of “significant values”. In this article, we develop a novel penalty-based method to derive FPCs that are only nonzero precisely in the intervals where the values of FPCs are significant, whence the derived FPCs possess better interpretability than the FPCs derived from existing methods. To compute the proposed FPCs, we devise an efficient algorithm based on projection deflation techniques. We show that the proposed interpretable FPCs are strongly consistent and asymptotically normal under mild conditions. Simulation studies confirm that with a competitive performance in explaining variations of sample curves, the proposed FPCs are more interpretable than the traditional counterparts. This advantage is demonstrated by analyzing two real datasets, namely, electroencephalography data and Canadian weather data.

Ignorance of the mechanisms responsible for the availability of information presents an unusual problem for analysts. It is often the case that the availability of information is dependent on the outcome. In the analysis of cluster data we say that a condition for *informative cluster size* (ICS) exists when the inference drawn from analysis of hypothetical balanced data varies from that of inference drawn on observed data. Much work has been done in order to address the analysis of clustered data with informative cluster size; examples include Inverse Probability Weighting (IPW), Cluster Weighted Generalized Estimating Equations (CWGEE), and Doubly Weighted Generalized Estimating Equations (DWGEE). When cluster size changes with time, i.e., the data set possess temporally varying cluster sizes (TVCS), these methods may produce biased inference for the underlying marginal distribution of interest. We propose a new marginalization that may be appropriate for addressing clustered longitudinal data with TVCS. The principal motivation for our present work is to analyze the periodontal data collected by Beck et al. (1997, Journal of Periodontal Research 6, 497–505). Longitudinal periodontal data often exhibits both ICS and TVCS as the number of teeth possessed by participants at the onset of study is not constant and teeth as well as individuals may be displaced throughout the study.

It is agreed among biostatisticians that prediction models for binary outcomes should satisfy two essential criteria: first, a prediction model should have a high discriminatory power, implying that it is able to clearly separate cases from controls. Second, the model should be well calibrated, meaning that the predicted risks should closely agree with the relative frequencies observed in the data. The focus of this work is on the predictiveness curve, which has been proposed by Huang et al. (Biometrics 63, 2007) as a graphical tool to assess the aforementioned criteria. By conducting a detailed analysis of its properties, we review the role of the predictiveness curve in the performance assessment of biomedical prediction models. In particular, we demonstrate that marker comparisons should not be based solely on the predictiveness curve, as it is not possible to consistently visualize the added predictive value of a new marker by comparing the predictiveness curves obtained from competing models. Based on our analysis, we propose the “residual-based predictiveness curve” (RBP curve), which addresses the aforementioned issue and which extends the original method to settings where the evaluation of a prediction model on independent test data is of particular interest. Similar to the predictiveness curve, the RBP curve reflects both the calibration and the discriminatory power of a prediction model. In addition, the curve can be conveniently used to conduct valid performance checks and marker comparisons.

The discriminatory ability of a marker for censored survival data is routinely assessed by the time-dependent ROC curve and the *c*-index. The time-dependent ROC curve evaluates the ability of a biomarker to predict whether a patient lives past a particular time *t*. The *c*-index measures the global concordance of the marker and the survival time regardless of the time point. We propose a Bayesian semiparametric approach to estimate these two measures. The proposed estimators are based on the conditional distribution of the survival time given the biomarker and the empirical biomarker distribution. The conditional distribution is estimated by a linear-dependent Dirichlet process mixture model. The resulting ROC curve is smooth as it is estimated by a mixture of parametric functions. The proposed *c*-index estimator is shown to be more efficient than the commonly used Harrell's *c*-index since it uses all pairs of data rather than only informative pairs. The proposed estimators are evaluated through simulations and illustrated using a lung cancer dataset.

Causal mediation modeling has become a popular approach for studying the effect of an exposure on an outcome through mediators. Currently, the literature on mediation analyses with survival outcomes largely focused on settings with a single mediator and quantified the mediation effects on the hazard, log hazard and log survival time (Lange and Hansen 2011; VanderWeele 2011). In this article, we propose a multi-mediator model for survival data by employing a flexible semiparametric probit model. We characterize path-specific effects (PSEs) of the exposure on the outcome mediated through specific mediators. We derive closed form expressions for PSEs on a transformed survival time and the survival probabilities. Statistical inference on the PSEs is developed using a nonparametric maximum likelihood estimator under the semiparametric probit model and the functional Delta method. Results from simulation studies suggest that our proposed methods perform well in finite sample. We illustrate the utility of our method in a genomic study of glioblastoma multiforme survival.

In this article, we develop a piecewise Poisson regression method to analyze survival data from complex sample surveys involving cluster-correlated, differential selection probabilities, and longitudinal responses, to conveniently draw inference on absolute risks in time intervals that are prespecified by investigators. Extensive simulations evaluate the developed methods with extensions to multiple covariates under various complex sample designs, including stratified sampling, sampling with selection probability proportional to a measure of size (PPS), and a multi-stage cluster sampling. We applied our methods to a study of mortality in men diagnosed with prostate cancer in the Prostate, Lung, Colorectal, and Ovarian (PLCO) cancer screening trial to investigate whether a biomarker available from biospecimens collected near time of diagnosis stratifies subsequent risk of death. Poisson regression coefficients and absolute risks of mortality (and the corresponding 95% confidence intervals) for prespecified age intervals by biomarker levels are estimated. We conclude with a brief discussion of the motivation, methods, and findings of the study.

We consider multi-state capture–recapture–recovery data where observed individuals are recorded in a set of possible discrete states. Traditionally, the Arnason–Schwarz model has been fitted to such data where the state process is modeled as a first-order Markov chain, though second-order models have also been proposed and fitted to data. However, low-order Markov models may not accurately represent the underlying biology. For example, specifying a (time-independent) first-order Markov process involves the assumption that the dwell time in each state (i.e., the duration of a stay in a given state) has a geometric distribution, and hence that the modal dwell time is one. Specifying time-dependent or higher-order processes provides additional flexibility, but at the expense of a potentially significant number of additional model parameters. We extend the Arnason–Schwarz model by specifying a semi-Markov model for the state process, where the dwell-time distribution is specified more generally, using, for example, a shifted Poisson or negative binomial distribution. A state expansion technique is applied in order to represent the resulting semi-Markov Arnason–Schwarz model in terms of a simpler and computationally tractable hidden Markov model. Semi-Markov Arnason–Schwarz models come with only a very modest increase in the number of parameters, yet permit a significantly more flexible state process. Model selection can be performed using standard procedures, and in particular via the use of information criteria. The semi-Markov approach allows for important biological inference to be drawn on the underlying state process, for example, on the times spent in the different states. The feasibility of the approach is demonstrated in a simulation study, before being applied to real data corresponding to house finches where the states correspond to the presence or absence of conjunctivitis.

In this article we present a new method for performing Bayesian parameter inference and model choice for low- count time series models with intractable likelihoods. The method involves incorporating an alive particle filter within a sequential Monte Carlo (SMC) algorithm to create a novel exact-approximate algorithm, which we refer to as alive SMC. The advantages of this approach over competing methods are that it is naturally adaptive, it does not involve between-model proposals required in reversible jump Markov chain Monte Carlo, and does not rely on potentially rough approximations. The algorithm is demonstrated on Markov process and integer autoregressive moving average models applied to real biological datasets of hospital-acquired pathogen incidence, animal health time series, and the cumulative number of prion disease cases in mule deer.

The various thresholding quantities grouped under the “Basic Reproductive Number” umbrella are often confused, but represent distinct approaches to estimating epidemic spread potential, and address different modeling needs. Here, we contrast several common reproduction measures applied to stochastic compartmental models, and introduce a new quantity dubbed the “empirically adjusted reproductive number” with several advantages. These include: more complete use of the underlying compartmental dynamics than common alternatives, use as a potential diagnostic tool to detect the presence and causes of intensity process underfitting, and the ability to provide timely feedback on disease spread. Conceptual connections between traditional reproduction measures and our approach are explored, and the behavior of our method is examined under simulation. Two illustrative examples are developed: First, the single location applications of our method are established using data from the 1995 Ebola outbreak in the Democratic Republic of the Congo and a traditional stochastic SEIR model. Second, a spatial formulation of this technique is explored in the context of the ongoing Ebola outbreak in West Africa with particular emphasis on potential use in model selection, diagnosis, and the resulting applications to estimation and prediction. Both analyses are placed in the context of a newly developed spatial analogue of the traditional SEIR modeling approach.

Often the object of inference in biomedical applications is a range that brackets a given fraction of individual observations in a population. A classical estimate of this range for univariate measurements is a “tolerance interval.” This article develops its natural extension for functional measurements, a “tolerance band,” and proposes a methodology for constructing its pointwise and simultaneous versions that incorporates both sparse and dense functional data. Assuming that the measurements are observed with noise, the methodology uses functional principal component analysis in a mixed model framework to represent the measurements and employs bootstrapping to approximate the tolerance factors needed for the bands. The proposed bands also account for uncertainty in the principal components decomposition. Simulations show that the methodology has, generally, acceptable performance unless the data are quite sparse and unbalanced, in which case the bands may be somewhat liberal. The methodology is illustrated using two real datasets, a sparse dataset involving CD4 cell counts and a dense dataset involving core body temperatures.

Community water fluoridation is an important public health measure to prevent dental caries, but it continues to be somewhat controversial. The Iowa Fluoride Study (IFS) is a longitudinal study on a cohort of Iowa children that began in 1991. The main purposes of this study (http://www.dentistry.uiowa.edu/preventive-fluoride-study) were to quantify fluoride exposures from both dietary and nondietary sources and to associate longitudinal fluoride exposures with dental fluorosis (spots on teeth) and dental caries (cavities). We analyze a subset of the IFS data by a marginal regression model with a zero-inflated version of the Conway–Maxwell–Poisson distribution for count data exhibiting excessive zeros and a wide range of dispersion patterns. In general, we introduce two estimation methods for fitting a ZICMP marginal regression model. Finite sample behaviors of the estimators and the resulting confidence intervals are studied using extensive simulation studies. We apply our methodologies to the dental caries data. Our novel modeling incorporating zero inflation, clustering, and overdispersion sheds some new light on the effect of community water fluoridation and other factors. We also include a second application of our methodology to a genomic (next-generation sequencing) dataset that exhibits underdispersion.

Clinical biomarkers play an important role in precision medicine and are now extensively used in clinical trials, particularly in cancer. A response adaptive trial design enables researchers to use treatment results about earlier patients to aid in treatment decisions of later patients. Optimal adaptive trial designs have been developed without consideration of biomarkers. In this article, we describe the mathematical steps for computing optimal biomarker-integrated adaptive trial designs. These designs maximize the expected trial utility given any pre-specified utility function, though we focus here on maximizing patient responses within a given patient horizon. We describe the performance of the optimal design in different scenarios. We compare it to Bayesian Adaptive Randomization (BAR), which is emerging as a practical approach to develop adaptive trials. The difference in expected utility between BAR and optimal designs is smallest when the biomarker subgroups are highly imbalanced. We also compare BAR, a frequentist play-the-winner rule with integrated biomarkers and a marker-stratified balanced randomization design (BR). We show that, in contrasting two treatments, BR achieves a nearly optimal expected utility when the patient horizon is relatively large. Our work provides novel theoretical solution, as well as an absolute benchmark for the evaluation of trial designs in personalized medicine.

We consider quantile regression for partially linear models where an outcome of interest is related to covariates and a marker set (e.g., gene or pathway). The covariate effects are modeled parametrically and the marker set effect of multiple loci is modeled using kernel machine. We propose an efficient algorithm to solve the corresponding optimization problem for estimating the effects of covariates and also introduce a powerful test for detecting the overall effect of the marker set. Our test is motivated by traditional score test, and borrows the idea of permutation test. Our estimation and testing procedures are evaluated numerically and applied to assess genetic association of change in fasting homocysteine level using the Vitamin Intervention for Stroke Prevention Trial data.

Infection is one of the most common complications after hematopoietic cell transplantation. Many patients experience infectious complications repeatedly after transplant. Existing statistical methods for recurrent gap time data typically assume that patients are enrolled due to the occurrence of an event of interest, and subsequently experience recurrent events of the same type; moreover, for one-sample estimation, the gap times between consecutive events are usually assumed to be identically distributed. Applying these methods to analyze the post-transplant infection data will inevitably lead to incorrect inferential results because the time from transplant to the first infection has a different biological meaning than the gap times between consecutive recurrent infections. Some unbiased yet inefficient methods include univariate survival analysis methods based on data from the first infection or bivariate serial event data methods based on the first and second infections. In this article, we propose a nonparametric estimator of the joint distribution of time from transplant to the first infection and the gap times between consecutive infections. The proposed estimator takes into account the potentially different distributions of the two types of gap times and better uses the recurrent infection data. Asymptotic properties of the proposed estimators are established.

Matched case-control studies are popular designs used in epidemiology for assessing the effects of exposures on binary traits. Modern studies increasingly enjoy the ability to examine a large number of exposures in a comprehensive manner. However, several risk factors often tend to be related in a nontrivial way, undermining efforts to identify the risk factors using standard analytic methods due to inflated type-I errors and possible masking of effects. Epidemiologists often use data reduction techniques by grouping the prognostic factors using a thematic approach, with themes deriving from biological considerations. We propose shrinkage-type estimators based on Bayesian penalization methods to estimate the effects of the risk factors using these themes. The properties of the estimators are examined using extensive simulations. The methodology is illustrated using data from a matched case-control study of polychlorinated biphenyls in relation to the etiology of non-Hodgkin's lymphoma.

The Wilcoxon rank-sum test is a popular nonparametric test for comparing two independent populations (groups). In recent years, there have been renewed attempts in extending the Wilcoxon rank sum test for clustered data, one of which (Datta and Satten, 2005, *Journal of the American Statistical Association* **100**, 908–915) addresses the issue of informative cluster size, i.e., when the outcomes and the cluster size are correlated. We are faced with a situation where the group specific marginal distribution in a cluster depends on the number of observations in that group (i.e., the intra-cluster group size). We develop a novel extension of the rank-sum test for handling this situation. We compare the performance of our test with the Datta–Satten test, as well as the naive Wilcoxon rank sum test. Using a naturally occurring simulation model of informative intra-cluster group size, we show that only our test maintains the correct size. We also compare our test with a classical signed rank test based on averages of the outcome values in each group paired by the cluster membership. While this test maintains the size, it has lower power than our test. Extensions to multiple group comparisons and the case of clusters not having samples from all groups are also discussed. We apply our test to determine whether there are differences in the attachment loss between the upper and lower teeth and between mesial and buccal sites of periodontal patients.

Efforts to personalize medicine in oncology have been limited by reductive characterizations of the intrinsically complex underlying biological phenomena. Future advances in personalized medicine will rely on molecular signatures that derive from synthesis of multifarious interdependent molecular quantities requiring robust quantitative methods. However, highly parameterized statistical models when applied in these settings often require a prohibitively large database and are sensitive to proper characterizations of the treatment-by-covariate interactions, which in practice are difficult to specify and may be limited by generalized linear models. In this article, we present a Bayesian predictive framework that enables the integration of a high-dimensional set of genomic features with clinical responses and treatment histories of historical patients, providing a probabilistic basis for using the clinical and molecular information to personalize therapy for future patients. Our work represents one of the first attempts to define personalized treatment assignment rules based on large-scale genomic data. We use actual gene expression data acquired from The Cancer Genome Atlas in the settings of leukemia and glioma to explore the statistical properties of our proposed Bayesian approach for personalizing treatment selection. The method is shown to yield considerable improvements in predictive accuracy when compared to penalized regression approaches.

We present a general method for estimating the effect of a treatment on an ordinal outcome in randomized trials. The method is robust in that it does not rely on the proportional odds assumption. Our estimator leverages information in prognostic baseline variables, and has all of the following properties: (i) it is consistent; (ii) it is locally efficient; (iii) it is guaranteed to have equal or better asymptotic precision than both the inverse probability-weighted and the unadjusted estimators. To the best of our knowledge, this is the first estimator of the causal relation between a treatment and an ordinal outcome to satisfy these properties. We demonstrate the estimator in simulations based on resampling from a completed randomized clinical trial of a new treatment for stroke; we show potential gains of up to 39% in relative efficiency compared to the unadjusted estimator. The proposed estimator could be a useful tool for analyzing randomized trials with ordinal outcomes, since existing methods either rely on model assumptions that are untenable in many practical applications, or lack the efficiency properties of the proposed estimator. We provide R code implementing the estimator.

Observational studies are often in peril of unmeasured confounding. Instrumental variable analysis is a method for controlling for unmeasured confounding. As yet, theory on instrumental variable analysis of censored time-to-event data is scarce. We propose a pseudo-observation approach to instrumental variable analysis of the survival function, the restricted mean, and the cumulative incidence function in competing risks with right-censored data using generalized method of moments estimation. For the purpose of illustrating our proposed method, we study antidepressant exposure in pregnancy and risk of autism spectrum disorder in offspring, and the performance of the method is assessed through simulation studies.

In many practical cases of multiple hypothesis problems, it can be expected that the alternatives are not symmetrically distributed. If it is known a priori that the distributions of the alternatives are skewed, we show that this information yields high power procedures as compared to the procedures based on symmetric alternatives when testing multiple hypotheses. We propose a Bayesian decision theoretic rule for multiple directional hypothesis testing, when the alternatives are distributed as skewed, under a constraint on a mixed directional false discovery rate. We compare the proposed rule with a frequentist's rule of Benjamini and Yekutieli (2005) using simulations. We apply our method to a well-studied HIV dataset.

Times between successive events (i.e., gap times) are of great importance in survival analysis. Although many methods exist for estimating covariate effects on gap times, very few existing methods allow for comparisons between gap times themselves. Motivated by the comparison of primary and repeat transplantation, our interest is specifically in contrasting the gap time survival functions and their integration (restricted mean gap time). Two major challenges in gap time analysis are non-identifiability of the marginal distributions and the existence of dependent censoring (for all but the first gap time). We use Cox regression to estimate the (conditional) survival distributions of each gap time (given the previous gap times). Combining fitted survival functions based on those models, along with multiple imputation applied to censored gap times, we then contrast the first and second gap times with respect to average survival and restricted mean lifetime. Large-sample properties are derived, with simulation studies carried out to evaluate finite-sample performance. We apply the proposed methods to kidney transplant data obtained from a national organ transplant registry. Mean 10-year graft survival of the primary transplant is significantly greater than that of the repeat transplant, by 3.9 months (), a result that may lack clinical importance.

We propose a novel Bayesian hierarchical model for brain imaging data that unifies voxel-level (the most localized unit of measure) and region-level brain connectivity analyses, and yields population-level inferences. Functional connectivity generally refers to associations in brain activity between distinct locations. The first level of our model summarizes brain connectivity for cross-region voxel pairs using a two-component mixture model consisting of connected and nonconnected voxels. We use the proportion of connected voxel pairs to define a new measure of connectivity strength, which reflects the breadth of between-region connectivity. Furthermore, we evaluate the impact of clinical covariates on connectivity between region-pairs at a population level. We perform parameter estimation using Markov chain Monte Carlo (MCMC) techniques, which can be executed quickly relative to the number of model parameters. We apply our method to resting-state functional magnetic resonance imaging (fMRI) data from 32 subjects with major depression and simulated data to demonstrate the properties of our method.

With the internet, a massive amount of information on species abundance can be collected by citizen science programs. However, these data are often difficult to use directly in statistical inference, as their collection is generally opportunistic, and the distribution of the sampling effort is often not known. In this article, we develop a general statistical framework to combine such “opportunistic data” with data collected using schemes characterized by a known sampling effort. Under some structural assumptions regarding the sampling effort and detectability, our approach makes it possible to estimate the relative abundance of several species in different sites. It can be implemented through a simple generalized linear model. We illustrate the framework with typical bird datasets from the Aquitaine region in south-western France. We show that, under some assumptions, our approach provides estimates that are more precise than the ones obtained from the dataset with a known sampling effort alone. When the opportunistic data are abundant, the gain in precision may be considerable, especially for rare species. We also show that estimates can be obtained even for species recorded only in the opportunistic scheme. Opportunistic data combined with a relatively small amount of data collected with a known effort may thus provide access to accurate and precise estimates of quantitative changes in relative abundance over space and/or time.

In genome-wide gene–environment interaction (GxE) studies, a common strategy to improve power is to first conduct a filtering test and retain only the SNPs that pass the filtering in the subsequent GxE analyses. Inspired by two-stage tests and gene-based tests in GxE analysis, we consider the general problem of jointly testing a set of parameters when only a few are truly from the alternative hypothesis and when filtering information is available. We propose a unified set-based test that simultaneously considers filtering on individual parameters and testing on the set. We derive the exact distribution and approximate the power function of the proposed unified statistic in simplified settings, and use them to adaptively calculate the optimal filtering threshold for each set. In the context of gene-based GxE analysis, we show that although the empirical power function may be affected by many factors, the optimal filtering threshold corresponding to the peak of the power curve primarily depends on the size of the gene. We further propose a resampling algorithm to calculate *P*-values for each gene given the estimated optimal filtering threshold. The performance of the method is evaluated in simulation studies and illustrated via a genome-wide gene–gender interaction analysis using pancreatic cancer genome-wide association data.

Large-scale homogeneous discrete *p*-values are encountered frequently in high-throughput genomics studies, and the related multiple testing problems become challenging because most existing methods for the false discovery rate (FDR) assume continuous *p*-values. In this article, we study the estimation of the null proportion and FDR for discrete *p*-values with common support. In the finite sample setting, we propose a novel class of conservative FDR estimators. Furthermore, we show that a broad class of FDR estimators is simultaneously conservative over all support points under some weak dependence condition in the asymptotic setting. We further demonstrate the significant improvement of a newly proposed method over existing methods through simulation studies and a case study.

Subject-specific and marginal models have been developed for the analysis of longitudinal ordinal data. Subject-specific models often lack a population-average interpretation of the model parameters due to the conditional formulation of random intercepts and slopes. Marginal models frequently lack an underlying distribution for ordinal data, in particular when generalized estimating equations are applied. To overcome these issues, latent variable models underneath the ordinal outcomes with a multivariate logistic distribution can be applied. In this article, we extend the work of O'Brien and Dunson (2004), who studied the multivariate *t*-distribution with marginal logistic distributions. We use maximum likelihood, instead of a Bayesian approach, and incorporated covariates in the correlation structure, in addition to the mean model. We compared our method with GEE and demonstrated that it performs better than GEE with respect to the fixed effect parameter estimation when the latent variables have an approximately elliptical distribution, and at least as good as GEE for other types of latent variable distributions.

We develop alternative strategies for building and fitting parametric capture–recapture models for closed populations which can be used to address a better understanding of behavioral patterns. In the perspective of transition models, we first rely on a conditional probability parameterization. A large subset of standard capture–recapture models can be regarded as a suitable partitioning in equivalence classes of the full set of conditional probability parameters. We exploit a regression approach combined with the use of new suitable summaries of the conditioning binary partial capture histories as a device for enlarging the scope of behavioral models and also exploring the range of all possible partitions. We show how one can easily find unconditional MLE of such models within a generalized linear model framework. We illustrate the potential of our approach with the analysis of some known datasets and a simulation study.

Motivated by a longitudinal oral health study, we propose a flexible modeling approach for clustered time-to-event data, when the response of interest can only be determined to lie in an interval obtained from a sequence of examination times (interval-censored data) and on top of that, the determination of the occurrence of the event is subject to misclassification. The clustered time-to-event data are modeled using an accelerated failure time model with random effects and by assuming a penalized Gaussian mixture model for the random effects terms to avoid restrictive distributional assumptions concerning the event times. A general misclassification model is discussed in detail, considering the possibility that different examiners were involved in the assessment of the occurrence of the events for a given subject across time. A Bayesian implementation of the proposed model is described in a detailed manner. We additionally provide empirical evidence showing that the model can be used to estimate the underlying time-to-event distribution and the misclassification parameters without any external information about the latter parameters. We also provide results of a simulation study to evaluate the effect of neglecting the presence of misclassification in the analysis of clustered time-to-event data.

DNA methylation studies have been revolutionized by the recent development of high throughput array-based platforms. Most of the existing methods analyze microarray methylation data on a probe-by-probe basis, ignoring probe-specific effects and correlations among methylation levels at neighboring genomic locations. These methods can potentially miss functionally relevant findings associated with genomic regions. In this article, we propose a statistical model that allows us to pool information on the same probe across multiple samples to estimate the probe affinity effect, and to borrow strength from the neighboring probe sites to better estimate the methylation values. Using a simulation study, we demonstrate that our method can provide accurate model-based estimates. We further use the proposed method to develop a new procedure for detecting differentially methylated regions, and compare it with a state-of-the-art approach via a data application.

There is an overwhelmingly large literature and algorithms already available on “large-scale inference problems” based on different modeling techniques and cultures. Our primary goal in this article is *not to add one more new methodology* to the existing toolbox but instead (i) to clarify the mystery how these different simultaneous inference methods are *connected*, (ii) to provide an alternative more intuitive derivation of the formulas that leads to *simpler* expressions in order (iii) to develop a *unified* algorithm for practitioners. A detailed discussion on representation, estimation, inference, and model selection is given. Applications to a variety of real and simulated datasets show promise. We end with several future research directions.

Causal mediation modeling has become a popular approach for studying the effect of an exposure on an outcome through a mediator. However, current methods are not applicable to the setting with a large number of mediators. We propose a testing procedure for mediation effects of high-dimensional continuous mediators. We characterize the marginal mediation effect, the multivariate component-wise mediation effects, and the norm of the component-wise effects, and develop a Monte-Carlo procedure for evaluating their statistical significance. To accommodate the setting with a large number of mediators and a small sample size, we further propose a transformation model using the spectral decomposition. Under the transformation model, mediation effects can be estimated using a series of regression models with a univariate transformed mediator, and examined by our proposed testing procedure. Extensive simulation studies are conducted to assess the performance of our methods for continuous and dichotomous outcomes. We apply the methods to analyze genomic data investigating the effect of microRNA miR-223 on a dichotomous survival status of patients with glioblastoma multiforme (GBM). We identify nine gene ontology sets with expression values that significantly mediate the effect of miR-223 on GBM survival.

Cigarette smoking is a prototypical example of a recurrent event. The pattern of recurrent smoking events may depend on time-varying covariates including mood and environmental variables. Fixed effects and frailty models for recurrent events data assume that smokers have a common association with time-varying covariates. We develop a mixed effects version of a recurrent events model that may be used to describe variation among smokers in how they respond to those covariates, potentially leading to the development of individual-based smoking cessation therapies. Our method extends the modified EM algorithm of Steele (1996) for generalized mixed models to recurrent events data with partially observed time-varying covariates. It is offered as an alternative to the method of Rizopoulos, Verbeke, and Lesaffre (2009) who extended Steele's (1996) algorithm to a joint-model for the recurrent events data and time-varying covariates. Our approach does not require a model for the time-varying covariates, but instead assumes that the time-varying covariates are sampled according to a Poisson point process with known intensity. Our methods are well suited to data collected using Ecological Momentary Assessment (EMA), a method of data collection widely used in the behavioral sciences to collect data on emotional state and recurrent events in the every-day environments of study subjects using electronic devices such as Personal Digital Assistants (PDA) or smart phones.

A new objective methodology is proposed to select the parsimonious set of important covariates that are associated with a censored outcome variable *Y*; the method simplifies to accommodate uncensored outcomes. Covariate selection proceeds in an iterated forward manner and is controlled by the pre-chosen upper bound for the number of covariates to be selected and the global false selection rate and level. A sequence of working regression models for the event given a covariate set is fit among subjects not censored before *y* and the corresponding process (through *y*) of conditional prediction error estimated; the direction and magnitude of covariate effects can arbitrarily change with *y*. The newly proposed adequacy measure for the covariate set is the slope coefficient resulting from a regression (with no intercept) between the baseline prediction error process for the intercept-only model and that process corresponding to the covariate set. Under quite general conditions on the censoring variable, the methods are shown to asymptotically control the false selection rate at the nominal level while consistently ranking covariate sets which permits recruitment of all important covariates from those available with probability tending to 1. A simulation study confirms these analytical results and compares the proposed methods to recent competitors. Two real data illustrations are provided.

Researchers often seek robust inference for a parameter through semiparametric estimation. Efficient semiparametric estimation currently requires theoretical derivation of the efficient influence function (EIF), which can be a challenging and time-consuming task. If this task can be computerized, it can save dramatic human effort, which can be transferred, for example, to the design of new studies. Although the EIF is, in principle, a derivative, simple numerical differentiation to calculate the EIF by a computer masks the EIF's functional dependence on the parameter of interest. For this reason, the standard approach to obtaining the EIF relies on the theoretical construction of the space of scores under all possible parametric submodels. This process currently depends on the correctness of conjectures about these spaces, and the correct verification of such conjectures. The correct guessing of such conjectures, though successful in some problems, is a nondeductive process, i.e., is not guaranteed to succeed (e.g., is not computerizable), and the verification of conjectures is generally susceptible to mistakes. We propose a method that can deductively produce semiparametric locally efficient estimators. The proposed method is computerizable, meaning that it does not need either conjecturing, or otherwise theoretically deriving the functional form of the EIF, and is guaranteed to produce the desired estimates even for complex parameters. The method is demonstrated through an example.

The amount and complexity of patient-level data being collected in randomized-controlled trials offer both opportunities and challenges for developing personalized rules for assigning treatment for a given disease or ailment. For example, trials examining treatments for major depressive disorder are not only collecting typical baseline data such as age, gender, or scores on various tests, but also data that measure the structure and function of the brain such as images from magnetic resonance imaging (MRI), functional MRI (fMRI), or electroencephalography (EEG). These latter types of data have an inherent structure and may be considered as functional data. We propose an approach that uses baseline covariates, both scalars and functions, to aid in the selection of an optimal treatment. In addition to providing information on which treatment should be selected for a new patient, the estimated regime has the potential to provide insight into the relationship between treatment response and the set of baseline covariates. Our approach can be viewed as an extension of “advantage learning” to include both scalar and functional covariates. We describe our method and how to implement it using existing software. Empirical performance of our method is evaluated with simulated data in a variety of settings and also applied to data arising from a study of patients with major depressive disorder from whom baseline scalar covariates as well as functional data from EEG are available.

A treatment regime formalizes personalized medicine as a function from individual patient characteristics to a recommended treatment. A high-quality treatment regime can improve patient outcomes while reducing cost, resource consumption, and treatment burden. Thus, there is tremendous interest in estimating treatment regimes from observational and randomized studies. However, the development of treatment regimes for application in clinical practice requires the long-term, joint effort of statisticians and clinical scientists. In this collaborative process, the statistician must integrate clinical science into the statistical models underlying a treatment regime and the clinician must scrutinize the estimated treatment regime for scientific validity. To facilitate meaningful information exchange, it is important that estimated treatment regimes be interpretable in a subject-matter context. We propose a simple, yet flexible class of treatment regimes whose members are representable as a short list of if–then statements. Regimes in this class are immediately interpretable and are therefore an appealing choice for broad application in practice. We derive a robust estimator of the optimal regime within this class and demonstrate its finite sample performance using simulation experiments. The proposed method is illustrated with data from two clinical trials.

Technological advances have led to a proliferation of structured big data that have matrix-valued covariates. We are specifically motivated to build predictive models for multi-subject neuroimaging data based on each subject's brain imaging scans. This is an ultra-high-dimensional problem that consists of a matrix of covariates (brain locations by time points) for each subject; few methods currently exist to fit supervised models directly to this tensor data. We propose a novel modeling and algorithmic strategy to apply generalized linear models (GLMs) to this massive tensor data in which one set of variables is associated with locations. Our method begins by fitting GLMs to each location separately, and then builds an ensemble by blending information across locations through regularization with what we term an aggregating penalty. Our so called, Local-Aggregate Model, can be fit in a completely distributed manner over the locations using an Alternating Direction Method of Multipliers (ADMM) strategy, and thus greatly reduces the computational burden. Furthermore, we propose to select the appropriate model through a novel sequence of faster algorithmic solutions that is similar to regularization paths. We will demonstrate both the computational and predictive modeling advantages of our methods via simulations and an EEG classification problem.

Predicting disease risk and progression is one of the main goals in many clinical research studies. Cohort studies on the natural history and etiology of chronic diseases span years and data are collected at multiple visits. Although, kernel-based statistical learning methods are proven to be powerful for a wide range of disease prediction problems, these methods are only well studied for independent data, but not for longitudinal data. It is thus important to develop time-sensitive prediction rules that make use of the longitudinal nature of the data. In this paper, we develop a novel statistical learning method for longitudinal data by introducing subject-specific short-term and long-term latent effects through a designed kernel to account for within-subject correlation of longitudinal measurements. Since the presence of multiple sources of data is increasingly common, we embed our method in a multiple kernel learning framework and propose a regularized multiple kernel statistical learning with random effects to construct effective nonparametric prediction rules. Our method allows easy integration of various heterogeneous data sources and takes advantage of correlation among longitudinal measures to increase prediction power. We use different kernels for each data source taking advantage of the distinctive feature of each data modality, and then optimally combine data across modalities. We apply the developed methods to two large epidemiological studies, one on Huntington's disease and the other on Alzheimer's Disease (Alzheimer's Disease Neuroimaging Initiative, ADNI) where we explore a unique opportunity to combine imaging and genetic data to study prediction of mild cognitive impairment, and show a substantial gain in performance while accounting for the longitudinal aspect of the data.

Merging multiple datasets collected from studies with identical or similar scientific objectives is often undertaken in practice to increase statistical power. This article concerns the development of an effective statistical method that enables to merge multiple longitudinal datasets subject to various heterogeneous characteristics, such as different follow-up schedules and study-specific missing covariates (e.g., covariates observed in some studies but missing in other studies). The presence of study-specific missing covariates presents great statistical methodology challenge in data merging and analysis. We propose a joint estimating function approach to addressing this challenge, in which a novel nonparametric estimating function constructed via splines-based sieve approximation is utilized to bridge estimating equations from studies with missing covariates to those with fully observed covariates. Under mild regularity conditions, we show that the proposed estimator is consistent and asymptotically normal. We evaluate finite-sample performances of the proposed method through simulation studies. In comparison to the conventional multiple imputation approach, our method exhibits smaller estimation bias. We provide an illustrative data analysis using longitudinal cohorts collected in Mexico City to assess the effect of lead exposures on children's somatic growth.

Misclassified causes of failures are a common phenomenon in competing risks survival data such as cancer mortality. We propose new estimating equations for a semiparametric proportional hazards (PH) model with misattributed causes of failures. Unlike other methods, the estimator does not require any parametric assumptions on baseline cause-specific hazard rates. It is shown that the estimators for regression coefficients are consistent and asymptotically normal. Simulation results support the theoretical analysis in finite samples. The methods are applied to analyze prostate cancer survival.

Recurrent events often serve as the outcome in epidemiologic studies. In some observational studies, the goal is to estimate the effect of a new or “experimental” (i.e., less established) treatment of interest on the recurrent event rate. The incentive for accepting the new treatment may be that it is more available than the standard treatment. Given that the patient can choose between the experimental treatment and conventional therapy, it is of clinical importance to compare the treatment of interest versus the setting where the experimental treatment did not exist, in which case patients could only receive no treatment or the standard treatment. Many methods exist for the analysis of recurrent events and for the evaluation of treatment effects. However, methodology for the intersection of these two areas is sparse. Moreover, care must be taken in setting up the comparison groups in our setting; use of existing methods featuring time-dependent treatment indicators will generally lead to a biased treatment effect since the comparison group construction will not properly account for the timing of treatment initiation. We propose a sequential stratification method featuring time-dependent prognostic score matching to estimate the effect of a time-dependent treatment on the recurrent event rate. The performance of the method in moderate-sized samples is assessed through simulation. The proposed methods are applied to a prospective clinical study in order to evaluate the effect of living donor liver transplantation on hospitalization rates; in this setting, conventional therapy involves remaining on the wait list or receiving a deceased donor transplant.

The case cohort (CCH) design is a cost-effective design for assessing genetic susceptibility with time-to-event data especially when the event rate is low. In this work, we propose a powerful pseudo-score test for assessing the association between a single nucleotide polymorphism (SNP) and the event time under the CCH design. The pseudo-score is derived from a pseudo-likelihood which is an estimated retrospective likelihood that treats the SNP genotype as the dependent variable and time-to-event outcome and other covariates as independent variables. It exploits the fact that the genetic variable is often distributed independent of covariates or only related to a low-dimensional subset. Estimates of hazard ratio parameters for association can be obtained by maximizing the pseudo-likelihood. A unique advantage of our method is that it allows the censoring distribution to depend on covariates that are only measured for the CCH sample while not requiring the knowledge of follow-up or covariate information on subjects not selected into the CCH sample. In addition to these flexibilities, the proposed method has high relative efficiency compared with commonly used alternative approaches. We study large sample properties of this method and assess its finite sample performance using both simulated and real data examples.

The Gittins index provides a well established, computationally attractive, optimal solution to a class of resource allocation problems known collectively as the multi-arm bandit problem. Its development was originally motivated by the problem of optimal patient allocation in multi-arm clinical trials. However, it has never been used in practice, possibly for the following reasons: (1) it is fully sequential, i.e., the endpoint must be observable soon after treating a patient, reducing the medical settings to which it is applicable; (2) it is completely deterministic and thus removes randomization from the trial, which would naturally protect against various sources of bias. We propose a novel implementation of the Gittins index rule that overcomes these difficulties, trading off a small deviation from optimality for a fully randomized, adaptive group allocation procedure which offers substantial improvements in terms of patient benefit, especially relevant for small populations. We report the operating characteristics of our approach compared to existing methods of adaptive randomization using a recently published trial as motivation.

We provide an asymptotic test to analyze randomized clinical trials that may be subject to selection bias. For normally distributed responses, and under permuted block randomization, we derive a likelihood ratio test of the treatment effect under a selection bias model. A likelihood ratio test of the presence of selection bias arises from the same formulation. We prove that the test is asymptotically chi-square on one degree of freedom. These results correlate well with the likelihood ratio test of Ivanova et al. (2005, *Statistics in Medicine* **24**, 1537–1546) for binary responses, for which they established by simulation that the asymptotic distribution is chi-square. Simulations also show that the test is robust to departures from normality and under another randomization procedure. We illustrate the test by reanalyzing a clinical trial on retinal detachment.

For comparison of proportions, there are three commonly used measurements: the difference, the relative risk, and the odds ratio. Significant effort has been spent on exact confidence intervals for the difference. In this article, we focus on the relative risk and the odds ratio when data are collected from a matched-pairs design or a two-arm independent binomial experiment. Exact one-sided and two-sided confidence intervals are proposed for each configuration of two measurements and two types of data. The one-sided intervals are constructed using an inductive order, they are the smallest under the order, and are admissible under the set inclusion criterion. The two-sided intervals are the intersection of two one-sided intervals. R codes are developed to implement the intervals. Supplementary materials for this article are available online.

We investigate likelihood ratio contrast tests for dose response signal detection under model uncertainty, when several competing regression models are available to describe the dose response relationship. The proposed approach uses the complete structure of the regression models, but does not require knowledge of the parameters of the competing models. Standard likelihood ratio test theory is applicable in linear models as well as in nonlinear regression models with identifiable parameters. However, for many commonly used nonlinear dose response models the regression parameters are not identifiable under the null hypothesis of no dose response and standard arguments cannot be used to obtain critical values. We thus derive the asymptotic distribution of likelihood ratio contrast tests in regression models with a lack of identifiability and use this result to simulate the quantiles based on Gaussian processes. The new method is illustrated with a real data example and compared to existing procedures using theoretical investigations as well as simulations.

Continuous-time birth–death-shift (BDS) processes are frequently used in stochastic modeling, with many applications in ecology and epidemiology. In particular, such processes can model evolutionary dynamics of transposable elements—important genetic markers in molecular epidemiology. Estimation of the effects of individual covariates on the birth, death, and shift rates of the process can be accomplished by analyzing patient data, but inferring these rates in a discretely and unevenly observed setting presents computational challenges. We propose a multi-type branching process approximation to BDS processes and develop a corresponding expectation maximization algorithm, where we use spectral techniques to reduce calculation of expected sufficient statistics to low-dimensional integration. These techniques yield an efficient and robust optimization routine for inferring the rates of the BDS process, and apply broadly to multi-type branching processes whose rates can depend on many covariates. After rigorously testing our methodology in simulation studies, we apply our method to study intrapatient time evolution of IS*6110* transposable element, a genetic marker frequently used during estimation of epidemiological clusters of *Mycobacterium tuberculosis* infections.

We introduce a new multivariate product-shot-noise Cox process which is useful for modeling multi-species spatial point patterns with clustering intra-specific interactions and neutral, negative, or positive inter-specific interactions. The auto- and cross-pair correlation functions of the process can be obtained in closed analytical forms and approximate simulation of the process is straightforward. We use the proposed process to model interactions within and among five tree species in the Barro Colorado Island plot.

Non-parametric estimation of the transition probabilities in multi-state models is considered for non-Markov processes. Firstly, a generalization of the estimator of Pepe et al., (1991) (Statistics in Medicine) is given for a class of progressive multi-state models based on the difference between Kaplan–Meier estimators. Secondly, a general estimator for progressive or non-progressive models is proposed based upon constructed univariate survival or competing risks processes which retain the Markov property. The properties of the estimators and their associated standard errors are investigated through simulation. The estimators are demonstrated on datasets relating to survival and recurrence in patients with colon cancer and prothrombin levels in liver cirrhosis patients.

We wish to estimate the total number of classes in a population based on sample counts, especially in the presence of high latent diversity. Drawing on probability theory that characterizes distributions on the integers by ratios of consecutive probabilities, we construct a nonlinear regression model for the ratios of consecutive frequency counts. This allows us to predict the unobserved count and hence estimate the total diversity. We believe that this is the first approach to depart from the classical mixed Poisson model in this problem. Our method is geometrically intuitive and yields good fits to data with reasonable standard errors. It is especially well-suited to analyzing high diversity datasets derived from next-generation sequencing in microbial ecology. We demonstrate the method's performance in this context and via simulation, and we present a dataset for which our method outperforms all competitors.

Creel surveys are used in recreational fisheries to estimate angling effort, catch, and harvest. Aerial-access creel surveys rely on two components: (1) a ground component in which fishing parties returning from their trips are interviewed at some access-points of the fishery; (2) an aerial component in which the number of fishing parties is counted. A common practice is to sample fewer aerial survey days than ground survey days. This is thought by practitioners to reduce the cost of the survey, but there is a lack of sound statistical methodology for this case. In this article, we propose various estimation methods to handle this situation and evaluate their asymptotic properties from a design-based perspective. We also propose formulas for the optimal allocation of the effort between the ground and the aerial portion of the survey, for given costs and budget. A simulation study investigates the performance of the estimators. Finally, we apply our methods to data from an annual Kootenay Lake survey (Canada).

We develop maximum likelihood methods for line transect surveys in which animals go undetected at distance zero, either because they are stochastically unavailable while within view or because they are missed when they are available. These incorporate a Markov-modulated Poisson process model for animal availability, allowing more clustered availability events than is possible with Poisson availability models. They include a mark-recapture component arising from the independent-observer survey, leading to more accurate estimation of detection probability given availability. We develop models for situations in which (a) multiple detections of the same individual are possible and (b) some or all of the availability process parameters are estimated from the line transect survey itself, rather than from independent data. We investigate estimator performance by simulation, and compare the multiple-detection estimators with estimators that use only initial detections of individuals, and with a single-observer estimator. Simultaneous estimation of detection function parameters and availability model parameters is shown to be feasible from the line transect survey alone with multiple detections and double-observer data but not with single-observer data. Recording multiple detections of individuals improves estimator precision substantially when estimating the availability model parameters from survey data, and we recommend that these data be gathered. We apply the methods to estimate detection probability from a double-observer survey of North Atlantic minke whales, and find that double-observer data greatly improve estimator precision here too.

Link et al. (2010, *Biometrics* **66**, 178–185) define a general framework for analyzing capture–recapture data with potential misidentifications. In this framework, the observed vector of counts, , is considered as a linear function of a vector of latent counts, , such that , with assumed to follow a multinomial distribution conditional on the model parameters, . Bayesian methods are then applied by sampling from the joint posterior distribution of both and . In particular, Link et al. (2010) propose a Metropolis–Hastings algorithm to sample from the full conditional distribution of , where new proposals are generated by sequentially adding elements from a basis of the null space (kernel) of . We consider this algorithm and show that using elements from a simple basis for the kernel of may not produce an irreducible Markov chain. Instead, we require a Markov basis, as defined by Diaconis and Sturmfels (1998, *The Annals of Statistics* **26**, 363–397). We illustrate the importance of Markov bases with three capture–recapture examples. We prove that a specific lattice basis is a Markov basis for a class of models including the original model considered by Link et al. (2010) and confirm that the specific basis used in their example with two sampling occasions is a Markov basis. The constructive nature of our proof provides an immediate method to obtain a Markov basis for any model in this class.

An expanded family of mixtures of multivariate power exponential distributions is introduced. While fitting heavy-tails and skewness have received much attention in the model-based clustering literature recently, we investigate the use of a distribution that can deal with both varying tail-weight and peakedness of data. A family of parsimonious models is proposed using an eigen-decomposition of the scale matrix. A generalized expectation–maximization algorithm is presented that combines convex optimization via a minorization–maximization approach and optimization based on accelerated line search algorithms on the Stiefel manifold. Lastly, the utility of this family of models is illustrated using both toy and benchmark data.

Differential brain response to sensory stimuli is very small (a few microvolts) compared to the overall magnitude of spontaneous electroencephalogram (EEG), yielding a low signal-to-noise ratio (SNR) in studies of event-related potentials (ERP). To cope with this phenomenon, stimuli are applied repeatedly and the ERP signals arising from the individual trials are averaged at the subject level. This results in loss of information about potentially important changes in the magnitude and form of ERP signals over the course of the experiment. In this article, we develop a meta-preprocessing step utilizing a moving average of ERP across sliding trial windows, to capture such longitudinal trends. We embed this procedure in a weighted linear mixed effects model to describe longitudinal trends in features such as ERP peak amplitude and latency across trials while adjusting for the inherent heteroskedasticity created at the meta-preprocessing step. The proposed unified framework, including the meta-processing and the weighted linear mixed effects modeling steps, is referred to as MAP-ERP (moving-averaged-processed ERP). We perform simulation studies to assess the performance of MAP-ERP in reconstructing existing longitudinal trends and apply MAP-ERP to data from young children with autism spectrum disorder (ASD) and their typically developing counterparts to examine differences in patterns of implicit learning, providing novel insights about the mechanisms underlying social and/or cognitive deficits in this disorder.

The global emergence of *Batrachochytrium dendrobatidis* (Bd) has caused the extinction of hundreds of amphibian species worldwide. It has become increasingly important to be able to precisely predict time to Bd arrival in a population. The data analyzed herein present a unique challenge in terms of modeling because there is a strong spatial component to Bd arrival time and the traditional proportional hazards assumption is grossly violated. To address these concerns, we develop a novel marginal Bayesian nonparametric survival model for spatially correlated right-censored data. This class of models assumes that the logarithm of survival times marginally follow a mixture of normal densities with a linear-dependent Dirichlet process prior as the random mixing measure, and their joint distribution is induced by a Gaussian copula model with a spatial correlation structure. To invert high-dimensional spatial correlation matrices, we adopt a full-scale approximation that can capture both large- and small-scale spatial dependence. An efficient Markov chain Monte Carlo algorithm with delayed rejection is proposed for posterior computation, and an R package spBayesSurv is provided to fit the model. This approach is first evaluated through simulations, then applied to threatened frog populations in Sequoia-Kings Canyon National Park.

Recent developments of high-throughput genomic technologies offer an unprecedented detailed view of the genetic variation in various human populations, and promise to lead to significant progress in understanding the genetic basis of complex diseases. Despite this tremendous advance in data generation, it remains very challenging to analyze and interpret these data due to their sparse and high-dimensional nature. Here, we propose novel applications and new developments of empirical Bayes scan statistics to identify genomic regions significantly enriched with disease risk variants. We show that the proposed empirical Bayes methodology can be substantially more powerful than existing scan statistics methods especially so in the presence of many non-disease risk variants, and in situations when there is a mixture of risk and protective variants. Furthermore, the empirical Bayes approach has greater flexibility to accommodate covariates such as functional prediction scores and additional biomarkers. As proof-of-concept we apply the proposed methods to a whole-exome sequencing study for autism spectrum disorders and identify several promising candidate genes.

Understanding HIV incidence, the rate at which new infections occur in populations, is critical for tracking and surveillance of the epidemic. In this article, we derive methods for determining sample sizes for cross-sectional surveys to estimate incidence with sufficient precision. We further show how to specify sample sizes for two successive cross-sectional surveys to detect changes in incidence with adequate power. In these surveys biomarkers such as CD4 cell count, viral load, and recently developed serological assays are used to determine which individuals are in an early disease stage of infection. The total number of individuals in this stage, divided by the number of people who are uninfected, is used to approximate the incidence rate. Our methods account for uncertainty in the durations of time spent in the biomarker defined early disease stage. We find that failure to account for this uncertainty when designing surveys can lead to imprecise estimates of incidence and underpowered studies. We evaluated our sample size methods in simulations and found that they performed well in a variety of underlying epidemics. Code for implementing our methods in R is available with this article at the *Biometrics* website on Wiley Online Library.

Random-effects models are often used in family-based genetic association studies to properly capture the within families relationships. In such models, the regression parameters have a conditional on the random effects interpretation and they measure, e.g., genetic effects for each family. Estimating parameters that can be used to make inferences at the population level is often more relevant than the family-specific effects, but not straightforward. This is mainly for two reasons: First the analysis of family data often requires high-dimensional random-effects vectors to properly model the familial relationships, for instance when members with a different degree of relationship are considered, such as trios, mix of monozygotic and dizygotic twins, etc. The second complication is the biased sampling design, such as the multiple cases families design, which is often employed to enrich the sample with genetic information. For these reasons deriving parameters with the desired marginal interpretation can be challenging. In this work we consider the marginalized mixed-effects models, we discuss challenges in applying them in ascertained family data and propose penalized maximum likelihood methodology to stabilize the parameter estimation by using external information on the disease prevalence or heritability. The performance of our methodology is evaluated via simulation and is illustrated on data from Rheumatoid Arthritis patients, where we estimate the marginal effect of HLA-DRB1 and shared epitope alleles across three different study designs and combine them using meta-analysis.

Accurate risk prediction models are needed to identify different risk groups for individualized prevention and treatment strategies. In the Nurses’ Health Study, to examine the effects of several biomarkers and genetic markers on the risk of rheumatoid arthritis (RA), a three-phase nested case-control (NCC) design was conducted, in which two sequential NCC subcohorts were formed with one nested within the other, and one set of new markers measured on each of the subcohorts. One objective of the study is to evaluate clinical values of novel biomarkers in improving upon existing risk models because of potential cost associated with assaying biomarkers. In this paper, we develop robust statistical procedures for constructing risk prediction models for RA and estimating the incremental value (IncV) of new markers based on three-phase NCC studies. Our method also takes into account possible time-varying effects of biomarkers in risk modeling, which allows us to more robustly assess the biomarker utility and address the question of whether a marker is better suited for short-term or long-term risk prediction. The proposed procedures are shown to perform well in finite samples via simulation studies.

Analysis of matched case-control studies is often complicated by missing data on covariates. Analysis can be restricted to individuals with complete data, but this is inefficient and may be biased. Multiple imputation (MI) is an efficient and flexible alternative. We describe two MI approaches. The first uses a model for the data on an individual and includes matching variables; the second uses a model for the data on a whole matched set and avoids the need to model the matching variables. Within each approach, we consider three methods: full-conditional specification (FCS), joint model MI using a normal model, and joint model MI using a latent normal model. We show that FCS MI is asymptotically equivalent to joint model MI using a restricted general location model that is compatible with the conditional logistic regression analysis model. The normal and latent normal imputation models are not compatible with this analysis model. All methods allow for multiple partially-observed covariates, non-monotone missingness, and multiple controls per case. They can be easily applied in standard statistical software and valid variance estimates obtained using Rubin's Rules. We compare the methods in a simulation study. The approach of including the matching variables is most efficient. Within each approach, the FCS MI method generally yields the least-biased odds ratio estimates, but normal or latent normal joint model MI is sometimes more efficient. All methods have good confidence interval coverage. Data on colorectal cancer and fibre intake from the EPIC-Norfolk study are used to illustrate the methods, in particular showing how efficiency is gained relative to just using individuals with complete data.

It is not uncommon in follow-up studies to make multiple attempts to collect a measurement after baseline. Recording whether these attempts are successful or not provides useful information for the purposes of assessing the missing at random (MAR) assumption and facilitating missing not at random (MNAR) modeling. This is because measurements from subjects who provide this data after multiple failed attempts may differ from those who provide the measurement after fewer attempts. This type of “continuum of resistance” to providing a measurement has hitherto been modeled in a selection model framework, where the outcome data is modeled jointly with the success or failure of the attempts given these outcomes. Here, we present a pattern mixture approach to model this type of data. We re-analyze the repeated attempt data from a trial that was previously analyzed using a selection model approach. Our pattern mixture model is more flexible and is more transparent in terms of parameter identifiability than the models that have previously been used to model repeated attempt data and allows for sensitivity analysis. We conclude that our approach to modeling this type of data provides a fully viable alternative to the more established selection model.

An important objective in biomedical and environmental risk assessment is estimation of minimum exposure levels that induce a pre-specified adverse response in a target population. The exposure points in such settings are typically referred to as benchmark doses (BMDs). Parametric Bayesian estimation for finding BMDs has grown in popularity, and a large variety of candidate dose-response models is available for applying these methods. Each model can possess potentially different parametric interpretation(s), however. We present reparameterized dose-response models that allow for explicit use of prior information on the target parameter of interest, the BMD. We also enhance our Bayesian estimation technique for BMD analysis by applying Bayesian model averaging to produce point estimates and (lower) credible bounds, overcoming associated questions of model adequacy when multimodel uncertainty is present. An example from carcinogenicity testing illustrates the calculations.

The inverse problem of parameter estimation from noisy observations is a major challenge in statistical inference for dynamical systems. Parameter estimation is usually carried out by optimizing some criterion function over the parameter space. Unless the optimization process starts with a good initial guess, the estimation may take an unreasonable amount of time, and may converge to local solutions, if at all. In this article, we introduce a novel technique for generating good initial guesses that can be used by any estimation method. We focus on the fairly general and often applied class of systems linear in the parameters. The new methodology bypasses numerical integration and can handle partially observed systems. We illustrate the performance of the method using simulations and apply it to real data.

We describe a simple, computationally efficient, permutation-based procedure for selecting the penalty parameter in LASSO-penalized regression. The procedure, permutation selection, is intended for applications where variable selection is the primary focus, and can be applied in a variety of structural settings, including that of generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of real biomedical data sets in which permutation selection is compared with selection based on the following: cross-validation (CV), the Bayesian information criterion (BIC), scaled sparse linear regression, and a selection method based on recently developed testing procedures for the LASSO.

**EDITOR: TAESUNG PARK**

**Adaptive Designs for Sequential Treatment Allocation**

(Alessandro B. Antognini and Alessandra Giovagnoli)

K. C. Carriere

**Modeling to Inform Infectious Disease Control**

(Niels G. Becker)

Jin Kyung Park

**Data Analysis with Competing Risks and Intermediate States**

(Ronald B. Geskus)

Jinheum Kim

**Analysis of Categorical Data with R**

(Christopher R. Bilder and Thomas M. Loughin)

Taesung Park