We consider simple ordinal model-based probability effect measures for comparing distributions of two groups, adjusted for explanatory variables. An “ordinal superiority” measure summarizes the probability that an observation from one distribution falls above an independent observation from the other distribution, adjusted for explanatory variables in a model. The measure applies directly to normal linear models and to a normal latent variable model for ordinal response variables. It equals for the corresponding ordinal model that applies a probit link function to cumulative multinomial probabilities, for standard normal cdf and effect that is the coefficient of the group indicator variable. For the more general latent variable model for ordinal responses that corresponds to a linear model with other possible error distributions and corresponding link functions for cumulative multinomial probabilities, the ordinal superiority measure equals with the log–log link and equals approximately with the logit link, where is the group effect. Another ordinal superiority measure generalizes the difference of proportions from binary to ordinal responses. We also present related measures directly for ordinal models for the observed response that need not assume corresponding latent response models. We present confidence intervals for the measures and illustrate with an example.

Long-term follow-up is common in many medical investigations where the interest lies in predicting patients’ risks for a future adverse outcome using repeatedly measured predictors over time. A key quantity is the likelihood of developing an adverse outcome among individuals who survived up to time *s* given their covariate information up to time *s*. Simple, yet reliable, methodology for updating the predicted risk of disease progression using longitudinal markers remains elusive. Two main approaches have been considered in the literature. One approach, based on joint modeling (JM) of failure time and longitudinal covariate process (Tsiatis and Davidian, 2004), derives such longitudinal predictive probability from the joint probability of a longitudinal marker and an event at a given time. A second approach, the partly conditional (PC) modeling (Zheng and Heagerty, 2005), directly models the predictive probability conditional on survival up to a landmark time and information accrued by that time. In this article, we propose new PC models for longitudinal prediction that are more flexible than joint modeling and improve the prediction accuracy over existing PC models. We provide procedures for making inference regarding future risk for an individual with longitudinal measures up to a given time. In addition, we conduct simulations to evaluate both JM and PC approaches in order to provide practical guidance on modeling choices. We use standard measures of predictive accuracy adapted to our setting to explore the predictiveness of the two approaches. We illustrate the performance of the two approaches on a dataset from the End Stage Renal Disease Study (ESRDS).

We consider the problem of testing for a dose-related effect based on a candidate set of (typically nonlinear) dose-response models using likelihood-ratio tests. For the considered models this reduces to assessing whether the slope parameter in these nonlinear regression models is zero or not. A technical problem is that the null distribution (when the slope is zero) depends on non-identifiable parameters, so that standard asymptotic results on the distribution of the likelihood-ratio test no longer apply. Asymptotic solutions for this problem have been extensively discussed in the literature. The resulting approximations however are not of simple form and require simulation to calculate the asymptotic distribution. In addition, their appropriateness might be doubtful for the case of a small sample size. Direct simulation to approximate the null distribution is numerically unstable due to the non identifiability of some parameters. In this article, we derive a numerical algorithm to approximate the exact distribution of the likelihood-ratio test under multiple models for normally distributed data. The algorithm uses methods from differential geometry and can be used to evaluate the distribution under the null hypothesis, but also allows for power and sample size calculations. We compare the proposed testing approach to the MCP-Mod methodology and alternative methods for testing for a dose-related trend in a dose-finding example data set and simulations.

Frailty models are here proposed in the tumor dormancy framework, in order to account for possible unobservable dependence mechanisms in cancer studies where a non-negligible proportion of cancer patients relapses years or decades after surgical removal of the primary tumor. Relapses do not seem to follow a memory-less process, since their timing distribution leads to multimodal hazards. From a biomedical perspective, this behavior may be explained by tumor dormancy, i.e., for some patients microscopic tumor foci may remain asymptomatic for a prolonged time interval and, when they escape from dormancy, micrometastatic growth results in a clinical disease appearance. The activation of the growth phase at different metastatic states would explain the occurrence of metastatic recurrences and mortality at different times (multimodal hazard). We propose a new frailty model which includes in the risk function a random source of heterogeneity (frailty variable) affecting the components of the hazard function. Thus, the individual hazard rate results as the product of a random frailty variable and the sum of basic hazard rates. In tumor dormancy, the basic hazard rates correspond to micrometastatic developments starting from different initial states. The frailty variable represents the heterogeneity among patients with respect to relapse, which might be related to unknown mechanisms that regulate tumor dormancy. We use our model to estimate the overall survival in a large breast cancer dataset, showing how this improves the understanding of the underlying biological process.

This article considers sieve estimation in the Cox model with an unknown regression structure based on right-censored data. We propose a semiparametric pursuit method to simultaneously identify and estimate linear and nonparametric covariate effects based on B-spline expansions through a penalized group selection method with concave penalties. We show that the estimators of the linear effects and the nonparametric component are consistent. Furthermore, we establish the asymptotic normality of the estimator of the linear effects. To compute the proposed estimators, we develop a modified blockwise majorization descent algorithm that is efficient and easy to implement. Simulation studies demonstrate that the proposed method performs well in finite sample situations. We also use the primary biliary cirrhosis data to illustrate its application.

Many diseases arise due to exposure to one of multiple possible pathogens. We consider the situation in which disease counts are available over time from a study region, along with a measure of clinical disease severity, for example, mild or severe. In addition, we suppose a subset of the cases are lab tested in order to determine the pathogen responsible for disease. In such a context, we focus interest on modeling the probabilities of disease incidence given pathogen type. The time course of these probabilities is of great interest as is the association with time-varying covariates such as meteorological variables. In this set up, a natural Bayesian approach would be based on imputation of the unsampled pathogen information using Markov Chain Monte Carlo but this is computationally challenging. We describe a practical approach to inference that is easy to implement. We use an empirical Bayes procedure in a first step to estimate summary statistics. We then treat these summary statistics as the observed data and develop a Bayesian generalized additive model. We analyze data on hand, foot, and mouth disease (HFMD) in China in which there are two pathogens of primary interest, enterovirus 71 (EV71) and Coxackie A16 (CA16). We find that both EV71 and CA16 are associated with temperature, relative humidity, and wind speed, with reasonably similar functional forms for both pathogens. The important issue of confounding by time is modeled using a penalized B-spline model with a random effects representation. The level of smoothing is addressed by a careful choice of the prior on the tuning variance.

In this article, we propose an association model to estimate the penetrance (risk) of successive cancers in the presence of competing risks. The association between the successive events is modeled via a copula and a proportional hazards model is specified for each competing event. This work is motivated by the analysis of successive cancers for people with Lynch Syndrome in the presence of competing risks. The proposed inference procedure is adapted to handle missing genetic covariates and selection bias, induced by the data collection protocol of the data at hand. The performance of the proposed estimation procedure is evaluated by simulations and its use is illustrated with data from the Colon Cancer Family Registry (Colon CFR).

Motivated by a study of molecular differences among breast cancer patients, we develop a Bayesian latent factor zero-inflated Poisson (LZIP) model for the analysis of correlated zero-inflated counts. The responses are modeled as independent zero-inflated Poisson distributions conditional on a set of subject-specific latent factors. For each outcome, we express the LZIP model as a function of two discrete random variables: the first captures the propensity to be in an underlying “at-risk” state, while the second represents the count response conditional on being at risk. The latent factors and loadings are assigned conditionally conjugate gamma priors that accommodate overdispersion and dependence among the outcomes. For posterior computation, we propose an efficient data-augmentation algorithm that relies primarily on easily sampled Gibbs steps. We conduct simulation studies to investigate both the inferential properties of the model and the computational capabilities of the proposed sampling algorithm. We apply the method to an analysis of breast cancer genomics data from The Cancer Genome Atlas.

The analysis of multiple outcomes is becoming increasingly common in modern biomedical studies. It is well-known that joint statistical models for multiple outcomes are more flexible and more powerful than fitting a separate model for each outcome; they yield more powerful tests of exposure or treatment effects by taking into account the dependence among outcomes and pooling evidence across outcomes. It is, however, unlikely that all outcomes are related to the same subset of covariates. Therefore, there is interest in identifying exposures or treatments associated with particular outcomes, which we term outcome-specific variable selection. In this work, we propose a variable selection approach for multivariate normal responses that incorporates not only information on the mean model, but also information on the variance–covariance structure of the outcomes. The approach effectively leverages evidence from all correlated outcomes to estimate the effect of a particular covariate on a given outcome. To implement this strategy, we develop a Bayesian method that builds a multivariate prior for the variable selection indicators based on the variance–covariance of the outcomes. We show via simulation that the proposed variable selection strategy can boost power to detect subtle effects without increasing the probability of false discoveries. We apply the approach to the Normative Aging Study (NAS) epigenetic data and identify a subset of five genes in the asthma pathway for which gene-specific DNA methylations are associated with exposures to either black carbon, a marker of traffic pollution, or sulfate, a marker of particles generated by power plants.

Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence, the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and are available on CRAN.

In this article, the existing concept of reversed percentile residual life, or percentile inactivity time, is recast to show that it can be used for routine analysis of time-to-event data under right censoring to summarize “life lost,” which poses several advantages over the existing methods for survival analysis. An estimating equation approach is adopted to avoid estimation of the probability density function of the underlying time-to-event distribution to estimate the variance of the quantile estimator. Additionally a *K*-sample test statistic is proposed to test the ratio of the quantile lost lifespans. Simulation studies are performed to assess finite properties of the proposed *K*-sample statistic in terms of coverage probability and power. The proposed method is illustrated with a real data example from a breast cancer study.

It is traditionally assumed that the random effects in mixed models follow a multivariate normal distribution, making likelihood-based inferences more feasible theoretically and computationally. However, this assumption does not necessarily hold in practice which may lead to biased and unreliable results. We introduce a novel diagnostic test based on the so-called gradient function proposed by Verbeke and Molenberghs (2013) to assess the random-effects distribution. We establish asymptotic properties of our test and show that, under a correctly specified model, the proposed test statistic converges to a weighted sum of independent chi-squared random variables each with one degree of freedom. The weights, which are eigenvalues of a square matrix, can be easily calculated. We also develop a parametric bootstrap algorithm for small samples. Our strategy can be used to check the adequacy of any distribution for random effects in a wide class of mixed models, including linear mixed models, generalized linear mixed models, and non-linear mixed models, with univariate as well as multivariate random effects. Both asymptotic and bootstrap proposals are evaluated via simulations and a real data analysis of a randomized multicenter study on toenail dermatophyte onychomycosis.

Various model selection methods can be applied to seek sparse subsets of the covariates to explain the response of interest in bioinformatics. While such methods often offer very helpful predictive performances, their selections of the covariates may be much less trustworthy. Indeed, when the number of covariates is large, the selections can be highly unstable, even under a slight change of the data. This casts a serious doubt on reproducibility of the identified variables. For a sound scientific understanding of the regression relationship, methods need to be developed to find the most important covariates that have higher chance to be confirmed in future studies. Such a method based on variable selection deviation is proposed and evaluated in this work.

Distortion product otoacoustic emissions (DPOAE) testing is a promising alternative to behavioral hearing tests and auditory brainstem response testing of pediatric cancer patients. The central goal of this study is to assess whether significant changes in the DPOAE frequency/emissions curve (DP-gram) occur in pediatric patients in a test-retest scenario. This is accomplished through the construction of normal reference charts, or credible regions, that DP-gram differences lie in, as well as contour probabilities that measure how abnormal (or in a certain sense rare) a test-retest difference is. A challenge is that the data were collected over varying frequencies, at different time points from baseline, and on possibly one or both ears. A hierarchical structural equation Gaussian process model is proposed to handle the different sources of correlation in the emissions measurements, wherein both subject-specific random effects and variance components governing the smoothness and variability of each child's Gaussian process are coupled together.

We use a nonparametric mixture model for the purpose of estimating the size of a population from multiple lists in which both the individual effects and list effects are allowed to vary. We propose a lower bound of the population size that admits an analytic expression. The lower bound can be estimated without the necessity of model-fitting. The asymptotical normality of the estimator is established. Both the estimator itself and that for the estimable bound of its variance are adjusted. These adjusted versions are shown to be unbiased in the limit. Simulation experiments are performed to assess the proposed approach and real applications are studied.

In this article, we propose a new statistical method—MutRSeq—for detecting differentially expressed single nucleotide variants (SNVs) based on RNA-seq data. Specifically, we focus on nonsynonymous mutations and employ a hierarchical likelihood approach to jointly model observed mutation events as well as read count measurements from RNA-seq experiments. We then introduce a likelihood ratio-based test statistic, which detects changes not only in overall expression levels, but also in allele-specific expression patterns. In addition, this method can jointly test multiple mutations in one gene/pathway. The simulation studies suggest that the proposed method achieves better power than a few competitors under a range of different settings. In the end, we apply this method to a breast cancer data set and identify genes with nonsynonymous mutations differentially expressed between the triple negative breast cancer tumors and other subtypes of breast cancer tumors.

In competing risks setup, the data for each subject consist of the event time, censoring indicator, and event category. However, sometimes the information about the event category can be missing, as, for example, in a case when the date of death is known but the cause of death is not available. In such situations, treating subjects with missing event category as censored leads to the underestimation of the hazard functions. We suggest nonparametric estimators for the cumulative cause-specific hazards and the cumulative incidence functions which use the Nadaraya–Watson estimator to obtain the contribution of an event with missing category to each of the cause-specific hazards. We derive the propertied of the proposed estimators. Optimal bandwidth is determined, which minimizes the mean integrated squared errors of the proposed estimators over time. The methodology is illustrated using data on lung infections in patients from the United States Cystic Fibrosis Foundation Patient Registry.

When functional data come as multiple curves per subject, characterizing the source of variations is not a trivial problem. The complexity of the problem goes deeper when there is phase variation in addition to amplitude variation. We consider clustering problem with multivariate functional data that have phase variations among the functional variables. We propose a conditional subject-specific warping framework in order to extract relevant features for clustering. Using multivariate growth curves of various parts of the body as a motivating example, we demonstrate the effectiveness of the proposed approach. The found clusters have individuals who show different relative growth patterns among different parts of the body.

Dose–response modeling in areas such as toxicology is often conducted using a parametric approach. While estimation of parameters is usually one of the goals, often the main aim of the study is the estimation of quantities derived from the parameters, such as the ED50 dose. From the view of statistical optimal design theory such an objective corresponds to a *c*-optimal design criterion. Unfortunately, *c*-optimal designs often create practical problems, and furthermore commonly do not allow actual estimation of the parameters. It is therefore useful to consider alternative designs which show good *c*-performance, while still being applicable in practice and allowing reasonably good general parameter estimation. In effect, using optimal design terminology this means that a reasonable performance regarding the *D*-criterion is expected as well. In this article, we propose several approaches to the task of combining *c*- and *D*-efficient designs, such as using mixed information functions or setting minimum requirements regarding either *c*- or *D*-efficiency, and show how to algorithmically determine optimal designs in each case. We apply all approaches to a standard situation from toxicology, and obtain a much better balance between *c*- and *D*-performance. Next, we investigate how to adapt the designs to different parameter values. Finally, we show that the methodology used here is not just limited to the combination of *c*- and *D*-designs, but can also be used to handle more general constraint situations such as limits on the cost of an experiment.

A gene may be controlled by distal enhancers and repressors, not merely by regulatory elements in its promoter. Spatial organization of chromosomes is the mechanism that brings genes and their distal regulatory elements into close proximity. Recent molecular techniques, coupled with Next Generation Sequencing (NGS) technology, enable genome-wide detection of physical contacts between distant genomic loci. In particular, Hi-C is an NGS-aided assay for the study of genome-wide spatial interactions. The availability of such data makes it possible to reconstruct the underlying three-dimensional (3D) spatial chromatin structure. In this article, we present the Poisson Random effect Architecture Model (PRAM) for such an inference. The main feature of PRAM that separates it from previous methods is that it addresses the issue of over-dispersion and takes correlations among contact counts into consideration, thereby achieving greater consistency with observed data. PRAM was applied to Hi-C data to illustrate its performance and to compare the predicted distances with those measured by a Fluorescence In Situ Hybridization (FISH) validation experiment. Further, PRAM was compared to other methods in the literature based on both real and simulated data.

Dynamic treatment regimes (DTRs) are sequential decision rules that focus simultaneously on treatment individualization and adaptation over time. To directly identify the optimal DTR in a multi-stage multi-treatment setting, we propose a dynamic statistical learning method, adaptive contrast weighted learning. We develop semiparametric regression-based contrasts with the adaptation of treatment effect ordering for each patient at each stage, and the adaptive contrasts simplify the problem of optimization with multiple treatment comparisons to a weighted classification problem that can be solved by existing machine learning techniques. The algorithm is implemented recursively using backward induction. By combining doubly robust semiparametric regression estimators with machine learning algorithms, the proposed method is robust and efficient for the identification of the optimal DTR, as shown in the simulation studies. We illustrate our method using observational data on esophageal cancer.

Treatments are frequently evaluated in terms of their effect on patient survival. In settings where randomization of treatment is not feasible, observational data are employed, necessitating correction for covariate imbalances. Treatments are usually compared using a hazard ratio. Most existing methods which quantify the treatment effect through the survival function are applicable to treatments assigned at time 0. In the data structure of our interest, subjects typically begin follow-up untreated; time-until-treatment, and the pretreatment death hazard are both heavily influenced by longitudinal covariates; and subjects may experience periods of treatment ineligibility. We propose semiparametric methods for estimating the average difference in restricted mean survival time attributable to a time-dependent treatment, the average effect of treatment among the treated, under current treatment assignment patterns. The pre- and posttreatment models are partly conditional, in that they use the covariate history up to the time of treatment. The pre-treatment model is estimated through recently developed landmark analysis methods. For each treated patient, fitted pre- and posttreatment survival curves are projected out, then averaged in a manner which accounts for the censoring of treatment times. Asymptotic properties are derived and evaluated through simulation. The proposed methods are applied to liver transplant data in order to estimate the effect of liver transplantation on survival among transplant recipients under current practice patterns.

The prior distribution is a key ingredient in Bayesian inference. Prior information on regression coefficients may come from different sources and may or may not be in conflict with the observed data. Various methods have been proposed to quantify a potential prior-data conflict, such as Box's *p*-value. However, there are no clear recommendations how to react to possible prior-data conflict in generalized regression models. To address this deficiency, we propose to adaptively weight a prespecified multivariate normal prior distribution on the regression coefficients. To this end, we relate empirical Bayes estimates of prior weight to Box's *p*-value and propose alternative fully Bayesian approaches. Prior weighting can be done for the joint prior distribution of the regression coefficients or—under prior independence—separately for prespecified blocks of regression coefficients. We outline how the proposed methodology can be implemented using integrated nested Laplace approximations (INLA) and illustrate the applicability with a Bayesian logistic regression model for data from a cross-sectional study. We also provide a simulation study that shows excellent performance of our approach in the case of prior misspecification in terms of root mean squared error and coverage. Supplementary Materials give details on software implementation and code and another application to binary longitudinal data from a randomized clinical trial using a Bayesian generalized linear mixed model.

Meta-analysis has become a widely used tool to combine results from independent studies. The collected studies are homogeneous if they share a common underlying true effect size; otherwise, they are heterogeneous. A fixed-effect model is customarily used when the studies are deemed homogeneous, while a random-effects model is used for heterogeneous studies. Assessing heterogeneity in meta-analysis is critical for model selection and decision making. Ideally, if heterogeneity is present, it should permeate the entire collection of studies, instead of being limited to a small number of outlying studies. Outliers can have great impact on conventional measures of heterogeneity and the conclusions of a meta-analysis. However, no widely accepted guidelines exist for handling outliers. This article proposes several new heterogeneity measures. In the presence of outliers, the proposed measures are less affected than the conventional ones. The performance of the proposed and conventional heterogeneity measures are compared theoretically, by studying their asymptotic properties, and empirically, using simulations and case studies.

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high-dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees—features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.

Many new experimental treatments benefit only a subset of the population. Identifying the baseline covariate profiles of patients who benefit from such a treatment, rather than determining whether or not the treatment has a population-level effect, can substantially lessen the risk in undertaking a clinical trial and expose fewer patients to treatments that do not benefit them. The standard analyses for identifying patient subgroups that benefit from an experimental treatment either do not account for multiplicity, or focus on testing for the presence of treatment–covariate interactions rather than the resulting individualized treatment effects. We propose a Bayesian *credible subgroups* method to identify two bounding subgroups for the benefiting subgroup: one for which it is likely that all members simultaneously have a treatment effect exceeding a specified threshold, and another for which it is likely that no members do. We examine frequentist properties of the credible subgroups method via simulations and illustrate the approach using data from an Alzheimer's disease treatment trial. We conclude with a discussion of the advantages and limitations of this approach to identifying patients for whom the treatment is beneficial.

Cocaine addiction is chronic and persistent, and has become a major social and health problem in many countries. Existing studies have shown that cocaine addicts often undergo episodic periods of addiction to, moderate dependence on, or swearing off cocaine. Given its reversible feature, cocaine use can be formulated as a stochastic process that transits from one state to another, while the impacts of various factors, such as treatment received and individuals’ psychological problems on cocaine use, may vary across states. This article develops a hidden Markov latent variable model to study multivariate longitudinal data concerning cocaine use from a California Civil Addict Program. The proposed model generalizes conventional latent variable models to allow bidirectional transition between cocaine-addiction states and conventional hidden Markov models to allow latent variables and their dynamic interrelationship. We develop a maximum-likelihood approach, along with a Monte Carlo expectation conditional maximization (MCECM) algorithm, to conduct parameter estimation. The asymptotic properties of the parameter estimates and statistics for testing the heterogeneity of model parameters are investigated. The finite sample performance of the proposed methodology is demonstrated by simulation studies. The application to cocaine use study provides insights into the prevention of cocaine use.

Joint modeling is increasingly popular for investigating the relationship between longitudinal and time-to-event data. However, numerical complexity often restricts this approach to linear models for the longitudinal part. Here, we use a novel development of the Stochastic-Approximation Expectation Maximization algorithm that allows joint models defined by nonlinear mixed-effect models. In the context of chemotherapy in metastatic prostate cancer, we show that a variety of patterns for the Prostate Specific Antigen (PSA) kinetics can be captured by using a mechanistic model defined by nonlinear ordinary differential equations. The use of a mechanistic model predicts that biological quantities that cannot be observed, such as treatment-sensitive and treatment-resistant cells, may have a larger impact than PSA value on survival. This suggests that mechanistic joint models could constitute a relevant approach to evaluate the efficacy of treatment and to improve the prediction of survival in patients.

Understanding how aquatic species grow is fundamental in fisheries because stock assessment often relies on growth dependent statistical models. Length-frequency-based methods become important when more applicable data for growth model estimation are either not available or very expensive. In this article, we develop a new framework for growth estimation from length-frequency data using a generalized von Bertalanffy growth model (VBGM) framework that allows for time-dependent covariates to be incorporated. A finite mixture of normal distributions is used to model the length-frequency cohorts of each month with the means constrained to follow a VBGM. The variances of the finite mixture components are constrained to be a function of mean length, reducing the number of parameters and allowing for an estimate of the variance at any length. To optimize the likelihood, we use a minorization–maximization (MM) algorithm with a Nelder–Mead sub-step. This work was motivated by the decline in catches of the blue swimmer crab (BSC) (*Portunus armatus*) off the east coast of Queensland, Australia. We test the method with a simulation study and then apply it to the BSC fishery data.

Our motivating application stems from surveys of natural populations and is characterized by large spatial heterogeneity in the counts, which makes parametric approaches to modeling local animal abundance too restrictive. We adopt a Bayesian nonparametric approach based on mixture models and innovate with respect to popular Dirichlet process mixture of Poisson kernels by increasing the model flexibility at the level both of the kernel and the nonparametric mixing measure. This allows to derive accurate and robust estimates of the distribution of local animal abundance and of the corresponding clusters. The application and a simulation study for different scenarios yield also some general methodological implications. Adding flexibility solely at the level of the mixing measure does not improve inferences, since its impact is severely limited by the rigidity of the Poisson kernel with considerable consequences in terms of bias. However, once a kernel more flexible than the Poisson is chosen, inferences can be robustified by choosing a prior more general than the Dirichlet process. Therefore, to improve the performance of Bayesian nonparametric mixtures for count data one has to enrich the model simultaneously at both levels, the kernel and the mixing measure.

Joint models are used in ageing studies to investigate the association between longitudinal markers and a time-to-event, and have been extended to multiple markers and/or competing risks. The competing risk of death must be considered in the elderly because death and dementia have common risk factors. Moreover, in cohort studies, time-to-dementia is interval-censored since dementia is assessed intermittently. So subjects can develop dementia and die between two visits without being diagnosed. To study predementia cognitive decline, we propose a joint latent class model combining a (possibly multivariate) mixed model and an illness–death model handling both interval censoring (by accounting for a possible unobserved transition to dementia) and semi-competing risks. Parameters are estimated by maximum-likelihood handling interval censoring. The correlation between the marker and the times-to-events is captured by latent classes, homogeneous sub-groups with specific risks of death, dementia, and profiles of cognitive decline. We propose Markovian and semi-Markovian versions. Both approaches are compared to a joint latent-class model for competing risks through a simulation study, and applied in a prospective cohort study of cerebral and functional ageing to distinguish different profiles of cognitive decline associated with risks of dementia and death. The comparison highlights that among subjects with dementia, mortality depends more on age than on duration of dementia. This model distinguishes the so-called terminal predeath decline (among healthy subjects) from the predementia decline.

The log-rank test is widely used to compare two survival distributions in a randomized clinical trial, while partial likelihood (Cox, 1975) is the method of choice for making inference about the hazard ratio under the Cox (1972) proportional hazards model. The Wald 95% confidence interval of the hazard ratio may include the null value of 1 when the *p*-value of the log-rank test is less than 0.05. Peto et al. (1977) provided an estimator for the hazard ratio based on the log-rank statistic; the corresponding 95% confidence interval excludes the null value of 1 if and only if the *p*-value of the log-rank test is less than 0.05. However, Peto's estimator is not consistent, and the corresponding confidence interval does not have correct coverage probability. In this article, we construct the confidence interval by inverting the score test under the (possibly stratified) Cox model, and we modify the variance estimator such that the resulting score test for the null hypothesis of no treatment difference is identical to the log-rank test in the possible presence of ties. Like Peto's method, the proposed confidence interval excludes the null value if and only if the log-rank test is significant. Unlike Peto's method, however, this interval has correct coverage probability. An added benefit of the proposed confidence interval is that it tends to be more accurate and narrower than the Wald confidence interval. We demonstrate the advantages of the proposed method through extensive simulation studies and a colon cancer study.

Interval-censored failure time data occur in many fields such as demography, economics, medical research, and reliability and many inference procedures on them have been developed (Sun, 2006; Chen, Sun, and Peace, 2012). However, most of the existing approaches assume that the mechanism that yields interval censoring is independent of the failure time of interest and it is clear that this may not be true in practice (Zhang et al., 2007; Ma, Hu, and Sun, 2015). In this article, we consider regression analysis of case *K* interval-censored failure time data when the censoring mechanism may be related to the failure time of interest. For the problem, an estimated sieve maximum-likelihood approach is proposed for the data arising from the proportional hazards frailty model and for estimation, a two-step procedure is presented. In the addition, the asymptotic properties of the proposed estimators of regression parameters are established and an extensive simulation study suggests that the method works well. Finally, we apply the method to a set of real interval-censored data that motivated this study.

Variable selection for recovering sparsity in nonadditive and nonparametric models with high-dimensional variables has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high-dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this article we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric regression model. The advantages of our approach are that it can: (i) recover the sparsity; (ii) automatically model unknown and complicated interactions; (iii) connect with several existing approaches including linear nonnegative garrote and multiple kernel learning; and (iv) provide flexibility for both additive and nonadditive nonparametric models. Our approach can be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a Least Squares Kernel Machine (LSKM) and construct the nonnegative garrote objective function as the function of the sparse scale parameters of kernel machine to recover sparsity of input variables whose relevances to the response are measured by the scale parameters. We also provide the asymptotic properties of our approach. We show that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve the power.

The evaluation of cure fractions in oncology research under the well known cure rate model has attracted considerable attention in the literature, but most of the existing testing procedures have relied on restrictive assumptions. A common assumption has been to restrict the cure fraction to a constant under alternatives to homogeneity, thereby neglecting any information from covariates. This article extends the literature by developing a score-based statistic that incorporates covariate information to detect cure fractions, with the existing testing procedure serving as a special case. A complication of this extension, however, is that the implied hypotheses are not typical and standard regularity conditions to conduct the test may not even hold. Using empirical processes arguments, we construct a sup-score test statistic for cure fractions and establish its limiting null distribution as a functional of mixtures of chi-square processes. In practice, we suggest a simple resampling procedure to approximate this limiting distribution. Our simulation results show that the proposed test can greatly improve efficiency over tests that neglect the heterogeneity of the cure fraction under the alternative. The practical utility of the methodology is illustrated using ovarian cancer survival data with long-term follow-up from the surveillance, epidemiology, and end results registry.

Recently, massive functional data have been widely collected over space across a set of grid points in various imaging studies. It is interesting to correlate functional data with various clinical variables, such as age and gender, in order to address scientific questions of interest. The aim of this article is to develop a single-index varying coefficient (SIVC) model for establishing a varying association between functional responses (e.g., image) and a set of covariates. It enjoys several unique features of both varying-coefficient and single-index models. An estimation procedure is developed to estimate varying coefficient functions, the index function, and the covariance function of individual functions. The optimal integration of information across different grid points is systematically delineated and the asymptotic properties (e.g., consistency and convergence rate) of all estimators are examined. Simulation studies are conducted to assess the finite-sample performance of the proposed estimation procedure. Furthermore, our real data analysis of a white matter tract dataset obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study confirms the advantage and accuracy of SIVC model over the popular varying coefficient model.

We consider the problem of selecting covariates in a spatial regression model when the response is binary. Penalized likelihood-based approach is proved to be effective for both variable selection and estimation simultaneously. In the context of a spatially dependent binary variable, an uniquely interpretable likelihood is not available, rather a quasi-likelihood might be more suitable. We develop a penalized quasi-likelihood with spatial dependence for simultaneous variable selection and parameter estimation along with an efficient computational algorithm. The theoretical properties including asymptotic normality and consistency are studied under increasing domain asymptotics framework. An extensive simulation study is conducted to validate the methodology. Real data examples are provided for illustration and applicability. Although theoretical justification has not been made, we also investigate empirical performance of the proposed penalized quasi-likelihood approach for spatial count data to explore suitability of this method to a general exponential family of distributions.

For the classical, homoscedastic measurement error model, moment reconstruction (Freedman et al., 2004, 2008) and moment-adjusted imputation (Thomas et al., 2011) are appealing, computationally simple imputation-like methods for general model fitting. Like classical regression calibration, the idea is to replace the unobserved variable subject to measurement error with a proxy that can be used in a variety of analyses. Moment reconstruction and moment-adjusted imputation differ from regression calibration in that they attempt to match multiple features of the latent variable, and also to match some of the latent variable's relationships with the response and additional covariates. In this note, we consider a problem where true exposure is generated by a complex, nonlinear random effects modeling process, and develop analogues of moment reconstruction and moment-adjusted imputation for this case. This general model includes classical measurement errors, Berkson measurement errors, mixtures of Berkson and classical errors and problems that are not measurement error problems, but also cases where the data-generating process for true exposure is a complex, nonlinear random effects modeling process. The methods are illustrated using the National Institutes of Health–AARP Diet and Health Study where the latent variable is a dietary pattern score called the Healthy Eating Index-2005. We also show how our general model includes methods used in radiation epidemiology as a special case. Simulations are used to illustrate the methods.

The peptide microarray immunoassay simultaneously screens sample serum against thousands of peptides, determining the presence of antibodies bound to array probes. Peptide microarrays tiling immunogenic regions of pathogens (e.g., envelope proteins of a virus) are an important high throughput tool for querying and mapping antibody binding. Because of the assay's many steps, from probe synthesis to incubation, peptide microarray data can be noisy with extreme outliers. In addition, subjects may produce different antibody profiles in response to an identical vaccine stimulus or infection, due to variability among subjects’ immune systems. We present a robust Bayesian hierarchical model for peptide microarray experiments, pepBayes, to estimate the probability of antibody response for each subject/peptide combination. Heavy-tailed error distributions accommodate outliers and extreme responses, and tailored random effect terms automatically incorporate technical effects prevalent in the assay. We apply our model to two vaccine trial data sets to demonstrate model performance. Our approach enjoys high sensitivity and specificity when detecting vaccine induced antibody responses. A simulation study shows an adaptive thresholding classification method has appropriate false discovery rate control with high sensitivity, and receiver operating characteristics generated on vaccine trial data suggest that pepBayes clearly separates responses from non-responses.

In many classical estimation problems, the parameter space has a boundary. In most cases, the standard asymptotic properties of the estimator do not hold when some of the underlying true parameters lie on the boundary. However, without knowledge of the true parameter values, confidence intervals constructed assuming that the parameters lie in the interior are generally over-conservative. A penalized estimation method is proposed in this article to address this issue. An adaptive lasso procedure is employed to shrink the parameters to the boundary, yielding oracle inference which adapt to whether or not the true parameters are on the boundary. When the true parameters are on the boundary, the inference is equivalent to that which would be achieved with a priori knowledge of the boundary, while if the converse is true, the inference is equivalent to that which is obtained in the interior of the parameter space. The method is demonstrated under two practical scenarios, namely the frailty survival model and linear regression with order-restricted parameters. Simulation studies and real data analyses show that the method performs well with realistic sample sizes and exhibits certain advantages over standard methods.

Semi-parametric methods are often used for the estimation of intervention effects on correlated outcomes in cluster-randomized trials (CRTs). When outcomes are missing at random (MAR), Inverse Probability Weighted (IPW) methods incorporating baseline covariates can be used to deal with informative missingness. Also, augmented generalized estimating equations (AUG) correct for imbalance in baseline covariates but need to be extended for MAR outcomes. However, in the presence of interactions between treatment and baseline covariates, neither method alone produces consistent estimates for the marginal treatment effect if the model for interaction is not correctly specified. We propose an AUG–IPW estimator that weights by the inverse of the probability of being a complete case and allows different outcome models in each intervention arm. This estimator is doubly robust (DR); it gives correct estimates whether the missing data process or the outcome model is correctly specified. We consider the problem of covariate interference which arises when the outcome of an individual may depend on covariates of other individuals. When interfering covariates are not modeled, the DR property prevents bias as long as covariate interference is not present simultaneously for the outcome and the missingness. An R package is developed implementing the proposed method. An extensive simulation study and an application to a CRT of HIV risk reduction-intervention in South Africa illustrate the method.

The ready availability of public-use data from various large national complex surveys has immense potential for the assessment of population characteristics using regression models. Complex surveys can be used to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to complex survey design features. That is, stratification, multistage sampling, and weighting. In this article, we accommodate these design features in the analysis of highly skewed response variables arising from large complex surveys. Specifically, we propose a double-transform-both-sides (DTBS)'based estimating equations approach to estimate the median regression parameters of the highly skewed response; the DTBS approach applies the same Box–Cox type transformation twice to both the outcome and regression function. The usual sandwich variance estimate can be used in our approach, whereas a resampling approach would be needed for a pseudo-likelihood based on minimizing absolute deviations (MAD). Furthermore, the approach is relatively robust to the true underlying distribution, and has much smaller mean square error than a MAD approach. The method is motivated by an analysis of laboratory data on urinary iodine (UI) concentration from the National Health and Nutrition Examination Survey.

Construction of confidence sets for the optimal factor levels is an important topic in response surfaces methodology. In Wan et al. (2015), an exact confidence set has been provided for a maximum or minimum point (i.e., an optimal factor level) of a univariate polynomial function in a given interval. In this article, the method has been extended to construct an exact confidence set for the optimal factor levels of response surfaces. The construction method is readily applied to many parametric and semiparametric regression models involving a quadratic function. A conservative confidence set has been provided as an intermediate step in the construction of the exact confidence set. Two examples are given to illustrate the application of the confidence sets. The comparison between confidence sets indicates that our exact confidence set is better than the only other confidence set available in the statistical literature that guarantees the confidence level.

The focus of this article is on the nature of the likelihood associated with *N*-mixture models for repeated count data. It is shown that the infinite sum embedded in the likelihood associated with the Poisson mixing distribution can be expressed in terms of a hypergeometric function and, thence, in closed form. The resultant expression for the likelihood can be readily computed to a high degree of accuracy and is algebraically tractable. Specifically, the likelihood equations can be simplified to some advantage, the concentrated likelihood in the probability of detection formulated and problematic cases identified. The results are illustrated by means of a simulation study and a real world example. The study is extended to *N*-mixture models with a negative binomial mixing distribution and results similar to those for the Poisson case obtained. *N*-mixture models with mixing distributions which accommodate excess zeros and, separately, with a beta-binomial distribution rather than a binomial used to model the intra-site counts are also investigated. However the results for these settings, while computationally attractive, do not provide insight into the nature of the maximum likelihood estimates.

Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.

An intermediate response measure that accurately predicts efficacy in a new setting can reduce trial cost and time to product licensure. In this article, we define a *trial level general surrogate*, which is an intermediate response that can be used to accurately predict efficacy in a new setting. Methods for evaluating general surrogates have been developed previously. Many methods in the literature use trial level intermediate responses for prediction. However, all existing methods focus on surrogate evaluation and prediction in new settings, rather than comparison of candidate general surrogates, and few formalize the use of cross validation to quantify the expected prediction error. Our proposed method uses Bayesian non-parametric modeling and cross-validation to estimate the absolute prediction error for use in evaluating and comparing candidate trial level general surrogates. Simulations show that our method performs well across a variety of scenarios. We use our method to evaluate and to compare candidate trial level general surrogates in several multi-national trials of a pentavalent rotavirus vaccine. We identify at least one immune measure that has potential value as a trial level general surrogate and use it to predict efficacy in a new trial where the clinical outcome was not measured.

Identification of novel biomarkers for risk prediction is important for disease prevention and optimal treatment selection. However, studies aiming to discover which biomarkers are useful for risk prediction often require the use of stored biological samples from large assembled cohorts, and thus the depletion of a finite and precious resource. To make efficient use of such stored samples, two-phase sampling designs are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare. Existing methods for analyzing data from two-phase studies focus primarily on single marker analysis or fitting the Cox regression model to combine information from multiple markers. However, the Cox model may not fit the data well. Under model misspecification, the composite score derived from the Cox model may not perform well in predicting the outcome. Under a general two-phase stratified cohort sampling design, we present a novel approach to combining multiple markers to optimize prediction by fitting a flexible nonparametric transformation model. Using inverse probability weighting to account for the outcome-dependent sampling, we propose to estimate the model parameters by maximizing an objective function which can be interpreted as a weighted C-statistic for survival outcomes. Regardless of model adequacy, the proposed procedure yields a sensible composite risk score for prediction. A major obstacle for making inference under two phase studies is due to the correlation induced by the finite population sampling, which prevents standard inference procedures such as the bootstrap from being used for variance estimation. We propose a resampling procedure to derive valid confidence intervals for the model parameters and the C-statistic accuracy measure. We illustrate the new methods with simulation studies and an analysis of a two-phase study of high-density lipoprotein cholesterol (HDL-C) subtypes for predicting the risk of coronary heart disease.

At a time of climate change and major loss of biodiversity, it is important to have efficient tools for monitoring populations. In this context, animal abundance indices play an important rôle. In producing indices for invertebrates, it is important to account for variation in counts within seasons. Two new methods for describing seasonal variation in invertebrate counts have recently been proposed; one is nonparametric, using generalized additive models, and the other is parametric, based on stopover models. We present a novel generalized abundance index which encompasses both parametric and nonparametric approaches. It is extremely efficient to compute this index due to the use of concentrated likelihood techniques. This has particular relevance for the analysis of data from long-term extensive monitoring schemes with records for many species and sites, for which existing modeling techniques can be prohibitively time consuming. Performance of the index is demonstrated by several applications to UK Butterfly Monitoring Scheme data. We demonstrate the potential for new insights into both phenology and spatial variation in seasonal patterns from parametric modeling and the incorporation of covariate dependence, which is relevant for both monitoring and conservation. Associated R code is available on the journal website.

It is now well recognized that the effectiveness and potential risk of a treatment often vary by patient subgroups. Although trial-and-error and one-size-fits-all approaches to treatment selection remain a common practice, much recent focus has been placed on individualized treatment selection based on patient information (La Thangue and Kerr, 2011; Ong et al., 2012). Genetic and molecular markers are becoming increasingly available to guide treatment selection for various diseases including HIV and breast cancer (Mallal et al., 2008; Zujewski and Kamin, 2008). In recent years, many statistical procedures for developing individualized treatment rules (ITRs) have been proposed. However, less focus has been given to efficient selection of predictive biomarkers for treatment selection. The standard Wald test for interactions between treatment and the set of markers of interest may not work well when the marker effects are nonlinear. Furthermore, interaction-based test is scale dependent and may fail to capture markers useful for predicting individualized treatment differences. In this article, we propose to overcome these difficulties by developing a kernel machine (KM) score test that can efficiently identify markers predictive of treatment difference. Simulation studies show that our proposed KM-based score test is more powerful than the Wald test when there is nonlinear effect among the predictors and when the outcome is binary with nonlinear link functions. Furthermore, when there is high-correlation among predictors and when the number of predictors is not small, our method also over-performs Wald test. The proposed method is illustrated with two randomized clinical trials.

This article considers nonparametric methods for studying recurrent disease and death with competing risks. We first point out that comparisons based on the well-known cumulative incidence function can be confounded by different prevalence rates of the competing events, and that comparisons of the conditional distribution of the survival time given the failure event type are more relevant for investigating the prognosis of different patterns of recurrence disease. We then propose nonparametric estimators for the conditional cumulative incidence function as well as the conditional bivariate cumulative incidence function for the bivariate gap times, that is, the time to disease recurrence and the residual lifetime after recurrence. To quantify the association between the two gap times in the competing risks setting, a modified Kendall's tau statistic is proposed. The proposed estimators for the conditional bivariate cumulative incidence distribution and the association measure account for the induced dependent censoring for the second gap time. Uniform consistency and weak convergence of the proposed estimators are established. Hypothesis testing procedures for two-sample comparisons are discussed. Numerical simulation studies with practical sample sizes are conducted to evaluate the performance of the proposed nonparametric estimators and tests. An application to data from a pancreatic cancer study is presented to illustrate the methods developed in this article.

The availability of data in longitudinal studies is often driven by features of the characteristics being studied. For example, clinical databases are increasingly being used for research to address longitudinal questions. Because visit times in such data are often driven by patient characteristics that may be related to the outcome being studied, the danger is that this will result in biased estimation compared to designed, prospective studies. We study longitudinal data that follow a generalized linear mixed model and use a log link to relate an informative visit process to random effects in the mixed model. This device allows us to elucidate which parameters are biased under the informative visit process and to what degree. We show that the informative visit process can badly bias estimators of parameters of covariates associated with the random effects, while allowing consistent estimation of other parameters.

Alzheimer's disease (AD) is usually diagnosed by clinicians through cognitive and functional performance test with a potential risk of misdiagnosis. Since the progression of AD is known to cause structural changes in the corpus callosum (CC), the CC thickness can be used as a functional covariate in AD classification problem for a diagnosis. However, misclassified class labels negatively impact the classification performance. Motivated by AD–CC association studies, we propose a logistic regression for functional data classification that is robust to misdiagnosis or label noise. Specifically, our logistic regression model is constructed by adopting individual intercepts to functional logistic regression model. This approach enables to indicate which observations are possibly mislabeled and also lead to a robust and efficient classifier. An effective algorithm using MM algorithm provides simple closed-form update formulas. We test our method using synthetic datasets to demonstrate its superiority over an existing method, and apply it to differentiating patients with AD from healthy normals based on CC from MRI.

In this article, we develop new methods for estimating average treatment effects in observational studies, in settings with more than two treatment levels, assuming unconfoundedness given pretreatment variables. We emphasize propensity score subclassification and matching methods which have been among the most popular methods in the binary treatment literature. Whereas the literature has suggested that these particular propensity-based methods do not naturally extend to the multi-level treatment case, we show, using the concept of weak unconfoundedness and the notion of the generalized propensity score, that adjusting for a scalar function of the pretreatment variables removes all biases associated with observed pretreatment variables. We apply the proposed methods to an analysis of the effect of treatments for fibromyalgia. We also carry out a simulation study to assess the finite sample performance of the methods relative to previously proposed methods.

A clinical trial with a factorial design involves randomization of subjects to treatment *A* or and, within each group, further randomization to treatment *B* or . Under this design, one can assess the effects of treatments *A* and *B* on a clinical endpoint using all patients. One may additionally compare treatment *A*, treatment *B*, or combination therapy to . With multiple comparisons, however, it may be desirable to control the overall type I error, especially for regulatory purposes. Because the subjects overlap in the comparisons, the test statistics are generally correlated. By accounting for the correlations, one can achieve higher statistical power compared to the conventional Bonferroni correction. Herein, we derive the correlation between any two (stratified or unstratified) log-rank statistics for a factorial design with a survival time endpoint, such that the overall type I error for multiple treatment comparisons can be properly controlled. In addition, we allow for adjustment of prognostic factors in the treatment comparisons and conduct simultaneous inference on the effect sizes. We use simulation studies to show that the proposed methods perform well in realistic situations. We then provide an application to a recently completed randomized controlled clinical trial on alcohol dependence. Finally, we discuss extensions of our approach to other factorial designs and multiple endpoints.

Applications of circular regression models appear in many different fields such as evolutionary psychology, motor behavior, biology, and, in particular, in the analysis of gene expressions in oscillatory systems. Specifically, for the gene expression problem, a researcher may be interested in modeling the relationship among the phases of cell-cycle genes in two species with differing periods. This challenging problem reduces to the problem of constructing a piecewise circular regression model and, with this objective in mind, we propose a flexible circular regression model which allows different parameter values depending on sectors along the circle. We give a detailed interpretation of the parameters in the model and provide maximum likelihood estimators. We also provide a model selection procedure based on the concept of generalized degrees of freedom. The model is then applied to the analysis of two different cell-cycle data sets and through these examples we highlight the power of our new methodology.

Motivated by an ongoing pediatric mental health care (PMHC) study, this article presents weakly structured methods for analyzing doubly censored recurrent event data where only coarsened information on censoring is available. The study extracted administrative records of emergency department visits from provincial health administrative databases. The available information of each individual subject is limited to a subject-specific time window determined up to concealed data. To evaluate time-dependent effect of exposures, we adapt the local linear estimation with right censored survival times under the Cox regression model with time-varying coefficients (cf. Cai and Sun, *Scandinavian Journal of Statistics* 2003, **30**, 93–111). We establish the pointwise consistency and asymptotic normality of the regression parameter estimator, and examine its performance by simulation. The PMHC study illustrates the proposed approach throughout the article.

Potential reductions in laboratory assay costs afforded by pooling equal aliquots of biospecimens have long been recognized in disease surveillance and epidemiological research and, more recently, have motivated design and analytic developments in regression settings. For example, Weinberg and Umbach (1999, *Biometrics* **55**, 718–726) provided methods for fitting set-based logistic regression models to case-control data when a continuous exposure variable (e.g., a biomarker) is assayed on pooled specimens. We focus on improving estimation efficiency by utilizing available subject-specific information at the pool allocation stage. We find that a strategy that we call “(y,**c**)-pooling,” which forms pooling sets of individuals within strata defined jointly by the outcome and other covariates, provides more precise estimation of the risk parameters associated with those covariates than does pooling within strata defined only by the outcome. We review the approach to set-based analysis through offsets developed by Weinberg and Umbach in a recent correction to their original paper. We propose a method for variance estimation under this design and use simulations and a real-data example to illustrate the precision benefits of (y,**c**)-pooling relative to y-pooling. We also note and illustrate that set-based models permit estimation of covariate interactions with exposure.

In applying scan statistics for public health research, it would be valuable to develop a detection method for multiple clusters that accommodates spatial correlation and covariate effects in an integrated model. In this article, we connect the concepts of the likelihood ratio (LR) scan statistic and the quasi-likelihood (QL) scan statistic to provide a series of detection procedures sufficiently flexible to apply to clusters of arbitrary shape. First, we use an independent scan model for detection of clusters and then a variogram tool to examine the existence of spatial correlation and regional variation based on residuals of the independent scan model. When the estimate of regional variation is significantly different from zero, a mixed QL estimating equation is developed to estimate coefficients of geographic clusters and covariates. We use the Benjamini–Hochberg procedure (1995) to find a threshold for *p*-values to address the multiple testing problem. A quasi-deviance criterion is used to regroup the estimated clusters to find geographic clusters with arbitrary shapes. We conduct simulations to compare the performance of the proposed method with other scan statistics. For illustration, the method is applied to enterovirus data from Taiwan.

Measuring the similarity between genes is often the starting point for building gene regulatory networks. Most similarity measures used in practice only consider pairwise information with a few also consider network structure. Although theoretical properties of pairwise measures are well understood in the statistics literature, little is known about their statistical properties of those similarity measures based on network structure. In this article, we consider a new whole genome network-based similarity measure, called *CCor*, that makes use of information of all the genes in the network. We derive a concentration inequality of *CCor* and compare it with the commonly used Pearson correlation coefficient for inferring network modules. Both theoretical analysis and real data example demonstrate the advantages of *CCor* over existing measures for inferring gene modules.

We introduce a new Bayesian nonparametric method for estimating the size of a closed population from multiple-recapture data. Our method, based on Dirichlet process mixtures, can accommodate complex patterns of heterogeneity of capture, and can transparently modulate its complexity without a separate model selection step. Additionally, it can handle the massively sparse contingency tables generated by large number of recaptures with moderate sample sizes. We develop an efficient and scalable MCMC algorithm for estimation. We apply our method to simulated data, and to two examples from the literature of estimation of casualties in armed conflicts.

This article presents a new approach to modeling group animal movement in continuous time. The movement of a group of animals is modeled as a multivariate Ornstein Uhlenbeck diffusion process in a high-dimensional space. Each individual of the group is attracted to a leading point which is generally unobserved, and the movement of the leading point is also an Ornstein Uhlenbeck process attracted to an unknown attractor. The Ornstein Uhlenbeck bridge is applied to reconstruct the location of the leading point. All movement parameters are estimated using Markov chain Monte Carlo sampling, specifically a Metropolis Hastings algorithm. We apply the method to a small group of simultaneously tracked reindeer, *Rangifer tarandus tarandus*, showing that the method detects dependency in movement between individuals.

There is an overwhelmingly large literature and algorithms already available on “large-scale inference problems” based on different modeling techniques and cultures. Our primary goal in this article is *not to add one more new methodology* to the existing toolbox but instead (i) to clarify the mystery how these different simultaneous inference methods are *connected*, (ii) to provide an alternative more intuitive derivation of the formulas that leads to *simpler* expressions in order (iii) to develop a *unified* algorithm for practitioners. A detailed discussion on representation, estimation, inference, and model selection is given. Applications to a variety of real and simulated datasets show promise. We end with several future research directions.

The various thresholding quantities grouped under the “Basic Reproductive Number” umbrella are often confused, but represent distinct approaches to estimating epidemic spread potential, and address different modeling needs. Here, we contrast several common reproduction measures applied to stochastic compartmental models, and introduce a new quantity dubbed the “empirically adjusted reproductive number” with several advantages. These include: more complete use of the underlying compartmental dynamics than common alternatives, use as a potential diagnostic tool to detect the presence and causes of intensity process underfitting, and the ability to provide timely feedback on disease spread. Conceptual connections between traditional reproduction measures and our approach are explored, and the behavior of our method is examined under simulation. Two illustrative examples are developed: First, the single location applications of our method are established using data from the 1995 Ebola outbreak in the Democratic Republic of the Congo and a traditional stochastic SEIR model. Second, a spatial formulation of this technique is explored in the context of the ongoing Ebola outbreak in West Africa with particular emphasis on potential use in model selection, diagnosis, and the resulting applications to estimation and prediction. Both analyses are placed in the context of a newly developed spatial analogue of the traditional SEIR modeling approach.

In this article we present a new method for performing Bayesian parameter inference and model choice for low- count time series models with intractable likelihoods. The method involves incorporating an alive particle filter within a sequential Monte Carlo (SMC) algorithm to create a novel exact-approximate algorithm, which we refer to as alive SMC. The advantages of this approach over competing methods are that it is naturally adaptive, it does not involve between-model proposals required in reversible jump Markov chain Monte Carlo, and does not rely on potentially rough approximations. The algorithm is demonstrated on Markov process and integer autoregressive moving average models applied to real biological datasets of hospital-acquired pathogen incidence, animal health time series, and the cumulative number of prion disease cases in mule deer.

DNA methylation studies have been revolutionized by the recent development of high throughput array-based platforms. Most of the existing methods analyze microarray methylation data on a probe-by-probe basis, ignoring probe-specific effects and correlations among methylation levels at neighboring genomic locations. These methods can potentially miss functionally relevant findings associated with genomic regions. In this article, we propose a statistical model that allows us to pool information on the same probe across multiple samples to estimate the probe affinity effect, and to borrow strength from the neighboring probe sites to better estimate the methylation values. Using a simulation study, we demonstrate that our method can provide accurate model-based estimates. We further use the proposed method to develop a new procedure for detecting differentially methylated regions, and compare it with a state-of-the-art approach via a data application.

We consider quantile regression for partially linear models where an outcome of interest is related to covariates and a marker set (e.g., gene or pathway). The covariate effects are modeled parametrically and the marker set effect of multiple loci is modeled using kernel machine. We propose an efficient algorithm to solve the corresponding optimization problem for estimating the effects of covariates and also introduce a powerful test for detecting the overall effect of the marker set. Our test is motivated by traditional score test, and borrows the idea of permutation test. Our estimation and testing procedures are evaluated numerically and applied to assess genetic association of change in fasting homocysteine level using the Vitamin Intervention for Stroke Prevention Trial data.

Large assembled cohorts with banked biospecimens offer valuable opportunities to identify novel markers for risk prediction. When the outcome of interest is rare, an effective strategy to conserve limited biological resources while maintaining reasonable statistical power is the case cohort (CCH) sampling design, in which expensive markers are measured on a subset of cases and controls. However, the CCH design introduces significant analytical complexity due to outcome-dependent, finite-population sampling. Current methods for analyzing CCH studies focus primarily on the estimation of simple survival models with linear effects; testing and estimation procedures that can efficiently capture complex non-linear marker effects for CCH data remain elusive. In this article, we propose inverse probability weighted (IPW) variance component type tests for identifying important marker sets through a Cox proportional hazards kernel machine () regression framework previously considered for full cohort studies (Cai et al., 2011). The optimal choice of kernel, while vitally important to attain high power, is typically unknown for a given dataset. Thus, we also develop robust testing procedures that adaptively combine information from multiple kernels. The proposed IPW test statistics have complex null distributions that cannot easily be approximated explicitly. Furthermore, due to the correlation induced by CCH sampling, standard resampling methods such as the bootstrap fail to approximate the distribution correctly. We, therefore, propose a novel perturbation resampling scheme that can effectively recover the induced correlation structure. Results from extensive simulation studies suggest that the proposed IPW testing procedures work well in finite samples. The proposed methods are further illustrated by application to a Danish CCH study of Apolipoprotein C-III markers on the risk of coronary heart disease.

We present a technique for using calibrated weights to incorporate whole-cohort information in the analysis of a countermatched sample. Following Samuelsen's approach for matched case-control sampling, we derive expressions for the marginal sampling probabilities, so that the data can be treated as an unequally-sampled case-cohort design. Pseudolikelihood estimating equations are used to find the estimates. The sampling weights can be calibrated, allowing all whole-cohort variables to be used in estimation; in contrast, the partial likelihood analysis makes use only of a single discrete surrogate for exposure. Using a survey-sampling approach rather than a martingale approach simplifies the theory; in particular, the sampling weights need not be a predictable process. Our simulation results show that pseudolikelihood estimation gives lower efficiency than partial likelihood estimation, but that the gain from calibration of weights can more than compensate for this loss. If there is a good surrogate for exposure, countermatched sampling still outperforms case-cohort and two-phase case-control sampling even when calibrated weights are used. Findings are illustrated with data from the National Wilms’ Tumour Study and the Welsh nickel refinery workers study.

It is agreed among biostatisticians that prediction models for binary outcomes should satisfy two essential criteria: first, a prediction model should have a high discriminatory power, implying that it is able to clearly separate cases from controls. Second, the model should be well calibrated, meaning that the predicted risks should closely agree with the relative frequencies observed in the data. The focus of this work is on the predictiveness curve, which has been proposed by Huang et al. (Biometrics 63, 2007) as a graphical tool to assess the aforementioned criteria. By conducting a detailed analysis of its properties, we review the role of the predictiveness curve in the performance assessment of biomedical prediction models. In particular, we demonstrate that marker comparisons should not be based solely on the predictiveness curve, as it is not possible to consistently visualize the added predictive value of a new marker by comparing the predictiveness curves obtained from competing models. Based on our analysis, we propose the “residual-based predictiveness curve” (RBP curve), which addresses the aforementioned issue and which extends the original method to settings where the evaluation of a prediction model on independent test data is of particular interest. Similar to the predictiveness curve, the RBP curve reflects both the calibration and the discriminatory power of a prediction model. In addition, the curve can be conveniently used to conduct valid performance checks and marker comparisons.

Causal mediation modeling has become a popular approach for studying the effect of an exposure on an outcome through a mediator. However, current methods are not applicable to the setting with a large number of mediators. We propose a testing procedure for mediation effects of high-dimensional continuous mediators. We characterize the marginal mediation effect, the multivariate component-wise mediation effects, and the norm of the component-wise effects, and develop a Monte-Carlo procedure for evaluating their statistical significance. To accommodate the setting with a large number of mediators and a small sample size, we further propose a transformation model using the spectral decomposition. Under the transformation model, mediation effects can be estimated using a series of regression models with a univariate transformed mediator, and examined by our proposed testing procedure. Extensive simulation studies are conducted to assess the performance of our methods for continuous and dichotomous outcomes. We apply the methods to analyze genomic data investigating the effect of microRNA miR-223 on a dichotomous survival status of patients with glioblastoma multiforme (GBM). We identify nine gene ontology sets with expression values that significantly mediate the effect of miR-223 on GBM survival.

Clinical biomarkers play an important role in precision medicine and are now extensively used in clinical trials, particularly in cancer. A response adaptive trial design enables researchers to use treatment results about earlier patients to aid in treatment decisions of later patients. Optimal adaptive trial designs have been developed without consideration of biomarkers. In this article, we describe the mathematical steps for computing optimal biomarker-integrated adaptive trial designs. These designs maximize the expected trial utility given any pre-specified utility function, though we focus here on maximizing patient responses within a given patient horizon. We describe the performance of the optimal design in different scenarios. We compare it to Bayesian Adaptive Randomization (BAR), which is emerging as a practical approach to develop adaptive trials. The difference in expected utility between BAR and optimal designs is smallest when the biomarker subgroups are highly imbalanced. We also compare BAR, a frequentist play-the-winner rule with integrated biomarkers and a marker-stratified balanced randomization design (BR). We show that, in contrasting two treatments, BR achieves a nearly optimal expected utility when the patient horizon is relatively large. Our work provides novel theoretical solution, as well as an absolute benchmark for the evaluation of trial designs in personalized medicine.

We present a general method for estimating the effect of a treatment on an ordinal outcome in randomized trials. The method is robust in that it does not rely on the proportional odds assumption. Our estimator leverages information in prognostic baseline variables, and has all of the following properties: (i) it is consistent; (ii) it is locally efficient; (iii) it is guaranteed to have equal or better asymptotic precision than both the inverse probability-weighted and the unadjusted estimators. To the best of our knowledge, this is the first estimator of the causal relation between a treatment and an ordinal outcome to satisfy these properties. We demonstrate the estimator in simulations based on resampling from a completed randomized clinical trial of a new treatment for stroke; we show potential gains of up to 39% in relative efficiency compared to the unadjusted estimator. The proposed estimator could be a useful tool for analyzing randomized trials with ordinal outcomes, since existing methods either rely on model assumptions that are untenable in many practical applications, or lack the efficiency properties of the proposed estimator. We provide R code implementing the estimator.

The Wilcoxon rank-sum test is a popular nonparametric test for comparing two independent populations (groups). In recent years, there have been renewed attempts in extending the Wilcoxon rank sum test for clustered data, one of which (Datta and Satten, 2005, *Journal of the American Statistical Association* **100**, 908–915) addresses the issue of informative cluster size, i.e., when the outcomes and the cluster size are correlated. We are faced with a situation where the group specific marginal distribution in a cluster depends on the number of observations in that group (i.e., the intra-cluster group size). We develop a novel extension of the rank-sum test for handling this situation. We compare the performance of our test with the Datta–Satten test, as well as the naive Wilcoxon rank sum test. Using a naturally occurring simulation model of informative intra-cluster group size, we show that only our test maintains the correct size. We also compare our test with a classical signed rank test based on averages of the outcome values in each group paired by the cluster membership. While this test maintains the size, it has lower power than our test. Extensions to multiple group comparisons and the case of clusters not having samples from all groups are also discussed. We apply our test to determine whether there are differences in the attachment loss between the upper and lower teeth and between mesial and buccal sites of periodontal patients.

Ignorance of the mechanisms responsible for the availability of information presents an unusual problem for analysts. It is often the case that the availability of information is dependent on the outcome. In the analysis of cluster data we say that a condition for *informative cluster size* (ICS) exists when the inference drawn from analysis of hypothetical balanced data varies from that of inference drawn on observed data. Much work has been done in order to address the analysis of clustered data with informative cluster size; examples include Inverse Probability Weighting (IPW), Cluster Weighted Generalized Estimating Equations (CWGEE), and Doubly Weighted Generalized Estimating Equations (DWGEE). When cluster size changes with time, i.e., the data set possess temporally varying cluster sizes (TVCS), these methods may produce biased inference for the underlying marginal distribution of interest. We propose a new marginalization that may be appropriate for addressing clustered longitudinal data with TVCS. The principal motivation for our present work is to analyze the periodontal data collected by Beck et al. (1997, Journal of Periodontal Research 6, 497–505). Longitudinal periodontal data often exhibits both ICS and TVCS as the number of teeth possessed by participants at the onset of study is not constant and teeth as well as individuals may be displaced throughout the study.

A new objective methodology is proposed to select the parsimonious set of important covariates that are associated with a censored outcome variable *Y*; the method simplifies to accommodate uncensored outcomes. Covariate selection proceeds in an iterated forward manner and is controlled by the pre-chosen upper bound for the number of covariates to be selected and the global false selection rate and level. A sequence of working regression models for the event given a covariate set is fit among subjects not censored before *y* and the corresponding process (through *y*) of conditional prediction error estimated; the direction and magnitude of covariate effects can arbitrarily change with *y*. The newly proposed adequacy measure for the covariate set is the slope coefficient resulting from a regression (with no intercept) between the baseline prediction error process for the intercept-only model and that process corresponding to the covariate set. Under quite general conditions on the censoring variable, the methods are shown to asymptotically control the false selection rate at the nominal level while consistently ranking covariate sets which permits recruitment of all important covariates from those available with probability tending to 1. A simulation study confirms these analytical results and compares the proposed methods to recent competitors. Two real data illustrations are provided.

Observational studies are often in peril of unmeasured confounding. Instrumental variable analysis is a method for controlling for unmeasured confounding. As yet, theory on instrumental variable analysis of censored time-to-event data is scarce. We propose a pseudo-observation approach to instrumental variable analysis of the survival function, the restricted mean, and the cumulative incidence function in competing risks with right-censored data using generalized method of moments estimation. For the purpose of illustrating our proposed method, we study antidepressant exposure in pregnancy and risk of autism spectrum disorder in offspring, and the performance of the method is assessed through simulation studies.

Motivated by a longitudinal oral health study, we propose a flexible modeling approach for clustered time-to-event data, when the response of interest can only be determined to lie in an interval obtained from a sequence of examination times (interval-censored data) and on top of that, the determination of the occurrence of the event is subject to misclassification. The clustered time-to-event data are modeled using an accelerated failure time model with random effects and by assuming a penalized Gaussian mixture model for the random effects terms to avoid restrictive distributional assumptions concerning the event times. A general misclassification model is discussed in detail, considering the possibility that different examiners were involved in the assessment of the occurrence of the events for a given subject across time. A Bayesian implementation of the proposed model is described in a detailed manner. We additionally provide empirical evidence showing that the model can be used to estimate the underlying time-to-event distribution and the misclassification parameters without any external information about the latter parameters. We also provide results of a simulation study to evaluate the effect of neglecting the presence of misclassification in the analysis of clustered time-to-event data.

Graph-constrained estimation methods encourage similarities among neighboring covariates presented as nodes of a graph, and can result in more accurate estimates, especially in high-dimensional settings. Variable selection approaches can then be utilized to select a subset of variables that are associated with the response. However, existing procedures do not provide measures of uncertainty of estimates. Further, the vast majority of existing approaches assume that available graph accurately captures the association among covariates; violations to this assumption could severely hurt the reliability of the resulting estimates. In this article, we present a new inference framework, called the Grace test, which produces coefficient estimates and corresponding *p*-values by incorporating the external graph information. We show, both theoretically and via numerical studies, that the proposed method asymptotically controls the type-I error rate regardless of the choice of the graph. We also show that when the underlying graph is informative, the Grace test is asymptotically more powerful than similar tests that ignore the external information. We study the power properties of the proposed test when the graph is not fully informative and develop a more powerful Grace-ridge test for such settings. Our numerical studies show that as long as the graph is reasonably informative, the proposed inference procedures deliver improved statistical power over existing methods that ignore external information.

In many practical cases of multiple hypothesis problems, it can be expected that the alternatives are not symmetrically distributed. If it is known a priori that the distributions of the alternatives are skewed, we show that this information yields high power procedures as compared to the procedures based on symmetric alternatives when testing multiple hypotheses. We propose a Bayesian decision theoretic rule for multiple directional hypothesis testing, when the alternatives are distributed as skewed, under a constraint on a mixed directional false discovery rate. We compare the proposed rule with a frequentist's rule of Benjamini and Yekutieli (2005) using simulations. We apply our method to a well-studied HIV dataset.

Often the object of inference in biomedical applications is a range that brackets a given fraction of individual observations in a population. A classical estimate of this range for univariate measurements is a “tolerance interval.” This article develops its natural extension for functional measurements, a “tolerance band,” and proposes a methodology for constructing its pointwise and simultaneous versions that incorporates both sparse and dense functional data. Assuming that the measurements are observed with noise, the methodology uses functional principal component analysis in a mixed model framework to represent the measurements and employs bootstrapping to approximate the tolerance factors needed for the bands. The proposed bands also account for uncertainty in the principal components decomposition. Simulations show that the methodology has, generally, acceptable performance unless the data are quite sparse and unbalanced, in which case the bands may be somewhat liberal. The methodology is illustrated using two real datasets, a sparse dataset involving CD4 cell counts and a dense dataset involving core body temperatures.

We introduce statistical methods for predicting the types of human activity at sub-second resolution using triaxial accelerometry data. The major innovation is that we use labeled activity data from some subjects to predict the activity labels of other subjects. To achieve this, we normalize the data across subjects by matching the standing up and lying down portions of triaxial accelerometry data. This is necessary to account for differences between the variability in the position of the device relative to gravity, which are induced by body shape and size as well as by the ambiguous definition of device placement. We also normalize the data at the device level to ensure that the magnitude of the signal at rest is similar across devices. After normalization we use overlapping movelets (segments of triaxial accelerometry time series) extracted from some of the subjects to predict the movement type of the other subjects. The problem was motivated by and is applied to a laboratory study of 20 older participants who performed different activities while wearing accelerometers at the hip. Prediction results based on other people's labeled dictionaries of activity performed almost as well as those obtained using their own labeled dictionaries. These findings indicate that prediction of activity types for data collected during natural activities of daily living may actually be possible.

Times between successive events (i.e., gap times) are of great importance in survival analysis. Although many methods exist for estimating covariate effects on gap times, very few existing methods allow for comparisons between gap times themselves. Motivated by the comparison of primary and repeat transplantation, our interest is specifically in contrasting the gap time survival functions and their integration (restricted mean gap time). Two major challenges in gap time analysis are non-identifiability of the marginal distributions and the existence of dependent censoring (for all but the first gap time). We use Cox regression to estimate the (conditional) survival distributions of each gap time (given the previous gap times). Combining fitted survival functions based on those models, along with multiple imputation applied to censored gap times, we then contrast the first and second gap times with respect to average survival and restricted mean lifetime. Large-sample properties are derived, with simulation studies carried out to evaluate finite-sample performance. We apply the proposed methods to kidney transplant data obtained from a national organ transplant registry. Mean 10-year graft survival of the primary transplant is significantly greater than that of the repeat transplant, by 3.9 months (), a result that may lack clinical importance.

Infection is one of the most common complications after hematopoietic cell transplantation. Many patients experience infectious complications repeatedly after transplant. Existing statistical methods for recurrent gap time data typically assume that patients are enrolled due to the occurrence of an event of interest, and subsequently experience recurrent events of the same type; moreover, for one-sample estimation, the gap times between consecutive events are usually assumed to be identically distributed. Applying these methods to analyze the post-transplant infection data will inevitably lead to incorrect inferential results because the time from transplant to the first infection has a different biological meaning than the gap times between consecutive recurrent infections. Some unbiased yet inefficient methods include univariate survival analysis methods based on data from the first infection or bivariate serial event data methods based on the first and second infections. In this article, we propose a nonparametric estimator of the joint distribution of time from transplant to the first infection and the gap times between consecutive infections. The proposed estimator takes into account the potentially different distributions of the two types of gap times and better uses the recurrent infection data. Asymptotic properties of the proposed estimators are established.

In this article, we develop a piecewise Poisson regression method to analyze survival data from complex sample surveys involving cluster-correlated, differential selection probabilities, and longitudinal responses, to conveniently draw inference on absolute risks in time intervals that are prespecified by investigators. Extensive simulations evaluate the developed methods with extensions to multiple covariates under various complex sample designs, including stratified sampling, sampling with selection probability proportional to a measure of size (PPS), and a multi-stage cluster sampling. We applied our methods to a study of mortality in men diagnosed with prostate cancer in the Prostate, Lung, Colorectal, and Ovarian (PLCO) cancer screening trial to investigate whether a biomarker available from biospecimens collected near time of diagnosis stratifies subsequent risk of death. Poisson regression coefficients and absolute risks of mortality (and the corresponding 95% confidence intervals) for prespecified age intervals by biomarker levels are estimated. We conclude with a brief discussion of the motivation, methods, and findings of the study.

The discriminatory ability of a marker for censored survival data is routinely assessed by the time-dependent ROC curve and the *c*-index. The time-dependent ROC curve evaluates the ability of a biomarker to predict whether a patient lives past a particular time *t*. The *c*-index measures the global concordance of the marker and the survival time regardless of the time point. We propose a Bayesian semiparametric approach to estimate these two measures. The proposed estimators are based on the conditional distribution of the survival time given the biomarker and the empirical biomarker distribution. The conditional distribution is estimated by a linear-dependent Dirichlet process mixture model. The resulting ROC curve is smooth as it is estimated by a mixture of parametric functions. The proposed *c*-index estimator is shown to be more efficient than the commonly used Harrell's *c*-index since it uses all pairs of data rather than only informative pairs. The proposed estimators are evaluated through simulations and illustrated using a lung cancer dataset.

Causal mediation modeling has become a popular approach for studying the effect of an exposure on an outcome through mediators. Currently, the literature on mediation analyses with survival outcomes largely focused on settings with a single mediator and quantified the mediation effects on the hazard, log hazard and log survival time (Lange and Hansen 2011; VanderWeele 2011). In this article, we propose a multi-mediator model for survival data by employing a flexible semiparametric probit model. We characterize path-specific effects (PSEs) of the exposure on the outcome mediated through specific mediators. We derive closed form expressions for PSEs on a transformed survival time and the survival probabilities. Statistical inference on the PSEs is developed using a nonparametric maximum likelihood estimator under the semiparametric probit model and the functional Delta method. Results from simulation studies suggest that our proposed methods perform well in finite sample. We illustrate the utility of our method in a genomic study of glioblastoma multiforme survival.

Efforts to personalize medicine in oncology have been limited by reductive characterizations of the intrinsically complex underlying biological phenomena. Future advances in personalized medicine will rely on molecular signatures that derive from synthesis of multifarious interdependent molecular quantities requiring robust quantitative methods. However, highly parameterized statistical models when applied in these settings often require a prohibitively large database and are sensitive to proper characterizations of the treatment-by-covariate interactions, which in practice are difficult to specify and may be limited by generalized linear models. In this article, we present a Bayesian predictive framework that enables the integration of a high-dimensional set of genomic features with clinical responses and treatment histories of historical patients, providing a probabilistic basis for using the clinical and molecular information to personalize therapy for future patients. Our work represents one of the first attempts to define personalized treatment assignment rules based on large-scale genomic data. We use actual gene expression data acquired from The Cancer Genome Atlas in the settings of leukemia and glioma to explore the statistical properties of our proposed Bayesian approach for personalizing treatment selection. The method is shown to yield considerable improvements in predictive accuracy when compared to penalized regression approaches.

Matched case-control studies are popular designs used in epidemiology for assessing the effects of exposures on binary traits. Modern studies increasingly enjoy the ability to examine a large number of exposures in a comprehensive manner. However, several risk factors often tend to be related in a nontrivial way, undermining efforts to identify the risk factors using standard analytic methods due to inflated type-I errors and possible masking of effects. Epidemiologists often use data reduction techniques by grouping the prognostic factors using a thematic approach, with themes deriving from biological considerations. We propose shrinkage-type estimators based on Bayesian penalization methods to estimate the effects of the risk factors using these themes. The properties of the estimators are examined using extensive simulations. The methodology is illustrated using data from a matched case-control study of polychlorinated biphenyls in relation to the etiology of non-Hodgkin's lymphoma.

We propose a novel Bayesian hierarchical model for brain imaging data that unifies voxel-level (the most localized unit of measure) and region-level brain connectivity analyses, and yields population-level inferences. Functional connectivity generally refers to associations in brain activity between distinct locations. The first level of our model summarizes brain connectivity for cross-region voxel pairs using a two-component mixture model consisting of connected and nonconnected voxels. We use the proportion of connected voxel pairs to define a new measure of connectivity strength, which reflects the breadth of between-region connectivity. Furthermore, we evaluate the impact of clinical covariates on connectivity between region-pairs at a population level. We perform parameter estimation using Markov chain Monte Carlo (MCMC) techniques, which can be executed quickly relative to the number of model parameters. We apply our method to resting-state functional magnetic resonance imaging (fMRI) data from 32 subjects with major depression and simulated data to demonstrate the properties of our method.

Community water fluoridation is an important public health measure to prevent dental caries, but it continues to be somewhat controversial. The Iowa Fluoride Study (IFS) is a longitudinal study on a cohort of Iowa children that began in 1991. The main purposes of this study (http://www.dentistry.uiowa.edu/preventive-fluoride-study) were to quantify fluoride exposures from both dietary and nondietary sources and to associate longitudinal fluoride exposures with dental fluorosis (spots on teeth) and dental caries (cavities). We analyze a subset of the IFS data by a marginal regression model with a zero-inflated version of the Conway–Maxwell–Poisson distribution for count data exhibiting excessive zeros and a wide range of dispersion patterns. In general, we introduce two estimation methods for fitting a ZICMP marginal regression model. Finite sample behaviors of the estimators and the resulting confidence intervals are studied using extensive simulation studies. We apply our methodologies to the dental caries data. Our novel modeling incorporating zero inflation, clustering, and overdispersion sheds some new light on the effect of community water fluoridation and other factors. We also include a second application of our methodology to a genomic (next-generation sequencing) dataset that exhibits underdispersion.

We consider multi-state capture–recapture–recovery data where observed individuals are recorded in a set of possible discrete states. Traditionally, the Arnason–Schwarz model has been fitted to such data where the state process is modeled as a first-order Markov chain, though second-order models have also been proposed and fitted to data. However, low-order Markov models may not accurately represent the underlying biology. For example, specifying a (time-independent) first-order Markov process involves the assumption that the dwell time in each state (i.e., the duration of a stay in a given state) has a geometric distribution, and hence that the modal dwell time is one. Specifying time-dependent or higher-order processes provides additional flexibility, but at the expense of a potentially significant number of additional model parameters. We extend the Arnason–Schwarz model by specifying a semi-Markov model for the state process, where the dwell-time distribution is specified more generally, using, for example, a shifted Poisson or negative binomial distribution. A state expansion technique is applied in order to represent the resulting semi-Markov Arnason–Schwarz model in terms of a simpler and computationally tractable hidden Markov model. Semi-Markov Arnason–Schwarz models come with only a very modest increase in the number of parameters, yet permit a significantly more flexible state process. Model selection can be performed using standard procedures, and in particular via the use of information criteria. The semi-Markov approach allows for important biological inference to be drawn on the underlying state process, for example, on the times spent in the different states. The feasibility of the approach is demonstrated in a simulation study, before being applied to real data corresponding to house finches where the states correspond to the presence or absence of conjunctivitis.

In genome-wide gene–environment interaction (GxE) studies, a common strategy to improve power is to first conduct a filtering test and retain only the SNPs that pass the filtering in the subsequent GxE analyses. Inspired by two-stage tests and gene-based tests in GxE analysis, we consider the general problem of jointly testing a set of parameters when only a few are truly from the alternative hypothesis and when filtering information is available. We propose a unified set-based test that simultaneously considers filtering on individual parameters and testing on the set. We derive the exact distribution and approximate the power function of the proposed unified statistic in simplified settings, and use them to adaptively calculate the optimal filtering threshold for each set. In the context of gene-based GxE analysis, we show that although the empirical power function may be affected by many factors, the optimal filtering threshold corresponding to the peak of the power curve primarily depends on the size of the gene. We further propose a resampling algorithm to calculate *P*-values for each gene given the estimated optimal filtering threshold. The performance of the method is evaluated in simulation studies and illustrated via a genome-wide gene–gender interaction analysis using pancreatic cancer genome-wide association data.

Large-scale homogeneous discrete *p*-values are encountered frequently in high-throughput genomics studies, and the related multiple testing problems become challenging because most existing methods for the false discovery rate (FDR) assume continuous *p*-values. In this article, we study the estimation of the null proportion and FDR for discrete *p*-values with common support. In the finite sample setting, we propose a novel class of conservative FDR estimators. Furthermore, we show that a broad class of FDR estimators is simultaneously conservative over all support points under some weak dependence condition in the asymptotic setting. We further demonstrate the significant improvement of a newly proposed method over existing methods through simulation studies and a case study.

With the internet, a massive amount of information on species abundance can be collected by citizen science programs. However, these data are often difficult to use directly in statistical inference, as their collection is generally opportunistic, and the distribution of the sampling effort is often not known. In this article, we develop a general statistical framework to combine such “opportunistic data” with data collected using schemes characterized by a known sampling effort. Under some structural assumptions regarding the sampling effort and detectability, our approach makes it possible to estimate the relative abundance of several species in different sites. It can be implemented through a simple generalized linear model. We illustrate the framework with typical bird datasets from the Aquitaine region in south-western France. We show that, under some assumptions, our approach provides estimates that are more precise than the ones obtained from the dataset with a known sampling effort alone. When the opportunistic data are abundant, the gain in precision may be considerable, especially for rare species. We also show that estimates can be obtained even for species recorded only in the opportunistic scheme. Opportunistic data combined with a relatively small amount of data collected with a known effort may thus provide access to accurate and precise estimates of quantitative changes in relative abundance over space and/or time.