When functional data come as multiple curves per subject, characterizing the source of variations is not a trivial problem. The complexity of the problem goes deeper when there is phase variation in addition to amplitude variation. We consider clustering problem with multivariate functional data that have phase variations among the functional variables. We propose a conditional subject-specific warping framework in order to extract relevant features for clustering. Using multivariate growth curves of various parts of the body as a motivating example, we demonstrate the effectiveness of the proposed approach. The found clusters have individuals who show different relative growth patterns among different parts of the body.

Dose–response modeling in areas such as toxicology is often conducted using a parametric approach. While estimation of parameters is usually one of the goals, often the main aim of the study is the estimation of quantities derived from the parameters, such as the ED50 dose. From the view of statistical optimal design theory such an objective corresponds to a *c*-optimal design criterion. Unfortunately, *c*-optimal designs often create practical problems, and furthermore commonly do not allow actual estimation of the parameters. It is therefore useful to consider alternative designs which show good *c*-performance, while still being applicable in practice and allowing reasonably good general parameter estimation. In effect, using optimal design terminology this means that a reasonable performance regarding the *D*-criterion is expected as well. In this article, we propose several approaches to the task of combining *c*- and *D*-efficient designs, such as using mixed information functions or setting minimum requirements regarding either *c*- or *D*-efficiency, and show how to algorithmically determine optimal designs in each case. We apply all approaches to a standard situation from toxicology, and obtain a much better balance between *c*- and *D*-performance. Next, we investigate how to adapt the designs to different parameter values. Finally, we show that the methodology used here is not just limited to the combination of *c*- and *D*-designs, but can also be used to handle more general constraint situations such as limits on the cost of an experiment.

A gene may be controlled by distal enhancers and repressors, not merely by regulatory elements in its promoter. Spatial organization of chromosomes is the mechanism that brings genes and their distal regulatory elements into close proximity. Recent molecular techniques, coupled with Next Generation Sequencing (NGS) technology, enable genome-wide detection of physical contacts between distant genomic loci. In particular, Hi-C is an NGS-aided assay for the study of genome-wide spatial interactions. The availability of such data makes it possible to reconstruct the underlying three-dimensional (3D) spatial chromatin structure. In this article, we present the Poisson Random effect Architecture Model (PRAM) for such an inference. The main feature of PRAM that separates it from previous methods is that it addresses the issue of over-dispersion and takes correlations among contact counts into consideration, thereby achieving greater consistency with observed data. PRAM was applied to Hi-C data to illustrate its performance and to compare the predicted distances with those measured by a Fluorescence In Situ Hybridization (FISH) validation experiment. Further, PRAM was compared to other methods in the literature based on both real and simulated data.

Dynamic treatment regimes (DTRs) are sequential decision rules that focus simultaneously on treatment individualization and adaptation over time. To directly identify the optimal DTR in a multi-stage multi-treatment setting, we propose a dynamic statistical learning method, adaptive contrast weighted learning. We develop semiparametric regression-based contrasts with the adaptation of treatment effect ordering for each patient at each stage, and the adaptive contrasts simplify the problem of optimization with multiple treatment comparisons to a weighted classification problem that can be solved by existing machine learning techniques. The algorithm is implemented recursively using backward induction. By combining doubly robust semiparametric regression estimators with machine learning algorithms, the proposed method is robust and efficient for the identification of the optimal DTR, as shown in the simulation studies. We illustrate our method using observational data on esophageal cancer.

Treatments are frequently evaluated in terms of their effect on patient survival. In settings where randomization of treatment is not feasible, observational data are employed, necessitating correction for covariate imbalances. Treatments are usually compared using a hazard ratio. Most existing methods which quantify the treatment effect through the survival function are applicable to treatments assigned at time 0. In the data structure of our interest, subjects typically begin follow-up untreated; time-until-treatment, and the pretreatment death hazard are both heavily influenced by longitudinal covariates; and subjects may experience periods of treatment ineligibility. We propose semiparametric methods for estimating the average difference in restricted mean survival time attributable to a time-dependent treatment, the average effect of treatment among the treated, under current treatment assignment patterns. The pre- and posttreatment models are partly conditional, in that they use the covariate history up to the time of treatment. The pre-treatment model is estimated through recently developed landmark analysis methods. For each treated patient, fitted pre- and posttreatment survival curves are projected out, then averaged in a manner which accounts for the censoring of treatment times. Asymptotic properties are derived and evaluated through simulation. The proposed methods are applied to liver transplant data in order to estimate the effect of liver transplantation on survival among transplant recipients under current practice patterns.

The prior distribution is a key ingredient in Bayesian inference. Prior information on regression coefficients may come from different sources and may or may not be in conflict with the observed data. Various methods have been proposed to quantify a potential prior-data conflict, such as Box's *p*-value. However, there are no clear recommendations how to react to possible prior-data conflict in generalized regression models. To address this deficiency, we propose to adaptively weight a prespecified multivariate normal prior distribution on the regression coefficients. To this end, we relate empirical Bayes estimates of prior weight to Box's *p*-value and propose alternative fully Bayesian approaches. Prior weighting can be done for the joint prior distribution of the regression coefficients or—under prior independence—separately for prespecified blocks of regression coefficients. We outline how the proposed methodology can be implemented using integrated nested Laplace approximations (INLA) and illustrate the applicability with a Bayesian logistic regression model for data from a cross-sectional study. We also provide a simulation study that shows excellent performance of our approach in the case of prior misspecification in terms of root mean squared error and coverage. Supplementary Materials give details on software implementation and code and another application to binary longitudinal data from a randomized clinical trial using a Bayesian generalized linear mixed model.

Meta-analysis has become a widely used tool to combine results from independent studies. The collected studies are homogeneous if they share a common underlying true effect size; otherwise, they are heterogeneous. A fixed-effect model is customarily used when the studies are deemed homogeneous, while a random-effects model is used for heterogeneous studies. Assessing heterogeneity in meta-analysis is critical for model selection and decision making. Ideally, if heterogeneity is present, it should permeate the entire collection of studies, instead of being limited to a small number of outlying studies. Outliers can have great impact on conventional measures of heterogeneity and the conclusions of a meta-analysis. However, no widely accepted guidelines exist for handling outliers. This article proposes several new heterogeneity measures. In the presence of outliers, the proposed measures are less affected than the conventional ones. The performance of the proposed and conventional heterogeneity measures are compared theoretically, by studying their asymptotic properties, and empirically, using simulations and case studies.

In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high-dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees—features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.

Many new experimental treatments benefit only a subset of the population. Identifying the baseline covariate profiles of patients who benefit from such a treatment, rather than determining whether or not the treatment has a population-level effect, can substantially lessen the risk in undertaking a clinical trial and expose fewer patients to treatments that do not benefit them. The standard analyses for identifying patient subgroups that benefit from an experimental treatment either do not account for multiplicity, or focus on testing for the presence of treatment–covariate interactions rather than the resulting individualized treatment effects. We propose a Bayesian *credible subgroups* method to identify two bounding subgroups for the benefiting subgroup: one for which it is likely that all members simultaneously have a treatment effect exceeding a specified threshold, and another for which it is likely that no members do. We examine frequentist properties of the credible subgroups method via simulations and illustrate the approach using data from an Alzheimer's disease treatment trial. We conclude with a discussion of the advantages and limitations of this approach to identifying patients for whom the treatment is beneficial.

Cocaine addiction is chronic and persistent, and has become a major social and health problem in many countries. Existing studies have shown that cocaine addicts often undergo episodic periods of addiction to, moderate dependence on, or swearing off cocaine. Given its reversible feature, cocaine use can be formulated as a stochastic process that transits from one state to another, while the impacts of various factors, such as treatment received and individuals’ psychological problems on cocaine use, may vary across states. This article develops a hidden Markov latent variable model to study multivariate longitudinal data concerning cocaine use from a California Civil Addict Program. The proposed model generalizes conventional latent variable models to allow bidirectional transition between cocaine-addiction states and conventional hidden Markov models to allow latent variables and their dynamic interrelationship. We develop a maximum-likelihood approach, along with a Monte Carlo expectation conditional maximization (MCECM) algorithm, to conduct parameter estimation. The asymptotic properties of the parameter estimates and statistics for testing the heterogeneity of model parameters are investigated. The finite sample performance of the proposed methodology is demonstrated by simulation studies. The application to cocaine use study provides insights into the prevention of cocaine use.

Joint modeling is increasingly popular for investigating the relationship between longitudinal and time-to-event data. However, numerical complexity often restricts this approach to linear models for the longitudinal part. Here, we use a novel development of the Stochastic-Approximation Expectation Maximization algorithm that allows joint models defined by nonlinear mixed-effect models. In the context of chemotherapy in metastatic prostate cancer, we show that a variety of patterns for the Prostate Specific Antigen (PSA) kinetics can be captured by using a mechanistic model defined by nonlinear ordinary differential equations. The use of a mechanistic model predicts that biological quantities that cannot be observed, such as treatment-sensitive and treatment-resistant cells, may have a larger impact than PSA value on survival. This suggests that mechanistic joint models could constitute a relevant approach to evaluate the efficacy of treatment and to improve the prediction of survival in patients.

Understanding how aquatic species grow is fundamental in fisheries because stock assessment often relies on growth dependent statistical models. Length-frequency-based methods become important when more applicable data for growth model estimation are either not available or very expensive. In this article, we develop a new framework for growth estimation from length-frequency data using a generalized von Bertalanffy growth model (VBGM) framework that allows for time-dependent covariates to be incorporated. A finite mixture of normal distributions is used to model the length-frequency cohorts of each month with the means constrained to follow a VBGM. The variances of the finite mixture components are constrained to be a function of mean length, reducing the number of parameters and allowing for an estimate of the variance at any length. To optimize the likelihood, we use a minorization–maximization (MM) algorithm with a Nelder–Mead sub-step. This work was motivated by the decline in catches of the blue swimmer crab (BSC) (*Portunus armatus*) off the east coast of Queensland, Australia. We test the method with a simulation study and then apply it to the BSC fishery data.

Our motivating application stems from surveys of natural populations and is characterized by large spatial heterogeneity in the counts, which makes parametric approaches to modeling local animal abundance too restrictive. We adopt a Bayesian nonparametric approach based on mixture models and innovate with respect to popular Dirichlet process mixture of Poisson kernels by increasing the model flexibility at the level both of the kernel and the nonparametric mixing measure. This allows to derive accurate and robust estimates of the distribution of local animal abundance and of the corresponding clusters. The application and a simulation study for different scenarios yield also some general methodological implications. Adding flexibility solely at the level of the mixing measure does not improve inferences, since its impact is severely limited by the rigidity of the Poisson kernel with considerable consequences in terms of bias. However, once a kernel more flexible than the Poisson is chosen, inferences can be robustified by choosing a prior more general than the Dirichlet process. Therefore, to improve the performance of Bayesian nonparametric mixtures for count data one has to enrich the model simultaneously at both levels, the kernel and the mixing measure.

Joint models are used in ageing studies to investigate the association between longitudinal markers and a time-to-event, and have been extended to multiple markers and/or competing risks. The competing risk of death must be considered in the elderly because death and dementia have common risk factors. Moreover, in cohort studies, time-to-dementia is interval-censored since dementia is assessed intermittently. So subjects can develop dementia and die between two visits without being diagnosed. To study predementia cognitive decline, we propose a joint latent class model combining a (possibly multivariate) mixed model and an illness–death model handling both interval censoring (by accounting for a possible unobserved transition to dementia) and semi-competing risks. Parameters are estimated by maximum-likelihood handling interval censoring. The correlation between the marker and the times-to-events is captured by latent classes, homogeneous sub-groups with specific risks of death, dementia, and profiles of cognitive decline. We propose Markovian and semi-Markovian versions. Both approaches are compared to a joint latent-class model for competing risks through a simulation study, and applied in a prospective cohort study of cerebral and functional ageing to distinguish different profiles of cognitive decline associated with risks of dementia and death. The comparison highlights that among subjects with dementia, mortality depends more on age than on duration of dementia. This model distinguishes the so-called terminal predeath decline (among healthy subjects) from the predementia decline.

The log-rank test is widely used to compare two survival distributions in a randomized clinical trial, while partial likelihood (Cox, 1975) is the method of choice for making inference about the hazard ratio under the Cox (1972) proportional hazards model. The Wald 95% confidence interval of the hazard ratio may include the null value of 1 when the *p*-value of the log-rank test is less than 0.05. Peto et al. (1977) provided an estimator for the hazard ratio based on the log-rank statistic; the corresponding 95% confidence interval excludes the null value of 1 if and only if the *p*-value of the log-rank test is less than 0.05. However, Peto's estimator is not consistent, and the corresponding confidence interval does not have correct coverage probability. In this article, we construct the confidence interval by inverting the score test under the (possibly stratified) Cox model, and we modify the variance estimator such that the resulting score test for the null hypothesis of no treatment difference is identical to the log-rank test in the possible presence of ties. Like Peto's method, the proposed confidence interval excludes the null value if and only if the log-rank test is significant. Unlike Peto's method, however, this interval has correct coverage probability. An added benefit of the proposed confidence interval is that it tends to be more accurate and narrower than the Wald confidence interval. We demonstrate the advantages of the proposed method through extensive simulation studies and a colon cancer study.

Interval-censored failure time data occur in many fields such as demography, economics, medical research, and reliability and many inference procedures on them have been developed (Sun, 2006; Chen, Sun, and Peace, 2012). However, most of the existing approaches assume that the mechanism that yields interval censoring is independent of the failure time of interest and it is clear that this may not be true in practice (Zhang et al., 2007; Ma, Hu, and Sun, 2015). In this article, we consider regression analysis of case *K* interval-censored failure time data when the censoring mechanism may be related to the failure time of interest. For the problem, an estimated sieve maximum-likelihood approach is proposed for the data arising from the proportional hazards frailty model and for estimation, a two-step procedure is presented. In the addition, the asymptotic properties of the proposed estimators of regression parameters are established and an extensive simulation study suggests that the method works well. Finally, we apply the method to a set of real interval-censored data that motivated this study.

Variable selection for recovering sparsity in nonadditive and nonparametric models with high-dimensional variables has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high-dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this article we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric regression model. The advantages of our approach are that it can: (i) recover the sparsity; (ii) automatically model unknown and complicated interactions; (iii) connect with several existing approaches including linear nonnegative garrote and multiple kernel learning; and (iv) provide flexibility for both additive and nonadditive nonparametric models. Our approach can be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a Least Squares Kernel Machine (LSKM) and construct the nonnegative garrote objective function as the function of the sparse scale parameters of kernel machine to recover sparsity of input variables whose relevances to the response are measured by the scale parameters. We also provide the asymptotic properties of our approach. We show that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve the power.

The evaluation of cure fractions in oncology research under the well known cure rate model has attracted considerable attention in the literature, but most of the existing testing procedures have relied on restrictive assumptions. A common assumption has been to restrict the cure fraction to a constant under alternatives to homogeneity, thereby neglecting any information from covariates. This article extends the literature by developing a score-based statistic that incorporates covariate information to detect cure fractions, with the existing testing procedure serving as a special case. A complication of this extension, however, is that the implied hypotheses are not typical and standard regularity conditions to conduct the test may not even hold. Using empirical processes arguments, we construct a sup-score test statistic for cure fractions and establish its limiting null distribution as a functional of mixtures of chi-square processes. In practice, we suggest a simple resampling procedure to approximate this limiting distribution. Our simulation results show that the proposed test can greatly improve efficiency over tests that neglect the heterogeneity of the cure fraction under the alternative. The practical utility of the methodology is illustrated using ovarian cancer survival data with long-term follow-up from the surveillance, epidemiology, and end results registry.

Recently, massive functional data have been widely collected over space across a set of grid points in various imaging studies. It is interesting to correlate functional data with various clinical variables, such as age and gender, in order to address scientific questions of interest. The aim of this article is to develop a single-index varying coefficient (SIVC) model for establishing a varying association between functional responses (e.g., image) and a set of covariates. It enjoys several unique features of both varying-coefficient and single-index models. An estimation procedure is developed to estimate varying coefficient functions, the index function, and the covariance function of individual functions. The optimal integration of information across different grid points is systematically delineated and the asymptotic properties (e.g., consistency and convergence rate) of all estimators are examined. Simulation studies are conducted to assess the finite-sample performance of the proposed estimation procedure. Furthermore, our real data analysis of a white matter tract dataset obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study confirms the advantage and accuracy of SIVC model over the popular varying coefficient model.

We consider the problem of selecting covariates in a spatial regression model when the response is binary. Penalized likelihood-based approach is proved to be effective for both variable selection and estimation simultaneously. In the context of a spatially dependent binary variable, an uniquely interpretable likelihood is not available, rather a quasi-likelihood might be more suitable. We develop a penalized quasi-likelihood with spatial dependence for simultaneous variable selection and parameter estimation along with an efficient computational algorithm. The theoretical properties including asymptotic normality and consistency are studied under increasing domain asymptotics framework. An extensive simulation study is conducted to validate the methodology. Real data examples are provided for illustration and applicability. Although theoretical justification has not been made, we also investigate empirical performance of the proposed penalized quasi-likelihood approach for spatial count data to explore suitability of this method to a general exponential family of distributions.

For the classical, homoscedastic measurement error model, moment reconstruction (Freedman et al., 2004, 2008) and moment-adjusted imputation (Thomas et al., 2011) are appealing, computationally simple imputation-like methods for general model fitting. Like classical regression calibration, the idea is to replace the unobserved variable subject to measurement error with a proxy that can be used in a variety of analyses. Moment reconstruction and moment-adjusted imputation differ from regression calibration in that they attempt to match multiple features of the latent variable, and also to match some of the latent variable's relationships with the response and additional covariates. In this note, we consider a problem where true exposure is generated by a complex, nonlinear random effects modeling process, and develop analogues of moment reconstruction and moment-adjusted imputation for this case. This general model includes classical measurement errors, Berkson measurement errors, mixtures of Berkson and classical errors and problems that are not measurement error problems, but also cases where the data-generating process for true exposure is a complex, nonlinear random effects modeling process. The methods are illustrated using the National Institutes of Health–AARP Diet and Health Study where the latent variable is a dietary pattern score called the Healthy Eating Index-2005. We also show how our general model includes methods used in radiation epidemiology as a special case. Simulations are used to illustrate the methods.

The peptide microarray immunoassay simultaneously screens sample serum against thousands of peptides, determining the presence of antibodies bound to array probes. Peptide microarrays tiling immunogenic regions of pathogens (e.g., envelope proteins of a virus) are an important high throughput tool for querying and mapping antibody binding. Because of the assay's many steps, from probe synthesis to incubation, peptide microarray data can be noisy with extreme outliers. In addition, subjects may produce different antibody profiles in response to an identical vaccine stimulus or infection, due to variability among subjects’ immune systems. We present a robust Bayesian hierarchical model for peptide microarray experiments, pepBayes, to estimate the probability of antibody response for each subject/peptide combination. Heavy-tailed error distributions accommodate outliers and extreme responses, and tailored random effect terms automatically incorporate technical effects prevalent in the assay. We apply our model to two vaccine trial data sets to demonstrate model performance. Our approach enjoys high sensitivity and specificity when detecting vaccine induced antibody responses. A simulation study shows an adaptive thresholding classification method has appropriate false discovery rate control with high sensitivity, and receiver operating characteristics generated on vaccine trial data suggest that pepBayes clearly separates responses from non-responses.

In many classical estimation problems, the parameter space has a boundary. In most cases, the standard asymptotic properties of the estimator do not hold when some of the underlying true parameters lie on the boundary. However, without knowledge of the true parameter values, confidence intervals constructed assuming that the parameters lie in the interior are generally over-conservative. A penalized estimation method is proposed in this article to address this issue. An adaptive lasso procedure is employed to shrink the parameters to the boundary, yielding oracle inference which adapt to whether or not the true parameters are on the boundary. When the true parameters are on the boundary, the inference is equivalent to that which would be achieved with a priori knowledge of the boundary, while if the converse is true, the inference is equivalent to that which is obtained in the interior of the parameter space. The method is demonstrated under two practical scenarios, namely the frailty survival model and linear regression with order-restricted parameters. Simulation studies and real data analyses show that the method performs well with realistic sample sizes and exhibits certain advantages over standard methods.

Semi-parametric methods are often used for the estimation of intervention effects on correlated outcomes in cluster-randomized trials (CRTs). When outcomes are missing at random (MAR), Inverse Probability Weighted (IPW) methods incorporating baseline covariates can be used to deal with informative missingness. Also, augmented generalized estimating equations (AUG) correct for imbalance in baseline covariates but need to be extended for MAR outcomes. However, in the presence of interactions between treatment and baseline covariates, neither method alone produces consistent estimates for the marginal treatment effect if the model for interaction is not correctly specified. We propose an AUG–IPW estimator that weights by the inverse of the probability of being a complete case and allows different outcome models in each intervention arm. This estimator is doubly robust (DR); it gives correct estimates whether the missing data process or the outcome model is correctly specified. We consider the problem of covariate interference which arises when the outcome of an individual may depend on covariates of other individuals. When interfering covariates are not modeled, the DR property prevents bias as long as covariate interference is not present simultaneously for the outcome and the missingness. An R package is developed implementing the proposed method. An extensive simulation study and an application to a CRT of HIV risk reduction-intervention in South Africa illustrate the method.

The ready availability of public-use data from various large national complex surveys has immense potential for the assessment of population characteristics using regression models. Complex surveys can be used to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to complex survey design features. That is, stratification, multistage sampling, and weighting. In this article, we accommodate these design features in the analysis of highly skewed response variables arising from large complex surveys. Specifically, we propose a double-transform-both-sides (DTBS)'based estimating equations approach to estimate the median regression parameters of the highly skewed response; the DTBS approach applies the same Box–Cox type transformation twice to both the outcome and regression function. The usual sandwich variance estimate can be used in our approach, whereas a resampling approach would be needed for a pseudo-likelihood based on minimizing absolute deviations (MAD). Furthermore, the approach is relatively robust to the true underlying distribution, and has much smaller mean square error than a MAD approach. The method is motivated by an analysis of laboratory data on urinary iodine (UI) concentration from the National Health and Nutrition Examination Survey.

Construction of confidence sets for the optimal factor levels is an important topic in response surfaces methodology. In Wan et al. (2015), an exact confidence set has been provided for a maximum or minimum point (i.e., an optimal factor level) of a univariate polynomial function in a given interval. In this article, the method has been extended to construct an exact confidence set for the optimal factor levels of response surfaces. The construction method is readily applied to many parametric and semiparametric regression models involving a quadratic function. A conservative confidence set has been provided as an intermediate step in the construction of the exact confidence set. Two examples are given to illustrate the application of the confidence sets. The comparison between confidence sets indicates that our exact confidence set is better than the only other confidence set available in the statistical literature that guarantees the confidence level.

The focus of this article is on the nature of the likelihood associated with *N*-mixture models for repeated count data. It is shown that the infinite sum embedded in the likelihood associated with the Poisson mixing distribution can be expressed in terms of a hypergeometric function and, thence, in closed form. The resultant expression for the likelihood can be readily computed to a high degree of accuracy and is algebraically tractable. Specifically, the likelihood equations can be simplified to some advantage, the concentrated likelihood in the probability of detection formulated and problematic cases identified. The results are illustrated by means of a simulation study and a real world example. The study is extended to *N*-mixture models with a negative binomial mixing distribution and results similar to those for the Poisson case obtained. *N*-mixture models with mixing distributions which accommodate excess zeros and, separately, with a beta-binomial distribution rather than a binomial used to model the intra-site counts are also investigated. However the results for these settings, while computationally attractive, do not provide insight into the nature of the maximum likelihood estimates.

Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.

An intermediate response measure that accurately predicts efficacy in a new setting can reduce trial cost and time to product licensure. In this article, we define a *trial level general surrogate*, which is an intermediate response that can be used to accurately predict efficacy in a new setting. Methods for evaluating general surrogates have been developed previously. Many methods in the literature use trial level intermediate responses for prediction. However, all existing methods focus on surrogate evaluation and prediction in new settings, rather than comparison of candidate general surrogates, and few formalize the use of cross validation to quantify the expected prediction error. Our proposed method uses Bayesian non-parametric modeling and cross-validation to estimate the absolute prediction error for use in evaluating and comparing candidate trial level general surrogates. Simulations show that our method performs well across a variety of scenarios. We use our method to evaluate and to compare candidate trial level general surrogates in several multi-national trials of a pentavalent rotavirus vaccine. We identify at least one immune measure that has potential value as a trial level general surrogate and use it to predict efficacy in a new trial where the clinical outcome was not measured.

Identification of novel biomarkers for risk prediction is important for disease prevention and optimal treatment selection. However, studies aiming to discover which biomarkers are useful for risk prediction often require the use of stored biological samples from large assembled cohorts, and thus the depletion of a finite and precious resource. To make efficient use of such stored samples, two-phase sampling designs are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare. Existing methods for analyzing data from two-phase studies focus primarily on single marker analysis or fitting the Cox regression model to combine information from multiple markers. However, the Cox model may not fit the data well. Under model misspecification, the composite score derived from the Cox model may not perform well in predicting the outcome. Under a general two-phase stratified cohort sampling design, we present a novel approach to combining multiple markers to optimize prediction by fitting a flexible nonparametric transformation model. Using inverse probability weighting to account for the outcome-dependent sampling, we propose to estimate the model parameters by maximizing an objective function which can be interpreted as a weighted C-statistic for survival outcomes. Regardless of model adequacy, the proposed procedure yields a sensible composite risk score for prediction. A major obstacle for making inference under two phase studies is due to the correlation induced by the finite population sampling, which prevents standard inference procedures such as the bootstrap from being used for variance estimation. We propose a resampling procedure to derive valid confidence intervals for the model parameters and the C-statistic accuracy measure. We illustrate the new methods with simulation studies and an analysis of a two-phase study of high-density lipoprotein cholesterol (HDL-C) subtypes for predicting the risk of coronary heart disease.

At a time of climate change and major loss of biodiversity, it is important to have efficient tools for monitoring populations. In this context, animal abundance indices play an important rôle. In producing indices for invertebrates, it is important to account for variation in counts within seasons. Two new methods for describing seasonal variation in invertebrate counts have recently been proposed; one is nonparametric, using generalized additive models, and the other is parametric, based on stopover models. We present a novel generalized abundance index which encompasses both parametric and nonparametric approaches. It is extremely efficient to compute this index due to the use of concentrated likelihood techniques. This has particular relevance for the analysis of data from long-term extensive monitoring schemes with records for many species and sites, for which existing modeling techniques can be prohibitively time consuming. Performance of the index is demonstrated by several applications to UK Butterfly Monitoring Scheme data. We demonstrate the potential for new insights into both phenology and spatial variation in seasonal patterns from parametric modeling and the incorporation of covariate dependence, which is relevant for both monitoring and conservation. Associated R code is available on the journal website.

It is now well recognized that the effectiveness and potential risk of a treatment often vary by patient subgroups. Although trial-and-error and one-size-fits-all approaches to treatment selection remain a common practice, much recent focus has been placed on individualized treatment selection based on patient information (La Thangue and Kerr, 2011; Ong et al., 2012). Genetic and molecular markers are becoming increasingly available to guide treatment selection for various diseases including HIV and breast cancer (Mallal et al., 2008; Zujewski and Kamin, 2008). In recent years, many statistical procedures for developing individualized treatment rules (ITRs) have been proposed. However, less focus has been given to efficient selection of predictive biomarkers for treatment selection. The standard Wald test for interactions between treatment and the set of markers of interest may not work well when the marker effects are nonlinear. Furthermore, interaction-based test is scale dependent and may fail to capture markers useful for predicting individualized treatment differences. In this article, we propose to overcome these difficulties by developing a kernel machine (KM) score test that can efficiently identify markers predictive of treatment difference. Simulation studies show that our proposed KM-based score test is more powerful than the Wald test when there is nonlinear effect among the predictors and when the outcome is binary with nonlinear link functions. Furthermore, when there is high-correlation among predictors and when the number of predictors is not small, our method also over-performs Wald test. The proposed method is illustrated with two randomized clinical trials.

This article considers nonparametric methods for studying recurrent disease and death with competing risks. We first point out that comparisons based on the well-known cumulative incidence function can be confounded by different prevalence rates of the competing events, and that comparisons of the conditional distribution of the survival time given the failure event type are more relevant for investigating the prognosis of different patterns of recurrence disease. We then propose nonparametric estimators for the conditional cumulative incidence function as well as the conditional bivariate cumulative incidence function for the bivariate gap times, that is, the time to disease recurrence and the residual lifetime after recurrence. To quantify the association between the two gap times in the competing risks setting, a modified Kendall's tau statistic is proposed. The proposed estimators for the conditional bivariate cumulative incidence distribution and the association measure account for the induced dependent censoring for the second gap time. Uniform consistency and weak convergence of the proposed estimators are established. Hypothesis testing procedures for two-sample comparisons are discussed. Numerical simulation studies with practical sample sizes are conducted to evaluate the performance of the proposed nonparametric estimators and tests. An application to data from a pancreatic cancer study is presented to illustrate the methods developed in this article.

The availability of data in longitudinal studies is often driven by features of the characteristics being studied. For example, clinical databases are increasingly being used for research to address longitudinal questions. Because visit times in such data are often driven by patient characteristics that may be related to the outcome being studied, the danger is that this will result in biased estimation compared to designed, prospective studies. We study longitudinal data that follow a generalized linear mixed model and use a log link to relate an informative visit process to random effects in the mixed model. This device allows us to elucidate which parameters are biased under the informative visit process and to what degree. We show that the informative visit process can badly bias estimators of parameters of covariates associated with the random effects, while allowing consistent estimation of other parameters.

Alzheimer's disease (AD) is usually diagnosed by clinicians through cognitive and functional performance test with a potential risk of misdiagnosis. Since the progression of AD is known to cause structural changes in the corpus callosum (CC), the CC thickness can be used as a functional covariate in AD classification problem for a diagnosis. However, misclassified class labels negatively impact the classification performance. Motivated by AD–CC association studies, we propose a logistic regression for functional data classification that is robust to misdiagnosis or label noise. Specifically, our logistic regression model is constructed by adopting individual intercepts to functional logistic regression model. This approach enables to indicate which observations are possibly mislabeled and also lead to a robust and efficient classifier. An effective algorithm using MM algorithm provides simple closed-form update formulas. We test our method using synthetic datasets to demonstrate its superiority over an existing method, and apply it to differentiating patients with AD from healthy normals based on CC from MRI.

In this article, we develop new methods for estimating average treatment effects in observational studies, in settings with more than two treatment levels, assuming unconfoundedness given pretreatment variables. We emphasize propensity score subclassification and matching methods which have been among the most popular methods in the binary treatment literature. Whereas the literature has suggested that these particular propensity-based methods do not naturally extend to the multi-level treatment case, we show, using the concept of weak unconfoundedness and the notion of the generalized propensity score, that adjusting for a scalar function of the pretreatment variables removes all biases associated with observed pretreatment variables. We apply the proposed methods to an analysis of the effect of treatments for fibromyalgia. We also carry out a simulation study to assess the finite sample performance of the methods relative to previously proposed methods.

A clinical trial with a factorial design involves randomization of subjects to treatment *A* or and, within each group, further randomization to treatment *B* or . Under this design, one can assess the effects of treatments *A* and *B* on a clinical endpoint using all patients. One may additionally compare treatment *A*, treatment *B*, or combination therapy to . With multiple comparisons, however, it may be desirable to control the overall type I error, especially for regulatory purposes. Because the subjects overlap in the comparisons, the test statistics are generally correlated. By accounting for the correlations, one can achieve higher statistical power compared to the conventional Bonferroni correction. Herein, we derive the correlation between any two (stratified or unstratified) log-rank statistics for a factorial design with a survival time endpoint, such that the overall type I error for multiple treatment comparisons can be properly controlled. In addition, we allow for adjustment of prognostic factors in the treatment comparisons and conduct simultaneous inference on the effect sizes. We use simulation studies to show that the proposed methods perform well in realistic situations. We then provide an application to a recently completed randomized controlled clinical trial on alcohol dependence. Finally, we discuss extensions of our approach to other factorial designs and multiple endpoints.

Applications of circular regression models appear in many different fields such as evolutionary psychology, motor behavior, biology, and, in particular, in the analysis of gene expressions in oscillatory systems. Specifically, for the gene expression problem, a researcher may be interested in modeling the relationship among the phases of cell-cycle genes in two species with differing periods. This challenging problem reduces to the problem of constructing a piecewise circular regression model and, with this objective in mind, we propose a flexible circular regression model which allows different parameter values depending on sectors along the circle. We give a detailed interpretation of the parameters in the model and provide maximum likelihood estimators. We also provide a model selection procedure based on the concept of generalized degrees of freedom. The model is then applied to the analysis of two different cell-cycle data sets and through these examples we highlight the power of our new methodology.

Motivated by an ongoing pediatric mental health care (PMHC) study, this article presents weakly structured methods for analyzing doubly censored recurrent event data where only coarsened information on censoring is available. The study extracted administrative records of emergency department visits from provincial health administrative databases. The available information of each individual subject is limited to a subject-specific time window determined up to concealed data. To evaluate time-dependent effect of exposures, we adapt the local linear estimation with right censored survival times under the Cox regression model with time-varying coefficients (cf. Cai and Sun, *Scandinavian Journal of Statistics* 2003, **30**, 93–111). We establish the pointwise consistency and asymptotic normality of the regression parameter estimator, and examine its performance by simulation. The PMHC study illustrates the proposed approach throughout the article.

Potential reductions in laboratory assay costs afforded by pooling equal aliquots of biospecimens have long been recognized in disease surveillance and epidemiological research and, more recently, have motivated design and analytic developments in regression settings. For example, Weinberg and Umbach (1999, *Biometrics* **55**, 718–726) provided methods for fitting set-based logistic regression models to case-control data when a continuous exposure variable (e.g., a biomarker) is assayed on pooled specimens. We focus on improving estimation efficiency by utilizing available subject-specific information at the pool allocation stage. We find that a strategy that we call “(y,**c**)-pooling,” which forms pooling sets of individuals within strata defined jointly by the outcome and other covariates, provides more precise estimation of the risk parameters associated with those covariates than does pooling within strata defined only by the outcome. We review the approach to set-based analysis through offsets developed by Weinberg and Umbach in a recent correction to their original paper. We propose a method for variance estimation under this design and use simulations and a real-data example to illustrate the precision benefits of (y,**c**)-pooling relative to y-pooling. We also note and illustrate that set-based models permit estimation of covariate interactions with exposure.

In applying scan statistics for public health research, it would be valuable to develop a detection method for multiple clusters that accommodates spatial correlation and covariate effects in an integrated model. In this article, we connect the concepts of the likelihood ratio (LR) scan statistic and the quasi-likelihood (QL) scan statistic to provide a series of detection procedures sufficiently flexible to apply to clusters of arbitrary shape. First, we use an independent scan model for detection of clusters and then a variogram tool to examine the existence of spatial correlation and regional variation based on residuals of the independent scan model. When the estimate of regional variation is significantly different from zero, a mixed QL estimating equation is developed to estimate coefficients of geographic clusters and covariates. We use the Benjamini–Hochberg procedure (1995) to find a threshold for *p*-values to address the multiple testing problem. A quasi-deviance criterion is used to regroup the estimated clusters to find geographic clusters with arbitrary shapes. We conduct simulations to compare the performance of the proposed method with other scan statistics. For illustration, the method is applied to enterovirus data from Taiwan.

Measuring the similarity between genes is often the starting point for building gene regulatory networks. Most similarity measures used in practice only consider pairwise information with a few also consider network structure. Although theoretical properties of pairwise measures are well understood in the statistics literature, little is known about their statistical properties of those similarity measures based on network structure. In this article, we consider a new whole genome network-based similarity measure, called *CCor*, that makes use of information of all the genes in the network. We derive a concentration inequality of *CCor* and compare it with the commonly used Pearson correlation coefficient for inferring network modules. Both theoretical analysis and real data example demonstrate the advantages of *CCor* over existing measures for inferring gene modules.

We introduce a new Bayesian nonparametric method for estimating the size of a closed population from multiple-recapture data. Our method, based on Dirichlet process mixtures, can accommodate complex patterns of heterogeneity of capture, and can transparently modulate its complexity without a separate model selection step. Additionally, it can handle the massively sparse contingency tables generated by large number of recaptures with moderate sample sizes. We develop an efficient and scalable MCMC algorithm for estimation. We apply our method to simulated data, and to two examples from the literature of estimation of casualties in armed conflicts.

Longitudinal covariates in survival models are generally analyzed using random effects models. By framing the estimation of these survival models as a functional measurement error problem, semiparametric approaches such as the conditional score or the corrected score can be applied to find consistent estimators for survival model parameters without distributional assumptions on the random effects. However, in order to satisfy the standard assumptions of a survival model, the semiparametric methods in the literature only use covariate data before each event time. This suggests that these methods may make inefficient use of the longitudinal data. We propose an extension of these approaches that follows a generalization of Rao–Blackwell theorem. A Monte Carlo error augmentation procedure is developed to utilize the entirety of longitudinal information available. The efficiency improvement of the proposed semiparametric approach is confirmed theoretically and demonstrated in a simulation study. A real data set is analyzed as an illustration of a practical application.

Array-based CGH experiments are designed to detect genomic aberrations or regions of DNA copy-number variation that are associated with an outcome, typically a state of disease. Most of the existing statistical methods target on detecting DNA copy number variations in a single sample or array. We focus on the detection of group effect variation, through simultaneous study of multiple samples from multiple groups. Rather than using direct segmentation or smoothing techniques, as commonly seen in existing detection methods, we develop a sequential model selection procedure that is guided by a modified Bayesian information criterion. This approach improves detection accuracy by accumulatively utilizing information across contiguous clones, and has computational advantage over the existing popular detection methods. Our empirical investigation suggests that the performance of the proposed method is superior to that of the existing detection methods, in particular, in detecting small segments or separating neighboring segments with differential degrees of copy-number variation.

Meta-analysis of trans-ethnic genome-wide association studies (GWAS) has proven to be a practical and profitable approach for identifying loci that contribute to the risk of complex diseases. However, the expected genetic effect heterogeneity cannot easily be accommodated through existing fixed-effects and random-effects methods. In response, we propose a novel random effect model for trans-ethnic meta-analysis with flexible modeling of the expected genetic effect heterogeneity across diverse populations. Specifically, we adopt a modified random effect model from the kernel regression framework, in which genetic effect coefficients are random variables whose correlation structure reflects the genetic distances across ancestry groups. In addition, we use the adaptive variance component test to achieve robust power regardless of the degree of genetic effect heterogeneity. Simulation studies show that our proposed method has well-calibrated type I error rates at very stringent significance levels and can improve power over the traditional meta-analysis methods. We reanalyzed the published type 2 diabetes GWAS meta-analysis (Consortium et al., 2014) and successfully identified one additional SNP that clearly exhibits genetic effect heterogeneity across different ancestry groups. Furthermore, our proposed method provides scalable computing time for genome-wide datasets, in which an analysis of one million SNPs would require less than 3 hours.

Semi-competing risks data are often encountered in chronic disease follow-up studies that record both nonterminal events (e.g., disease landmark events) and terminal events (e.g., death). Studying the relationship between the nonterminal event and the terminal event can provide insightful information on disease progression. In this article, we propose a new sensible dependence measure tailored to addressing such an interest. We develop a nonparametric estimator, which is general enough to handle both independent right censoring and left truncation. Our strategy of connecting the new dependence measure with quantile regression enables a natural extension to adjust for covariates with minor additional assumptions imposed. We establish the asymptotic properties of the proposed estimators and develop inferences accordingly. Simulation studies suggest good finite-sample performance of the proposed methods. Our proposals are illustrated via an application to Denmark diabetes registry data.

Individual covariates are commonly used in capture–recapture models as they can provide important information for population size estimation. However, in practice, one or more covariates may be missing at random for some individuals, which can lead to unreliable inference if records with missing data are treated as missing completely at random. We show that, in general, such a naive complete-case analysis in closed capture–recapture models with some covariates missing at random underestimates the population size. We develop methods for estimating regression parameters and population size using regression calibration, inverse probability weighting, and multiple imputation without any distributional assumptions about the covariates. We show that the inverse probability weighting and multiple imputation approaches are asymptotically equivalent. We present a simulation study to investigate the effects of missing covariates and to evaluate the performance of the proposed methods. We also illustrate an analysis using data on the bird species yellow-bellied prinia collected in Hong Kong.

Motivated by ultrahigh-dimensional biomarkers screening studies, we propose a model-free screening approach tailored to censored lifetime outcomes. Our proposal is built upon the introduction of a new measure, survival impact index (SII). By its design, SII sensibly captures the overall influence of a covariate on the outcome distribution, and can be estimated with familiar nonparametric procedures that do not require smoothing and are readily adaptable to handle lifetime outcomes under various censoring and truncation mechanisms. We provide large sample distributional results that facilitate the inference on SII in classical multivariate settings. More importantly, we investigate SII as an effective screener for ultrahigh-dimensional data, not relying on rigid regression model assumptions for real applications. We establish the sure screening property of the proposed SII-based screener. Extensive numerical studies are carried out to assess the performance of our method compared with other existing screening methods. A lung cancer microarray data is analyzed to demonstrate the practical utility of our proposals.

The usefulness of meta-analysis has been recognized in the evaluation of drug safety, as a single trial usually yields few adverse events and offers limited information. For rare events, conventional meta-analysis methods may yield an invalid inference, as they often rely on large sample theories and require empirical corrections for zero events. These problems motivate research in developing *exact* methods, including Tian et al.'s method of combining confidence intervals (2009, *Biostatistics*, **10**, 275–281) and Liu et al.'s method of combining *p*-value functions (2014, *JASA*, **109**, 1450–1465). This article shows that these two exact methods can be unified under the framework of combining confidence distributions (CDs). Furthermore, we show that the CD method generalizes Tian et al.'s method in several aspects. Given that the CD framework also subsumes the Mantel–Haenszel and Peto methods, we conclude that the CD method offers a general framework for meta-analysis of rare events. We illustrate the CD framework using two real data sets collected for the safety analysis of diabetes drugs.

Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify *multiple regulators* or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.

Combining multiple studies is frequently undertaken in biomedical research to increase sample sizes for statistical power improvement. We consider the marginal model for the regression analysis of repeated measurements collected in several similar studies with potentially different variances and correlation structures. It is of great importance to examine whether there exist common parameters across study-specific marginal models so that simpler models, sensible interpretations, and meaningful efficiency gain can be obtained. Combining multiple studies via the classical means of hypothesis testing involves a large number of simultaneous tests for all possible subsets of common regression parameters, in which it results in unduly large degrees of freedom and low statistical power. We develop a new method of *fused lasso with the adaptation of parameter ordering* (FLAPO) to scrutinize only adjacent-pair parameter differences, leading to a substantial reduction for the number of involved constraints. Our method enjoys the oracle properties as does the full fused lasso based on all pairwise parameter differences. We show that FLAPO gives estimators with smaller error bounds and better finite sample performance than the full fused lasso. We also establish a regularized inference procedure based on bias-corrected FLAPO. We illustrate our method through both simulation studies and an analysis of HIV surveillance data collected over five geographic regions in China, in which the presence or absence of common covariate effects is reflective to relative effectiveness of regional policies on HIV control and prevention.

A dynamic treatment regimen consists of decision rules that recommend how to individualize treatment to patients based on available treatment and covariate history. In many scientific domains, these decision rules are shared across stages of intervention. As an illustrative example, we discuss STAR*D, a multistage randomized clinical trial for treating major depression. Estimating these shared decision rules often amounts to estimating parameters indexing the decision rules that are shared across stages. In this article, we propose a novel simultaneous estimation procedure for the shared parameters based on Q-learning. We provide an extensive simulation study to illustrate the merit of the proposed method over simple competitors, in terms of the treatment allocation matching of the procedure with the “oracle” procedure, defined as the one that makes treatment recommendations based on the true parameter values as opposed to their estimates. We also look at bias and mean squared error of the individual parameter-estimates as secondary metrics. Finally, we analyze the STAR*D data using the proposed method.

Zero-inflated regression models have emerged as a popular tool within the parametric framework to characterize count data with excess zeros. Despite their increasing popularity, much of the literature on real applications of these models has centered around the latent class formulation where the mean response of the so-called at-risk or susceptible population and the susceptibility probability are both related to covariates. While this formulation in some instances provides an interesting representation of the data, it often fails to produce easily interpretable covariate effects on the overall mean response. In this article, we propose two approaches that circumvent this limitation. The first approach consists of estimating the effect of covariates on the overall mean from the assumed latent class models, while the second approach formulates a model that directly relates the overall mean to covariates. Our results are illustrated by extensive numerical simulations and an application to an oral health study on low income African-American children, where the overall mean model is used to evaluate the effect of sugar consumption on caries indices.

In oncology, the international WHO and RECIST criteria have allowed the standardization of tumor response evaluation in order to identify the time of disease progression. These semi-quantitative measurements are often used as endpoints in phase II and phase III trials to study the efficacy of new therapies. However, through categorization of the continuous tumor size, information can be lost and they can be challenged by recently developed methods of modeling biomarkers in a longitudinal way. Thus, it is of interest to compare the predictive ability of cancer progressions based on categorical criteria and quantitative measures of tumor size (left-censored due to detection limit problems) and/or appearance of new lesions on overall survival. We propose a joint model for a simultaneous analysis of three types of data: a longitudinal marker, recurrent events, and a terminal event. The model allows to determine in a randomized clinical trial on which particular component treatment acts mostly. A simulation study is performed and shows that the proposed trivariate model is appropriate for practical use. We propose statistical tools that evaluate predictive accuracy for joint models to compare our model to models based on categorical criteria and their components. We apply the model to a randomized phase III clinical trial of metastatic colorectal cancer, conducted by the Fédération Francophone de Cancérologie Digestive (FFCD 2000–05 trial), which assigned 410 patients to two therapeutic strategies with multiple successive chemotherapy regimens.

We discuss the use of the determinantal point process (DPP) as a prior for latent structure in biomedical applications, where inference often centers on the interpretation of latent features as biologically or clinically meaningful structure. Typical examples include mixture models, when the terms of the mixture are meant to represent clinically meaningful subpopulations (of patients, genes, etc.). Another class of examples are feature allocation models. We propose the DPP prior as a repulsive prior on latent mixture components in the first example, and as prior on feature-specific parameters in the second case. We argue that the DPP is in general an attractive prior model for latent structure when biologically relevant interpretation of such structure is desired. We illustrate the advantages of DPP prior in three case studies, including inference in mixture models for magnetic resonance images (MRI) and for protein expression, and a feature allocation model for gene expression using data from The Cancer Genome Atlas. An important part of our argument are efficient and straightforward posterior simulation methods. We implement a variation of reversible jump Markov chain Monte Carlo simulation for inference under the DPP prior, using a density with respect to the unit rate Poisson process.

We propose a new sparse estimation method for Cox (1972) proportional hazards models by optimizing an approximated information criterion. The main idea involves approximation of the norm with a continuous or smooth unit dent function. The proposed method bridges the best subset selection and regularization by borrowing strength from both. It mimics the best subset selection using a penalized likelihood approach yet with no need of a tuning parameter. We further reformulate the problem with a reparameterization step so that it reduces to one unconstrained nonconvex yet smooth programming problem, which can be solved efficiently as in computing the maximum partial likelihood estimator (MPLE). Furthermore, the reparameterization tactic yields an additional advantage in terms of circumventing postselection inference. The oracle property of the proposed method is established. Both simulated experiments and empirical examples are provided for assessment and illustration.

Capture–recapture methods are used to estimate the size of a population of interest which is only partially observed. In such studies, each member of the population carries a count of the number of times it has been identified during the observational period. In real-life applications, only positive counts are recorded, and we get a truncated at zero-observed distribution. We need to use the truncated count distribution to estimate the number of unobserved units. We consider ratios of neighboring count probabilities, estimated by ratios of observed frequencies, regardless of whether we have a zero-truncated or an untruncated distribution. Rocchetti et al. (2011) have shown that, for densities in the Katz family, these ratios can be modeled by a regression approach, and Rocchetti et al. (2014) have specialized the approach to the beta-binomial distribution. Once the regression model has been estimated, the unobserved frequency of zero counts can be simply derived. The guiding principle is that it is often easier to find an appropriate regression model than a proper model for the count distribution. However, a full analysis of the connection between the regression model and the associated count distribution has been missing. In this manuscript, we fill the gap and show that the regression model approach leads, under general conditions, to a valid count distribution; we also consider a wider class of regression models, based on fractional polynomials. The proposed approach is illustrated by analyzing various empirical applications, and by means of a simulation study.

In this work a new metric of surrogacy, the so-called individual causal association (ICA), is introduced using information-theoretic concepts and a causal inference model for a binary surrogate and true endpoint. The ICA has a simple and appealing interpretation in terms of uncertainty reduction and, in some scenarios, it seems to provide a more coherent assessment of the validity of a surrogate than existing measures. The identifiability issues are tackled using a two-step procedure. In the first step, the region of the parametric space of the distribution of the potential outcomes, compatible with the data at hand, is geometrically characterized. Further, in a second step, a Monte Carlo approach is proposed to study the behavior of the ICA on the previous region. The method is illustrated using data from the Collaborative Initial Glaucoma Treatment Study. A newly developed and user-friendly R package *Surrogate* is provided to carry out the evaluation exercise.

The next-generation sequencing data, called high-throughput sequencing data, are recorded as count data, which are generally far from normal distribution. Under the assumption that the count data follow the Poisson -normal distribution, this article provides an -penalized likelihood framework and an efficient search algorithm to estimate the structure of sparse directed acyclic graphs (DAGs) for multivariate counts data. In searching for the solution, we use iterative optimization procedures to estimate the adjacency matrix and the variance matrix of the latent variables. The simulation result shows that our proposed method outperforms the approach which assumes multivariate normal distributions, and the -transformation approach. It also shows that the proposed method outperforms the rank-based PC method under sparse network or hub network structures. As a real data example, we demonstrate the efficiency of the proposed method in estimating the gene regulatory networks of the ovarian cancer study.

Co-Editors: |

Executive Editor: |

Treatment policies, also known as dynamic treatment regimes, are sequences of decision rules that link the observed patient history with treatment recommendations. Multiple, plausible, treatment policies are frequently constructed by researchers using expert opinion, theories, and reviews of the literature. Often these different policies represent competing approaches to managing an illness. Here, we develop an “assisted estimator” that can be used to compare the mean outcome of competing treatment policies. The term “assisted” refers to the fact estimators from the Structural Nested Mean Model, a parametric model for the causal effect of treatment at each time point, are used in the process of estimating the mean outcome. This work is motivated by our work on comparing the mean outcome of two competing treatment policies using data from the ExTENd study in alcohol dependence.

In comparative effectiveness research, it is often of interest to calibrate treatment effect estimates from a clinical trial to a target population that differs from the study population. One important application is an indirect comparison of a new treatment with a placebo control on the basis of two separate randomized clinical trials: a non-inferiority trial comparing the new treatment with an active control and a historical trial comparing the active control with placebo. The available methods for treatment effect calibration include an outcome regression (OR) method based on a regression model for the outcome and a weighting method based on a propensity score (PS) model. This article proposes new methods for treatment effect calibration: one based on a conditional effect (CE) model and two doubly robust (DR) methods. The first DR method involves a PS model and an OR model, is asymptotically valid if either model is correct, and attains the semiparametric information bound if both models are correct. The second DR method involves a PS model, a CE model, and possibly an OR model, is asymptotically valid under the union of the PS and CE models, and attains the semiparametric information bound if all three models are correct. The various methods are compared in a simulation study and applied to recent clinical trials for treating human immunodeficiency virus infection.

Under suitable assumptions and by exploiting the independence between inherited genetic susceptibility and treatment assignment, the case-only design yields efficient estimates for subgroup treatment effects and gene-treatment interaction in a Cox model. However it cannot provide estimates of the genetic main effect and baseline hazards, that are necessary to compute the absolute disease risk. For two-arm, placebo-controlled trials with rare failure time endpoints, we consider augmenting the case-only design with random samples of controls from both arms, as in the classical case-cohort sampling scheme, or with a random sample of controls from the active treatment arm only. The latter design is motivated by vaccine trials for cost-effective use of resources and specimens so that host genetics and vaccine-induced immune responses can be studied simultaneously in a bigger set of participants. We show that these designs can identify all parameters in a Cox model and that the efficient case-only estimator can be incorporated in a two-step plug-in procedure. Results in simulations and a data example suggest that incorporating case-only estimators in the classical case-cohort design improves the precision of all estimated parameters; sampling controls only in the active treatment arm attains a similar level of efficiency.

Many robust tests have been proposed in the literature to compare two hazard rate functions, however, very few of them can be used in cases when there are multiple hazard rate functions to be compared. In this article, we propose an approach for detecting the difference among multiple hazard rate functions. Through a simulation study and a real-data application, we show that the new method is robust and powerful in many situations, compared with some commonly used tests.

Cigarette smoking is a prototypical example of a recurrent event. The pattern of recurrent smoking events may depend on time-varying covariates including mood and environmental variables. Fixed effects and frailty models for recurrent events data assume that smokers have a common association with time-varying covariates. We develop a mixed effects version of a recurrent events model that may be used to describe variation among smokers in how they respond to those covariates, potentially leading to the development of individual-based smoking cessation therapies. Our method extends the modified EM algorithm of Steele (1996) for generalized mixed models to recurrent events data with partially observed time-varying covariates. It is offered as an alternative to the method of Rizopoulos, Verbeke, and Lesaffre (2009) who extended Steele's (1996) algorithm to a joint-model for the recurrent events data and time-varying covariates. Our approach does not require a model for the time-varying covariates, but instead assumes that the time-varying covariates are sampled according to a Poisson point process with known intensity. Our methods are well suited to data collected using Ecological Momentary Assessment (EMA), a method of data collection widely used in the behavioral sciences to collect data on emotional state and recurrent events in the every-day environments of study subjects using electronic devices such as Personal Digital Assistants (PDA) or smart phones.

Infectious diseases that can be spread directly or indirectly from one person to another are caused by pathogenic microorganisms such as bacteria, viruses, parasites, or fungi. Infectious diseases remain one of the greatest threats to human health and the analysis of infectious disease data is among the most important application of statistics. In this article, we develop Bayesian methodology using parametric bivariate accelerated lifetime model to study dependency between the colonization and infection times for Acinetobacter baumannii bacteria which is leading cause of infection among the hospital infection agents. We also study their associations with covariates such as age, gender, apache score, antibiotics use 3 months before admission and invasive mechanical ventilation use. To account for singularity, we use Singular Bivariate Extreme Value distribution to model residuals in Bivariate Accelerated lifetime model under the fully Bayesian framework. We analyze a censored data related to the colonization and infection collected in five major hospitals in Turkey using our methodology. The data analysis done in this article is for illustration of our proposed method and can be applied to any situation that our model can be used.

In many observational longitudinal studies, the outcome of interest presents a skewed distribution, is subject to censoring due to detection limit or other reasons, and is observed at irregular times that may follow a outcome-dependent pattern. In this work, we consider quantile regression modeling of such longitudinal data, because quantile regression is generally robust in handling skewed and censored outcomes and is flexible to accommodate dynamic covariate-outcome relationships. Specifically, we study a longitudinal quantile regression model that specifies covariate effects on the marginal quantiles of the longitudinal outcome. Such a model is easy to interpret and can accommodate dynamic outcome profile changes over time. We propose estimation and inference procedures that can appropriately account for censoring and irregular outcome-dependent follow-up. Our proposals can be readily implemented based on existing software for quantile regression. We establish the asymptotic properties of the proposed estimator, including uniform consistency and weak convergence. Extensive simulations suggest good finite-sample performance of the new method. We also present an analysis of data from a long-term study of a population exposed to polybrominated biphenyls (PBB), which uncovers an inhomogeneous PBB elimination pattern that would not be detected by traditional longitudinal data analysis.

Estimating the conditional quantiles of outcome variables of interest is frequent in many research areas, and quantile regression is foremost among the utilized methods. The coefficients of a quantile regression model depend on the order of the quantile being estimated. For example, the coefficients for the median are generally different from those of the 10th centile. In this article, we describe an approach to modeling the regression coefficients as parametric functions of the order of the quantile. This approach may have advantages in terms of parsimony, efficiency, and may expand the potential of statistical modeling. Goodness-of-fit measures and testing procedures are discussed, and the results of a simulation study are presented. We apply the method to analyze the data that motivated this work. The described method is implemented in the qrcm R package.

Finding an efficient and computationally feasible approach to deal with the curse of high-dimensionality is a daunting challenge faced by modern biological science. The problem becomes even more severe when the interactions are the research focus. To improve the performance of statistical analyses, we propose a sparse and low-rank (SLR) screening based on the combination of a low-rank interaction model and the Lasso screening. SLR models the interaction effects using a low-rank matrix to achieve parsimonious parametrization. The low-rank model increases the efficiency of statistical inference and, hence, SLR screening is able to more accurately detect gene–gene interactions than conventional methods. Incorporation of SLR screening into the Screen-and-Clean approach (Wasserman and Roeder, 2009; Wu et al., 2010) is also discussed, which suffers less penalty from Boferroni correction, and is able to assign p-values for the identified variables in high-dimensional model. We apply the proposed screening procedure to the Warfarin dosage study and the CoLaus study. The results suggest that the new procedure can identify main and interaction effects that would have been omitted by conventional screening methods.

Despite spectacular advances in molecular genomic technologies in the past two decades, resources available for genomic studies are still finite and limited, especially for family-based studies. Hence, it is important to consider an optimum study design to maximally utilize limited resources to increase statistical power in family-based studies. A particular question of interest is whether it is more profitable to genotype siblings of probands or to recruit more independent families. Numerous studies have attempted to address this study design issue for simultaneous detection of imprinting and maternal effects, two important epigenetic factors for studying complex diseases. The question is far from settled, however, mainly due to the fact that results and recommendations in the literature are based on anecdotal evidence from limited simulation studies rather than based on rigorous statistical analysis. In this article, we propose a systematic approach to study various designs based on a partial likelihood formulation. We derive the asymptotic properties and obtain formulas for computing the information contents of study designs being considered. Our results show that, for a common disease, recruiting additional siblings is beneficial because both affected and unaffected individuals will be included. However, if a disease is rare, then any additional siblings recruited are most likely to be unaffected, thus contributing little additional information; in such cases, additional families will be a better choice with a fixed amount of resources. Our work thus offers a practical strategy for investigators to select the optimum study design within a case-control family scheme before data collection.

Semicontinuous data in the form of a mixture of a large portion of zero values and continuously distributed positive values frequently arise in many areas of biostatistics. This article is motivated by the analysis of relationships between disease outcomes and intakes of episodically consumed dietary components. An important aspect of studies in nutritional epidemiology is that true diet is unobservable and commonly evaluated by food frequency questionnaires with substantial measurement error. Following the regression calibration approach for measurement error correction, unknown individual intakes in the risk model are replaced by their conditional expectations given mismeasured intakes and other model covariates. Those regression calibration predictors are estimated using short-term unbiased reference measurements in a calibration substudy. Since dietary intakes are often “energy-adjusted,” e.g., by using ratios of the intake of interest to total energy intake, the correct estimation of the regression calibration predictor for each energy-adjusted episodically consumed dietary component requires modeling short-term reference measurements of the component (a semicontinuous variable), and energy (a continuous variable) simultaneously in a bivariate model. In this article, we develop such a bivariate model, together with its application to regression calibration. We illustrate the new methodology using data from the NIH-AARP Diet and Health Study (Schatzkin et al., 2001, *American Journal of Epidemiology* **154**, 1119–1125), and also evaluate its performance in a simulation study.

We propose an model for population size estimation in capture-recapture studies. The *tb* part is based on equality constraints for the conditional capture probabilities, leading to an extremely rich model class. Observed and unobserved heterogeneity are dealt with by means of a logistic parameterization. In order to explore the model class, we introduce a penalized version of the likelihood. The conditional likelihood and penalized conditional likelihood are maximized by means of efficient EM algorithms. Simulations and two real data examples illustrate the approach.

We develop alternative strategies for building and fitting parametric capture–recapture models for closed populations which can be used to address a better understanding of behavioral patterns. In the perspective of transition models, we first rely on a conditional probability parameterization. A large subset of standard capture–recapture models can be regarded as a suitable partitioning in equivalence classes of the full set of conditional probability parameters. We exploit a regression approach combined with the use of new suitable summaries of the conditioning binary partial capture histories as a device for enlarging the scope of behavioral models and also exploring the range of all possible partitions. We show how one can easily find unconditional MLE of such models within a generalized linear model framework. We illustrate the potential of our approach with the analysis of some known datasets and a simulation study.

The problem of estimating discovery probabilities originated in the context of statistical ecology, and in recent years it has become popular due to its frequent appearance in challenging applications arising in genetics, bioinformatics, linguistics, designs of experiments, machine learning, etc. A full range of statistical approaches, parametric and nonparametric as well as frequentist and Bayesian, has been proposed for estimating discovery probabilities. In this article, we investigate the relationships between the celebrated Good–Turing approach, which is a frequentist nonparametric approach developed in the 1940s, and a Bayesian nonparametric approach recently introduced in the literature. Specifically, under the assumption of a two parameter Poisson-Dirichlet prior, we show that Bayesian nonparametric estimators of discovery probabilities are asymptotically equivalent, for a large sample size, to suitably smoothed Good–Turing estimators. As a by-product of this result, we introduce and investigate a methodology for deriving exact and asymptotic credible intervals to be associated with the Bayesian nonparametric estimators of discovery probabilities. The proposed methodology is illustrated through a comprehensive simulation study and the analysis of Expressed Sequence Tags data generated by sequencing a benchmark complementary DNA library.

Estimation of the skeleton of a directed acyclic graph (DAG) is of great importance for understanding the underlying DAG and causal effects can be assessed from the skeleton when the DAG is not identifiable. We propose a novel method named PenPC to estimate the skeleton of a high-dimensional DAG by a two-step approach. We first estimate the nonzero entries of a concentration matrix using penalized regression, and then fix the difference between the concentration matrix and the skeleton by evaluating a set of conditional independence hypotheses. For high-dimensional problems where the number of vertices *p* is in polynomial or exponential scale of sample size *n*, we study the asymptotic property of PenPC on two types of graphs: traditional random graphs where all the vertices have the same expected number of neighbors, and scale-free graphs where a few vertices may have a large number of neighbors. As illustrated by extensive simulations and applications on gene expression data of cancer patients, PenPC has higher sensitivity and specificity than the state-of-the-art method, the PC-stable algorithm.

We consider in this article testing rare variants by environment interactions in sequencing association studies. Current methods for studying the association of rare variants with traits cannot be readily applied for testing for rare variants by environment interactions, as these methods do not effectively control for the main effects of rare variants, leading to unstable results and/or inflated Type 1 error rates. We will first analytically study the bias of the use of conventional burden-based tests for rare variants by environment interactions, and show the tests can often be invalid and result in inflated Type 1 error rates. To overcome these difficulties, we develop the interaction sequence kernel association test (iSKAT) for assessing rare variants by environment interactions. The proposed test iSKAT is optimal in a class of variance component tests and is powerful and robust to the proportion of variants in a gene that interact with environment and the signs of the effects. This test properly controls for the main effects of the rare variants using weighted ridge regression while adjusting for covariates. We demonstrate the performance of iSKAT using simulation studies and illustrate its application by analysis of a candidate gene sequencing study of plasma adiponectin levels.

A variety of pathway/gene-set approaches have been proposed to provide evidence of higher-level biological phenomena in the association of expression with experimental condition or clinical outcome. Among these approaches, it has been repeatedly shown that resampling methods are far preferable to approaches that implicitly assume independence of genes. However, few approaches have been optimized for the specific characteristics of RNA-Seq transcription data, in which mapped tags produce discrete counts with varying library sizes, and with potential outliers or skewness patterns that violate parametric assumptions. We describe transformations to RNA-Seq data to improve power for linear associations with outcome and flexibly handle normalization factors. Using these transformations or alternate transformations, we apply recently developed null approximations to quadratic form statistics for both self-contained and competitive pathway testing. The approach provides a convenient integrated platform for RNA-Seq pathway testing. We demonstrate that the approach provides appropriate type I error control without actual permutation and is powerful under many settings in comparison to competing approaches. Pathway analysis of data from a study of F344 vs. HIV1Tg rats, and of sex differences in lymphoblastoid cell lines from humans, strongly supports the biological interpretability of the findings.

A common practice with ordered doses of treatment and ordered responses, perhaps recorded in a contingency table with ordered rows and columns, is to cut or remove a cross from the table, leaving the outer corners—that is, the high-versus-low dose, high-versus-low response corners—and from these corners to compute a risk or odds ratio. This little remarked but common practice seems to be motivated by the oldest and most familiar method of sensitivity analysis in observational studies, proposed by Cornfield et al. (1959), which says that to explain a population risk ratio purely as bias from an unobserved binary covariate, the prevalence ratio of the covariate must exceed the risk ratio. Quite often, the largest risk ratio, hence the one least sensitive to bias by this standard, is derived from the corners of the ordered table with the central cross removed. Obviously, the corners use only a portion of the data, so a focus on the corners has consequences for the standard error as well as for bias, but sampling variability was not a consideration in this early and familiar form of sensitivity analysis, where point estimates replaced population parameters. Here, this cross-cut analysis is examined with the aid of design sensitivity and the power of a sensitivity analysis.

It is common in biomedical research to run case-control studies involving high-dimensional predictors, with the main goal being detection of the sparse subset of predictors having a significant association with disease. Usual analyses rely on independent screening, considering each predictor one at a time, or in some cases on logistic regression assuming no interactions. We propose a fundamentally different approach based on a nonparametric Bayesian low rank tensor factorization model for the retrospective likelihood. Our model allows a very flexible structure in characterizing the distribution of multivariate variables as unknown and without any linear assumptions as in logistic regression. Predictors are excluded only if they have no impact on disease risk, either directly or through interactions with other predictors. Hence, we obtain an omnibus approach for screening for important predictors. Computation relies on an efficient Gibbs sampler. The methods are shown to have high power and low false discovery rates in simulation studies, and we consider an application to an epidemiology study of birth defects.

Menstrual cycle length (MCL) has been shown to play an important role in couple fecundity, which is the biologic capacity for reproduction irrespective of pregnancy intentions. However, a comprehensive assessment of its role requires a fecundity model that accounts for male and female attributes and the couple's intercourse pattern relative to the ovulation day. To this end, we employ a Bayesian joint model for MCL and pregnancy. MCLs follow a scale multiplied (accelerated) mixture model with Gaussian and Gumbel components; the pregnancy model includes MCL as a covariate and computes the cycle-specific probability of pregnancy in a menstrual cycle conditional on the pattern of intercourse and no previous fertilization. Day-specific fertilization probability is modeled using natural, cubic splines. We analyze data from the Longitudinal Investigation of Fertility and the Environment Study (the LIFE Study), a couple based prospective pregnancy study, and find a statistically significant quadratic relation between fecundity and menstrual cycle length, after adjustment for intercourse pattern and other attributes, including male semen quality, both partner's age, and active smoking status (determined by baseline cotinine level 100 ng/mL). We compare results to those produced by a more basic model and show the advantages of a more comprehensive approach.

Recurrent event data arise frequently in longitudinal medical studies. In many situations, there are a large portion of subjects without any recurrent events, manifesting the “zero-inflated” nature of the data. Some of the zero events may be “structural zeros” as patients are unsusceptible to recurrent events, while others are “random zeros” due to censoring before any recurrent events. On the other hand, there often exists a terminal event which may be correlated with the recurrent events. In this article, we propose two joint frailty models for zero-inflated recurrent events in the presence of a terminal event, combining a logistic model for “structural zero” status (Yes/No) and a joint frailty proportional hazards model for recurrent and terminal event times. The models can be fitted conveniently in SAS Proc NLMIXED. We apply the methods to model recurrent opportunistic diseases in the presence of death in an AIDS study, and tumor recurrences and a terminal event in a sarcoma study.

For a study with an event time as the endpoint, its survival function contains all the information regarding the temporal, stochastic profile of this outcome variable. The survival probability at a specific time point, say *t*, however, does not transparently capture the temporal profile of this endpoint up to *t*. An alternative is to use the restricted mean survival time (RMST) at time *t* to summarize the profile. The RMST is the mean survival time of all subjects in the study population followed up to *t*, and is simply the area under the survival curve up to *t*. The advantages of using such a quantification over the survival rate have been discussed in the setting of a fixed-time analysis. In this article, we generalize this approach by considering a curve based on the RMST over time as an alternative summary to the survival function. Inference, for instance, based on simultaneous confidence bands for a single RMST curve and also the difference between two RMST curves are proposed. The latter is informative for evaluating two groups under an equivalence or noninferiority setting, and quantifies the difference of two groups in a time scale. The proposal is illustrated with the data from two clinical trials, one from oncology and the other from cardiology.

The proportional hazards model (PH) is currently the most popular regression model for analyzing time-to-event data. Despite its popularity, the analysis of interval-censored data under the PH model can be challenging using many available techniques. This article presents a new method for analyzing interval-censored data under the PH model. The proposed approach uses a monotone spline representation to approximate the unknown nondecreasing cumulative baseline hazard function. Formulating the PH model in this fashion results in a finite number of parameters to estimate while maintaining substantial modeling flexibility. A novel expectation-maximization (EM) algorithm is developed for finding the maximum likelihood estimates of the parameters. The derivation of the EM algorithm relies on a two-stage data augmentation involving latent Poisson random variables. The resulting algorithm is easy to implement, robust to initialization, enjoys quick convergence, and provides closed-form variance estimates. The performance of the proposed regression methodology is evaluated through a simulation study, and is further illustrated using data from a large population-based randomized trial designed and sponsored by the United States National Cancer Institute.

The confidence intervals for the ratio of two median residual lifetimes are developed for left-truncated and right-censored data. The approach of Su and Wei (1993) is first extended by replacing the Kaplan–Meier survival estimator with the estimator of the conditional survival function (Lynden-Bell, 1971). This procedure does not involve a nonparametric estimation of the probability density function of the failure time. However, the Su and Wei type confidence intervals are very conservative even for larger sample size. Therefore, this article proposes an alternative confidence interval for the ratio of two median residual lifetimes, which is not only without nonparametric estimation of the density function of failure times but is also computationally simpler than the Su and Wei type confidence interval. A simulation study is conducted to examine the accuracy of these confidence intervals and the implementation of these confidence intervals to two real data sets is illustrated.

Multiple imputation (MI) is a well-established method to handle item-nonresponse in sample surveys. Survey data obtained from complex sampling designs often involve features that include unequal probability of selection. MI requires imputation to be congenial, that is, for the imputations to come from a Bayesian predictive distribution and for the observed and complete data estimator to equal the posterior mean given the observed or complete data, and similarly for the observed and complete variance estimator to equal the posterior variance given the observed or complete data; more colloquially, the analyst and imputer make similar modeling assumptions. Yet multiply imputed data sets from complex sample designs with unequal sampling weights are typically imputed under simple random sampling assumptions and then analyzed using methods that account for the sampling weights. This is a setting in which the analyst assumes more than the imputer, which can led to biased estimates and anti-conservative inference. Less commonly used alternatives such as including case weights as predictors in the imputation model typically require interaction terms for more complex estimators such as regression coefficients, and can be vulnerable to model misspecification and difficult to implement. We develop a simple two-step MI framework that accounts for sampling weights using a weighted finite population Bayesian bootstrap method to validly impute the whole population (including item nonresponse) from the observed data. In the second step, having generated posterior predictive distributions of the entire population, we use standard IID imputation to handle the item nonresponse. Simulation results show that the proposed method has good frequentist properties and is robust to model misspecification compared to alternative approaches. We apply the proposed method to accommodate missing data in the Behavioral Risk Factor Surveillance System when estimating means and parameters of regression models.

Subject-specific and marginal models have been developed for the analysis of longitudinal ordinal data. Subject-specific models often lack a population-average interpretation of the model parameters due to the conditional formulation of random intercepts and slopes. Marginal models frequently lack an underlying distribution for ordinal data, in particular when generalized estimating equations are applied. To overcome these issues, latent variable models underneath the ordinal outcomes with a multivariate logistic distribution can be applied. In this article, we extend the work of O'Brien and Dunson (2004), who studied the multivariate *t*-distribution with marginal logistic distributions. We use maximum likelihood, instead of a Bayesian approach, and incorporated covariates in the correlation structure, in addition to the mean model. We compared our method with GEE and demonstrated that it performs better than GEE with respect to the fixed effect parameter estimation when the latent variables have an approximately elliptical distribution, and at least as good as GEE for other types of latent variable distributions.

We present a novel formulation of a mark–recapture–resight model that allows estimation of population size, stopover duration, and arrival and departure schedules at migration areas. Estimation is based on encounter histories of uniquely marked individuals and relative counts of marked and unmarked animals. We use a Bayesian analysis of a state–space formulation of the Jolly–Seber mark–recapture model, integrated with a binomial model for counts of unmarked animals, to derive estimates of population size and arrival and departure probabilities. We also provide a novel estimator for stopover duration that is derived from the latent state variable representing the interim between arrival and departure in the state–space model. We conduct a simulation study of field sampling protocols to understand the impact of superpopulation size, proportion marked, and number of animals sampled on bias and precision of estimates. Simulation results indicate that relative bias of estimates of the proportion of the population with marks was low for all sampling scenarios and never exceeded 2%. Our approach does not require enumeration of all unmarked animals detected or direct knowledge of the number of marked animals in the population at the time of the study. This provides flexibility and potential application in a variety of sampling situations (e.g., migratory birds, breeding seabirds, sea turtles, fish, pinnipeds, etc.). Application of the methods is demonstrated with data from a study of migratory sandpipers.

In recent years, increasing attention has been devoted to the problem of the stability of multivariable regression models, understood as the resistance of the model to small changes in the data on which it has been fitted. Resampling techniques, mainly based on the bootstrap, have been developed to address this issue. In particular, the approaches based on the idea of “inclusion frequency” consider the repeated implementation of a variable selection procedure, for example backward elimination, on several bootstrap samples. The analysis of the variables selected in each iteration provides useful information on the model stability and on the variables’ importance. Recent findings, nevertheless, show possible pitfalls in the use of the bootstrap, and alternatives such as subsampling have begun to be taken into consideration in the literature. Using model selection frequencies and variable inclusion frequencies, we empirically compare these two different resampling techniques, investigating the effect of their use in selected classical model selection procedures for multivariable regression. We conduct our investigations by analyzing two real data examples and by performing a simulation study. Our results reveal some advantages in using a subsampling technique rather than the bootstrap in this context.

Climate change is expected to have many impacts on the environment, including changes in ozone concentrations at the surface level. A key public health concern is the potential increase in ozone-related summertime mortality if surface ozone concentrations rise in response to climate change. Although ozone formation depends partly on summertime weather, which exhibits considerable inter-annual variability, previous health impact studies have not incorporated the variability of ozone into their prediction models. A major source of uncertainty in the health impacts is the variability of the modeled ozone concentrations. We propose a Bayesian model and Monte Carlo estimation method for quantifying health effects of future ozone. An advantage of this approach is that we include the uncertainty in both the health effect association and the modeled ozone concentrations. Using our proposed approach, we quantify the expected change in ozone-related summertime mortality in the contiguous United States between 2000 and 2050 under a changing climate. The mortality estimates show regional patterns in the expected degree of impact. We also illustrate the results when using a common technique in previous work that averages ozone to reduce the size of the data, and contrast these findings with our own. Our analysis yields more realistic inferences, providing clearer interpretation for decision making regarding the impacts of climate change.

Spatial generalized linear mixed models (SGLMMs) are popular models for spatial data with a non-Gaussian response. Binomial SGLMMs with logit or probit link functions are often used to model spatially dependent binomial random variables. It is known that for independent binomial data, the robit regression model provides a more robust (against extreme observations) alternative to the more popular logistic and probit models. In this article, we introduce a Bayesian spatial robit model for spatially dependent binomial data. Since constructing a meaningful prior on the link function parameter as well as the spatial correlation parameters in SGLMMs is difficult, we propose an empirical Bayes (EB) approach for the estimation of these parameters as well as for the prediction of the random effects. The EB methodology is implemented by efficient importance sampling methods based on Markov chain Monte Carlo (MCMC) algorithms. Our simulation study shows that the robit model is robust against model misspecification, and our EB method results in estimates with less bias than full Bayesian (FB) analysis. The methodology is applied to a Celastrus Orbiculatus data, and a Rhizoctonia root data. For the former, which is known to contain outlying observations, the robit model is shown to do better for predicting the spatial distribution of an invasive species. For the latter, our approach is doing as well as the classical models for predicting the disease severity for a root disease, as the probit link is shown to be appropriate.

Though this article is written for Binomial SGLMMs for brevity, the EB methodology is more general and can be applied to other types of SGLMMs. In the accompanying R package geoBayes, implementations for other SGLMMs such as Poisson and Gamma SGLMMs are provided.

In the context of group testing screening, McMahan, Tebbs, and Bilder (2012, *Biometrics* **68**, 287–296) proposed a two-stage procedure in a heterogenous population in the presence of misclassification. In earlier work published in *Biometrics*, Kim, Hudgens, Dreyfuss, Westreich, and Pilcher (2007, *Biometrics* **63**, 1152–1162) also proposed group testing algorithms in a homogeneous population with misclassification. In both cases, the authors evaluated performance of the algorithms based on the expected number of tests per person, with the optimal design being defined by minimizing this quantity. The purpose of this article is to show that although the expected number of tests per person is an appropriate evaluation criteria for group testing when there is no misclassification, it may be problematic when there is misclassification. Specifically, a valid criterion needs to take into account the amount of correct classification and not just the number of tests. We propose, a more suitable objective function that accounts for not only the expected number of tests, but also the expected number of correct classifications. We then show how using this objective function that accounts for correct classification is important for design when considering group testing under misclassification. We also present novel analytical results which characterize the optimal Dorfman (1943) design under the misclassification.