We address estimation of intervention effects in experimental designs in which (a) interventions are assigned at the cluster level; (b) clusters are selected to form pairs, matched on observed characteristics; and (c) intervention is assigned to one cluster at random within each pair. One goal of policy interest is to estimate the average outcome if all clusters in all pairs are assigned control versus if all clusters in all pairs are assigned to intervention. In such designs, inference that ignores individual level covariates can be imprecise because cluster-level assignment can leave substantial imbalance in the covariate distribution between experimental arms within each pair. However, most existing methods that adjust for covariates have estimands that are not of policy interest. We propose a methodology that explicitly balances the observed covariates among clusters in a pair, and retains the original estimand of interest. We demonstrate our approach through the evaluation of the Guided Care program.

The performance of diagnostic tests for disease classification is often measured by accuracy (e.g., sensitivity or specificity); however, costs of the diagnostic test are a concern as well. Combinations of multiple diagnostic tests may improve accuracy, but incur additional costs. Here, we consider serial testing approaches that maintain accuracy while controlling costs of the diagnostic tests. We present a serial risk score classification approach. The basic idea is to sequentially test with additional diagnostic tests just until persons are classified. In this way, it is not necessary to test all persons with all tests. The methods are studied in simulations and compared with logistic regression. We applied the methods to data from HIV cohort studies to identify HIV infected individuals who are recently infected (1 year) by testing with assays for multiple biomarkers. We find that the serial risk score classification approach can maintain accuracy while achieving a reduction in cost compared to testing all individuals with all assays.

In some longitudinal studies the initiation time of the process is not clearly defined, yet it is important to make inference or do predictions about the longitudinal process. The application of interest in this article is to provide a framework for modeling individualized labor curves (longitudinal cervical dilation measurements) where the start of labor is not clearly defined. This is a well-known problem in obstetrics where the benchmark reference time is often chosen as the end of the process (individuals are fully dilated at 10 cm) and time is run backwards. This approach results in valid and efficient inference unless subjects are censored before the end of the process, or if we are focused on prediction. Providing dynamic individualized predictions of the longitudinal labor curve prospectively (where backwards time is unknown) is of interest to aid obstetricians to determine if a labor is on a suitable trajectory. We propose a model for longitudinal labor dilation that uses a random-effects model with unknown time-zero and a random change point. We present a maximum likelihood approach for parameter estimation that uses adaptive Gaussian quadrature for the numerical integration. Further, we propose a Monte Carlo approach for dynamic prediction of the future longitudinal dilation trajectory from past dilation measurements. The methodology is illustrated with longitudinal cervical dilation data from the Consortium of Safe Labor Study.

Pocock et al. (2012, *European Heart Journal* **33**, 176–182) proposed a win ratio approach to analyzing composite endpoints comprised of outcomes with different clinical priorities. In this article, we establish a statistical framework for this approach. We derive the null hypothesis and propose a closed-form variance estimator for the win ratio statistic in all pairwise matching situation. Our simulation study shows that the proposed variance estimator performs well regardless of the magnitude of treatment effect size and the type of the joint distribution of the outcomes.

A popular way to model overdispersed count data, such as the number of falls reported during intervention studies, is by means of the negative binomial (NB) distribution. Classical estimating methods are well-known to be sensitive to model misspecifications, taking the form of patients falling much more than expected in such intervention studies where the NB regression model is used. We extend in this article two approaches for building robust *M*-estimators of the regression parameters in the class of generalized linear models to the NB distribution. The first approach achieves robustness in the response by applying a bounded function on the Pearson residuals arising in the maximum likelihood estimating equations, while the second approach achieves robustness by bounding the unscaled deviance components. For both approaches, we explore different choices for the bounding functions. Through a unified notation, we show how close these approaches may actually be as long as the bounding functions are chosen and tuned appropriately, and provide the asymptotic distributions of the resulting estimators. Moreover, we introduce a robust weighted maximum likelihood estimator for the overdispersion parameter, specific to the NB distribution. Simulations under various settings show that redescending bounding functions yield estimates with smaller biases under contamination while keeping high efficiency at the assumed model, and this for both approaches. We present an application to a recent randomized controlled trial measuring the effectiveness of an exercise program at reducing the number of falls among people suffering from Parkinsons disease to illustrate the diagnostic use of such robust procedures and their need for reliable inference.

Studying the interactions between different brain regions is essential to achieve a more complete understanding of brain function. In this article, we focus on identifying functional co-activation patterns and undirected functional networks in neuroimaging studies. We build a functional brain network, using a sparse covariance matrix, with elements representing associations between region-level peak activations. We adopt a penalized likelihood approach to impose sparsity on the covariance matrix based on an extended multivariate Poisson model. We obtain penalized maximum likelihood estimates via the expectation-maximization (EM) algorithm and optimize an associated tuning parameter by maximizing the predictive log-likelihood. Permutation tests on the brain co-activation patterns provide region pair and network-level inference. Simulations suggest that the proposed approach has minimal biases and provides a coverage rate close to 95% of covariance estimations. Conducting a meta-analysis of 162 functional neuroimaging studies on emotions, our model identifies a functional network that consists of connected regions within the basal ganglia, limbic system, and other emotion-related brain regions. We characterize this network through statistical inference on region-pair connections as well as by graph measures.

Mediation analysis is important for understanding the mechanisms whereby one variable causes changes in another. Measurement error could obscure the ability of the potential mediator to explain such changes. This article focuses on developing correction methods for measurement error in the mediator with failure time outcomes. We consider a broad definition of measurement error, including technical error, and error associated with temporal variation. The underlying model with the “true” mediator is assumed to be of the Cox proportional hazards model form. The induced hazard ratio for the observed mediator no longer has a simple form independent of the baseline hazard function, due to the conditioning event. We propose a mean-variance regression calibration approach and a follow-up time regression calibration approach, to approximate the partial likelihood for the induced hazard function. Both methods demonstrate value in assessing mediation effects in simulation studies. These methods are generalized to multiple biomarkers and to both case-cohort and nested case-control sampling designs. We apply these correction methods to the Women's Health Initiative hormone therapy trials to understand the mediation effect of several serum sex hormone measures on the relationship between postmenopausal hormone therapy and breast cancer risk.

Semi-parametric regression models for the joint estimation of marginal mean and within-cluster pairwise association parameters are used in a variety of settings for population-averaged modeling of multivariate categorical outcomes. Recently, a formulation of alternating logistic regressions based on orthogonalized, marginal residuals has been introduced for correlated binary data. Unlike the original procedure based on conditional residuals, its covariance estimator is invariant to the ordering of observations within clusters. In this article, the orthogonalized residuals method is extended to model correlated ordinal data with a global odds ratio, and shown in a simulation study to be more efficient and less biased with regards to estimating within-cluster association parameters than an existing extension to ordinal data of alternating logistic regressions based on conditional residuals. Orthogonalized residuals are used to estimate a model for three correlated ordinal outcomes measured repeatedly in a longitudinal clinical trial of an intervention to improve recovery of patients’ perception of altered sensation following jaw surgery.

Motivated by objective measurements of physical activity, we take a functional data approach to longitudinal data with simultaneous measurement of a continuous and a binary outcomes. The regression structures are specified as smooth curves measured at various time-points with random effects that have a hierarchical correlation structure. The random effect curves for each variable are summarized using a few important principal components, and the association of the two longitudinal variables is modeled through the association of the principal component scores. We use penalized splines to model the mean curves and the principal component curves, and cast the proposed model into a mixed effects model framework for model fitting, prediction and inference. Via a quasilikelihood type approximation for the binary component, we develop an algorithm to fit the model. Data-based transformation of the continuous variable and selection of the number of principal components are incorporated into the algorithm. The method is applied to the motivating physical activity data and is evaluated empirically by a simulation study. Extensions for different types of outcomes are also discussed.

Variable screening has emerged as a crucial first step in the analysis of high-throughput data, but existing procedures can be computationally cumbersome, difficult to justify theoretically, or inapplicable to certain types of analyses. Motivated by a high-dimensional censored quantile regression problem in multiple myeloma genomics, this article makes three contributions. First, we establish a score test-based screening framework, which is widely applicable, extremely computationally efficient, and relatively simple to justify. Secondly, we propose a resampling-based procedure for selecting the number of variables to retain after screening according to the principle of reproducibility. Finally, we propose a new iterative score test screening method which is closely related to sparse regression. In simulations we apply our methods to four different regression models and show that they can outperform existing procedures. We also apply score test screening to an analysis of gene expression data from multiple myeloma patients using a censored quantile regression model to identify high-risk genes.

Treatment-selection markers predict an individual's response to different therapies, thus allowing for the selection of a therapy with the best predicted outcome. A good marker-based treatment-selection rule can significantly impact public health through the reduction of the disease burden in a cost-effective manner. Our goal in this article is to use data from randomized trials to identify optimal linear and nonlinear biomarker combinations for treatment selection that minimize the total burden to the population caused by either the targeted disease or its treatment. We frame this objective into a general problem of minimizing a weighted sum of 0-1 loss and propose a novel penalized minimization method that is based on the difference of convex functions algorithm (DCA). The corresponding estimator of marker combinations has a kernel property that allows flexible modeling of linear and nonlinear marker combinations. We compare the proposed methods with existing methods for optimizing treatment regimens such as the logistic regression model and the weighted support vector machine. Performances of different weight functions are also investigated. The application of the proposed method is illustrated using a real example from an HIV vaccine trial: we search for a combination of Fc receptor genes for recommending vaccination in preventing HIV infection.

When estimating the effect of an exposure or treatment on an outcome it is important to select the proper subset of confounding variables to include in the model. Including too many covariates increases mean square error on the effect of interest while not including confounding variables biases the exposure effect estimate. We propose a decision-theoretic approach to confounder selection and effect estimation. We first estimate the full standard Bayesian regression model and then post-process the posterior distribution with a loss function that penalizes models omitting important confounders. Our method can be fit easily with existing software and in many situations without the use of Markov chain Monte Carlo methods, resulting in computation on the order of the least squares solution. We prove that the proposed estimator has attractive asymptotic properties. In a simulation study we show that our method outperforms existing methods. We demonstrate our method by estimating the effect of fine particulate matter (PM) exposure on birth weight in Mecklenburg County, North Carolina.

There has been a lot of work fitting Ising models to multivariate binary data in order to understand the conditional dependency relationships between the variables. However, additional covariates are frequently recorded together with the binary data, and may influence the dependence relationships. Motivated by such a dataset on genomic instability collected from tumor samples of several types, we propose a sparse covariate dependent Ising model to study both the conditional dependency within the binary data and its relationship with the additional covariates. This results in subject-specific Ising models, where the subject's covariates influence the strength of association between the genes. As in all exploratory data analysis, interpretability of results is important, and we use penalties to induce sparsity in the fitted graphs and in the number of selected covariates. Two algorithms to fit the model are proposed and compared on a set of simulated data, and asymptotic results are established. The results on the tumor dataset and their biological significance are discussed in detail.

In many biomedical studies, patients may experience the same type of recurrent event repeatedly over time, such as bleeding, multiple infections and disease. In this article, we propose a Bayesian design to a pivotal clinical trial in which lower risk myelodysplastic syndromes (MDS) patients are treated with MDS disease modifying therapies. One of the key study objectives is to demonstrate the investigational product (treatment) effect on reduction of platelet transfusion and bleeding events while receiving MDS therapies. In this context, we propose a new Bayesian approach for the design of superiority clinical trials using recurrent events frailty regression models. Historical recurrent events data from an already completed phase 2 trial are incorporated into the Bayesian design via the partial borrowing power prior of Ibrahim et al. (2012, *Biometrics* **68**, 578–586). An efficient Gibbs sampling algorithm, a predictive data generation algorithm, and a simulation-based algorithm are developed for sampling from the fitting posterior distribution, generating the predictive recurrent events data, and computing various design quantities such as the type I error rate and power, respectively. An extensive simulation study is conducted to compare the proposed method to the existing frequentist methods and to investigate various operating characteristics of the proposed design.

There has been an increasing interest in the analysis of spatially distributed multivariate binary data motivated by a wide range of research problems. Two types of correlations are usually involved: the correlation between the multiple outcomes at one location and the spatial correlation between the locations for one particular outcome. The commonly used regression models only consider one type of correlations while ignoring or modeling inappropriately the other one. To address this limitation, we adopt a Bayesian nonparametric approach to jointly modeling multivariate spatial binary data by integrating both types of correlations. A multivariate probit model is employed to link the binary outcomes to Gaussian latent variables; and Gaussian processes are applied to specify the spatially correlated random effects. We develop an efficient Markov chain Monte Carlo algorithm for the posterior computation. We illustrate the proposed model on simulation studies and a multidrug-resistant tuberculosis case study.

Research in the field of nonparametric shape constrained regression has been intensive. However, only few publications explicitly deal with unimodality although there is need for such methods in applications, for example, in dose–response analysis. In this article, we propose unimodal spline regression methods that make use of Bernstein–Schoenberg splines and their shape preservation property. To achieve unimodal and smooth solutions we use penalized splines, and extend the penalized spline approach toward penalizing against general parametric functions, instead of using just difference penalties. For tuning parameter selection under a unimodality constraint a restricted maximum likelihood and an alternative Bayesian approach for unimodal regression are developed. We compare the proposed methodologies to other common approaches in a simulation study and apply it to a dose–response data set. All results suggest that the unimodality constraint or the combination of unimodality and a penalty can substantially improve estimation of the functional relationship.

Integrative genomics offers a promising approach to more powerful genetic association studies. The hope is that combining outcome and genotype data with other types of genomic information can lead to more powerful SNP detection. We present a new association test based on a statistical model that explicitly assumes that genetic variations affect the outcome through perturbing gene expression levels. It is shown analytically that the proposed approach can have more power to detect SNPs that are associated with the outcome through transcriptional regulation, compared to tests using the outcome and genotype data alone, and simulations show that our method is relatively robust to misspecification. We also provide a strategy for applying our approach to high-dimensional genomic data. We use this strategy to identify a potentially new association between a SNP and a yeast cell's response to the natural product tomatidine, which standard association analysis did not detect.

We develop a linear mixed regression model where both the response and the predictor are functions. Model parameters are estimated by maximizing the log likelihood via the ECME algorithm. The estimated variance parameters or covariance matrices are shown to be positive or positive definite at each iteration. In simulation studies, the approach outperforms in terms of the fitting error and the MSE of estimating the “regression coefficients.”

We consider the problem of robust estimation of the regression relationship between a response and a covariate based on sample in which precise measurements on the covariate are not available but error-prone surrogates for the unobserved covariate are available for each sampled unit. Existing methods often make restrictive and unrealistic assumptions about the density of the covariate and the densities of the regression and the measurement errors, for example, normality and, for the latter two, also homoscedasticity and thus independence from the covariate. In this article we describe Bayesian semiparametric methodology based on mixtures of B-splines and mixtures induced by Dirichlet processes that relaxes these restrictive assumptions. In particular, our models for the aforementioned densities adapt to asymmetry, heavy tails and multimodality. The models for the densities of regression and measurement errors also accommodate conditional heteroscedasticity. In simulation experiments, our method vastly outperforms existing methods. We apply our method to data from nutritional epidemiology.

Complex computer models play a crucial role in air quality research. These models are used to evaluate potential regulatory impacts of emission control strategies and to estimate air quality in areas without monitoring data. For both of these purposes, it is important to calibrate model output with monitoring data to adjust for model biases and improve spatial prediction. In this article, we propose a new spectral method to study and exploit complex relationships between model output and monitoring data. Spectral methods allow us to estimate the relationship between model output and monitoring data separately at different spatial scales, and to use model output for prediction only at the appropriate scales. The proposed method is computationally efficient and can be implemented using standard software. We apply the method to compare Community Multiscale Air Quality (CMAQ) model output with ozone measurements in the United States in July 2005. We find that CMAQ captures large-scale spatial trends, but has low correlation with the monitoring data at small spatial scales.

It is difficult to accurately estimate species richness if there are many almost undetectable species in a hyper-diverse community. Practically, an accurate lower bound for species richness is preferable to an inaccurate point estimator. The traditional nonparametric lower bound developed by Chao (1984, *Scandinavian Journal of Statistics 11, 265–270)* for individual-based abundance data uses only the information on the rarest species (the numbers of singletons and doubletons) to estimate the number of undetected species in samples. Applying a modified Good–Turing frequency formula, we derive an approximate formula for the first-order bias of this traditional lower bound. The approximate bias is estimated by using additional information (namely, the numbers of tripletons and quadrupletons). This approximate bias can be corrected, and an improved lower bound is thus obtained. The proposed lower bound is nonparametric in the sense that it is universally valid for any species abundance distribution. A similar type of improved lower bound can be derived for incidence data. We test our proposed lower bounds on simulated data sets generated from various species abundance models. Simulation results show that the proposed lower bounds always reduce bias over the traditional lower bounds and improve accuracy (as measured by mean squared error) when the heterogeneity of species abundances is relatively high. We also apply the proposed new lower bounds to real data for illustration and for comparisons with previously developed estimators.

Follow-up is more and more used in medicine/doping control to identify abnormal results in an individual. Currently, follow-ups are mostly carried out variable by variable using “reference intervals” that contain the values observable in % of healthy/undoped individuals. Observations of the evolution of the variables over time in a sample of *N* healthy/undoped individuals, allows these reference intervals to be individualized by taking into account the possible effect of covariables and some previous observations of these variables obtained when the individual was healthy/undoped. For each variable these individualized intervals should contain % of observable values compatible with previous observed values in this individual. General methods to build these intervals are available, but they allow only a variable by variable follow-up whatever the possible correlations over time between them. In this article, we propose a general method to jointly follow-up several correlated variables over time. This methodology relies on a multivariate linear mixed effects model. We first provide a method to estimate the model's parameters. In an asymptotic framework (*N* large enough), we then derive a individualized prediction region. Sometimes, the sample size *N* is not large enough for the asymptotic framework to give a reasonable prediction region. It is for this reason, we propose and compare three different prediction regions that should behave better for small *N*. Finally, the whole methodology is illustrated by the follow-up of kidney insufficiency in cats.

Spatial-clustered data refer to high-dimensional correlated measurements collected from units or subjects that are spatially clustered. Such data arise frequently from studies in social and health sciences. We propose a unified modeling framework, termed as GeoCopula, to characterize both large-scale variation, and small-scale variation for various data types, including continuous data, binary data, and count data as special cases. To overcome challenges in the estimation and inference for the model parameters, we propose an efficient composite likelihood approach in that the estimation efficiency is resulted from a construction of over-identified joint composite estimating equations. Consequently, the statistical theory for the proposed estimation is developed by extending the classical theory of the generalized method of moments. A clear advantage of the proposed estimation method is the computation feasibility. We conduct several simulation studies to assess the performance of the proposed models and estimation methods for both Gaussian and binary spatial-clustered data. Results show a clear improvement on estimation efficiency over the conventional composite likelihood method. An illustrative data example is included to motivate and demonstrate the proposed method.

We investigate model for abundance estimation in closed-population capture–recapture studies, where animals are identified from natural marks such as DNA profiles or photographs of distinctive individual features. Model extends the classical model to accommodate errors in identification, by specifying that each sample identification is correct with probability and false with probability . Information about misidentification is gained from a surplus of capture histories with only one entry, which arise from false identifications. We derive an exact closed-form expression for the likelihood for model and show that it can be computed efficiently, in contrast to previous studies which have held the likelihood to be computationally intractable. Our fast computation enables us to conduct a thorough investigation of the statistical properties of the maximum likelihood estimates. We find that the indirect approach to error estimation places high demands on data richness, and good statistical properties in terms of precision and bias require high capture probabilities or many capture occasions. When these requirements are not met, abundance is estimated with very low precision and negative bias, and at the extreme better properties can be obtained by the naive approach of ignoring misidentification error. We recommend that model be used with caution and other strategies for handling misidentification error be considered. We illustrate our study with genetic and photographic surveys of the New Zealand population of southern right whale (*Eubalaena australis*).

A Bayesian approach to the prediction of occurred-but-not-yet-reported events is developed for application in real-time public health surveillance. The motivation was the prediction of the daily number of hospitalizations for the hemolytic-uremic syndrome during the large May–July 2011 outbreak of Shiga toxin-producing *Escherichia coli* (STEC) O104:H4 in Germany. Our novel Bayesian approach addresses the count data nature of the problem using negative binomial sampling and shows that right-truncation of the reporting delay distribution under an assumption of time-homogeneity can be handled in a conjugate prior-posterior framework using the generalized Dirichlet distribution. Since, in retrospect, the true number of hospitalizations is available, proper scoring rules for count data are used to evaluate and compare the predictive quality of the procedures during the outbreak. The results show that it is important to take the count nature of the time series into account and that changes in the delay distribution occurred due to intervention measures. As a consequence, we extend the Bayesian analysis to a hierarchical model, which combines a discrete time survival regression model for the delay distribution with a penalized spline for the dynamics of the epidemic curve. Altogether, we conclude that in emerging and time-critical outbreaks, nowcasting approaches are a valuable tool to gain information about current trends.

Markers that predict treatment effect have the potential to improve patient outcomes. For example, the has some ability to predict the benefit of adjuvant chemotherapy over and above hormone therapy for the treatment of estrogen-receptor-positive breast cancer, facilitating the provision of chemotherapy to women most likely to benefit from it. Given that the score was originally developed for predicting outcome given hormone therapy alone, it is of interest to develop alternative combinations of the genes comprising the score that are optimized for treatment selection. However, most methodology for combining markers is useful when predicting outcome under a single treatment. We propose a method for combining markers for treatment selection which requires modeling the treatment effect as a function of markers. Multiple models of treatment effect are fit iteratively by upweighting or “boosting” subjects potentially misclassified according to treatment benefit at the previous stage. The boosting approach is compared to existing methods in a simulation study based on the change in expected outcome under marker-based treatment. The approach improves upon methods in some settings and has comparable performance in others. Our simulation study also provides insights as to the relative merits of the existing methods. Application of the boosting approach to the breast cancer data, using scaled versions of the original markers, produces marker combinations that may have improved performance for treatment selection.

Kang, Janes and Huang propose an interesting boosting method to combine biomarkers for treatment selection. The method requires modeling the treatment effects using markers. We discuss an alternative method, outcome weighted learning. This method sidesteps the need for modeling the outcomes, and thus can be more robust to model misspecification.

The log-rank test has been widely used to test treatment effects under the Cox model for censored time-to-event outcomes, though it may lose power substantially when the model's proportional hazards assumption does not hold. In this article, we consider an extended Cox model that uses B-splines or smoothing splines to model a time-varying treatment effect and propose score test statistics for the treatment effect. Our proposed new tests combine statistical evidence from both the magnitude and the shape of the time-varying hazard ratio function, and thus are omnibus and powerful against various types of alternatives. In addition, the new testing framework is applicable to any choice of spline basis functions, including B-splines, and smoothing splines. Simulation studies confirm that the proposed tests performed well in finite samples and were frequently more powerful than conventional tests alone in many settings. The new methods were applied to the HIVNET 012 Study, a randomized clinical trial to assess the efficacy of single-dose Nevirapine against mother-to-child HIV transmission conducted by the HIV Prevention Trial Network.

Interference occurs when the treatment of one person affects the outcome of another. For example, in infectious diseases, whether one individual is vaccinated may affect whether another individual becomes infected or develops disease. Quantifying such indirect (or spillover) effects of vaccination could have important public health or policy implications. In this article we use recently developed inverse-probability weighted (IPW) estimators of treatment effects in the presence of interference to analyze an individually-randomized, placebo-controlled trial of cholera vaccination that targeted 121,982 individuals in Matlab, Bangladesh. Because these IPW estimators have not been employed previously, a simulation study was also conducted to assess the empirical behavior of the estimators in settings similar to the cholera vaccine trial. Simulation study results demonstrate the IPW estimators can yield unbiased estimates of the direct, indirect, total, and overall effects of vaccination when there is interference provided the untestable no unmeasured confounders assumption holds and the group-level propensity score model is correctly specified. Application of the IPW estimators to the cholera vaccine trial indicates the presence of interference. For example, the IPW estimates suggest on average 5.29 fewer cases of cholera per 1000 person-years (95% confidence interval 2.61, 7.96) will occur among unvaccinated individuals within neighborhoods with 60% vaccine coverage compared to neighborhoods with 32% coverage. Our analysis also demonstrates how not accounting for interference can render misleading conclusions about the public health utility of vaccination.

Estimating the effectiveness of a new intervention is usually the primary objective for HIV prevention trials. The Cox proportional hazard model is mainly used to estimate effectiveness by assuming that participants share the same risk under the covariates and the risk is always non-zero. In fact, the risk is only non-zero when an exposure event occurs, and participants can have a varying risk to transmit due to varying patterns of exposure events. Therefore, we propose a novel estimate of effectiveness adjusted for the heterogeneity in the magnitude of exposure among the study population, using a latent Poisson process model for the exposure path of each participant. Moreover, our model considers the scenario in which a proportion of participants never experience an exposure event and adopts a zero-inflated distribution for the rate of the exposure process. We employ a Bayesian estimation approach to estimate the exposure-adjusted effectiveness eliciting the priors from the historical information. Simulation studies are carried out to validate the approach and explore the properties of the estimates. An application example is presented from an HIV prevention trial.

Motivated by the problem of construction of gene co-expression network, we propose a statistical framework for estimating high-dimensional partial correlation matrix by a three-step approach. We first obtain a penalized estimate of a partial correlation matrix using ridge penalty. Next we select the non-zero entries of the partial correlation matrix by hypothesis testing. Finally we re-estimate the partial correlation coefficients at these non-zero entries. In the second step, the null distribution of the test statistics derived from penalized partial correlation estimates has not been established. We address this challenge by estimating the null distribution from the empirical distribution of the test statistics of all the penalized partial correlation estimates. Extensive simulation studies demonstrate the good performance of our method. Application on a yeast cell cycle gene expression data shows that our method delivers better predictions of the protein–protein interactions than the Graphic Lasso.

A potential venue to improve healthcare efficiency is to effectively tailor individualized treatment strategies by incorporating patient level predictor information such as environmental exposure, biological, and genetic marker measurements. Many useful statistical methods for deriving individualized treatment rules (ITR) have become available in recent years. Prior to adopting any ITR in clinical practice, it is crucial to evaluate its value in improving patient outcomes. Existing methods for quantifying such values mainly consider either a single marker or semi-parametric methods that are subject to bias under model misspecification. In this article, we consider a general setting with multiple markers and propose a two-step robust method to derive ITRs and evaluate their values. We also propose procedures for comparing different ITRs, which can be used to quantify the incremental value of new markers in improving treatment selection. While working models are used in step I to approximate optimal ITRs, we add a layer of calibration to guard against model misspecification and further assess the value of the ITR non-parametrically, which ensures the validity of the inference. To account for the sampling variability of the estimated rules and their corresponding values, we propose a resampling procedure to provide valid confidence intervals for the value functions as well as for the incremental value of new markers for treatment selection. Our proposals are examined through extensive simulation studies and illustrated with the data from a clinical trial that studies the effects of two drug combinations on HIV-1 infected patients.

In high-dimensional data analysis, it is of primary interest to reduce the data dimensionality without loss of information. Sufficient dimension reduction (SDR) arises in this context, and many successful SDR methods have been developed since the introduction of sliced inverse regression (SIR) [Li (1991) *Journal of the American Statistical Association* **86**, 316–327]. Despite their fast progress, though, most existing methods target on regression problems with a continuous response. For binary classification problems, SIR suffers the limitation of estimating at most one direction since only two slices are available. In this article, we develop a new and flexible probability-enhanced SDR method for binary classification problems by using the weighted support vector machine (WSVM). The key idea is to slice the data based on conditional class probabilities of observations rather than their binary responses. We first show that the central subspace based on the conditional class probability is the same as that based on the binary response. This important result justifies the proposed slicing scheme from a theoretical perspective and assures no information loss. In practice, the true conditional class probability is generally not available, and the problem of probability estimation can be challenging for data with large-dimensional inputs. We observe that, in order to implement the new slicing scheme, one does not need exact probability values and the only required information is the relative order of probability values. Motivated by this fact, our new SDR procedure bypasses the probability estimation step and employs the WSVM to directly estimate the order of probability values, based on which the slicing is performed. The performance of the proposed probability-enhanced SDR scheme is evaluated by both simulated and real data examples.

The identification of causal peer effects (also known as social contagion or induction) from observational data in social networks is challenged by two distinct sources of bias: latent homophily and unobserved confounding. In this paper, we investigate how causal peer effects of traits and behaviors can be identified using genes (or other structurally isomorphic variables) as instrumental variables (IV) in a large set of data generating models with homophily and confounding. We use directed acyclic graphs to represent these models and employ multiple IV strategies and report three main identification results. First, using a single fixed gene (or allele) as an IV will generally fail to identify peer effects if the gene affects past values of the treatment. Second, multiple fixed genes/alleles, or, more promisingly, time-varying gene expression, can identify peer effects if we instrument exclusion violations as well as the focal treatment. Third, we show that IV identification of peer effects remains possible even under multiple complications often regarded as lethal for IV identification of intra-individual effects, such as pleiotropy on observables and unobservables, homophily on past phenotype, past and ongoing homophily on genotype, inter-phenotype peer effects, population stratification, gene expression that is endogenous to past phenotype and past gene expression, and others. We apply our identification results to estimating peer effects of body mass index (BMI) among friends and spouses in the Framingham Heart Study. Results suggest a positive causal peer effect of BMI between friends.

This article presents an Analysis of Variance model for functional data that explicitly incorporates phase variability through a time-warping component, allowing for a unified approach to estimation and inference in presence of amplitude and time variability. The focus is on single-random-factor models but the approach can be easily generalized to more complex ANOVA models. The behavior of the estimators is studied by simulation, and an application to the analysis of growth curves of flour beetles is presented. Although the model assumes a smooth latent process behind the observed trajectories, smootheness of the observed data is not required; the method can be applied to irregular time grids, which are common in longitudinal studies.

In cancer research, profiling studies have been extensively conducted, searching for genes/SNPs associated with prognosis. Cancer is diverse. Examining the similarity and difference in the genetic basis of multiple subtypes of the same cancer can lead to a better understanding of their connections and distinctions. Classic meta-analysis methods analyze each subtype separately and then compare analysis results across subtypes. Integrative analysis methods, in contrast, analyze the raw data on multiple subtypes simultaneously and can outperform meta-analysis methods. In this study, prognosis data on multiple subtypes of the same cancer are analyzed. An AFT (accelerated failure time) model is adopted to describe survival. The genetic basis of multiple subtypes is described using the heterogeneity model, which allows a gene/SNP to be associated with prognosis of some subtypes but not others. A compound penalization method is developed to identify genes that contain important SNPs associated with prognosis. The proposed method has an intuitive formulation and is realized using an iterative algorithm. Asymptotic properties are rigorously established. Simulation shows that the proposed method has satisfactory performance and outperforms a penalization-based meta-analysis method and a regularized thresholding method. An NHL (non-Hodgkin lymphoma) prognosis study with SNP measurements is analyzed. Genes associated with the three major subtypes, namely DLBCL, FL, and CLL/SLL, are identified. The proposed method identifies genes that are different from alternatives and have important implications and satisfactory prediction performance.

Many papers have introduced adaptive clinical trial methods that allow modifications to the sample size based on interim estimates of treatment effect. There has been extensive commentary on type I error control and efficiency considerations, but little research on estimation after an adaptive hypothesis test. We evaluate the reliability and precision of different inferential procedures in the presence of an adaptive design with pre-specified rules for modifying the sampling plan. We extend group sequential orderings of the outcome space based on the stage at stopping, likelihood ratio statistic, and sample mean to the adaptive setting in order to compute median-unbiased point estimates, exact confidence intervals, and *P*-values uniformly distributed under the null hypothesis. The likelihood ratio ordering is found to average shorter confidence intervals and produce higher probabilities of *P*-values below important thresholds than alternative approaches. The bias adjusted mean demonstrates the lowest mean squared error among candidate point estimates. A conditional error-based approach in the literature has the benefit of being the only method that accommodates unplanned adaptations. We compare the performance of this and other methods in order to quantify the cost of failing to plan ahead in settings where adaptations could realistically be pre-specified at the design stage. We find the cost to be meaningful for all designs and treatment effects considered, and to be substantial for designs frequently proposed in the literature.

Among patients on dialysis, cardiovascular disease and infection are leading causes of hospitalization and death. Although recent studies have found that the risk of cardiovascular events is higher after an infection-related hospitalization, studies have not fully elucidated how the risk of cardiovascular events changes over time for patients on dialysis. In this work, we characterize the dynamics of cardiovascular event risk trajectories for patients on dialysis while conditioning on survival status via multiple time indices: (1) time since the start of dialysis, (2) time since the pivotal initial infection-related hospitalization, and (3) the patient's age at the start of dialysis. This is achieved by using a new class of generalized multiple-index varying coefficient (GM-IVC) models. The proposed GM-IVC models utilize a multiplicative structure and one-dimensional varying coefficient functions along each time and age index to capture the cardiovascular risk dynamics before and after the initial infection-related hospitalization among the dynamic cohort of survivors. We develop a two-step estimation procedure for the GM-IVC models based on local maximum likelihood. We report new insights on the dynamics of cardiovascular events risk using the United States Renal Data System database, which collects data on nearly all patients with end-stage renal disease in the United States. Finally, simulation studies assess the performance of the proposed estimation procedures.

Here, we consider time-to-event data where individuals can experience two or more types of events that are not distinguishable from one another without further confirmation, perhaps by laboratory test. The event type of primary interest can occur only once. The other types of events can recur. If the type of a portion of the events is identified, this forms a validation set. However, even if a random sample of events are tested, confirmations can be missing nonmonotonically, creating uncertainty about whether an individual is still at risk for the event of interest. For example, in a study to estimate efficacy of an influenza vaccine, an individual may experience a sequence of symptomatic respiratory illnesses caused by various pathogens over the season. Often only a limited number of these episodes are confirmed in the laboratory to be influenza-related or not. We propose two novel methods to estimate covariate effects in this survival setting, and subsequently vaccine efficacy. The first is a pathway expectation-maximization (EM) algorithm that takes into account all pathways of event types in an individual compatible with that individual's test outcomes. The pathway EM iteratively estimates baseline hazards that are used to weight possible event types. The second method is a non-iterative pathway piecewise validation method that does not estimate the baseline hazards. These methods are compared with a previous simpler method. Simulation studies suggest mean squared error is lower in the efficacy estimates when the baseline hazards are estimated, especially at higher hazard rates. We use the pathway EM-algorithm to reevaluate the efficacy of a trivalent live-attenuated influenza vaccine during the 2003–2004 influenza season in Temple-Belton, Texas, and compare our results with a previously published analysis.

Spatially referenced datasets arising from multiple sources are routinely combined to assess relationships among various outcomes and covariates. The geographical units associated with the data, such as the geographical coordinates or areal-level administrative units, are often spatially misaligned, that is, observed at different locations or aggregated over different geographical units. As a result, the covariate is often predicted at the locations where the response is observed. The method used to align disparate datasets must be accounted for when subsequently modeling the aligned data. Here we consider the case where kriging is used to align datasets in point-to-point and point-to-areal misalignment problems when the response variable is non-normally distributed. If the relationship is modeled using generalized linear models, the additional uncertainty induced from using the kriging mean as a covariate introduces a Berkson error structure. In this article, we develop a pseudo-penalized quasi-likelihood algorithm to account for the additional uncertainty when estimating regression parameters and associated measures of uncertainty. The method is applied to a point-to-point example assessing the relationship between low-birth weights and PM_{2.5} levels after the onset of the largest wildfire in Florida history, the Bugaboo scrub fire. A point-to-areal misalignment problem is presented where the relationship between asthma events in Florida's counties and PM_{2.5} levels after the onset of the fire is assessed. Finally, the method is evaluated using a simulation study. Our results indicate the method performs well in terms of coverage for 95% confidence intervals and naive methods that ignore the additional uncertainty tend to underestimate the variability associated with parameter estimates. The underestimation is most profound in Poisson regression models.

Semicompeting risk outcome data (e.g., time to disease progression and time to death) are commonly collected in clinical trials. However, analysis of these data is often hampered by a scarcity of available statistical tools. As such, we propose a novel semiparametric transformation model that improves the existing models in the following two ways. First, it estimates regression coefficients and association parameters simultaneously. Second, the measure of surrogacy, for example, the proportion of the treatment effect that is mediated by the surrogate and the ratio of the overall treatment effect on the true endpoint over that on the surrogate endpoint, can be directly obtained. We propose an estimation procedure for inference and show that the proposed estimator is consistent and asymptotically normal. Extensive simulations demonstrate the valid usage of our method. We apply the method to a multiple myeloma trial to study the impact of several biomarkers on patients’ semicompeting outcomes—namely, time to progression and time to death.

Survival data are subject to length-biased sampling when the survival times are left-truncated and the underlying truncation time random variable is uniformly distributed. Substantial efficiency gains can be achieved by incorporating the information about the truncation time distribution in the estimation procedure [Wang (1989) *Journal of the American Statistical Association* **84**, 742–748; Wang (1996) *Biometrika* **83**, 343–354]. Under the semiparametric transformation models, the maximum likelihood method is expected to be fully efficient, yet it is difficult to implement because the full likelihood depends on the nonparametric component in a complicated way. Moreover, its asymptotic properties have not been established. In this article, we extend the martingale estimating equation approach [Chen et al. (2002) *Biometrika* **89**, 659–668; Kim et al. (2013) *Journal of the American Statistical Association* **108**, 217–227] and the pseudo-partial likelihood approach [Severini and Wong (1992) *The Annals of Statistics* **4**, 1768–1802; Zucker (2005) *Journal of the American Statistical Association* **100**, 1264–1277] for semiparametric transformation models with right-censored data to handle left-truncated and right-censored data. In the same spirit of the composite likelihood method [Huang and Qin (2012) *Journal of the American Statistical Association* **107**, 946–957], we further construct another set of unbiased estimating equations by exploiting the special probability structure of length-biased sampling. Thus the number of estimating equations exceeds the number of parameters, and efficiency gains can be achieved by solving a simple combination of these estimating equations. The proposed methods are easy to implement as they do not require additional programming efforts. Moreover, they are shown to be consistent and asymptotically normally distributed. A data analysis of a dementia study illustrates the methods.

Phylogeography investigates the historical process that is responsible for the contemporary geographic distributions of populations in a species. The inference is made on the basis of molecular sequence data sampled from modern-day populations. The estimates, however, may fluctuate depending on the relevant genomic regions, because the evolution mechanism of each genome is unique, even within the same individual. In this article, we propose a genome-differentiated population tree model that allows the existence of separate population trees for each homologous genome. In each population tree, the unique evolutionary characteristics account for each genome, along with their homologous relationship; therefore, the approach can distinguish the evolutionary history of one genome from that of another. In addition to the separate divergence times, the new model can estimate separate effective population sizes, gene-genealogies and other mutation parameters. For Bayesian inference, we developed a Markov chain Monte Carlo (MCMC) methodology with a novel MCMC algorithm which can mix over a complicated state space. The stability of the new estimator is demonstrated through comparison with the Monte Carlo samples and other methods, as well as MCMC convergence diagnostics. The analysis of African gorilla data from two homologous loci reveals discordant divergence times between loci, and this discrepancy is explained by male-mediated gene flows until the end of the last ice age.

In the analysis of competing risks data, the cumulative incidence function is a useful quantity to characterize the crude risk of failure from a specific event type. In this article, we consider an efficient semiparametric analysis of mixture component models on cumulative incidence functions. Under the proposed mixture model, latency survival regressions given the event type are performed through a class of semiparametric models that encompasses the proportional hazards model and the proportional odds model, allowing for time-dependent covariates. The marginal proportions of the occurrences of cause-specific events are assessed by a multinomial logistic model. Our mixture modeling approach is advantageous in that it makes a joint estimation of model parameters associated with all competing risks under consideration, satisfying the constraint that the cumulative probability of failing from any cause adds up to one given any covariates. We develop a novel maximum likelihood scheme based on semiparametric regression analysis that facilitates efficient and reliable estimation. Statistical inferences can be conveniently made from the inverse of the observed information matrix. We establish the consistency and asymptotic normality of the proposed estimators. We validate small sample properties with simulations and demonstrate the methodology with a data set from a study of follicular lymphoma.

Substantial progress has been made in identifying single genetic variants predisposing to common complex diseases. Nonetheless, the genetic etiology of human diseases remains largely unknown. Human complex diseases are likely influenced by the joint effect of a large number of genetic variants instead of a single variant. The joint analysis of multiple genetic variants considering linkage disequilibrium (LD) and potential interactions can further enhance the discovery process, leading to the identification of new disease-susceptibility genetic variants. Motivated by development in spatial statistics, we propose a new statistical model based on the random field theory, referred to as a genetic random field model (GenRF), for joint association analysis with the consideration of possible gene–gene interactions and LD. Using a pseudo-likelihood approach, a GenRF test for the joint association of multiple genetic variants is developed, which has the following advantages: (1) accommodating complex interactions for improved performance; (2) natural dimension reduction; (3) boosting power in the presence of LD; and (4) computationally efficient. Simulation studies are conducted under various scenarios. The development has been focused on quantitative traits and robustness of the GenRF test to other traits, for example, binary traits, is also discussed. Compared with a commonly adopted kernel machine approach, SKAT, as well as other more standard methods, GenRF shows overall comparable performance and better performance in the presence of complex interactions. The method is further illustrated by an application to the Dallas Heart Study.

Many techniques of functional data analysis require choosing a measure of distance between functions, with the most common choice being distance. In this article we show that using a weighted distance, with a judiciously chosen weight function, can improve the performance of various statistical methods for functional data, including *k*-medoids clustering, nonparametric classification, and permutation testing. Assuming a quadratically penalized (e.g., spline) basis representation for the functional data, we consider three nontrivial weight functions: design density weights, inverse-variance weights, and a new weight function that minimizes the coefficient of variation of the resulting squared distance by means of an efficient iterative procedure. The benefits of weighting, in particular with the proposed weight function, are demonstrated both in simulation studies and in applications to the Berkeley growth data and a functional magnetic resonance imaging data set.

Structural mean models (SMMs) have been proposed for estimating causal parameters in the presence of non-ignorable non-compliance in clinical trials. To obtain a valid causal estimate, we must impose several assumptions. One of these is the correct specification of the structural model. Building on Pan's work (2001, *Biometrics* **57**, 120–125) on developing a model selection criterion for generalized estimating equations, we propose a new approach for model selection of SMMs based on a quasi-likelihood. We provide a formal model selection criterion that is an extension of Akaike's information criterion. Using subset selection of baseline covariates, our method allows us to understand whether the treatment effect varies across the available baseline covariate levels, and/or to quantify the treatment effect on a specific covariates level to target specific individuals to maximize treatment benefit. We present simulation results in which our method performs reasonably well compared to other testing methods in terms of both the probability of selecting the correct model and the predictive performances of the individual treatment effects. We use a large randomized clinical trial of pravastatin as a motivation.

A common goal of epidemiologic research is to study how two exposures interact in causing a binary outcome. Causal interaction is defined as the presence of subjects for which the causal effect of one exposure depends on the level of the other exposure. For binary exposures, it has previously been shown that the presence of causal interaction is testable through additive statistical interaction. However, it has also been shown that the magnitude of causal interaction, defined as the proportion of subjects for which there is causal interaction, is generally not identifiable. In this article, we derive bounds on causal interactions, which are applicable to binary outcomes and categorical exposures with arbitrarily many levels. These bounds can be used to assess the magnitude of causal interaction, and serve as an important complement to the statistical test that is frequently employed. The bounds are derived both without and with an assumption about monotone exposure effects. We present an application of the bounds to a study of gene–gene interaction in rheumatoid arthritis.

Set classification problems arise when classification tasks are based on sets of observations as opposed to individual observations. In set classification, a classification rule is trained with *N* sets of observations, where each set is labeled with class information, and the prediction of a class label is performed also with a set of observations. Data sets for set classification appear, for example, in diagnostics of disease based on multiple cell nucleus images from a single tissue. Relevant statistical models for set classification are introduced, which motivate a set classification framework based on context-free feature extraction. By understanding a set of observations as an empirical distribution, we employ a data-driven method to choose those features which contain information on location and major variation. In particular, the method of principal component analysis is used to extract the features of major variation. Multidimensional scaling is used to represent features as vector-valued points on which conventional classifiers can be applied. The proposed set classification approaches achieve better classification results than competing methods in a number of simulated data examples. The benefits of our method are demonstrated in an analysis of histopathology images of cell nuclei related to liver cancer.

The *genotype main effects and genotype-by-environment interaction effects* (GGE) model and the *additive main effects and multiplicative interaction* (AMMI) model are two common models for analysis of genotype-by-environment data. These models are frequently used by agronomists, plant breeders, geneticists and statisticians for analysis of multi-environment trials. In such trials, a set of genotypes, for example, crop cultivars, are compared across a range of environments, for example, locations. The GGE and AMMI models use singular value decomposition to partition genotype-by-environment interaction into an ordered sum of multiplicative terms. This article deals with the problem of testing the significance of these multiplicative terms in order to decide how many terms to retain in the final model. We propose parametric bootstrap methods for this problem. Models with fixed main effects, fixed multiplicative terms and random normally distributed errors are considered. Two methods are derived: a *full* and a *simple* parametric bootstrap method. These are compared with the alternatives of using approximate *F*-tests and cross-validation. In a simulation study based on four multi-environment trials, both bootstrap methods performed well with regard to Type I error rate and power. The simple parametric bootstrap method is particularly easy to use, since it only involves repeated sampling of standard normally distributed values. This method is recommended for selecting the number of multiplicative terms in GGE and AMMI models. The proposed methods can also be used for testing components in principal component analysis.

In this article we propose an accelerated intensity frailty (AIF) model for recurrent events data and derive a test for the variance of frailty. In addition, we develop a kernel-smoothing-based EM algorithm for estimating regression coefficients and the baseline intensity function. The variance of the resulting estimator for regression parameters is obtained by a numerical differentiation method. Simulation studies are conducted to evaluate the finite sample performance of the proposed estimator under practical settings and demonstrate the efficiency gain over the Gehan rank estimator based on the AFT model for counting process (Lin et al., 1998). Our method is further illustrated with an application to a bladder tumor recurrence data.

The year 2012 marks the 50th anniversary of the death of Sir Ronald A. Fisher, one of the two Fathers of Statistics and a Founder of the International Biometric Society (the “Society”). To celebrate the extraordinary genius of Fisher and the far-sighted vision of Fisher and Chester Bliss in organizing and promoting the formation of the Society, this article looks at the origins and growth of the Society, some of the key players and events, and especially the roles played by Fisher himself as the First President. A fresh look at Fisher, the man rather than the scientific genius is also presented.

R. A. Fisher spent much of his final 3 years of life in Adelaide. It was a congenial place to live and work, and he was much in demand as a speaker, in Australia and overseas. It was, however, a difficult time for him because of the sustained criticism of fiducial inference from the early 1950s onwards. The article discusses some of Fisher's work on inference from an Adelaide perspective. It also considers some of the successes arising from this time, in the statistics of field experimentation and in evolutionary genetics. A few personal recollections of Fisher as houseguest are provided. This article is the text of a article presented on August 31, 2012 at the 26th International Biometric Conference, Kobe, Japan.

Suppose we are interested in estimating the average causal effect from an observational study. A doubly robust estimator, which is a hybrid of the outcome regression and propensity score weighting, is more robust than estimators obtained by either of them in the sense that, if at least one of the two models holds, the doubly robust estimator is consistent. However, a doubly robust estimator may still suffer from model misspecification since it is not consistent if neither of them is correctly specified. In this article, we propose an alternative estimator, called the stratified doubly robust estimator, by further combining propensity score stratification with outcome regression and propensity score weighting. This estimator allows two candidate models for the propensity score and is more robust than existing doubly robust estimators in the sense that it is consistent either if the outcome regression holds or if one of the two models for the propensity score holds. Asymptotic properties are examined and finite sample performance of the proposed estimator is investigated by simulation studies. Our proposed method is illustrated with the Tone study, which is a community survey conducted in Japan.

We consider a new approach to identify the causal effects of a binary treatment when the outcome is missing on a subset of units and dependence of nonresponse on the outcome cannot be ruled out even after conditioning on observed covariates. We provide sufficient conditions under which the availability of a binary instrument for nonresponse allows us to derive tighter identification intervals for causal effects in the whole population and to partially identify causal effects in some latent subgroups of units, named Principal Strata, defined by the nonresponse behavior in all possible combinations of treatment and instrument. A simulation study is used to assess the benefits of the presence versus the absence of an instrument for nonresponse. The simulation design is based on real health data, coming from a randomized trial on breast self-examination (BSE) affected by a large proportion of missing outcome data. An instrument for nonresponse is simulated considering alternative scenarios to discuss the key role of the instrument for nonresponse in identifying average causal effects in presence of nonignorable missing outcomes. We also investigate the potential inferential gains from using an instrument for nonresponse adopting a Bayesian approach for inference. In virtue of our theoretical and empirical results, we provide some recommendations on study designs for causal inference.

In statistical inference, one has to make sure that the underlying regression model is correctly specified otherwise the resulting estimation may be biased. Model checking is an important method to detect any departure of the regression model from the true one. Missing data are a ubiquitous problem in social and medical studies. If the underlying regression model is correctly specified, recent researches show great popularity of the doubly robust (DR) estimates method for handling missing data because of its robustness to the misspecification of either the missing data model or the conditional mean model, that is, the model for the conditional expectation of true regression model conditioning on the observed quantities. However, little work has been devoted to the goodness of fit test for DR estimates method. In this article, we propose a testing method to assess the reliability of the estimator derived from the DR estimating equation with possibly missing response and always observed auxiliary variables. Numerical studies demonstrate that the proposed test can control type I errors well. Furthermore the proposed method can detect departures from model assumptions in the marginal mean model of interest powerfully. A real dementia data set is used to illustrate the method for the diagnosis of model misspecification in the problem of missing response with an always observed auxiliary variable for cross-sectional data.

Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this article, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to non-identifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset.

Missing data rates could depend on the targeted values in many settings, including mass spectrometry-based proteomic profiling studies. Here, we consider mean and covariance estimation under a multivariate Gaussian distribution with non-ignorable missingness, including scenarios in which the dimension (*p*) of the response vector is equal to or greater than the number (*n*) of independent observations. A parameter estimation procedure is developed by maximizing a class of penalized likelihood functions that entails explicit modeling of missing data probabilities. The performance of the resulting “penalized EM algorithm incorporating missing data mechanism (PEMM)” estimation procedure is evaluated in simulation studies and in a proteomic data illustration.

In this article, we first extend the superpopulation capture–recapture model to multiple states (locations or populations) for two age groups., Wen et al., (2011; 2013) developed a new approach combining capture–recapture data with population assignment information to estimate the relative contributions of in situ births and immigrants to the growth of a single study population. Here, we first generalize Wen et al., (2011; 2013) approach to a system composed of multiple study populations (multi-state) with two age groups, where an imputation approach is employed to account for the uncertainty inherent in the population assignment information. Then we develop a different, individual-level mixture model approach to integrate the individual-level population assignment information with the capture–recapture data. Our simulation and real data analyses show that the fusion of population assignment information with capture–recapture data allows us to estimate the origination-specific recruitment of new animals to the system and the dispersal process between populations within the system. Compared to a standard capture–recapture model, our new models improve the estimation of demographic parameters, including survival probability, origination-specific entry probability, and especially the probability of movement between populations, yielding higher accuracy and precision.

This work, motivated by an osteoporosis survey study, considers regression analysis with incompletely observed current status data. Here the current status data, including an examination time and an indicator for whether or not the event of interest has occurred by the examination time, is not observed for all subjects. Instead, a surrogate outcome subject to misclassification of the current status is available for all subjects. We focus on semiparametric regression under transformation models, including the proportional hazards and proportional odds models as special cases. Under the missing at random mechanism where the missingness of the current status outcome can depend only on the observed surrogate outcome and covariates, we propose an approach of validation likelihood based on the likelihood from the validation subsample where the data are fully observed, with adjustments of the probability of observing the current status outcome, as well as the distribution of the surrogate outcome in the validation subsample. We propose an efficient computation algorithm for implementation, and derive consistency and asymptotic normality for inference with the proposed estimator. The application to the osteoporosis survey data and simulations reveal that the validation likelihood performs well; it removes the bias from the “complete case” analysis discarding subjects with missing data, and achieves higher efficiency than the inverse probability weighting analysis.

Many processes in nature can be viewed as arising from subjects progressing through sequential stages and may be described by multistage models. Examples include disease development and the physiological development of plants and animals. We develop a multistage model for sampling designs where a small set of subjects is followed and the number of subjects in each stage is assessed repeatedly for a sequence of time points, but for which the subjects cannot be identified. The motivating problem is the laboratory study of developing arthropods through stage frequency data. Our model assumes that the same individuals are censused at each time, introducing among sample dependencies. This type of data often occur in laboratory studies of small arthropods but their detailed analysis has received little attention. The likelihood of the model is derived from a stochastic model of the development and mortality of the individuals in the cohort. We present an MCMC scheme targeting the posterior distribution of the times of development and times of death of individuals. This is a novel type of MCMC that uses customized proposals to explore a posterior with disconnected support arising from the fact that individual identities are unknown. The MCMC algorithm may be used for inference about parameters governing stage duration distributions and mortality rates. The method is demonstrated by fitting the development model to stage frequency data of a mealybug cohort placed on a grape vine.

Statistical challenges arise from modern biomedical studies that produce time course genomic data with ultrahigh dimensions. In a renal cancer study that motivated this paper, the pharmacokinetic measures of a tumor suppressor (CCI-779) and expression levels of 12,625 genes were measured for each of 33 patients at 8 and 16 weeks after the start of treatments, with the goal of identifying predictive gene transcripts and the interactions with time in peripheral blood mononuclear cells for pharmacokinetics over the time course. The resulting data set defies analysis even with regularized regression. Although some remedies have been proposed for both linear and generalized linear models, there are virtually no solutions in the time course setting. As such, a novel GEE-based screening procedure is proposed, which only pertains to the specifications of the first two marginal moments and a working correlation structure. Different from existing methods that either fit separate marginal models or compute pairwise correlation measures, the new procedure merely involves making a single evaluation of estimating functions and thus is extremely computationally efficient. The new method is robust against the mis-specification of correlation structures and enjoys theoretical readiness, which is further verified via Monte Carlo simulations. The procedure is applied to analyze the aforementioned renal cancer study and identify gene transcripts and possible time-interactions that are relevant to CCI-779 metabolism in peripheral blood.

To evaluate the utility of automated deformable image registration (DIR) algorithms, it is necessary to evaluate both the registration accuracy of the DIR algorithm itself, as well as the registration accuracy of the human readers from whom the “gold standard” is obtained. We propose a Bayesian hierarchical model to evaluate the spatial accuracy of human readers and automatic DIR methods based on multiple image registration data generated by human readers and automatic DIR methods. To fully account for the locations of landmarks in all images, we treat the true locations of landmarks as latent variables and impose a hierarchical structure on the magnitude of registration errors observed across image pairs. DIR registration errors are modeled using Gaussian processes with reference prior densities on prior parameters that determine the associated covariance matrices. We develop a Gibbs sampling algorithm to efficiently fit our models to high-dimensional data, and apply the proposed method to analyze an image dataset obtained from a 4D thoracic CT study.

We propose a dynamic allocation procedure that increases power and efficiency when measuring an average treatment effect in fixed sample randomized trials with sequential allocation. Subjects arrive iteratively and are either randomized *or* paired via a matching criterion to a previously randomized subject and administered the alternate treatment. We develop estimators for the average treatment effect that combine information from both the matched pairs and unmatched subjects as well as an exact test. Simulations illustrate the method's higher efficiency and power over several competing allocation procedures in both simulations and in data from a clinical trial.

While a general goal of early phase clinical studies is to identify an acceptable dose for further investigation, modern dose finding studies and designs are highly specific to individual clinical settings. In addition, as outcome-adaptive dose finding methods often involve complex algorithms, it is crucial to have diagnostic tools to evaluate the plausibility of a method's simulated performance and the adequacy of the algorithm. In this article, we propose a simple technique that provides an upper limit, or a benchmark, of accuracy for dose finding methods for a given design objective. The proposed benchmark is nonparametric optimal in the sense of O'Quigley et al. (2002, *Biostatistics* **3,** 51–56), and is demonstrated by examples to be a practical accuracy upper bound for model-based dose finding methods. We illustrate the implementation of the technique in the context of phase I trials that consider multiple toxicities and phase I/II trials where dosing decisions are based on both toxicity and efficacy, and apply the benchmark to several clinical examples considered in the literature. By comparing the operating characteristics of a dose finding method to that of the benchmark, we can form quick initial assessments of whether the method is adequately calibrated and evaluate its sensitivity to the dose–outcome relationships.

Linear regressions are commonly used to calibrate the signal measurements in proteomic analysis by mass spectrometry. However, with or without a monotone (e.g., log) transformation, data from such functional proteomic experiments are not necessarily linear or even monotone functions of protein (or peptide) concentration except over a very restricted range. A computationally efficient spline procedure improves upon linear regression. However, mass spectrometry data are not necessarily homoscedastic; more often the variation of measured concentrations increases disproportionately near the boundaries of the instruments measurement capability (dynamic range), that is, the upper and lower limits of quantitation. These calibration difficulties exist with other applications of mass spectrometry as well as with other broad-scale calibrations. Therefore the method proposed here uses a functional data approach to define the calibration curve and also the limits of quantitation under the two assumptions: (i) that the variance is a bounded, convex function of concentration; and (ii) that the calibration curve itself is monotone at least between the limits of quantitation, but not necessarily outside these limits. Within this paradigm, the limit of detection, where the signal is definitely present but not measurable with any accuracy, is also defined. An iterative approach draws on existing smoothing methods to account simultaneously for both restrictions and is shown to achieve the global optimal convergence rate under weak conditions. This approach can also be implemented when convexity is replaced by other (bounded) restrictions. Examples from Addona et al. (2009, *Nature Biotechnology* 27, 663–641) both motivate and illustrate the effectiveness of this functional data methodology when compared with the simpler linear regressions and spline techniques.

We propose a method to test the correlation of two random fields when they are both spatially autocorrelated. In this scenario, the assumption of independence for the pair of observations in the standard test does not hold, and as a result we reject in many cases where there is no effect (the precision of the null distribution is overestimated). Our method recovers the null distribution taking into account the autocorrelation. It uses Monte-Carlo methods, and focuses on permuting, and then smoothing and scaling one of the variables to destroy the correlation with the other, while maintaining at the same time the initial autocorrelation. With this simulation model, any test based on the independence of two (or more) random fields can be constructed. This research was motivated by a project in biodiversity and conservation in the Biology Department at Stanford University.

Estimation of the long-term health effects of air pollution is a challenging task, especially when modeling spatial small-area disease incidence data in an ecological study design. The challenge comes from the unobserved underlying spatial autocorrelation structure in these data, which is accounted for using random effects modeled by a globally smooth conditional autoregressive model. These smooth random effects confound the effects of air pollution, which are also globally smooth. To avoid this collinearity a Bayesian localized conditional autoregressive model is developed for the random effects. This localized model is flexible spatially, in the sense that it is not only able to model areas of spatial smoothness, but also it is able to capture step changes in the random effects surface. This methodological development allows us to improve the estimation performance of the covariate effects, compared to using traditional conditional auto-regressive models. These results are established using a simulation study, and are then illustrated with our motivating study on air pollution and respiratory ill health in Greater Glasgow, Scotland in 2011. The model shows substantial health effects of particulate matter air pollution and nitrogen dioxide, whose effects have been consistently attenuated by the currently available globally smooth models.

The photoactivatable ribonucleoside enhanced cross-linking immunoprecipitation (PAR-CLIP) has been increasingly used for the global mapping of RNA–protein interaction sites. There are two key features of the PAR-CLIP experiments: The sequence read tags are likely to form an enriched peak around each RNA–protein interaction site; and the cross-linking procedure is likely to introduce a specific mutation in each sequence read tag at the interaction site. Several ad hoc methods have been developed to identify the RNA–protein interaction sites using either sequence read counts or mutation counts alone; however, rigorous statistical methods for analyzing PAR-CLIP are still lacking. In this article, we propose an integrative model to establish a joint distribution of observed read and mutation counts. To pinpoint the interaction sites at single base-pair resolution, we developed a novel modeling approach that adopts non-homogeneous hidden Markov models to incorporate the nucleotide sequence at each genomic location. Both simulation studies and data application showed that our method outperforms the ad hoc methods, and provides reliable inferences for the RNA–protein binding sites from PAR-CLIP data.

In HIV-1 clinical trials the interest is often to compare how well treatments suppress the HIV-1 RNA viral load. The current practice in statistical analysis of such trials is to define a single ad hoc composite event which combines information about both the viral load suppression and the subsequent viral rebound, and then analyze the data using standard univariate survival analysis techniques. The main weakness of this approach is that the results of the analysis can be easily influenced by minor details in the definition of the composite event. We propose a straightforward alternative endpoint based on the probability of being suppressed over time, and suggest that treatment differences be summarized using the restricted mean time a patient spends in the state of viral suppression. A nonparametric analysis is based on methods for multiple endpoint studies. We demonstrate the utility of our analytic strategy using a recent therapeutic trial, in which the protocol specified a primary analysis using a composite endpoint approach.

Clustered data commonly arise in epidemiology. We assume each cluster member has an outcome *Y* and covariates . When there are missing data in *Y*, the distribution of *Y* given in all cluster members (“complete clusters”) may be different from the distribution just in members with observed *Y* (“observed clusters”). Often the former is of interest, but when data are missing because in a fundamental sense *Y* does not exist (e.g., quality of life for a person who has died), the latter may be more meaningful (quality of life conditional on being alive). Weighted and doubly weighted generalized estimating equations and shared random-effects models have been proposed for observed-cluster inference when cluster size is informative, that is, the distribution of *Y* given in observed clusters depends on observed cluster size. We show these methods can be seen as actually giving inference for complete clusters and may not also give observed-cluster inference. This is true even if observed clusters are complete in themselves rather than being the observed part of larger complete clusters: here methods may describe imaginary complete clusters rather than the observed clusters. We show under which conditions shared random-effects models proposed for observed-cluster inference do actually describe members with observed *Y*. A psoriatic arthritis dataset is used to illustrate the danger of misinterpreting estimates from shared random-effects models.

We consider inference for the reaction rates in discretely observed networks such as those found in models for systems biology, population ecology, and epidemics. Most such networks are neither slow enough nor small enough for inference via the true state-dependent Markov jump process to be feasible. Typically, inference is conducted by approximating the dynamics through an ordinary differential equation (ODE) or a stochastic differential equation (SDE). The former ignores the stochasticity in the true model and can lead to inaccurate inferences. The latter is more accurate but is harder to implement as the transition density of the SDE model is generally unknown. The linear noise approximation (LNA) arises from a first-order Taylor expansion of the approximating SDE about a deterministic solution and can be viewed as a compromise between the ODE and SDE models. It is a stochastic model, but discrete time transition probabilities for the LNA are available through the solution of a series of ordinary differential equations. We describe how a restarting LNA can be efficiently used to perform inference for a general class of reaction networks; evaluate the accuracy of such an approach; and show how and when this approach is either statistically or computationally more efficient than ODE or SDE methods. We apply the LNA to analyze Google Flu Trends data from the North and South Islands of New Zealand, and are able to obtain more accurate short-term forecasts of new flu cases than another recently proposed method, although at a greater computational cost.