Spatially referenced datasets arising from multiple sources are routinely combined to assess relationships among various outcomes and covariates. The geographical units associated with the data, such as the geographical coordinates or areal-level administrative units, are often spatially misaligned, that is, observed at different locations or aggregated over different geographical units. As a result, the covariate is often predicted at the locations where the response is observed. The method used to align disparate datasets must be accounted for when subsequently modeling the aligned data. Here we consider the case where kriging is used to align datasets in point-to-point and point-to-areal misalignment problems when the response variable is non-normally distributed. If the relationship is modeled using generalized linear models, the additional uncertainty induced from using the kriging mean as a covariate introduces a Berkson error structure. In this article, we develop a pseudo-penalized quasi-likelihood algorithm to account for the additional uncertainty when estimating regression parameters and associated measures of uncertainty. The method is applied to a point-to-point example assessing the relationship between low-birth weights and PM_{2.5} levels after the onset of the largest wildfire in Florida history, the Bugaboo scrub fire. A point-to-areal misalignment problem is presented where the relationship between asthma events in Florida's counties and PM_{2.5} levels after the onset of the fire is assessed. Finally, the method is evaluated using a simulation study. Our results indicate the method performs well in terms of coverage for 95% confidence intervals and naive methods that ignore the additional uncertainty tend to underestimate the variability associated with parameter estimates. The underestimation is most profound in Poisson regression models.

Semicompeting risk outcome data (e.g., time to disease progression and time to death) are commonly collected in clinical trials. However, analysis of these data is often hampered by a scarcity of available statistical tools. As such, we propose a novel semiparametric transformation model that improves the existing models in the following two ways. First, it estimates regression coefficients and association parameters simultaneously. Second, the measure of surrogacy, for example, the proportion of the treatment effect that is mediated by the surrogate and the ratio of the overall treatment effect on the true endpoint over that on the surrogate endpoint, can be directly obtained. We propose an estimation procedure for inference and show that the proposed estimator is consistent and asymptotically normal. Extensive simulations demonstrate the valid usage of our method. We apply the method to a multiple myeloma trial to study the impact of several biomarkers on patients’ semicompeting outcomes—namely, time to progression and time to death.

Survival data are subject to length-biased sampling when the survival times are left-truncated and the underlying truncation time random variable is uniformly distributed. Substantial efficiency gains can be achieved by incorporating the information about the truncation time distribution in the estimation procedure [Wang (1989) *Journal of the American Statistical Association* **84**, 742–748; Wang (1996) *Biometrika* **83**, 343–354]. Under the semiparametric transformation models, the maximum likelihood method is expected to be fully efficient, yet it is difficult to implement because the full likelihood depends on the nonparametric component in a complicated way. Moreover, its asymptotic properties have not been established. In this article, we extend the martingale estimating equation approach [Chen et al. (2002) *Biometrika* **89**, 659–668; Kim et al. (2013) *Journal of the American Statistical Association* **108**, 217–227] and the pseudo-partial likelihood approach [Severini and Wong (1992) *The Annals of Statistics* **4**, 1768–1802; Zucker (2005) *Journal of the American Statistical Association* **100**, 1264–1277] for semiparametric transformation models with right-censored data to handle left-truncated and right-censored data. In the same spirit of the composite likelihood method [Huang and Qin (2012) *Journal of the American Statistical Association* **107**, 946–957], we further construct another set of unbiased estimating equations by exploiting the special probability structure of length-biased sampling. Thus the number of estimating equations exceeds the number of parameters, and efficiency gains can be achieved by solving a simple combination of these estimating equations. The proposed methods are easy to implement as they do not require additional programming efforts. Moreover, they are shown to be consistent and asymptotically normally distributed. A data analysis of a dementia study illustrates the methods.

Phylogeography investigates the historical process that is responsible for the contemporary geographic distributions of populations in a species. The inference is made on the basis of molecular sequence data sampled from modern-day populations. The estimates, however, may fluctuate depending on the relevant genomic regions, because the evolution mechanism of each genome is unique, even within the same individual. In this article, we propose a genome-differentiated population tree model that allows the existence of separate population trees for each homologous genome. In each population tree, the unique evolutionary characteristics account for each genome, along with their homologous relationship; therefore, the approach can distinguish the evolutionary history of one genome from that of another. In addition to the separate divergence times, the new model can estimate separate effective population sizes, gene-genealogies and other mutation parameters. For Bayesian inference, we developed a Markov chain Monte Carlo (MCMC) methodology with a novel MCMC algorithm which can mix over a complicated state space. The stability of the new estimator is demonstrated through comparison with the Monte Carlo samples and other methods, as well as MCMC convergence diagnostics. The analysis of African gorilla data from two homologous loci reveals discordant divergence times between loci, and this discrepancy is explained by male-mediated gene flows until the end of the last ice age.

In the analysis of competing risks data, the cumulative incidence function is a useful quantity to characterize the crude risk of failure from a specific event type. In this article, we consider an efficient semiparametric analysis of mixture component models on cumulative incidence functions. Under the proposed mixture model, latency survival regressions given the event type are performed through a class of semiparametric models that encompasses the proportional hazards model and the proportional odds model, allowing for time-dependent covariates. The marginal proportions of the occurrences of cause-specific events are assessed by a multinomial logistic model. Our mixture modeling approach is advantageous in that it makes a joint estimation of model parameters associated with all competing risks under consideration, satisfying the constraint that the cumulative probability of failing from any cause adds up to one given any covariates. We develop a novel maximum likelihood scheme based on semiparametric regression analysis that facilitates efficient and reliable estimation. Statistical inferences can be conveniently made from the inverse of the observed information matrix. We establish the consistency and asymptotic normality of the proposed estimators. We validate small sample properties with simulations and demonstrate the methodology with a data set from a study of follicular lymphoma.

Substantial progress has been made in identifying single genetic variants predisposing to common complex diseases. Nonetheless, the genetic etiology of human diseases remains largely unknown. Human complex diseases are likely influenced by the joint effect of a large number of genetic variants instead of a single variant. The joint analysis of multiple genetic variants considering linkage disequilibrium (LD) and potential interactions can further enhance the discovery process, leading to the identification of new disease-susceptibility genetic variants. Motivated by development in spatial statistics, we propose a new statistical model based on the random field theory, referred to as a genetic random field model (GenRF), for joint association analysis with the consideration of possible gene–gene interactions and LD. Using a pseudo-likelihood approach, a GenRF test for the joint association of multiple genetic variants is developed, which has the following advantages: (1) accommodating complex interactions for improved performance; (2) natural dimension reduction; (3) boosting power in the presence of LD; and (4) computationally efficient. Simulation studies are conducted under various scenarios. The development has been focused on quantitative traits and robustness of the GenRF test to other traits, for example, binary traits, is also discussed. Compared with a commonly adopted kernel machine approach, SKAT, as well as other more standard methods, GenRF shows overall comparable performance and better performance in the presence of complex interactions. The method is further illustrated by an application to the Dallas Heart Study.

Many techniques of functional data analysis require choosing a measure of distance between functions, with the most common choice being distance. In this article we show that using a weighted distance, with a judiciously chosen weight function, can improve the performance of various statistical methods for functional data, including *k*-medoids clustering, nonparametric classification, and permutation testing. Assuming a quadratically penalized (e.g., spline) basis representation for the functional data, we consider three nontrivial weight functions: design density weights, inverse-variance weights, and a new weight function that minimizes the coefficient of variation of the resulting squared distance by means of an efficient iterative procedure. The benefits of weighting, in particular with the proposed weight function, are demonstrated both in simulation studies and in applications to the Berkeley growth data and a functional magnetic resonance imaging data set.

Structural mean models (SMMs) have been proposed for estimating causal parameters in the presence of non-ignorable non-compliance in clinical trials. To obtain a valid causal estimate, we must impose several assumptions. One of these is the correct specification of the structural model. Building on Pan's work (2001, *Biometrics* **57**, 120–125) on developing a model selection criterion for generalized estimating equations, we propose a new approach for model selection of SMMs based on a quasi-likelihood. We provide a formal model selection criterion that is an extension of Akaike's information criterion. Using subset selection of baseline covariates, our method allows us to understand whether the treatment effect varies across the available baseline covariate levels, and/or to quantify the treatment effect on a specific covariates level to target specific individuals to maximize treatment benefit. We present simulation results in which our method performs reasonably well compared to other testing methods in terms of both the probability of selecting the correct model and the predictive performances of the individual treatment effects. We use a large randomized clinical trial of pravastatin as a motivation.

A common goal of epidemiologic research is to study how two exposures interact in causing a binary outcome. Causal interaction is defined as the presence of subjects for which the causal effect of one exposure depends on the level of the other exposure. For binary exposures, it has previously been shown that the presence of causal interaction is testable through additive statistical interaction. However, it has also been shown that the magnitude of causal interaction, defined as the proportion of subjects for which there is causal interaction, is generally not identifiable. In this article, we derive bounds on causal interactions, which are applicable to binary outcomes and categorical exposures with arbitrarily many levels. These bounds can be used to assess the magnitude of causal interaction, and serve as an important complement to the statistical test that is frequently employed. The bounds are derived both without and with an assumption about monotone exposure effects. We present an application of the bounds to a study of gene–gene interaction in rheumatoid arthritis.

Set classification problems arise when classification tasks are based on sets of observations as opposed to individual observations. In set classification, a classification rule is trained with *N* sets of observations, where each set is labeled with class information, and the prediction of a class label is performed also with a set of observations. Data sets for set classification appear, for example, in diagnostics of disease based on multiple cell nucleus images from a single tissue. Relevant statistical models for set classification are introduced, which motivate a set classification framework based on context-free feature extraction. By understanding a set of observations as an empirical distribution, we employ a data-driven method to choose those features which contain information on location and major variation. In particular, the method of principal component analysis is used to extract the features of major variation. Multidimensional scaling is used to represent features as vector-valued points on which conventional classifiers can be applied. The proposed set classification approaches achieve better classification results than competing methods in a number of simulated data examples. The benefits of our method are demonstrated in an analysis of histopathology images of cell nuclei related to liver cancer.

The *genotype main effects and genotype-by-environment interaction effects* (GGE) model and the *additive main effects and multiplicative interaction* (AMMI) model are two common models for analysis of genotype-by-environment data. These models are frequently used by agronomists, plant breeders, geneticists and statisticians for analysis of multi-environment trials. In such trials, a set of genotypes, for example, crop cultivars, are compared across a range of environments, for example, locations. The GGE and AMMI models use singular value decomposition to partition genotype-by-environment interaction into an ordered sum of multiplicative terms. This article deals with the problem of testing the significance of these multiplicative terms in order to decide how many terms to retain in the final model. We propose parametric bootstrap methods for this problem. Models with fixed main effects, fixed multiplicative terms and random normally distributed errors are considered. Two methods are derived: a *full* and a *simple* parametric bootstrap method. These are compared with the alternatives of using approximate *F*-tests and cross-validation. In a simulation study based on four multi-environment trials, both bootstrap methods performed well with regard to Type I error rate and power. The simple parametric bootstrap method is particularly easy to use, since it only involves repeated sampling of standard normally distributed values. This method is recommended for selecting the number of multiplicative terms in GGE and AMMI models. The proposed methods can also be used for testing components in principal component analysis.

In this article we propose an accelerated intensity frailty (AIF) model for recurrent events data and derive a test for the variance of frailty. In addition, we develop a kernel-smoothing-based EM algorithm for estimating regression coefficients and the baseline intensity function. The variance of the resulting estimator for regression parameters is obtained by a numerical differentiation method. Simulation studies are conducted to evaluate the finite sample performance of the proposed estimator under practical settings and demonstrate the efficiency gain over the Gehan rank estimator based on the AFT model for counting process (Lin et al., 1998). Our method is further illustrated with an application to a bladder tumor recurrence data.

To evaluate the utility of automated deformable image registration (DIR) algorithms, it is necessary to evaluate both the registration accuracy of the DIR algorithm itself, as well as the registration accuracy of the human readers from whom the “gold standard” is obtained. We propose a Bayesian hierarchical model to evaluate the spatial accuracy of human readers and automatic DIR methods based on multiple image registration data generated by human readers and automatic DIR methods. To fully account for the locations of landmarks in all images, we treat the true locations of landmarks as latent variables and impose a hierarchical structure on the magnitude of registration errors observed across image pairs. DIR registration errors are modeled using Gaussian processes with reference prior densities on prior parameters that determine the associated covariance matrices. We develop a Gibbs sampling algorithm to efficiently fit our models to high-dimensional data, and apply the proposed method to analyze an image dataset obtained from a 4D thoracic CT study.

Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this article, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to non-identifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset.

Suppose we are interested in estimating the average causal effect from an observational study. A doubly robust estimator, which is a hybrid of the outcome regression and propensity score weighting, is more robust than estimators obtained by either of them in the sense that, if at least one of the two models holds, the doubly robust estimator is consistent. However, a doubly robust estimator may still suffer from model misspecification since it is not consistent if neither of them is correctly specified. In this article, we propose an alternative estimator, called the stratified doubly robust estimator, by further combining propensity score stratification with outcome regression and propensity score weighting. This estimator allows two candidate models for the propensity score and is more robust than existing doubly robust estimators in the sense that it is consistent either if the outcome regression holds or if one of the two models for the propensity score holds. Asymptotic properties are examined and finite sample performance of the proposed estimator is investigated by simulation studies. Our proposed method is illustrated with the Tone study, which is a community survey conducted in Japan.

In this article, we first extend the superpopulation capture–recapture model to multiple states (locations or populations) for two age groups., Wen et al., (2011; 2013) developed a new approach combining capture–recapture data with population assignment information to estimate the relative contributions of in situ births and immigrants to the growth of a single study population. Here, we first generalize Wen et al., (2011; 2013) approach to a system composed of multiple study populations (multi-state) with two age groups, where an imputation approach is employed to account for the uncertainty inherent in the population assignment information. Then we develop a different, individual-level mixture model approach to integrate the individual-level population assignment information with the capture–recapture data. Our simulation and real data analyses show that the fusion of population assignment information with capture–recapture data allows us to estimate the origination-specific recruitment of new animals to the system and the dispersal process between populations within the system. Compared to a standard capture–recapture model, our new models improve the estimation of demographic parameters, including survival probability, origination-specific entry probability, and especially the probability of movement between populations, yielding higher accuracy and precision.

While a general goal of early phase clinical studies is to identify an acceptable dose for further investigation, modern dose finding studies and designs are highly specific to individual clinical settings. In addition, as outcome-adaptive dose finding methods often involve complex algorithms, it is crucial to have diagnostic tools to evaluate the plausibility of a method's simulated performance and the adequacy of the algorithm. In this article, we propose a simple technique that provides an upper limit, or a benchmark, of accuracy for dose finding methods for a given design objective. The proposed benchmark is nonparametric optimal in the sense of O'Quigley et al. (2002, *Biostatistics* **3,** 51–56), and is demonstrated by examples to be a practical accuracy upper bound for model-based dose finding methods. We illustrate the implementation of the technique in the context of phase I trials that consider multiple toxicities and phase I/II trials where dosing decisions are based on both toxicity and efficacy, and apply the benchmark to several clinical examples considered in the literature. By comparing the operating characteristics of a dose finding method to that of the benchmark, we can form quick initial assessments of whether the method is adequately calibrated and evaluate its sensitivity to the dose–outcome relationships.

In statistical inference, one has to make sure that the underlying regression model is correctly specified otherwise the resulting estimation may be biased. Model checking is an important method to detect any departure of the regression model from the true one. Missing data are a ubiquitous problem in social and medical studies. If the underlying regression model is correctly specified, recent researches show great popularity of the doubly robust (DR) estimates method for handling missing data because of its robustness to the misspecification of either the missing data model or the conditional mean model, that is, the model for the conditional expectation of true regression model conditioning on the observed quantities. However, little work has been devoted to the goodness of fit test for DR estimates method. In this article, we propose a testing method to assess the reliability of the estimator derived from the DR estimating equation with possibly missing response and always observed auxiliary variables. Numerical studies demonstrate that the proposed test can control type I errors well. Furthermore the proposed method can detect departures from model assumptions in the marginal mean model of interest powerfully. A real dementia data set is used to illustrate the method for the diagnosis of model misspecification in the problem of missing response with an always observed auxiliary variable for cross-sectional data.

The photoactivatable ribonucleoside enhanced cross-linking immunoprecipitation (PAR-CLIP) has been increasingly used for the global mapping of RNA–protein interaction sites. There are two key features of the PAR-CLIP experiments: The sequence read tags are likely to form an enriched peak around each RNA–protein interaction site; and the cross-linking procedure is likely to introduce a specific mutation in each sequence read tag at the interaction site. Several ad hoc methods have been developed to identify the RNA–protein interaction sites using either sequence read counts or mutation counts alone; however, rigorous statistical methods for analyzing PAR-CLIP are still lacking. In this article, we propose an integrative model to establish a joint distribution of observed read and mutation counts. To pinpoint the interaction sites at single base-pair resolution, we developed a novel modeling approach that adopts non-homogeneous hidden Markov models to incorporate the nucleotide sequence at each genomic location. Both simulation studies and data application showed that our method outperforms the ad hoc methods, and provides reliable inferences for the RNA–protein binding sites from PAR-CLIP data.

Estimation of the long-term health effects of air pollution is a challenging task, especially when modeling spatial small-area disease incidence data in an ecological study design. The challenge comes from the unobserved underlying spatial autocorrelation structure in these data, which is accounted for using random effects modeled by a globally smooth conditional autoregressive model. These smooth random effects confound the effects of air pollution, which are also globally smooth. To avoid this collinearity a Bayesian localized conditional autoregressive model is developed for the random effects. This localized model is flexible spatially, in the sense that it is not only able to model areas of spatial smoothness, but also it is able to capture step changes in the random effects surface. This methodological development allows us to improve the estimation performance of the covariate effects, compared to using traditional conditional auto-regressive models. These results are established using a simulation study, and are then illustrated with our motivating study on air pollution and respiratory ill health in Greater Glasgow, Scotland in 2011. The model shows substantial health effects of particulate matter air pollution and nitrogen dioxide, whose effects have been consistently attenuated by the currently available globally smooth models.

Parametric estimation of the cumulative incidence function (CIF) is considered for competing risks data subject to interval censoring. Existing parametric models of the CIF for right censored competing risks data are adapted to the general case of interval censoring. Maximum likelihood estimators for the CIF are considered under the assumed models, extending earlier work on nonparametric estimation. A simple naive likelihood estimator is also considered that utilizes only part of the observed data. The naive estimator enables separate estimation of models for each cause, unlike full maximum likelihood in which all models are fit simultaneously. The naive likelihood is shown to be valid under mixed case interval censoring, but not under an independent inspection process model, in contrast with full maximum likelihood which is valid under both interval censoring models. In simulations, the naive estimator is shown to perform well and yield comparable efficiency to the full likelihood estimator in some settings. The methods are applied to data from a large, recent randomized clinical trial for the prevention of mother-to-child transmission of HIV.

This article targets the estimation of a time-dependent association measure for bivariate failure times, the conditional cause-specific hazards ratio (CCSHR), which is a generalization of the conditional hazards ratio (CHR) to accommodate competing risks data. We model the CCSHR as a parametric regression function of time and event causes and leave all other aspects of the joint distribution of the failure times unspecified. We develop a pseudo-likelihood estimation procedure for model fitting and inference and establish the asymptotic properties of the estimators. We assess the finite-sample properties of the proposed estimators against the estimators obtained from a moment-based estimating equation approach. Data from the Cache County study on dementia are used to illustrate the proposed methodology.

We take a semiparametric approach in fitting a linear transformation model to a right censored data when predictive variables are subject to measurement errors. We construct consistent estimating equations when repeated measurements of a surrogate of the unobserved true predictor are available. The proposed approach applies under minimal assumptions on the distributions of the true covariate or the measurement errors. We derive the asymptotic properties of the estimator and illustrate the characteristics of the estimator in finite sample performance via simulation studies. We apply the method to analyze an AIDS clinical trial data set that motivated the work.

Estimation of the covariance structure for irregular sparse longitudinal data has been studied by many authors in recent years but typically using fully parametric specifications. In addition, when data are collected from several groups over time, it is known that assuming the same or completely different covariance matrices over groups can lead to loss of efficiency and/or bias. Nonparametric approaches have been proposed for estimating the covariance matrix for regular univariate longitudinal data by sharing information across the groups under study. For the irregular case, with longitudinal measurements that are bivariate or multivariate, modeling becomes more difficult. In this article, to model bivariate sparse longitudinal data from several groups, we propose a flexible covariance structure via a novel matrix stick-breaking process for the residual covariance structure and a Dirichlet process mixture of normals for the random effects. Simulation studies are performed to investigate the effectiveness of the proposed approach over more traditional approaches. We also analyze a subset of Framingham Heart Study data to examine how the blood pressure trajectories and covariance structures differ for the patients from different BMI groups (high, medium, and low) at baseline.

Investigators commonly gather longitudinal data to assess changes in responses over time and to relate these changes to within-subject changes in predictors. With rare or expensive outcomes such as uncommon diseases and costly radiologic measurements, outcome-dependent, and more generally outcome-related, sampling plans can improve estimation efficiency and reduce cost. Longitudinal follow up of subjects gathered in an initial outcome-related sample can then be used to study the trajectories of responses over time and to assess the association of changes in predictors within subjects with change in response. In this article, we develop two likelihood-based approaches for fitting generalized linear mixed models (GLMMs) to longitudinal data from a wide variety of outcome-related sampling designs. The first is an extension of the semi-parametric maximum likelihood approach developed in Neuhaus, Scott and Wild (2002, *Biometrika* **89**, 23–37) and Neuhaus, Scott and Wild (2006, *Biometrics* **62**, 488–494) and applies quite generally. The second approach is an adaptation of standard conditional likelihood methods and is limited to random intercept models with a canonical link. Data from a study of attention deficit hyperactivity disorder in children motivates the work and illustrates the findings.

Dynamic treatment regimes (DTRs) operationalize the clinical decision process as a sequence of functions, one for each clinical decision, where each function maps up-to-date patient information to a single recommended treatment. Current methods for estimating optimal DTRs, for example *Q*-learning, require the specification of a single outcome by which the “goodness” of competing dynamic treatment regimes is measured. However, this is an over-simplification of the goal of clinical decision making, which aims to balance several potentially competing outcomes, for example, symptom relief and side-effect burden. When there are competing outcomes and patients do not know or cannot communicate their preferences, formation of a single composite outcome that correctly balances the competing outcomes is not possible. This problem also occurs when patient preferences evolve over time. We propose a method for constructing DTRs that accommodates competing outcomes by recommending sets of treatments at each decision point. Formally, we construct a sequence of set-valued functions that take as input up-to-date patient information and give as output a recommended subset of the possible treatments. For a given patient history, the recommended set of treatments contains all treatments that produce non-inferior outcome vectors. Constructing these set-valued functions requires solving a non-trivial enumeration problem. We offer an exact enumeration algorithm by recasting the problem as a linear mixed integer program. The proposed methods are illustrated using data from the CATIE schizophrenia study.

In order to make a missing at random (MAR) or ignorability assumption realistic, auxiliary covariates are often required. However, the auxiliary covariates are not desired in the model for inference. Typical multiple imputation approaches do not assume that the imputation model marginalizes to the inference model. This has been termed “uncongenial” [Meng (1994, *Statistical Science* **9**, 538–558)]. In order to make the two models congenial (or compatible), we would rather not assume a parametric model for the marginal distribution of the auxiliary covariates, but we typically do not have enough data to estimate the joint distribution well non-parametrically. In addition, when the imputation model uses a non-linear link function (e.g., the logistic link for a binary response), the marginalization over the auxiliary covariates to derive the inference model typically results in a difficult to interpret form for the effect of covariates. In this article, we propose a fully Bayesian approach to ensure that the models are compatible for incomplete longitudinal data by embedding an interpretable inference model within an imputation model and that also addresses the two complications described above. We evaluate the approach via simulations and implement it on a recent clinical trial.

Motivated by examples from genetic association studies, this article considers the model selection problem in a general complex linear model system and in a Bayesian framework. We discuss formulating model selection problems and incorporating context-dependent *a priori* information through different levels of prior specifications. We also derive analytic Bayes factors and their approximations to facilitate model selection and discuss their theoretical and computational properties. We demonstrate our Bayesian approach based on an implemented Markov Chain Monte Carlo (MCMC) algorithm in simulations and a real data application of mapping tissue-specific eQTLs. Our novel results on Bayes factors provide a general framework to perform efficient model comparisons in complex linear model systems.

Functional enrichment analysis is conducted on high-throughput data to provide functional interpretation for a list of genes or proteins that share a common property, such as being differentially expressed (DE). The hypergeometric *P*-value has been widely used to investigate whether genes from pre-defined functional terms, for example, Gene Ontology (GO), are enriched in the DE genes. The hypergeometric *P*-value has three limitations: (1) computed independently for each term, thus neglecting biological dependence; (2) subject to a size constraint that leads to the tendency of selecting less-specific terms; (3) repeated use of information due to overlapping annotations by the true-path rule. We propose a Bayesian approach based on the non-central hypergeometric model. The GO dependence structure is incorporated through a prior on non-centrality parameters. The likelihood function does not include overlapping information. The inference about enrichment is based on posterior probabilities that do not have a size constraint. This method can detect moderate but consistent enrichment signals and identify sets of closely related and biologically meaningful functional terms rather than isolated terms. We also describe the basic ideas of assumption and implementation of different methods to provide some theoretical insights, which are demonstrated via a simulation study. A real application is presented.

We develop a Bayesian nonparametric mixture modeling framework for quantal bioassay settings. The approach is built upon modeling dose-dependent response distributions. We adopt a structured nonparametric prior mixture model, which induces a monotonicity restriction for the dose–response curve. Particular emphasis is placed on the key risk assessment goal of calibration for the dose level that corresponds to a specified response. The proposed methodology yields flexible inference for the dose–response relationship as well as for other inferential objectives, as illustrated with two data sets from the literature.

This article proposes a new multiple-testing approach for estimation of the minimum effective dose allowing for non-monotonous dose–response shapes. The presented approach combines the advantages of two commonly used methods. It is shown that the new approach controls the error rate of underestimating the true minimum effective dose. Monte Carlo simulations indicate that the proposed method outperforms alternative methods in many cases and is only marginally worse in the remaining situations.

We propose a new variable selection criterion designed for use with forward selection algorithms; the score information criterion (SIC). The proposed criterion is based on score statistics which incorporate correlated response data. The main advantage of the SIC is that it is much faster to compute than existing model selection criteria when the number of predictor variables added to a model is large, this is because SIC can be computed for all candidate models without actually fitting them. A second advantage is that it incorporates the correlation between variables into its quasi-likelihood, leading to more desirable properties than competing selection criteria. Consistency and prediction properties are shown for the SIC. We conduct simulation studies to evaluate the selection and prediction performances, and compare these, as well as computational times, with some well-known variable selection criteria. We apply the SIC on a real data set collected on arthropods by considering variable selection on a large number of interactions terms consisting of species traits and environmental covariates.

Reconstructing neural activities using non-invasive sensor arrays outside the brain is an ill-posed inverse problem since the observed sensor measurements could result from an infinite number of possible neuronal sources. The sensor covariance-based beamformer mapping represents a popular and simple solution to the above problem. In this article, we propose a family of beamformers by using covariance thresholding. A general theory is developed on how their spatial and temporal dimensions determine their performance. Conditions are provided for the convergence rate of the associated beamformer estimation. The implications of the theory are illustrated by simulations and a real data analysis.

In this article, we present a new variational Bayes approach for solving the neuroelectromagnetic inverse problem arising in studies involving electroencephalography (EEG) and magnetoencephalography (MEG). This high-dimensional spatiotemporal estimation problem involves the recovery of time-varying neural activity at a large number of locations within the brain, from electromagnetic signals recorded at a relatively small number of external locations on or near the scalp. Framing this problem within the context of spatial variable selection for an underdetermined functional linear model, we propose a spatial mixture formulation where the profile of electrical activity within the brain is represented through location-specific spike-and-slab priors based on a spatial logistic specification. The prior specification accommodates spatial clustering in brain activation, while also allowing for the inclusion of auxiliary information derived from alternative imaging modalities, such as functional magnetic resonance imaging (fMRI). We develop a variational Bayes approach for computing estimates of neural source activity, and incorporate a nonparametric bootstrap for interval estimation. The proposed methodology is compared with several alternative approaches through simulation studies, and is applied to the analysis of a multimodal neuroimaging study examining the neural response to face perception using EEG, MEG, and fMRI.

Despite modern effective HIV treatment, hepatitis C virus (HCV) co-infection is associated with a high risk of progression to end-stage liver disease (ESLD) which has emerged as the primary cause of death in this population. Clinical interest lies in determining the impact of clearance of HCV on risk for ESLD. In this case study, we examine whether HCV clearance affects risk of ESLD using data from the multicenter Canadian Co-infection Cohort Study. Complications in this survival analysis arise from the time-dependent nature of the data, the presence of baseline confounders, loss to follow-up, and confounders that change over time, all of which can obscure the causal effect of interest. Additional challenges included non-censoring variable missingness and event sparsity.

In order to efficiently estimate the ESLD-free survival probabilities under a specific history of HCV clearance, we demonstrate the double-robust and semiparametric efficient method of Targeted Maximum Likelihood Estimation (TMLE). Marginal structural models (MSM) can be used to model the effect of viral clearance (expressed as a hazard ratio) on ESLD-free survival and we demonstrate a way to estimate the parameters of a logistic model for the hazard function with TMLE. We show the theoretical derivation of the efficient influence curves for the parameters of two different MSMs and how they can be used to produce variance approximations for parameter estimates. Finally, the data analysis evaluating the impact of HCV on ESLD was undertaken using multiple imputations to account for the non-monotone missing data.

Matched case-control designs are commonly used in epidemiologic studies for increased efficiency. These designs have recently been introduced to the setting of modern imaging and genomic studies, which are characterized by high-dimensional covariates. However, appropriate statistical analyses that adjust for the matching have not been widely adopted. A matched case-control study of 430 acute ischemic stroke patients was conducted at Massachusetts General Hospital (MGH) in order to identify specific brain regions of acute infarction that are associated with hospital acquired pneumonia (HAP) in these patients. There are 138 brain regions in which infarction was measured, which introduce nearly 10,000 two-way interactions, and challenge the statistical analysis. We investigate penalized conditional and unconditional logistic regression approaches to this variable selection problem that properly differentiate between selection of main effects and of interactions, and that acknowledge the matching. This neuroimaging study was nested within a larger prospective study of HAP in 1915 stroke patients at MGH, which recorded clinical variables, but did not include neuroimaging. We demonstrate how the larger study, in conjunction with the nested, matched study, affords us the capability to derive a score for prediction of HAP in future stroke patients based on imaging and clinical features. We evaluate the proposed methods in simulation studies and we apply them to the MGH HAP study.

Traditional studies of short-term air pollution health effects use time series data, while cohort studies generally focus on long-term effects. There is increasing interest in exploiting individual level cohort data to assess short-term health effects in order to understand the mechanisms and time scales of action. We extend semiparametric regression methods used to adjust for unmeasured confounding in time series studies to the cohort setting. Time series methods are not directly applicable since cohort data are typically collected over a prespecified time period and include exposure measurements on days without health observations. Therefore, long-time asymptotics are not appropriate, and it is possible to improve efficiency by exploiting the additional exposure data. We show that flexibility of the semiparametric adjustment model should match the complexity of the trend in the health outcome, in contrast to the time series setting where it suffices to match temporal structure in the exposure. We also demonstrate that pre-adjusting exposures concurrent with the health endpoints using trends in the complete exposure time series results in unbiased health effect estimation and can improve efficiency without additional confounding adjustment. A recently published article found evidence of an association between short-term exposure to ambient fine particulate matter (PM_{2.5}) and retinal arteriolar diameter as measured by retinal photography in the Multi-Ethnic Study of Atherosclerosis (MESA). We reanalyze the data from this article in order to compare the methods described here, and we evaluate our methods in a simulation study based on the MESA data.

Motivated by actual study designs, this article considers efficient logistic regression designs where the population is identified with a binary test that is subject to diagnostic error. We consider the case where the imperfect test is obtained on all participants, while the gold standard test is measured on a small chosen subsample. Under maximum-likelihood estimation, we evaluate the optimal design in terms of sample selection as well as verification. We show that there may be substantial efficiency gains by choosing a small percentage of individuals who test negative on the imperfect test for inclusion in the sample (e.g., verifying 90% test-positive cases). We also show that a two-stage design may be a good practical alternative to a fixed design in some situations. Under optimal and nearly optimal designs, we compare maximum-likelihood and semi-parametric efficient estimators under correct and misspecified models with simulations. The methodology is illustrated with an analysis from a diabetes behavioral intervention trial.

Trial investigators often have a primary interest in the estimation of the survival curve in a population for which there exists acceptable historical information from which to borrow strength. However, borrowing strength from a historical trial that is non-exchangeable with the current trial can result in biased conclusions. In this article we propose a fully Bayesian semiparametric method for the purpose of attenuating bias and increasing efficiency when jointly modeling time-to-event data from two possibly non-exchangeable sources of information. We illustrate the mechanics of our methods by applying them to a pair of post-market surveillance datasets regarding adverse events in persons on dialysis that had either a bare metal or drug-eluting stent implanted during a cardiac revascularization surgery. We finish with a discussion of the advantages and limitations of this approach to evidence synthesis, as well as directions for future work in this area. The article's Supplementary Materials offer simulations to show our procedure's bias, mean squared error, and coverage probability properties in a variety of settings.

A transformed Bernstein polynomial that is centered at standard parametric families, such as Weibull or log-logistic, is proposed for use in the accelerated hazards model. This class provides a convenient way towards creating a Bayesian nonparametric prior for smooth densities, blending the merits of parametric and nonparametric methods, that is amenable to standard estimation approaches. For example optimization methods in SAS or R can yield the posterior mode and asymptotic covariance matrix. This novel nonparametric prior is employed in the accelerated hazards model, which is further generalized to time-dependent covariates. The proposed approach fares considerably better than previous approaches in simulations; data on the effectiveness of biodegradable carmustine polymers on recurrent brain malignant gliomas is investigated.

Epidemiological studies involving biomarkers are often hindered by prohibitively expensive laboratory tests. Strategically pooling specimens prior to performing these lab assays has been shown to effectively reduce cost with minimal information loss in a logistic regression setting. When the goal is to perform regression with a continuous biomarker as the outcome, regression analysis of pooled specimens may not be straightforward, particularly if the outcome is right-skewed. In such cases, we demonstrate that a slight modification of a standard multiple linear regression model for poolwise data can provide valid and precise coefficient estimates when pools are formed by combining biospecimens from subjects with identical covariate values. When these -homogeneous pools cannot be formed, we propose a Monte Carlo expectation maximization (MCEM) algorithm to compute maximum likelihood estimates (MLEs). Simulation studies demonstrate that these analytical methods provide essentially unbiased estimates of coefficient parameters as well as their standard errors when appropriate assumptions are met. Furthermore, we show how one can utilize the fully observed covariate data to inform the pooling strategy, yielding a high level of statistical efficiency at a fraction of the total lab cost.

After establishing the utility of a continuous diagnostic marker investigators will typically address the question of determining a cut-off point which will be used for diagnostic purposes in clinical decision making. The most commonly used optimality criterion for cut-off point selection in the context of ROC curve analysis is the maximum of the Youden index. The pair of sensitivity and specificity proportions that correspond to the Youden index-based cut-off point characterize the performance of the diagnostic marker. Confidence intervals for sensitivity and specificity are routinely estimated based on the assumption that sensitivity and specificity are independent binomial proportions as they arise from the independent populations of diseased and healthy subjects, respectively. The Youden index-based cut-off point is estimated from the data and as such the resulting sensitivity and specificity proportions are in fact correlated. This correlation needs to be taken into account in order to calculate confidence intervals that result in the anticipated coverage. In this article we study parametric and non-parametric approaches for the construction of confidence intervals for the pair of sensitivity and specificity proportions that correspond to the Youden index-based optimal cut-off point. These approaches result in the anticipated coverage under different scenarios for the distributions of the healthy and diseased subjects. We find that a parametric approach based on a Box–Cox transformation to normality often works well. For biomarkers following more complex distributions a non-parametric procedure using logspline density estimation can be used.

We examine differences between independent component analyses (ICAs) arising from different assumptions, measures of dependence, and starting points of the algorithms. ICA is a popular method with diverse applications including artifact removal in electrophysiology data, feature extraction in microarray data, and identifying brain networks in functional magnetic resonance imaging (fMRI). ICA can be viewed as a generalization of principal component analysis (PCA) that takes into account higher-order cross-correlations. Whereas the PCA solution is unique, there are many ICA methods–whose solutions may differ. Infomax, FastICA, and JADE are commonly applied to fMRI studies, with FastICA being arguably the most popular. Hastie and Tibshirani (2003) demonstrated that ProDenICA outperformed FastICA in simulations with two components. We introduce the application of ProDenICA to simulations with more components and to fMRI data. ProDenICA was more accurate in simulations, and we identified differences between biologically meaningful ICs from ProDenICA versus other methods in the fMRI analysis. ICA methods require nonconvex optimization, yet current practices do not recognize the importance of, nor adequately address sensitivity to, initial values. We found that local optima led to dramatically different estimates in both simulations and group ICA of fMRI, and we provide evidence that the global optimum from ProDenICA is the best estimate. We applied a modification of the Hungarian (Kuhn-Munkres) algorithm to match ICs from multiple estimates, thereby gaining novel insights into how brain networks vary in their sensitivity to initial values and ICA method.

High-throughput screening (HTS) of environmental chemicals is used to identify chemicals with high potential for adverse human health and environmental effects from among the thousands of untested chemicals. Predicting physiologically relevant activity with HTS data requires estimating the response of a large number of chemicals across a battery of screening assays based on sparse dose–response data for each chemical-assay combination. Many standard dose–response methods are inadequate because they treat each curve separately and under-perform when there are as few as 6–10 observations per curve. We propose a semiparametric Bayesian model that borrows strength across chemicals and assays. Our method directly parametrizes the efficacy and potency of the chemicals as well as the probability of response. We use the ToxCast data from the U.S. Environmental Protection Agency (EPA) as motivation. We demonstrate that our hierarchical method provides more accurate estimates of the probability of response, efficacy, and potency than separate curve estimation in a simulation study. We use our semiparametric method to compare the efficacy of chemicals in the ToxCast data to well-characterized reference chemicals on estrogen receptor (ER) and peroxisome proliferator-activated receptor (PPAR) assays, then estimate the probability that other chemicals are active at lower concentrations than the reference chemicals.

A unified modeling framework based on a set of nonlinear mixed models is proposed for flexible modeling of gene expression in real-time PCR experiments. Focus is on estimating the marginal or population-based derived parameters: cycle thresholds and , but retaining the conditional mixed model structure to adequately reflect the experimental design. Additionally, the calculation of model-average estimates allows incorporation of the model selection uncertainty. The methodology is applied for estimating the differential expression of a phosphate transporter gene OsPT6 in rice in comparison to a reference gene at several states after phosphate resupply. In a small simulation study the performance of the proposed method is evaluated and compared to a standard method.

**EDITOR: TAESUNG PARK**

**Survival Analysis in Medicine and Genetics**

(Jialiang Li and Shuangge Ma)

*Seungyeoun Lee*

**Bayesian Methods in Health Economics**

(Gianluca Baio)

*Man-Suk Oh*

**Methods of Statistical Model Estimation**

(Joseph M. Hilbe and Andrew P. Robinson)

*Jae-kwang Kim*