The processing of auditory information in neurons is an important area in neuroscience. We consider statistical analysis for an electrophysiological experiment related to this area. The recorded synaptic current responses from the experiment are observed as clusters, where the number of clusters is related to an important characteristic of the auditory system. This number is difficult to estimate visually because the clusters are blurred by biological variability. Using singular value decomposition and a Gaussian mixture model, we develop an estimator for the number of clusters. Additionally, we provide a method for hypothesis testing and sample size determination in the two-sample problem. We illustrate our approach with both simulated and experimental data.

We generalize Levene's test for variance (scale) heterogeneity between *k* groups for more complex data, when there are sample correlation and group membership uncertainty. Following a two-stage regression framework, we show that least absolute deviation regression must be used in the stage 1 analysis to ensure a correct asymptotic distribution of the generalized scale (*gS*) test statistic. We then show that the proposed *gS* test is independent of the generalized location test, under the joint null hypothesis of no mean and no variance heterogeneity. Consequently, we generalize the recently proposed joint location-scale (*gJLS*) test, valuable in settings where there is an interaction effect but one interacting variable is not available. We evaluate the proposed method via an extensive simulation study and two genetic association application studies.

Population attributable fraction (PAF) is widely used to quantify the disease burden associated with a modifiable exposure in a population. It has been extended to a time-varying measure that provides additional information on when and how the exposure's impact varies over time for cohort studies. However, there is no estimation procedure for PAF using data that are collected from population-based case-control studies, which, because of time and cost efficiency, are commonly used for studying genetic and environmental risk factors of disease incidences. In this article, we show that time-varying PAF is identifiable from a case-control study and develop a novel estimator of PAF. Our estimator combines odds ratio estimates from logistic regression models and density estimates of the risk factor distribution conditional on failure times in cases from a kernel smoother. The proposed estimator is shown to be consistent and asymptotically normal with asymptotic variance that can be estimated empirically from the data. Simulation studies demonstrate that the proposed estimator performs well in finite sample sizes. Finally, the method is illustrated by a population-based case-control study of colorectal cancer.

In this article, we first propose a Bayesian neighborhood selection method to estimate Gaussian Graphical Models (GGMs). We show the graph selection consistency of this method in the sense that the posterior probability of the true model converges to one. When there are multiple groups of data available, instead of estimating the networks independently for each group, joint estimation of the networks may utilize the shared information among groups and lead to improved estimation for each individual network. Our method is extended to jointly estimate GGMs in multiple groups of data with complex structures, including spatial data, temporal data, and data with both spatial and temporal structures. Markov random field (MRF) models are used to efficiently incorporate the complex data structures. We develop and implement an efficient algorithm for statistical inference that enables parallel computing. Simulation studies suggest that our approach achieves better accuracy in network estimation compared with methods not incorporating spatial and temporal dependencies when there are shared structures among the networks, and that it performs comparably well otherwise. Finally, we illustrate our method using the human brain gene expression microarray dataset, where the expression levels of genes are measured in different brain regions across multiple time periods.

We consider conditional estimation in two-stage sample size adjustable designs and the consequent bias. More specifically, we consider a design which permits raising the sample size when interim results look rather promising, and which retains the originally planned sample size when results look very promising. The estimation procedures reported comprise the unconditional maximum likelihood, the conditionally unbiased Rao–Blackwell estimator, the conditional median unbiased estimator, and the conditional maximum likelihood with and without bias correction. We compare these estimators based on analytical results and a simulation study. We show how they can be applied in a real clinical trial setting.

Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high-dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this article, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets.

Traditionally, phylogeny and sequence alignment are estimated separately: first estimate a multiple sequence alignment and then infer a phylogeny based on the sequence alignment estimated in the previous step. However, uncertainty in the alignment is ignored, resulting, possibly, in overstated certainty in phylogeny estimates. We develop a joint model for co-estimating phylogeny and sequence alignment which improves estimates from the traditional approach by accounting for uncertainty in the alignment in phylogenetic inferences. Our insertion and deletion (indel) model allows arbitrary-length overlapping indel events and a general distribution for indel fragment size. We employ a Bayesian approach using MCMC to estimate the joint posterior distribution of a phylogenetic tree and a multiple sequence alignment. Our approach has a tree and a complete history of indel events mapped onto the tree as the state space of the Markov Chain while alternative previous approaches have a tree and an alignment. A large state space containing a complete history of indel events makes our MCMC approach more challenging, but it enables us to infer more information about the indel process. The performances of this joint method and traditional sequential methods are compared using simulated data as well as real data. Software named BayesCAT (Bayesian Co-estimation of Alignment and Tree) is available at https://github.com/heejungshim/BayesCAT.

In all sorts of regression problems, it has become more and more important to deal with high-dimensional data with lots of potentially influential covariates. A possible solution is to apply estimation methods that aim at the detection of the relevant effect structure by using penalization methods. In this article, the effect structure in the Cox frailty model, which is the most widely used model that accounts for heterogeneity in survival data, is investigated. Since in survival models one has to account for possible variation of the effect strength over time the selection of the relevant features has to distinguish between several cases, covariates can have time-varying effects, time-constant effects, or be irrelevant. A penalization approach is proposed that is able to distinguish between these types of effects to obtain a sparse representation that includes the relevant effects in a proper form. It is shown in simulations that the method works well. The method is applied to model the time until pregnancy, illustrating that the complexity of the influence structure can be strongly reduced by using the proposed penalty approach.

High-throughput genetic and epigenetic data are often screened for associations with an observed phenotype. For example, one may wish to test hundreds of thousands of genetic variants, or DNA methylation sites, for an association with disease status. These genomic variables can naturally be grouped by the gene they encode, among other criteria. However, standard practice in such applications is independent screening with a universal correction for multiplicity. We propose a Bayesian approach in which the prior probability of an association for a given genomic variable depends on its gene, and the gene-specific probabilities are modeled nonparametrically. This hierarchical model allows for appropriate gene and genome-wide multiplicity adjustments, and can be incorporated into a variety of Bayesian association screening methodologies with negligible increase in computational complexity. We describe an application to screening for differences in DNA methylation between lower grade glioma and glioblastoma multiforme tumor samples from The Cancer Genome Atlas. Software is available via the package BayesianScreening for R:github.com/lockEF/BayesianScreening.

To assess the compliance of air quality regulations, the Environmental Protection Agency (EPA) must know if a site exceeds a pre-specified level. In the case of ozone, the level for compliance is fixed at 75 parts per billion, which is high, but not extreme at all locations. We present a new space-time model for threshold exceedances based on the skew-*t* process. Our method incorporates a random partition to permit long-distance asymptotic independence while allowing for sites that are near one another to be asymptotically dependent, and we incorporate thresholding to allow the tails of the data to speak for themselves. We also introduce a transformed AR(1) time-series to allow for temporal dependence. Finally, our model allows for high-dimensional Bayesian inference that is comparable in computation time to traditional geostatistical methods for large data sets. We apply our method to an ozone analysis for July 2005, and find that our model improves over both Gaussian and max-stable methods in terms of predicting exceedances of a high level.

The problem of inferring the relatedness distribution between two individuals from biallelic marker data is considered. This problem can be cast as an estimation task in a mixture model: at each marker the latent variable is the relatedness state, and the observed variable is the genotype of the two individuals. In this model, only the prior proportions are unknown, and can be obtained via ML estimation using the EM algorithm. When the markers are biallelic and the data unphased, the identifiability of the model is known not to be guaranteed. In this article, model identifiability is investigated in the case of phased data generated from a crossing design, a classical situation in plant genetics. It is shown that identifiability can be guaranteed under some conditions on the crossing design. The adapted ML estimator is implemented in an R package called **Relatedness**. The performance of the ML estimator is evaluated and compared to that of the benchmark moment estimator, both on simulated and real data. Compared to its competitor, the ML estimator is shown to be more robust and to provide more realistic estimates.

Partial differential equations (PDEs) are used to model complex dynamical systems in multiple dimensions, and their parameters often have important scientific interpretations. In some applications, PDE parameters are not constant but can change depending on the values of covariates, a feature that we call *varying coefficients*. We propose a parameter cascading method to estimate varying coefficients in PDE models from noisy data. Our estimates of the varying coefficients are shown to be consistent and asymptotically normally distributed. The performance of our method is evaluated by a simulation study and by an empirical study estimating three varying coefficients in a PDE model arising from LIDAR data.

The electroencephalography (EEG) data created in event-related potential (ERP) experiments have a complex high-dimensional structure. Each stimulus presentation, or trial, generates an ERP waveform which is an instance of functional data. The experiments are made up of sequences of multiple trials, resulting in longitudinal functional data and moreover, responses are recorded at multiple electrodes on the scalp, adding an electrode dimension. Traditional EEG analyses involve multiple simplifications of this structure to increase the signal-to-noise ratio, effectively collapsing the functional and longitudinal components by identifying key features of the ERPs and averaging them across trials. Motivated by an implicit learning paradigm used in autism research in which the functional, longitudinal, and electrode components all have critical interpretations, we propose a multidimensional functional principal components analysis (MD-FPCA) technique which does not collapse any of the dimensions of the ERP data. The proposed decomposition is based on separation of the total variation into subject and subunit level variation which are further decomposed in a two-stage functional principal components analysis. The proposed methodology is shown to be useful for modeling longitudinal trends in the ERP functions, leading to novel insights into the learning patterns of children with Autism Spectrum Disorder (ASD) and their typically developing peers as well as comparisons between the two groups. Finite sample properties of MD-FPCA are further studied via extensive simulations.

Dysphagia is a primary cause of death among patients diagnosed with amyotrophic lateral sclerosis (ALS), and percutaneous endoscopic gastrostomy (PEG) is a procedure to insert a tube into the stomach to assist or replace oral feeding. It is believed that PEG is beneficial and, generally, earlier insertion is preferable to later. However, gathering clinical evidence to support these beliefs on the use and timing of PEG is challenging because controlled clinical trials are not feasible and clinical endpoints are confounded with PEG in observational data. Moreover, the confounders are time-varying and time to PEG insertion may be only partially observed. We show how one can view this problem as an adaptive treatment length policy and propose a new estimator via *g*-computation. We show that our estimator is consistent and asymptotically normal for the causal estimand and explore its finite sample properties in simulation studies. Finally, using more than 10 years of data from Emory ALS clinic registry, we found no evidence to suggest that earlier PEG reduced 4-year mortality; thus, our results do not support the hypothesis and belief that initiating palliative care earlier extends life, on average. At the same, we cannot be certain that all important confounding variables are collected and observed to ensure our modeling assumptions are correct, so more work is needed to address these important end-of-life questions for ALS patients.

Brain connectivity analysis is now at the foreground of neuroscience research. A connectivity network is characterized by a graph, where nodes represent neural elements such as neurons and brain regions, and links represent statistical dependence that is often encoded in terms of partial correlation. Such a graph is inferred from the matrix-valued neuroimaging data such as electroencephalography and functional magnetic resonance imaging. There have been a good number of successful proposals for sparse precision matrix estimation under normal or matrix normal distribution; however, this family of solutions does not offer a direct statistical significance quantification for the estimated links. In this article, we adopt a matrix normal distribution framework and formulate the brain connectivity analysis as a precision matrix hypothesis testing problem. Based on the separable spatial-temporal dependence structure, we develop oracle and data-driven procedures to test both the global hypothesis that all spatial locations are conditionally independent, and simultaneous tests for identifying conditional dependent spatial locations with false discovery rate control. Our theoretical results show that the data-driven procedures perform asymptotically as well as the oracle procedures and enjoy certain optimality properties. The empirical finite-sample performance of the proposed tests is studied via intensive simulations, and the new tests are applied on a real electroencephalography data analysis.

The assessment of patients’ functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set.

An important goal of censored quantile regression is to provide reliable predictions of survival quantiles, which are often reported in practice to offer robust and comprehensive biomedical summaries. However, formal methods for evaluating and comparing *working* quantile regression models in terms of their performance in predicting survival quantiles have been lacking, especially when the working models are subject to model mis-specification. In this article, we proposes a sensible and rigorous framework to fill in this gap. We introduce and justify a predictive performance measure defined based on the check loss function. We derive estimators of the proposed predictive performance measure and study their distributional properties and the corresponding inference procedures. More importantly, we develop model comparison procedures that enable thorough evaluations of model predictive performance among nested or non-nested models. Our proposals properly accommodate random censoring to the survival outcome and the realistic complication of model mis-specification, and thus are generally applicable. Extensive simulations and a real data example demonstrate satisfactory performances of the proposed methods in real life settings.

This article investigates a generalized semiparametric varying-coefficient model for longitudinal data that can flexibly model three types of covariate effects: time-constant effects, time-varying effects, and covariate-varying effects. Different link functions can be selected to provide a rich family of models for longitudinal data. The model assumes that the time-varying effects are unspecified functions of time and the covariate-varying effects are parametric functions of an exposure variable specified up to a finite number of unknown parameters. The estimation procedure is developed using local linear smoothing and profile weighted least squares estimation techniques. Hypothesis testing procedures are developed to test the parametric functions of the covariate-varying effects. The asymptotic distributions of the proposed estimators are established. A working formula for bandwidth selection is discussed and examined through simulations. Our simulation study shows that the proposed methods have satisfactory finite sample performance. The proposed methods are applied to the ACTG 244 clinical trial of HIV infected patients being treated with Zidovudine to examine the effects of antiretroviral treatment switching before and after HIV develops the T215Y/F drug resistance mutation. Our analysis shows benefits of treatment switching to the combination therapies as compared to continuing with ZDV monotherapy before and after developing the 215-mutation.

Prognostic survival models are commonly evaluated in terms of both their calibration and their discrimination. Comparing observed and predicted survival curves can assess calibration, while discrimination is typically summarized through comparison of the properties of “cases” or subjects who experience an event, and the properties of “controls” represented by event-free individuals. For binary data, discrimination is characterized either by using the relative ranks of cases and controls and a receiver operating characteristic (ROC) curve, or by summarizing the magnitude of risk placed on cases and controls through calculation of the discrimination slope (DS). In this article, we propose a risk-based measure of time-varying discrimination that generalizes the discrimination slope to allow use with incident events and hazard models. We refer to the new measure as the hazard discrimination summary (HDS) since it compares the relative risk among incident cases to their associated dynamic risk set controls. We introduce both a model-based estimation procedure that adopts the Cox model, and an alternative approach that locally relaxes the proportional hazards assumption. We illustrate the proposed methods using both a benchmark survival data set, and an oncology study where primary interest is in the time-varying performance of candidate biomarkers.

Prostate cancer patients are closely followed after the initial therapy and salvage treatment may be prescribed to prevent or delay cancer recurrence. The salvage treatment decision is usually made dynamically based on the patient's evolving history of disease status and other time-dependent clinical covariates. A multi-center prostate cancer observational study has provided us data on longitudinal prostate specific antigen (PSA) measurements, time-varying salvage treatment, and cancer recurrence time. These data enable us to estimate the best dynamic regime of salvage treatment, while accounting for the complicated confounding of time-varying covariates present in the data. A Random Forest based method is used to model the probability of regime adherence and inverse probability weights are used to account for the complexity of selection bias in regime adherence. The optimal regime is then identified by the largest restricted mean survival time. We conduct simulation studies with different PSA trends to mimic both simple and complex regime adherence mechanisms. The proposed method can efficiently accommodate complex and possibly unknown adherence mechanisms, and it is robust to cases where the proportional hazards assumption is violated. We apply the method to data collected from the observational study and estimate the best salvage treatment regime in managing the risk of prostate cancer recurrence.

It is standard practice for covariates to enter a parametric model through a single distributional parameter of interest, for example, the scale parameter in many standard survival models. Indeed, the well-known proportional hazards model is of this kind. In this article, we discuss a more general approach whereby covariates enter the model through *more than one* distributional parameter simultaneously (e.g., scale *and* shape parameters). We refer to this practice as “multi-parameter regression” (MPR) modeling and explore its use in a survival analysis context. We find that multi-parameter regression leads to more flexible models which can offer greater insight into the underlying data generating process. To illustrate the concept, we consider the two-parameter Weibull model which leads to time-dependent hazard ratios, thus relaxing the typical proportional hazards assumption and motivating a new test of proportionality. A novel variable selection strategy is introduced for such multi-parameter regression models. It accounts for the correlation arising between the estimated regression coefficients in two or more linear predictors—a feature which has not been considered by other authors in similar settings. The methods discussed have been implemented in the mpr package in R.

Researchers estimating causal effects are increasingly challenged with decisions on how to best control for a potentially high-dimensional set of confounders. Typically, a single propensity score model is chosen and used to adjust for confounding, while the uncertainty surrounding which covariates to include into the propensity score model is often ignored, and failure to include even one important confounder will results in bias. We propose a practical and generalizable approach that overcomes the limitations described above through the use of model averaging. We develop and evaluate this approach in the context of double robust estimation. More specifically, we introduce the model averaged double robust (MA-DR) estimators, which account for model uncertainty in both the propensity score and outcome model through the use of model averaging. The MA-DR estimators are defined as weighted averages of double robust estimators, where each double robust estimator corresponds to a specific choice of the outcome model and the propensity score model. The MA-DR estimators extend the desirable double robustness property by achieving consistency under the much weaker assumption that either the true propensity score model or the true outcome model be within a specified, possibly large, class of models. Using simulation studies, we also assessed small sample properties, and found that MA-DR estimators can reduce mean squared error substantially, particularly when the set of potential confounders is large relative to the sample size. We apply the methodology to estimate the average causal effect of temozolomide plus radiotherapy versus radiotherapy alone on one-year survival in a cohort of 1887 Medicare enrollees who were diagnosed with glioblastoma between June 2005 and December 2009.

We study threshold regression models that allow the relationship between the outcome and a covariate of interest to change across a threshold value in the covariate. In particular, we focus on continuous threshold models, which experience no jump at the threshold. Continuous threshold regression functions can provide a useful summary of the association between outcome and the covariate of interest, because they offer a balance between flexibility and simplicity. Motivated by collaborative works in studying immune response biomarkers of transmission of infectious diseases, we study estimation of continuous threshold models in this article with particular attention to inference under model misspecification. We derive the limiting distribution of the maximum likelihood estimator, and propose both Wald and test-inversion confidence intervals. We evaluate finite sample performance of our methods, compare them with bootstrap confidence intervals, and provide guidelines for practitioners to choose the most appropriate method in real data analysis. We illustrate the application of our methods with examples from the HIV-1 immune correlates studies.

Understanding the complex interplay among protein coding genes and regulatory elements requires rigorous interrogation with analytic tools designed for discerning the relative contributions of overlapping genomic regions. To this aim, we offer a novel application of Bayesian variable selection (BVS) for classifying genomic class level associations using existing large meta-analysis summary level resources. This approach is applied using the expectation maximization variable selection (EMVS) algorithm to typed and imputed SNPs across 502 protein coding genes (PCGs) and 220 long intergenic non-coding RNAs (lncRNAs) that overlap 45 known loci for coronary artery disease (CAD) using publicly available Global Lipids Gentics Consortium (GLGC) (Teslovich et al., 2010; Willer et al., 2013) meta-analysis summary statistics for low-density lipoprotein cholesterol (LDL-C). The analysis reveals 33 PCGs and three lncRNAs across 11 loci with 50% posterior probabilities for inclusion in an additive model of association. The findings are consistent with previous reports, while providing some new insight into the architecture of LDL-cholesterol to be investigated further. As genomic taxonomies continue to evolve, additional classes such as enhancer elements and splicing regions, can easily be layered into the proposed analysis framework. Moreover, application of this approach to alternative publicly available meta-analysis resources, or more generally as a post-analytic strategy to further interrogate regions that are identified through single point analysis, is straightforward. All coding examples are implemented in R version 3.2.1 and provided as supplemental material.

Quantitative trait locus analysis has been used as an important tool to identify markers where the phenotype or quantitative trait is linked with the genotype. Most existing tests for single locus association with quantitative traits aim at the detection of the mean differences across genotypic groups. However, recent research has revealed functional genetic loci that affect the variance of traits, known as variability-controlling quantitative trait locus. In addition, it has been suggested that many genotypes have both mean and variance effects, while the mean effects or variance effects alone may not be strong enough to be detected. The existing methods accounting for unequal variances include the Levene's test, the Lepage test, and the *D*-test, but suffer from their limitations of lack of robustness or lack of power. We propose a semiparametric model and a novel pairwise conditional likelihood ratio test. Specifically, the semiparametric model is designed to identify the combined differences in higher moments among genotypic groups. The pairwise likelihood is constructed based on conditioning procedure, where the unknown reference distribution is eliminated. We show that the proposed pairwise likelihood ratio test has a simple asymptotic chi-square distribution, which does not require permutation or bootstrap procedures. Simulation studies show that the proposed test performs well in controlling Type I errors and having competitive power in identifying the differences across genotypic groups. In addition, the proposed test has certain robustness to model mis-specifications. The proposed test is illustrated by an example of identifying both mean and variances effects in body mass index using the Framingham Heart Study data.

Genetic risk prediction is an important component of individualized medicine, but prediction accuracies remain low for many complex diseases. A fundamental limitation is the sample sizes of the studies on which the prediction algorithms are trained. One way to increase the effective sample size is to integrate information from previously existing studies. However, it can be difficult to find existing data that examine the target disease of interest, especially if that disease is rare or poorly studied. Furthermore, individual-level genotype data from these auxiliary studies are typically difficult to obtain. This article proposes a new approach to integrative genetic risk prediction of complex diseases with binary phenotypes. It accommodates possible heterogeneity in the genetic etiologies of the target and auxiliary diseases using a tuning parameter-free non-parametric empirical Bayes procedure, and can be trained using only auxiliary summary statistics. Simulation studies show that the proposed method can provide superior predictive accuracy relative to non-integrative as well as integrative classifiers. The method is applied to a recent study of pediatric autoimmune diseases, where it substantially reduces prediction error for certain target/auxiliary disease combinations. The proposed method is implemented in the R package ssa.

Alternating logistic regressions is an estimating equations procedure used to model marginal means of correlated binary outcomes while simultaneously specifying a within-cluster association model for log odds ratios of outcome pairs. A recent generalization of alternating logistic regressions, known as orthogonalized residuals, is extended to incorporate finite sample adjustments in the estimation of the log odds ratio model parameters for when there is a moderately small number of clusters. Bias adjustments are made both in the sandwich variance estimators and in the estimating equations for the association parameters. The proposed methods are demonstrated in a repeated cross-sectional cluster trial to reduce underage drinking in the United States, and in an analysis of dental caries incidence in a cluster randomized trial of 30 aboriginal communities in the Northern Territory of Australia. A simulation study demonstrates improved performance with respect to bias and coverage of their estimators relative to those based on the uncorrected orthogonalized residuals procedure.

For a fallback randomized clinical trial design with a marker, Choai and Matsui (2015, *Biometrics* 71, 25–32) estimate the bias of the estimator of the treatment effect in the marker-positive subgroup conditional on the treatment effect not being statistically significant in the overall population. This is used to construct and examine conditionally bias-corrected estimators of the treatment effect for the marker-positive subgroup. We argue that it may not be appropriate to correct for conditional bias in this setting. Instead, we consider the unconditional bias of estimators of the treatment effect for marker-positive patients.

In precision medicine, a patient is treated with targeted therapies that are predicted to be effective based on the patient's baseline characteristics such as biomarker profiles. Oftentimes, patient subgroups are unknown and must be learned through inference using observed data. We present SCUBA, a Subgroup ClUster-based Bayesian Adaptive design aiming to fulfill two simultaneous goals in a clinical trial, 1) to treatments enrich the allocation of each subgroup of patients to their precision and desirable treatments and 2) to report multiple subgroup-treatment pairs (STPs). Using random partitions and semiparametric Bayesian models, SCUBA provides coherent and probabilistic assessment of potential patient subgroups and their associated targeted therapies. Each STP can then be used for future confirmatory studies for regulatory approval. Through extensive simulation studies, we present an application of SCUBA to an innovative clinical trial in gastroesphogeal cancer.

Conventional distance sampling (CDS) methods assume that animals are uniformly distributed in the vicinity of lines or points. But when animals move in response to observers before detection, or when lines or points are not located randomly, this assumption may fail. By formulating distance sampling models as survival models, we show that using time to first detection in addition to perpendicular distance (line transect surveys) or radial distance (point transect surveys) allows estimation of detection probability, and hence density, when animal distribution in the vicinity of lines or points is not uniform and is unknown. We also show that times to detection can provide information about failure of the CDS assumption that detection probability is 1 at distance zero. We obtain a maximum likelihood estimator of line transect survey detection probability and effective strip half-width using times to detection, and we investigate its properties by simulation in situations where animals are nonuniformly distributed and their distribution is unknown. The estimator is found to perform well when detection probability at distance zero is 1. It allows unbiased estimates of density to be obtained in this case from surveys in which there has been responsive movement prior to animals coming within detectable range. When responsive movement continues within detectable range, estimates may be biased but are likely less biased than estimates from methods that assuming no responsive movement. We illustrate by estimating primate density from a line transect survey in which animals are known to avoid the transect line, and a shipboard survey of dolphins that are attracted to it.

In randomized studies involving severely ill patients, functional outcomes are often unobserved due to missed clinic visits, premature withdrawal, or death. It is well known that if these unobserved functional outcomes are not handled properly, biased treatment comparisons can be produced. In this article, we propose a procedure for comparing treatments that is based on a composite endpoint that combines information on both the functional outcome and survival. We further propose a missing data imputation scheme and sensitivity analysis strategy to handle the unobserved functional outcomes not due to death. Illustrations of the proposed method are given by analyzing data from a recent non-small cell lung cancer clinical trial and a recent trial of sedation interruption among mechanically ventilated patients.

Survival model construction can be guided by goodness-of-fit techniques as well as measures of predictive strength. Here, we aim to bring together these distinct techniques within the context of a single framework. The goal is how to best characterize and code the effects of the variables, in particular time dependencies, when taken either singly or in combination with other related covariates. Simple graphical techniques can provide an immediate visual indication as to the goodness-of-fit but, in cases of departure from model assumptions, will point in the direction of a more involved and richer alternative model. These techniques appear to be intuitive. This intuition is backed up by formal theorems that underlie the process of building richer models from simpler ones. Measures of predictive strength are used in conjunction with these goodness-of-fit techniques and, again, formal theorems show that these measures can be used to help identify models closest to the unknown non-proportional hazards mechanism that we can suppose generates the observations. Illustrations from studies in breast cancer show how these tools can be of help in guiding the practical problem of efficient model construction for survival data.

We propose a subgroup identification approach for inferring optimal and interpretable personalized treatment rules with high-dimensional covariates. Our approach is based on a two-step greedy tree algorithm to pursue signals in a high-dimensional space. In the first step, we transform the treatment selection problem into a weighted classification problem that can utilize tree-based methods. In the second step, we adopt a newly proposed tree-based method, known as reinforcement learning trees, to detect features involved in the optimal treatment rules and to construct binary splitting rules. The method is further extended to right censored survival data by using the accelerated failure time model and introducing double weighting to the classification trees. The performance of the proposed method is demonstrated via simulation studies, as well as analyses of the Cancer Cell Line Encyclopedia (CCLE) data and the Tamoxifen breast cancer data.

In many biomedical studies that involve correlated data, an outcome is often repeatedly measured for each individual subject along with the number of these measurements, which is also treated as an observed outcome. This type of data has been referred as multivariate random length data by Barnhart and Sampson (1995). A common approach to handling such type of data is to jointly model the multiple measurements and the random length. In previous literature, a key assumption is the multivariate normality for the multiple measurements. Motivated by a reproductive study, we propose a new copula-based joint model which relaxes the normality assumption. Specifically, we adopt the Clayton–Oakes model for multiple measurements with flexible marginal distributions specified as semi-parametric transformation models. The random length is modeled via a generalized linear model. We develop an approximate EM algorithm to derive parameter estimators and standard errors of the estimators are obtained through bootstrapping procedures and the finite-sample performance of the proposed method is investigated using simulation studies. We apply our method to the Mount Sinai Study of Women Office Workers (MSSWOW), where women were prospectively followed for 1 year for studying fertility.

In a sensitivity analysis in an observational study with a binary outcome, is it better to use all of the data or to focus on subgroups that are expected to experience the largest treatment effects? The answer depends on features of the data that may be difficult to anticipate, a trade-off between unknown effect-sizes and known sample sizes. We propose a sensitivity analysis for an adaptive test similar to the Mantel–Haenszel test. The adaptive test performs two highly correlated analyses, one focused analysis using a subgroup, one combined analysis using all of the data, correcting for multiple testing using the joint distribution of the two test statistics. Because the two component tests are highly correlated, this correction for multiple testing is small compared with, for instance, the Bonferroni inequality. The test has the maximum design sensitivity of two component tests. A simulation evaluates the power of a sensitivity analysis using the adaptive test. Two examples are presented. An R package, sensitivity2x2xk, implements the procedure.

Integration of genomic data from multiple platforms has the capability to increase precision, accuracy, and statistical power in the identification of prognostic biomarkers. A fundamental problem faced in many multi-platform studies is unbalanced sample sizes due to the inability to obtain measurements from all the platforms for all the patients in the study. We have developed a novel Bayesian approach that integrates multi-regression models to identify a small set of biomarkers that can accurately predict time-to-event outcomes. This method fully exploits the amount of available information across platforms and does not exclude any of the subjects from the analysis. Through simulations, we demonstrate the utility of our method and compare its performance to that of methods that do not borrow information across regression models. Motivated by The Cancer Genome Atlas kidney renal cell carcinoma dataset, our methodology provides novel insights missed by non-integrative models.

Personalized cancer therapy requires clinical trials with smaller sample sizes compared to trials involving unselected populations that have not been divided into biomarker subgroups. The use of exponential survival modeling for survival endpoints has the potential of gaining 35% efficiency or saving 28% required sample size (Miller, 1983), making personalized therapy trials more feasible. However, the use of exponential survival has not been fully accepted in cancer research practice due to uncertainty about whether or not the exponential assumption holds. We propose a test for identifying violations of the exponential assumption using a reduced piecewise exponential approach. Compared with an alternative goodness-of-fit test, which suffers from inflation of type I error rate under various censoring mechanisms, the proposed test maintains the correct type I error rate. We conduct power analysis using simulated data based on different types of cancer survival distribution in the SEER registry database, and demonstrate the implementation of this approach in existing cancer clinical trials.

Group testing, where individuals are tested initially in pools, is widely used to screen a large number of individuals for rare diseases. Triggered by the recent development of assays that detect multiple infections at once, screening programs now involve testing individuals in pools for multiple infections simultaneously. Tebbs, McMahan, and Bilder (2013, *Biometrics*) recently evaluated the performance of a two-stage hierarchical algorithm used to screen for chlamydia and gonorrhea as part of the Infertility Prevention Project in the United States. In this article, we generalize this work to accommodate a larger number of stages. To derive the operating characteristics of higher-stage hierarchical algorithms with more than one infection, we view the pool decoding process as a time-inhomogeneous, finite-state Markov chain. Taking this conceptualization enables us to derive closed-form expressions for the expected number of tests and classification accuracy rates in terms of transition probability matrices. When applied to chlamydia and gonorrhea testing data from four states (Region X of the United States Department of Health and Human Services), higher-stage hierarchical algorithms provide, on average, an estimated 11% reduction in the number of tests when compared to two-stage algorithms. For applications with rarer infections, we show theoretically that this percentage reduction can be much larger.

In many scientific and engineering fields, advanced experimental and computing technologies are producing data that are not just high dimensional, but also internally structured. For instance, statistical units may have heterogeneous origins from distinct studies or subpopulations, and features may be naturally partitioned based on experimental platforms generating them, or on information available about their roles in a given phenomenon. In a regression analysis, exploiting this known structure in the predictor dimension reduction stage that precedes modeling can be an effective way to integrate diverse data. To pursue this, we propose a novel Sufficient Dimension Reduction (SDR) approach that we call *structured Ordinary Least Squares* (sOLS). This combines ideas from existing SDR literature to merge reductions performed within groups of samples and/or predictors. In particular, it leads to a version of OLS for grouped predictors that requires far less computation than recently proposed groupwise SDR procedures, and provides an informal yet effective variable selection tool in these settings. We demonstrate the performance of sOLS by simulation and present a first application to genomic data. The R package “sSDR,” publicly available on CRAN, includes all procedures necessary to implement the sOLS approach.

Bivariate survival data arise frequently in familial association studies of chronic disease onset, as well as in clinical trials and observational studies with multiple time to event endpoints. The association between two event times is often scientifically important. In this article, we examine the association via a novel quantile association measure, which describes the dynamic association as a function of the quantile levels. The quantile association measure is free of marginal distributions, allowing direct evaluation of the underlying association pattern at different locations of the event times. We propose a nonparametric estimator for quantile association, as well as a semiparametric estimator that is superior in smoothness and efficiency. The proposed methods possess desirable asymptotic properties including uniform consistency and root-n convergence. They demonstrate satisfactory numerical performances under a range of dependence structures. An application of our methods suggests interesting association patterns between time to myocardial infarction and time to stroke in an atherosclerosis study.

Model diagnosis, an important issue in statistical modeling, has not yet been addressed adequately for cure models. We focus on mixture cure models in this work and propose some residual-based methods to examine the fit of the mixture cure model, particularly the fit of the latency part of the mixture cure model. The new methods extend the classical residual-based methods to the mixture cure model. Numerical work shows that the proposed methods are capable of detecting lack-of-fit of a mixture cure model, particularly in the latency part, such as outliers, improper covariate functional form, or nonproportionality in hazards if the proportional hazards assumption is employed in the latency part. The methods are illustrated with two real data sets that were previously analyzed with mixture cure models.

Estimating biomarker-index accuracy when only imperfect reference-test information is available is usually performed under the assumption of conditional independence between the biomarker and imperfect reference-test values. We propose to define a latent normally-distributed tolerance-variable underlying the observed dichotomous imperfect reference-test results. Subsequently, we construct a Bayesian latent-class model based on the joint multivariate normal distribution of the latent tolerance and biomarker values, conditional on latent true disease status, which allows accounting for conditional dependence. The accuracy of the continuous biomarker-index is quantified by the AUC of the optimal linear biomarker-combination. Model performance is evaluated by using a simulation study and two sets of data of Alzheimer's disease patients (one from the memory-clinic-based Amsterdam Dementia Cohort and one from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database). Simulation results indicate adequate model performance and bias in estimates of the diagnostic-accuracy measures when the assumption of conditional independence is used when, in fact, it is incorrect. In the considered case studies, conditional dependence between some of the biomarkers and the imperfect reference-test is detected. However, making the conditional independence assumption does not lead to any marked differences in the estimates of diagnostic accuracy.

Sequential multiple assignment randomization trial (SMART) is a powerful design to study Dynamic Treatment Regimes (DTRs) and allows causal comparisons of DTRs. To handle practical challenges of SMART, we propose a SMART with Enrichment (SMARTER) design, which performs stage-wise enrichment for SMART. SMARTER can improve design efficiency, shorten the recruitment period, and partially reduce trial duration to make SMART more practical with limited time and resource. Specifically, at each subsequent stage of a SMART, we enrich the study sample with new patients who have received previous stages’ treatments in a naturalistic fashion without randomization, and only randomize them among the current stage treatment options. One extreme case of the SMARTER is to synthesize separate independent single-stage randomized trials with patients who have received previous stage treatments. We show data from SMARTER allows for unbiased estimation of DTRs as SMART does under certain assumptions. Furthermore, we show analytically that the efficiency gain of the new design over SMART can be significant especially when the dropout rate is high. Lastly, extensive simulation studies are performed to demonstrate performance of SMARTER design, and sample size estimation in a scenario informed by real data from a SMART study is presented.

This article proposes a method to address the problem that can arise when covariates in a regression setting are not Gaussian, which may give rise to approximately mixture-distributed errors, or when a true mixture of regressions produced the data. The method begins with non-Gaussian mixture-based marginal variable screening, followed by fitting a full but relatively smaller mixture regression model to the selected data with help of a new penalization scheme. Under certain regularity conditions, the new screening procedure is shown to possess a sure screening property even when the population is heterogeneous. We further prove that there exists an elbow point in the associated scree plot which results in a consistent estimator of the set of active covariates in the model. By simulations, we demonstrate that the new procedure can substantially improve the performance of the existing procedures in the content of variable screening and data clustering. By applying the proposed procedure to motif data analysis in molecular biology, we demonstrate that the new method holds promise in practice.

Standard false discovery rate (FDR) procedures can provide misleading inference when testing multiple null hypotheses with heterogeneous multinomial data. For example, in the motivating study the goal is to identify species of bacteria near the roots of wheat plants (rhizobacteria) that are moderately or strongly associated with productivity. However, standard procedures discover the most abundant species even when their association is weak and fail to discover many moderate and strong associations when the species are not abundant. This article provides a new FDR-controlling method based on a finite mixture of multinomial distributions and shows that it tends to discover more moderate and strong associations and fewer weak associations when the data are heterogeneous across tests. The new method is applied to the rhizobacteria data and performs favorably over competing methods.

In cancer research, interest frequently centers on factors influencing a latent event that must precede a terminal event. In practice it is often impossible to observe the latent event precisely, making inference about this process difficult. To address this problem, we propose a joint model for the unobserved time to the latent and terminal events, with the two events linked by the baseline hazard. Covariates enter the model parametrically as linear combinations that multiply, respectively, the hazard for the latent event and the hazard for the terminal event conditional on the latent one. We derive the partial likelihood estimators for this problem assuming the latent event is observed, and propose a profile likelihood-based method for estimation when the latent event is unobserved. The baseline hazard in this case is estimated nonparametrically using the EM algorithm, which allows for closed-form Breslow-type estimators at each iteration, bringing improved computational efficiency and stability compared with maximizing the marginal likelihood directly. We present simulation studies to illustrate the finite-sample properties of the method; its use in practice is demonstrated in the analysis of a prostate cancer data set.

Cancer survival comparisons between cohorts are often assessed by estimates of relative or net survival. These measure the difference in mortality between those diagnosed with the disease and the general population. For such comparisons methods are needed to standardize cohort structure (including age at diagnosis) and all-cause mortality rates in the general population. Standardized non-parametric relative survival measures are evaluated by determining how well they (i) ensure the correct rank ordering, (ii) allow for differences in covariate distributions, and (iii) possess robustness and maximal estimation precision. Two relative survival families that subsume the Ederer-I, Ederer-II, and Pohar-Perme statistics are assessed. The aforementioned statistics do not meet our criteria, and are not invariant under a change of covariate distribution. Existing methods for standardization of these statistics are either not invariant to changes in the general population mortality or are not robust. Standardized statistics and estimators are developed to address the deficiencies. They use a reference distribution for covariates such as age, and a reference population mortality survival distribution that is recommended to approach zero with increasing age as fast as the cohort with the worst life expectancy. Estimators are compared using a breast-cancer survival example and computer simulation. The proposals are invariant and robust, and out-perform current methods to standardize the Ederer-II and Pohar-Perme estimators in simulations, particularly for extended follow-up.

In this article, we present a Bayesian hierarchical model for predicting a latent health state from longitudinal clinical measurements. Model development is motivated by the need to integrate multiple sources of data to improve clinical decisions about whether to remove or irradiate a patient's prostate cancer. Existing modeling approaches are extended to accommodate measurement error in cancer state determinations based on biopsied tissue, clinical measurements possibly not missing at random, and informative partial observation of the true state. The proposed model enables estimation of whether an individual's underlying prostate cancer is aggressive, requiring surgery and/or radiation, or indolent, permitting continued surveillance. These individualized predictions can then be communicated to clinicians and patients to inform decision-making. We demonstrate the model with data from a cohort of low-risk prostate cancer patients at Johns Hopkins University and assess predictive accuracy among a subset for whom true cancer state is observed. Simulation studies confirm model performance and explore the impact of adjusting for informative missingness on true state predictions. R code is provided in an online supplement and at http://github.com/rycoley/prediction-prostate-surveillance.

Motivated by a study conducted to evaluate the associations of 51 inflammatory markers and lung cancer risk, we propose several approaches of varying computational complexity for analyzing multiple correlated markers that are also censored due to lower and/or upper limits of detection, using likelihood-based sufficient dimension reduction (SDR) methods. We extend the theory and the likelihood-based SDR framework in two ways: (i) we accommodate censored predictors directly in the likelihood, and (ii) we incorporate variable selection. We find linear combinations that contain all the information that the correlated markers have on an outcome variable (i.e., are sufficient for modeling and prediction of the outcome) while accounting for censoring of the markers. These methods yield efficient estimators and can be applied to any type of outcome, including continuous and categorical. We illustrate and compare all methods using data from the motivating study and in simulations. We find that explicitly accounting for the censoring in the likelihood of the SDR methods can lead to appreciable gains in efficiency and prediction accuracy, and also outperformed multiple imputations combined with standard SDR.

In biomedical research, a steep rise or decline in longitudinal biomarkers may indicate latent disease progression, which may subsequently cause patients to drop out of the study. Ignoring the informative drop-out can cause bias in estimation of the longitudinal model. In such cases, a full parametric specification may be insufficient to capture the complicated pattern of the longitudinal biomarkers. For these types of longitudinal data with the issue of informative drop-outs, we develop a joint partially linear model, with an aim to find the trajectory of the longitudinal biomarker. Specifically, an arbitrary function of time along with linear fixed and random covariate effects is proposed in the model for the biomarker, while a flexible semiparametric transformation model is used to describe the drop-out mechanism. Advantages of this semiparametric joint modeling approach are the following: 1) it provides an easier interpretation, compared to standard nonparametric regression models, and 2) it is a natural way to control for common (observable and unobservable) prognostic factors that may affect both the longitudinal trajectory and the drop-out process. We describe a sieve maximum likelihood estimation procedure using the EM algorithm, where the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are considered to select the number of knots. We show that the proposed estimators achieve desirable asymptotic properties through empirical process theory. The proposed methods are evaluated by simulation studies and applied to prostate cancer data.

Case-cohort (Prentice, 1986) and nested case-control (Thomas, 1977) designs have been widely used as a cost-effective alternative to the full-cohort design. In this article, we propose an efficient likelihood-based estimation method for the accelerated failure time model under case-cohort and nested case-control designs. An EM algorithm is developed to maximize the likelihood function and a kernel smoothing technique is adopted to facilitate the estimation in the M-step of the EM algorithm. We show that the proposed estimators for the regression coefficients are consistent and asymptotically normal. The asymptotic variance of the estimators can be consistently estimated using an EM-aided numerical differentiation method. Simulation studies are conducted to evaluate the finite-sample performance of the estimators and an application to a Wilms tumor data set is also given to illustrate the methodology.

We propose a Bayesian non-parametric (BNP) framework for estimating causal effects of mediation, the natural direct, and indirect, effects. The strategy is to do this in two parts. Part 1 is a flexible model (using BNP) for the observed data distribution. Part 2 is a set of uncheckable assumptions with sensitivity parameters that in conjunction with Part 1 allows identification and estimation of the causal parameters and allows for uncertainty about these assumptions via priors on the sensitivity parameters. For Part 1, we specify a Dirichlet process mixture of multivariate normals as a prior on the joint distribution of the outcome, mediator, and covariates. This approach allows us to obtain a (simple) closed form of each marginal distribution. For Part 2, we consider two sets of assumptions: (a) the standard sequential ignorability (Imai et al., 2010) and (b) weakened set of the conditional independence type assumptions introduced in Daniels et al. (2012) and propose sensitivity analyses for both. We use this approach to assess mediation in a physical activity promotion trial.

Cancer population studies based on cancer registry databases are widely conducted to address various research questions. In general, cancer registry databases do not collect information on cause of death. The net survival rate is defined as the survival rate if a subject would not die for any causes other than cancer. This counterfactual concept is widely used for the analyses of cancer registry data. Perme, Stare, and Estève (2012) proposed a nonparametric estimator of the net survival rate under the assumption that the censoring time is independent of the survival time and covariates. Kodre and Perme (2013) proposed an inverse weighting estimator for the net survival rate under the covariate-dependent censoring. An alternative approach to estimating the net survival rate under covariate-dependent censoring is to apply a regression model for the conditional net survival rate given covariates. In this article, we propose a new estimator for the net survival rate. The proposed estimator is shown to be doubly robust in the sense that it is consistent at least one of the regression models for survival time and for censoring time. We examine the theoretical and empirical properties of our proposed estimator by asymptotic theory and simulation studies. We also apply the proposed method to cancer registry data for gastric cancer patients in Osaka, Japan.

The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration.

Finding rare variants and gene–environment interactions (GXE) is critical in dissecting complex diseases. We consider the problem of detecting GXE where G is a rare haplotype and E is a nongenetic factor. Such methods typically assume G-E independence, which may not hold in many applications. A pertinent example is lung cancer—there is evidence that variants on Chromosome 15q25.1 interact with smoking to affect the risk. However, these variants are associated with smoking behavior rendering the assumption of G-E independence inappropriate. With the motivation of detecting GXE under G-E dependence, we extend an existing approach, logistic Bayesian LASSO, which assumes G-E independence (LBL-GXE-I) by modeling G-E dependence through a multinomial logistic regression (referred to as LBL-GXE-D). Unlike LBL-GXE-I, LBL-GXE-D controls type I error rates in all situations; however, it has reduced power when G-E independence holds. To control type I error without sacrificing power, we further propose a unified approach, LBL-GXE, to incorporate uncertainty in the G-E independence assumption by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that LBL-GXE has power similar to that of LBL-GXE-I when G-E independence holds, yet has well-controlled type I errors in all situations. To illustrate the utility of LBL-GXE, we analyzed a lung cancer dataset and found several significant interactions in the 15q25.1 region, including one between a specific rare haplotype and smoking.

Highly active antiretroviral therapy (HAART) has proved efficient in increasing CD4 counts in many randomized clinical trials. Because randomized trials have some limitations (e.g., short duration, highly selected subjects), it is interesting to assess the effect of treatments using observational studies. This is challenging because treatment is started preferentially in subjects with severe conditions. This general problem had been treated using Marginal Structural Models (MSM) relying on the counterfactual formulation. Another approach to causality is based on dynamical models. We present three discrete-time dynamic models based on linear increments models (LIM): the first one based on one difference equation for CD4 counts, the second with an equilibrium point, and the third based on a system of two difference equations, which allows jointly modeling CD4 counts and viral load. We also consider continuous-time models based on ordinary differential equations with non-linear mixed effects (ODE-NLME). These mechanistic models allow incorporating biological knowledge when available, which leads to increased statistical evidence for detecting treatment effect. Because inference in ODE-NLME is numerically challenging and requires specific methods and softwares, LIM are a valuable intermediary option in terms of consistency, precision, and complexity. We compare the different approaches in simulation and in illustration on the ANRS CO3 Aquitaine Cohort and the Swiss HIV Cohort Study.

It is now well recognized that the effectiveness and potential risk of a treatment often vary by patient subgroups. Although trial-and-error and one-size-fits-all approaches to treatment selection remain a common practice, much recent focus has been placed on individualized treatment selection based on patient information (La Thangue and Kerr, 2011; Ong et al., 2012). Genetic and molecular markers are becoming increasingly available to guide treatment selection for various diseases including HIV and breast cancer (Mallal et al., 2008; Zujewski and Kamin, 2008). In recent years, many statistical procedures for developing individualized treatment rules (ITRs) have been proposed. However, less focus has been given to efficient selection of predictive biomarkers for treatment selection. The standard Wald test for interactions between treatment and the set of markers of interest may not work well when the marker effects are nonlinear. Furthermore, interaction-based test is scale dependent and may fail to capture markers useful for predicting individualized treatment differences. In this article, we propose to overcome these difficulties by developing a kernel machine (KM) score test that can efficiently identify markers predictive of treatment difference. Simulation studies show that our proposed KM-based score test is more powerful than the Wald test when there is nonlinear effect among the predictors and when the outcome is binary with nonlinear link functions. Furthermore, when there is high-correlation among predictors and when the number of predictors is not small, our method also over-performs Wald test. The proposed method is illustrated with two randomized clinical trials.

Many new experimental treatments benefit only a subset of the population. Identifying the baseline covariate profiles of patients who benefit from such a treatment, rather than determining whether or not the treatment has a population-level effect, can substantially lessen the risk in undertaking a clinical trial and expose fewer patients to treatments that do not benefit them. The standard analyses for identifying patient subgroups that benefit from an experimental treatment either do not account for multiplicity, or focus on testing for the presence of treatment–covariate interactions rather than the resulting individualized treatment effects. We propose a Bayesian *credible subgroups* method to identify two bounding subgroups for the benefiting subgroup: one for which it is likely that all members simultaneously have a treatment effect exceeding a specified threshold, and another for which it is likely that no members do. We examine frequentist properties of the credible subgroups method via simulations and illustrate the approach using data from an Alzheimer's disease treatment trial. We conclude with a discussion of the advantages and limitations of this approach to identifying patients for whom the treatment is beneficial.

Identification of novel biomarkers for risk prediction is important for disease prevention and optimal treatment selection. However, studies aiming to discover which biomarkers are useful for risk prediction often require the use of stored biological samples from large assembled cohorts, and thus the depletion of a finite and precious resource. To make efficient use of such stored samples, two-phase sampling designs are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare. Existing methods for analyzing data from two-phase studies focus primarily on single marker analysis or fitting the Cox regression model to combine information from multiple markers. However, the Cox model may not fit the data well. Under model misspecification, the composite score derived from the Cox model may not perform well in predicting the outcome. Under a general two-phase stratified cohort sampling design, we present a novel approach to combining multiple markers to optimize prediction by fitting a flexible nonparametric transformation model. Using inverse probability weighting to account for the outcome-dependent sampling, we propose to estimate the model parameters by maximizing an objective function which can be interpreted as a weighted C-statistic for survival outcomes. Regardless of model adequacy, the proposed procedure yields a sensible composite risk score for prediction. A major obstacle for making inference under two phase studies is due to the correlation induced by the finite population sampling, which prevents standard inference procedures such as the bootstrap from being used for variance estimation. We propose a resampling procedure to derive valid confidence intervals for the model parameters and the C-statistic accuracy measure. We illustrate the new methods with simulation studies and an analysis of a two-phase study of high-density lipoprotein cholesterol (HDL-C) subtypes for predicting the risk of coronary heart disease.

An intermediate response measure that accurately predicts efficacy in a new setting can reduce trial cost and time to product licensure. In this article, we define a *trial level general surrogate*, which is an intermediate response that can be used to accurately predict efficacy in a new setting. Methods for evaluating general surrogates have been developed previously. Many methods in the literature use trial level intermediate responses for prediction. However, all existing methods focus on surrogate evaluation and prediction in new settings, rather than comparison of candidate general surrogates, and few formalize the use of cross validation to quantify the expected prediction error. Our proposed method uses Bayesian non-parametric modeling and cross-validation to estimate the absolute prediction error for use in evaluating and comparing candidate trial level general surrogates. Simulations show that our method performs well across a variety of scenarios. We use our method to evaluate and to compare candidate trial level general surrogates in several multi-national trials of a pentavalent rotavirus vaccine. We identify at least one immune measure that has potential value as a trial level general surrogate and use it to predict efficacy in a new trial where the clinical outcome was not measured.

In this article, we develop new methods for estimating average treatment effects in observational studies, in settings with more than two treatment levels, assuming unconfoundedness given pretreatment variables. We emphasize propensity score subclassification and matching methods which have been among the most popular methods in the binary treatment literature. Whereas the literature has suggested that these particular propensity-based methods do not naturally extend to the multi-level treatment case, we show, using the concept of weak unconfoundedness and the notion of the generalized propensity score, that adjusting for a scalar function of the pretreatment variables removes all biases associated with observed pretreatment variables. We apply the proposed methods to an analysis of the effect of treatments for fibromyalgia. We also carry out a simulation study to assess the finite sample performance of the methods relative to previously proposed methods.

Semi-parametric methods are often used for the estimation of intervention effects on correlated outcomes in cluster-randomized trials (CRTs). When outcomes are missing at random (MAR), Inverse Probability Weighted (IPW) methods incorporating baseline covariates can be used to deal with informative missingness. Also, augmented generalized estimating equations (AUG) correct for imbalance in baseline covariates but need to be extended for MAR outcomes. However, in the presence of interactions between treatment and baseline covariates, neither method alone produces consistent estimates for the marginal treatment effect if the model for interaction is not correctly specified. We propose an AUG–IPW estimator that weights by the inverse of the probability of being a complete case and allows different outcome models in each intervention arm. This estimator is doubly robust (DR); it gives correct estimates whether the missing data process or the outcome model is correctly specified. We consider the problem of covariate interference which arises when the outcome of an individual may depend on covariates of other individuals. When interfering covariates are not modeled, the DR property prevents bias as long as covariate interference is not present simultaneously for the outcome and the missingness. An R package is developed implementing the proposed method. An extensive simulation study and an application to a CRT of HIV risk reduction-intervention in South Africa illustrate the method.

A clinical trial with a factorial design involves randomization of subjects to treatment *A* or and, within each group, further randomization to treatment *B* or . Under this design, one can assess the effects of treatments *A* and *B* on a clinical endpoint using all patients. One may additionally compare treatment *A*, treatment *B*, or combination therapy to . With multiple comparisons, however, it may be desirable to control the overall type I error, especially for regulatory purposes. Because the subjects overlap in the comparisons, the test statistics are generally correlated. By accounting for the correlations, one can achieve higher statistical power compared to the conventional Bonferroni correction. Herein, we derive the correlation between any two (stratified or unstratified) log-rank statistics for a factorial design with a survival time endpoint, such that the overall type I error for multiple treatment comparisons can be properly controlled. In addition, we allow for adjustment of prognostic factors in the treatment comparisons and conduct simultaneous inference on the effect sizes. We use simulation studies to show that the proposed methods perform well in realistic situations. We then provide an application to a recently completed randomized controlled clinical trial on alcohol dependence. Finally, we discuss extensions of our approach to other factorial designs and multiple endpoints.

This article considers sieve estimation in the Cox model with an unknown regression structure based on right-censored data. We propose a semiparametric pursuit method to simultaneously identify and estimate linear and nonparametric covariate effects based on B-spline expansions through a penalized group selection method with concave penalties. We show that the estimators of the linear effects and the nonparametric component are consistent. Furthermore, we establish the asymptotic normality of the estimator of the linear effects. To compute the proposed estimators, we develop a modified blockwise majorization descent algorithm that is efficient and easy to implement. Simulation studies demonstrate that the proposed method performs well in finite sample situations. We also use the primary biliary cirrhosis data to illustrate its application.

The log-rank test is widely used to compare two survival distributions in a randomized clinical trial, while partial likelihood (Cox, 1975) is the method of choice for making inference about the hazard ratio under the Cox (1972) proportional hazards model. The Wald 95% confidence interval of the hazard ratio may include the null value of 1 when the *p*-value of the log-rank test is less than 0.05. Peto et al. (1977) provided an estimator for the hazard ratio based on the log-rank statistic; the corresponding 95% confidence interval excludes the null value of 1 if and only if the *p*-value of the log-rank test is less than 0.05. However, Peto's estimator is not consistent, and the corresponding confidence interval does not have correct coverage probability. In this article, we construct the confidence interval by inverting the score test under the (possibly stratified) Cox model, and we modify the variance estimator such that the resulting score test for the null hypothesis of no treatment difference is identical to the log-rank test in the possible presence of ties. Like Peto's method, the proposed confidence interval excludes the null value if and only if the log-rank test is significant. Unlike Peto's method, however, this interval has correct coverage probability. An added benefit of the proposed confidence interval is that it tends to be more accurate and narrower than the Wald confidence interval. We demonstrate the advantages of the proposed method through extensive simulation studies and a colon cancer study.

Interval-censored failure time data occur in many fields such as demography, economics, medical research, and reliability and many inference procedures on them have been developed (Sun, 2006; Chen, Sun, and Peace, 2012). However, most of the existing approaches assume that the mechanism that yields interval censoring is independent of the failure time of interest and it is clear that this may not be true in practice (Zhang et al., 2007; Ma, Hu, and Sun, 2015). In this article, we consider regression analysis of case *K* interval-censored failure time data when the censoring mechanism may be related to the failure time of interest. For the problem, an estimated sieve maximum-likelihood approach is proposed for the data arising from the proportional hazards frailty model and for estimation, a two-step procedure is presented. In the addition, the asymptotic properties of the proposed estimators of regression parameters are established and an extensive simulation study suggests that the method works well. Finally, we apply the method to a set of real interval-censored data that motivated this study.

Motivated by an ongoing pediatric mental health care (PMHC) study, this article presents weakly structured methods for analyzing doubly censored recurrent event data where only coarsened information on censoring is available. The study extracted administrative records of emergency department visits from provincial health administrative databases. The available information of each individual subject is limited to a subject-specific time window determined up to concealed data. To evaluate time-dependent effect of exposures, we adapt the local linear estimation with right censored survival times under the Cox regression model with time-varying coefficients (cf. Cai and Sun, *Scandinavian Journal of Statistics* 2003, **30**, 93–111). We establish the pointwise consistency and asymptotic normality of the regression parameter estimator, and examine its performance by simulation. The PMHC study illustrates the proposed approach throughout the article.

Joint models are used in ageing studies to investigate the association between longitudinal markers and a time-to-event, and have been extended to multiple markers and/or competing risks. The competing risk of death must be considered in the elderly because death and dementia have common risk factors. Moreover, in cohort studies, time-to-dementia is interval-censored since dementia is assessed intermittently. So subjects can develop dementia and die between two visits without being diagnosed. To study predementia cognitive decline, we propose a joint latent class model combining a (possibly multivariate) mixed model and an illness–death model handling both interval censoring (by accounting for a possible unobserved transition to dementia) and semi-competing risks. Parameters are estimated by maximum-likelihood handling interval censoring. The correlation between the marker and the times-to-events is captured by latent classes, homogeneous sub-groups with specific risks of death, dementia, and profiles of cognitive decline. We propose Markovian and semi-Markovian versions. Both approaches are compared to a joint latent-class model for competing risks through a simulation study, and applied in a prospective cohort study of cerebral and functional ageing to distinguish different profiles of cognitive decline associated with risks of dementia and death. The comparison highlights that among subjects with dementia, mortality depends more on age than on duration of dementia. This model distinguishes the so-called terminal predeath decline (among healthy subjects) from the predementia decline.

Longitudinal covariates in survival models are generally analyzed using random effects models. By framing the estimation of these survival models as a functional measurement error problem, semiparametric approaches such as the conditional score or the corrected score can be applied to find consistent estimators for survival model parameters without distributional assumptions on the random effects. However, in order to satisfy the standard assumptions of a survival model, the semiparametric methods in the literature only use covariate data before each event time. This suggests that these methods may make inefficient use of the longitudinal data. We propose an extension of these approaches that follows a generalization of Rao–Blackwell theorem. A Monte Carlo error augmentation procedure is developed to utilize the entirety of longitudinal information available. The efficiency improvement of the proposed semiparametric approach is confirmed theoretically and demonstrated in a simulation study. A real data set is analyzed as an illustration of a practical application.

Motivated by ultrahigh-dimensional biomarkers screening studies, we propose a model-free screening approach tailored to censored lifetime outcomes. Our proposal is built upon the introduction of a new measure, survival impact index (SII). By its design, SII sensibly captures the overall influence of a covariate on the outcome distribution, and can be estimated with familiar nonparametric procedures that do not require smoothing and are readily adaptable to handle lifetime outcomes under various censoring and truncation mechanisms. We provide large sample distributional results that facilitate the inference on SII in classical multivariate settings. More importantly, we investigate SII as an effective screener for ultrahigh-dimensional data, not relying on rigid regression model assumptions for real applications. We establish the sure screening property of the proposed SII-based screener. Extensive numerical studies are carried out to assess the performance of our method compared with other existing screening methods. A lung cancer microarray data is analyzed to demonstrate the practical utility of our proposals.

Variable selection for recovering sparsity in nonadditive and nonparametric models with high-dimensional variables has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high-dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this article we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric regression model. The advantages of our approach are that it can: (i) recover the sparsity; (ii) automatically model unknown and complicated interactions; (iii) connect with several existing approaches including linear nonnegative garrote and multiple kernel learning; and (iv) provide flexibility for both additive and nonadditive nonparametric models. Our approach can be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a Least Squares Kernel Machine (LSKM) and construct the nonnegative garrote objective function as the function of the sparse scale parameters of kernel machine to recover sparsity of input variables whose relevances to the response are measured by the scale parameters. We also provide the asymptotic properties of our approach. We show that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve the power.

We consider the problem of selecting covariates in a spatial regression model when the response is binary. Penalized likelihood-based approach is proved to be effective for both variable selection and estimation simultaneously. In the context of a spatially dependent binary variable, an uniquely interpretable likelihood is not available, rather a quasi-likelihood might be more suitable. We develop a penalized quasi-likelihood with spatial dependence for simultaneous variable selection and parameter estimation along with an efficient computational algorithm. The theoretical properties including asymptotic normality and consistency are studied under increasing domain asymptotics framework. An extensive simulation study is conducted to validate the methodology. Real data examples are provided for illustration and applicability. Although theoretical justification has not been made, we also investigate empirical performance of the proposed penalized quasi-likelihood approach for spatial count data to explore suitability of this method to a general exponential family of distributions.

In many classical estimation problems, the parameter space has a boundary. In most cases, the standard asymptotic properties of the estimator do not hold when some of the underlying true parameters lie on the boundary. However, without knowledge of the true parameter values, confidence intervals constructed assuming that the parameters lie in the interior are generally over-conservative. A penalized estimation method is proposed in this article to address this issue. An adaptive lasso procedure is employed to shrink the parameters to the boundary, yielding oracle inference which adapt to whether or not the true parameters are on the boundary. When the true parameters are on the boundary, the inference is equivalent to that which would be achieved with a priori knowledge of the boundary, while if the converse is true, the inference is equivalent to that which is obtained in the interior of the parameter space. The method is demonstrated under two practical scenarios, namely the frailty survival model and linear regression with order-restricted parameters. Simulation studies and real data analyses show that the method performs well with realistic sample sizes and exhibits certain advantages over standard methods.

Combining multiple studies is frequently undertaken in biomedical research to increase sample sizes for statistical power improvement. We consider the marginal model for the regression analysis of repeated measurements collected in several similar studies with potentially different variances and correlation structures. It is of great importance to examine whether there exist common parameters across study-specific marginal models so that simpler models, sensible interpretations, and meaningful efficiency gain can be obtained. Combining multiple studies via the classical means of hypothesis testing involves a large number of simultaneous tests for all possible subsets of common regression parameters, in which it results in unduly large degrees of freedom and low statistical power. We develop a new method of *fused lasso with the adaptation of parameter ordering* (FLAPO) to scrutinize only adjacent-pair parameter differences, leading to a substantial reduction for the number of involved constraints. Our method enjoys the oracle properties as does the full fused lasso based on all pairwise parameter differences. We show that FLAPO gives estimators with smaller error bounds and better finite sample performance than the full fused lasso. We also establish a regularized inference procedure based on bias-corrected FLAPO. We illustrate our method through both simulation studies and an analysis of HIV surveillance data collected over five geographic regions in China, in which the presence or absence of common covariate effects is reflective to relative effectiveness of regional policies on HIV control and prevention.

Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify *multiple regulators* or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.

The peptide microarray immunoassay simultaneously screens sample serum against thousands of peptides, determining the presence of antibodies bound to array probes. Peptide microarrays tiling immunogenic regions of pathogens (e.g., envelope proteins of a virus) are an important high throughput tool for querying and mapping antibody binding. Because of the assay's many steps, from probe synthesis to incubation, peptide microarray data can be noisy with extreme outliers. In addition, subjects may produce different antibody profiles in response to an identical vaccine stimulus or infection, due to variability among subjects’ immune systems. We present a robust Bayesian hierarchical model for peptide microarray experiments, pepBayes, to estimate the probability of antibody response for each subject/peptide combination. Heavy-tailed error distributions accommodate outliers and extreme responses, and tailored random effect terms automatically incorporate technical effects prevalent in the assay. We apply our model to two vaccine trial data sets to demonstrate model performance. Our approach enjoys high sensitivity and specificity when detecting vaccine induced antibody responses. A simulation study shows an adaptive thresholding classification method has appropriate false discovery rate control with high sensitivity, and receiver operating characteristics generated on vaccine trial data suggest that pepBayes clearly separates responses from non-responses.

Measuring the similarity between genes is often the starting point for building gene regulatory networks. Most similarity measures used in practice only consider pairwise information with a few also consider network structure. Although theoretical properties of pairwise measures are well understood in the statistics literature, little is known about their statistical properties of those similarity measures based on network structure. In this article, we consider a new whole genome network-based similarity measure, called *CCor*, that makes use of information of all the genes in the network. We derive a concentration inequality of *CCor* and compare it with the commonly used Pearson correlation coefficient for inferring network modules. Both theoretical analysis and real data example demonstrate the advantages of *CCor* over existing measures for inferring gene modules.

In applying scan statistics for public health research, it would be valuable to develop a detection method for multiple clusters that accommodates spatial correlation and covariate effects in an integrated model. In this article, we connect the concepts of the likelihood ratio (LR) scan statistic and the quasi-likelihood (QL) scan statistic to provide a series of detection procedures sufficiently flexible to apply to clusters of arbitrary shape. First, we use an independent scan model for detection of clusters and then a variogram tool to examine the existence of spatial correlation and regional variation based on residuals of the independent scan model. When the estimate of regional variation is significantly different from zero, a mixed QL estimating equation is developed to estimate coefficients of geographic clusters and covariates. We use the Benjamini–Hochberg procedure (1995) to find a threshold for *p*-values to address the multiple testing problem. A quasi-deviance criterion is used to regroup the estimated clusters to find geographic clusters with arbitrary shapes. We conduct simulations to compare the performance of the proposed method with other scan statistics. For illustration, the method is applied to enterovirus data from Taiwan.

The focus of this article is on the nature of the likelihood associated with *N*-mixture models for repeated count data. It is shown that the infinite sum embedded in the likelihood associated with the Poisson mixing distribution can be expressed in terms of a hypergeometric function and, thence, in closed form. The resultant expression for the likelihood can be readily computed to a high degree of accuracy and is algebraically tractable. Specifically, the likelihood equations can be simplified to some advantage, the concentrated likelihood in the probability of detection formulated and problematic cases identified. The results are illustrated by means of a simulation study and a real world example. The study is extended to *N*-mixture models with a negative binomial mixing distribution and results similar to those for the Poisson case obtained. *N*-mixture models with mixing distributions which accommodate excess zeros and, separately, with a beta-binomial distribution rather than a binomial used to model the intra-site counts are also investigated. However the results for these settings, while computationally attractive, do not provide insight into the nature of the maximum likelihood estimates.

We introduce a new Bayesian nonparametric method for estimating the size of a closed population from multiple-recapture data. Our method, based on Dirichlet process mixtures, can accommodate complex patterns of heterogeneity of capture, and can transparently modulate its complexity without a separate model selection step. Additionally, it can handle the massively sparse contingency tables generated by large number of recaptures with moderate sample sizes. We develop an efficient and scalable MCMC algorithm for estimation. We apply our method to simulated data, and to two examples from the literature of estimation of casualties in armed conflicts.

Understanding how aquatic species grow is fundamental in fisheries because stock assessment often relies on growth dependent statistical models. Length-frequency-based methods become important when more applicable data for growth model estimation are either not available or very expensive. In this article, we develop a new framework for growth estimation from length-frequency data using a generalized von Bertalanffy growth model (VBGM) framework that allows for time-dependent covariates to be incorporated. A finite mixture of normal distributions is used to model the length-frequency cohorts of each month with the means constrained to follow a VBGM. The variances of the finite mixture components are constrained to be a function of mean length, reducing the number of parameters and allowing for an estimate of the variance at any length. To optimize the likelihood, we use a minorization–maximization (MM) algorithm with a Nelder–Mead sub-step. This work was motivated by the decline in catches of the blue swimmer crab (BSC) (*Portunus armatus*) off the east coast of Queensland, Australia. We test the method with a simulation study and then apply it to the BSC fishery data.

Applications of circular regression models appear in many different fields such as evolutionary psychology, motor behavior, biology, and, in particular, in the analysis of gene expressions in oscillatory systems. Specifically, for the gene expression problem, a researcher may be interested in modeling the relationship among the phases of cell-cycle genes in two species with differing periods. This challenging problem reduces to the problem of constructing a piecewise circular regression model and, with this objective in mind, we propose a flexible circular regression model which allows different parameter values depending on sectors along the circle. We give a detailed interpretation of the parameters in the model and provide maximum likelihood estimators. We also provide a model selection procedure based on the concept of generalized degrees of freedom. The model is then applied to the analysis of two different cell-cycle data sets and through these examples we highlight the power of our new methodology.

Recently, massive functional data have been widely collected over space across a set of grid points in various imaging studies. It is interesting to correlate functional data with various clinical variables, such as age and gender, in order to address scientific questions of interest. The aim of this article is to develop a single-index varying coefficient (SIVC) model for establishing a varying association between functional responses (e.g., image) and a set of covariates. It enjoys several unique features of both varying-coefficient and single-index models. An estimation procedure is developed to estimate varying coefficient functions, the index function, and the covariance function of individual functions. The optimal integration of information across different grid points is systematically delineated and the asymptotic properties (e.g., consistency and convergence rate) of all estimators are examined. Simulation studies are conducted to assess the finite-sample performance of the proposed estimation procedure. Furthermore, our real data analysis of a white matter tract dataset obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study confirms the advantage and accuracy of SIVC model over the popular varying coefficient model.

Construction of confidence sets for the optimal factor levels is an important topic in response surfaces methodology. In Wan et al. (2015), an exact confidence set has been provided for a maximum or minimum point (i.e., an optimal factor level) of a univariate polynomial function in a given interval. In this article, the method has been extended to construct an exact confidence set for the optimal factor levels of response surfaces. The construction method is readily applied to many parametric and semiparametric regression models involving a quadratic function. A conservative confidence set has been provided as an intermediate step in the construction of the exact confidence set. Two examples are given to illustrate the application of the confidence sets. The comparison between confidence sets indicates that our exact confidence set is better than the only other confidence set available in the statistical literature that guarantees the confidence level.

Individual covariates are commonly used in capture–recapture models as they can provide important information for population size estimation. However, in practice, one or more covariates may be missing at random for some individuals, which can lead to unreliable inference if records with missing data are treated as missing completely at random. We show that, in general, such a naive complete-case analysis in closed capture–recapture models with some covariates missing at random underestimates the population size. We develop methods for estimating regression parameters and population size using regression calibration, inverse probability weighting, and multiple imputation without any distributional assumptions about the covariates. We show that the inverse probability weighting and multiple imputation approaches are asymptotically equivalent. We present a simulation study to investigate the effects of missing covariates and to evaluate the performance of the proposed methods. We also illustrate an analysis using data on the bird species yellow-bellied prinia collected in Hong Kong.

At a time of climate change and major loss of biodiversity, it is important to have efficient tools for monitoring populations. In this context, animal abundance indices play an important rôle. In producing indices for invertebrates, it is important to account for variation in counts within seasons. Two new methods for describing seasonal variation in invertebrate counts have recently been proposed; one is nonparametric, using generalized additive models, and the other is parametric, based on stopover models. We present a novel generalized abundance index which encompasses both parametric and nonparametric approaches. It is extremely efficient to compute this index due to the use of concentrated likelihood techniques. This has particular relevance for the analysis of data from long-term extensive monitoring schemes with records for many species and sites, for which existing modeling techniques can be prohibitively time consuming. Performance of the index is demonstrated by several applications to UK Butterfly Monitoring Scheme data. We demonstrate the potential for new insights into both phenology and spatial variation in seasonal patterns from parametric modeling and the incorporation of covariate dependence, which is relevant for both monitoring and conservation. Associated R code is available on the journal website.

The availability of data in longitudinal studies is often driven by features of the characteristics being studied. For example, clinical databases are increasingly being used for research to address longitudinal questions. Because visit times in such data are often driven by patient characteristics that may be related to the outcome being studied, the danger is that this will result in biased estimation compared to designed, prospective studies. We study longitudinal data that follow a generalized linear mixed model and use a log link to relate an informative visit process to random effects in the mixed model. This device allows us to elucidate which parameters are biased under the informative visit process and to what degree. We show that the informative visit process can badly bias estimators of parameters of covariates associated with the random effects, while allowing consistent estimation of other parameters.

Alzheimer's disease (AD) is usually diagnosed by clinicians through cognitive and functional performance test with a potential risk of misdiagnosis. Since the progression of AD is known to cause structural changes in the corpus callosum (CC), the CC thickness can be used as a functional covariate in AD classification problem for a diagnosis. However, misclassified class labels negatively impact the classification performance. Motivated by AD–CC association studies, we propose a logistic regression for functional data classification that is robust to misdiagnosis or label noise. Specifically, our logistic regression model is constructed by adopting individual intercepts to functional logistic regression model. This approach enables to indicate which observations are possibly mislabeled and also lead to a robust and efficient classifier. An effective algorithm using MM algorithm provides simple closed-form update formulas. We test our method using synthetic datasets to demonstrate its superiority over an existing method, and apply it to differentiating patients with AD from healthy normals based on CC from MRI.

The ready availability of public-use data from various large national complex surveys has immense potential for the assessment of population characteristics using regression models. Complex surveys can be used to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to complex survey design features. That is, stratification, multistage sampling, and weighting. In this article, we accommodate these design features in the analysis of highly skewed response variables arising from large complex surveys. Specifically, we propose a double-transform-both-sides (DTBS)'based estimating equations approach to estimate the median regression parameters of the highly skewed response; the DTBS approach applies the same Box–Cox type transformation twice to both the outcome and regression function. The usual sandwich variance estimate can be used in our approach, whereas a resampling approach would be needed for a pseudo-likelihood based on minimizing absolute deviations (MAD). Furthermore, the approach is relatively robust to the true underlying distribution, and has much smaller mean square error than a MAD approach. The method is motivated by an analysis of laboratory data on urinary iodine (UI) concentration from the National Health and Nutrition Examination Survey.

The evaluation of cure fractions in oncology research under the well known cure rate model has attracted considerable attention in the literature, but most of the existing testing procedures have relied on restrictive assumptions. A common assumption has been to restrict the cure fraction to a constant under alternatives to homogeneity, thereby neglecting any information from covariates. This article extends the literature by developing a score-based statistic that incorporates covariate information to detect cure fractions, with the existing testing procedure serving as a special case. A complication of this extension, however, is that the implied hypotheses are not typical and standard regularity conditions to conduct the test may not even hold. Using empirical processes arguments, we construct a sup-score test statistic for cure fractions and establish its limiting null distribution as a functional of mixtures of chi-square processes. In practice, we suggest a simple resampling procedure to approximate this limiting distribution. Our simulation results show that the proposed test can greatly improve efficiency over tests that neglect the heterogeneity of the cure fraction under the alternative. The practical utility of the methodology is illustrated using ovarian cancer survival data with long-term follow-up from the surveillance, epidemiology, and end results registry.

Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.

For the classical, homoscedastic measurement error model, moment reconstruction (Freedman et al., 2004, 2008) and moment-adjusted imputation (Thomas et al., 2011) are appealing, computationally simple imputation-like methods for general model fitting. Like classical regression calibration, the idea is to replace the unobserved variable subject to measurement error with a proxy that can be used in a variety of analyses. Moment reconstruction and moment-adjusted imputation differ from regression calibration in that they attempt to match multiple features of the latent variable, and also to match some of the latent variable's relationships with the response and additional covariates. In this note, we consider a problem where true exposure is generated by a complex, nonlinear random effects modeling process, and develop analogues of moment reconstruction and moment-adjusted imputation for this case. This general model includes classical measurement errors, Berkson measurement errors, mixtures of Berkson and classical errors and problems that are not measurement error problems, but also cases where the data-generating process for true exposure is a complex, nonlinear random effects modeling process. The methods are illustrated using the National Institutes of Health–AARP Diet and Health Study where the latent variable is a dietary pattern score called the Healthy Eating Index-2005. We also show how our general model includes methods used in radiation epidemiology as a special case. Simulations are used to illustrate the methods.

The usefulness of meta-analysis has been recognized in the evaluation of drug safety, as a single trial usually yields few adverse events and offers limited information. For rare events, conventional meta-analysis methods may yield an invalid inference, as they often rely on large sample theories and require empirical corrections for zero events. These problems motivate research in developing *exact* methods, including Tian et al.'s method of combining confidence intervals (2009, *Biostatistics*, **10**, 275–281) and Liu et al.'s method of combining *p*-value functions (2014, *JASA*, **109**, 1450–1465). This article shows that these two exact methods can be unified under the framework of combining confidence distributions (CDs). Furthermore, we show that the CD method generalizes Tian et al.'s method in several aspects. Given that the CD framework also subsumes the Mantel–Haenszel and Peto methods, we conclude that the CD method offers a general framework for meta-analysis of rare events. We illustrate the CD framework using two real data sets collected for the safety analysis of diabetes drugs.