The instrumental variable (IV) design is a well-known approach for unbiased evaluation of causal effects in the presence of unobserved confounding. In this article, we study the IV approach to account for selection bias in regression analysis with outcome missing not at random. In such a setting, a valid IV is a variable which (i) predicts the nonresponse process, and (ii) is independent of the outcome in the underlying population. We show that under the additional assumption (iii) that the IV is independent of the magnitude of selection bias due to nonresponse, the population regression in view is nonparametrically identified. For point estimation under (i)–(iii), we propose a simple complete-case analysis which modifies the regression of primary interest by carefully incorporating the IV to account for selection bias. The approach is developed for the identity, log and logit link functions. For inferences about the marginal mean of a binary outcome assuming (i) and (ii) only, we describe novel and approximately sharp bounds which unlike Robins–Manski bounds, are smooth in model parameters, therefore allowing for a straightforward approach to account for uncertainty due to sampling variability. These bounds provide a more honest account of uncertainty and allows one to assess the extent to which a violation of the key identifying condition (iii) might affect inferences. For illustration, the methods are used to account for selection bias induced by HIV testing nonparticipation in the evaluation of HIV prevalence in the Zambian Demographic and Health Surveys.

Many statistical methods have recently been developed for identifying subgroups of patients who may benefit from different available treatments. Compared with the traditional outcome-modeling approaches, these methods focus on modeling interactions between the treatments and covariates while by-pass or minimize modeling the main effects of covariates because the subgroup identification only depends on the *sign* of the interaction. However, these methods are scattered and often narrow in scope. In this article, we propose a general framework, by weighting and A-learning, for subgroup identification in both randomized clinical trials and observational studies. Our framework involves minimum modeling for the relationship between the outcome and covariates pertinent to the subgroup identification. Under the proposed framework, we may also estimate the *magnitude* of the interaction, which leads to the construction of scoring system measuring the individualized treatment effect. The proposed methods are quite flexible and include many recently proposed estimators as special cases. As a result, some estimators originally proposed for randomized clinical trials can be extended to observational studies, and procedures based on the weighting method can be converted to an A-learning method and vice versa. Our approaches also allow straightforward incorporation of regularization methods for high-dimensional data, as well as possible efficiency augmentation and generalization to multiple treatments. We examine the empirical performance of several procedures belonging to the proposed framework through extensive numerical studies.

We develop a general class of response-adaptive Bayesian designs using hierarchical models, and provide open source software to implement them. Our work is motivated by recent master protocols in oncology, where several treatments are investigated simultaneously in one or multiple disease types, and treatment efficacy is expected to vary across biomarker-defined subpopulations. Adaptive trials such as I-SPY-2 (Barker et al., 2009) and BATTLE (Zhou et al., 2008) are special cases within our framework. We discuss the application of our adaptive scheme to two distinct research goals. The first is to identify a biomarker subpopulation for which a therapy shows evidence of treatment efficacy, and to exclude other subpopulations for which such evidence does not exist. This leads to a subpopulation-finding design. The second is to identify, within biomarker-defined subpopulations, a set of cancer types for which an experimental therapy is superior to the standard-of-care. This goal leads to a subpopulation-stratified design. Using simulations constructed to faithfully represent ongoing cancer sequencing projects, we quantify the potential gains of our proposed designs relative to conventional non-adaptive designs.

Interval-censored competing risks data arise when each study subject may experience an event or failure from one of several causes and the failure time is not observed directly but rather is known to lie in an interval between two examinations. We formulate the effects of possibly time-varying (external) covariates on the cumulative incidence or sub-distribution function of competing risks (i.e., the marginal probability of failure from a specific cause) through a broad class of semiparametric regression models that captures both proportional and non-proportional hazards structures for the sub-distribution. We allow each subject to have an arbitrary number of examinations and accommodate missing information on the cause of failure. We consider nonparametric maximum likelihood estimation and devise a fast and stable EM-type algorithm for its computation. We then establish the consistency, asymptotic normality, and semiparametric efficiency of the resulting estimators for the regression parameters by appealing to modern empirical process theory. In addition, we show through extensive simulation studies that the proposed methods perform well in realistic situations. Finally, we provide an application to a study on HIV-1 infection with different viral subtypes.

Replacing missing data using the baseline observation carried forward (BOCF) technique is known to be fraught with problems. Despite recommendations to the contrary, BOCF and the related last observation carried forward (LOCF) continue to be used in some fields of research. We first show the use of BOCF in testing for a change in a single sample is essentially equivalent to what results from a completer analysis. Next, we derive a simple method based only on summary statistics for adjusting inference from a completer analysis in situations where the estimand of the completer analysis is expected to be close to that of the full analysis set. For those with missing follow-up data, the method assumes a mean change of zero from baseline, with variance equivalent to that estimated from the patients who complete; in other words, it is essentially BOCF analysis with plausible variance inflation. The extension to two samples is considered and a similar adjustment of the completer analysis is presented. The method presented here can be used to reanalyze results from publications that treat missing data inadequately, to adjust for missing data in power calculations and as a quick way of checking the plausibility of an appropriate treatment of missing data. These practical tools are meant to complement, rather than compete with, established methods for dealing with missing data. Numerical simulations are used to elucidate some properties of the proposed method and, finally, we apply the method to data from a recent trial which evaluated weight loss programs.

We consider the optimization of smoothing parameters and variance components in models with a regular log likelihood subject to quadratic penalization of the model coefficients, via a generalization of the method of Fellner (1986) and Schall (1991). In particular: (i) we generalize the original method to the case of penalties that are linear in several smoothing parameters, thereby covering the important cases of tensor product and adaptive smoothers; (ii) we show why the method's steps increase the restricted marginal likelihood of the model, that it tends to converge faster than the EM algorithm, or obvious accelerations of this, and investigate its relation to Newton optimization; (iii) we generalize the method to any Fisher regular likelihood. The method represents a considerable simplification over existing methods of estimating smoothing parameters in the context of regular likelihoods, without sacrificing generality: for example, it is only necessary to compute with the same first and second derivatives of the log-likelihood required for coefficient estimation, and not with the third or fourth order derivatives required by alternative approaches. Examples are provided which would have been impossible or impractical with pre-existing Fellner-Schall methods, along with an example of a Tweedie location, scale and shape model which would be a challenge for alternative methods, and a sparse additive modeling example where the method facilitates computational efficiency gains of several orders of magnitude. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Forward stagewise estimation is a revived slow-brewing approach for model building that is particularly attractive in dealing with complex data structures for both its computational efficiency and its intrinsic connections with penalized estimation. Under the framework of generalized estimating equations, we study general stagewise estimation approaches that can handle clustered data and non-Gaussian/non-linear models in the presence of prior variable grouping structure. As the grouping structure is often not ideal in that even the important groups may contain irrelevant variables, the key is to simultaneously conduct group selection and within-group variable selection, that is, bi-level selection. We propose two approaches to address the challenge. The first is a bi-level stagewise estimating equations (BiSEE) approach, which is shown to correspond to the sparse group lasso penalized regression. The second is a hierarchical stagewise estimating equations (HiSEE) approach to handle more general hierarchical grouping structure, in which each stagewise estimation step itself is executed as a hierarchical selection process based on the grouping structure. Simulation studies show that BiSEE and HiSEE yield competitive model selection and predictive performance compared to existing approaches. We apply the proposed approaches to study the association between the suicide-related hospitalization rates of the 15–19 age group and the characteristics of the school districts in the State of Connecticut.

The LASSO method estimates coefficients by minimizing the residual sum of squares plus a penalty term. The regularization parameter in LASSO controls the trade-off between data fitting and sparsity. We derive relationship between and the false discovery proportion (FDP) of LASSO estimator and show how to select so as to achieve a desired FDP. Our estimation is based on the asymptotic distribution of LASSO estimator in the limit of both sample size and dimension going to infinity with fixed ratio. We use a factor analysis model to describe the dependence structure of the design matrix. An efficient majorization–minimization based algorithm is developed to estimate the FDP at fixed value of . The analytic results are compared with those of numerical simulations on finite-size systems and are confirmed to be correct. An application to the high-throughput genomic riboavin data set also demonstrates the usefulness of our method.

Quantile regression coefficient functions describe how the coefficients of a quantile regression model depend on the order of the quantile. A method for parametric modeling of quantile regression coefficient functions was discussed in a recent article. The aim of the present work is to extend the existing framework to censored and truncated data. We propose an estimator and derive its asymptotic properties. We discuss goodness-of-fit measures, present simulation results, and analyze the data that motivated this article. The described estimator has been implemented in the R package qrcm.

We consider design issues for cluster randomized trials (CRTs) with a binary outcome where both unit costs and intraclass correlation coefficients (ICCs) in the two arms may be unequal. We first propose a design that maximizes cost efficiency (CE), defined as the ratio of the precision of the efficacy measure to the study cost. Because such designs can be highly sensitive to the unknown ICCs and the anticipated success rates in the two arms, a local strategy based on a single set of best guesses for the ICCs and success rates can be risky. To mitigate this issue, we propose a maximin optimal design that permits ranges of values to be specified for the success rate and the ICC in each arm. We derive maximin optimal designs for three common measures of the efficacy of the intervention, risk difference, relative risk and odds ratio, and study their properties. Using a real cancer control and prevention trial example, we ascertain the efficiency of the widely used balanced design relative to the maximin optimal design and show that the former can be quite inefficient and less robust to mis-specifications of the ICCs and the success rates in the two arms.

We propose a method for visualizing genetic assignment data by characterizing the distribution of genetic profiles for each candidate source population. This method enhances the assignment method of Rannala and Mountain (1997) by calculating appropriate graph positions for individuals for which some genetic data are missing. An individual with missing data is positioned in the distributions of genetic profiles for a population according to its estimated quantile based on its available data. The quantiles of the genetic profile distribution for each population are calculated by approximating the cumulative distribution function (CDF) using the saddlepoint method, and then inverting the CDF to get the quantile function. The saddlepoint method also provides a way to visualize assignment results calculated using the leave-one-out procedure. This new method offers an advance upon assignment software such as geneclass2, which provides no visualization method, and is biologically more interpretable than the bar charts provided by the software structure. We show results from simulated data and apply the methods to microsatellite genotype data from ship rats (*Rattus rattus*) captured on the Great Barrier Island archipelago, New Zealand. The visualization method makes it straightforward to detect features of population structure and to judge the discriminative power of the genetic data for assigning individuals to source populations.

Precise modeling of disease progression in neurodegenerative disorders may enable early intervention before clinical manifestation of a disease, which is crucial since early intervention at the premanifest stage is expected to be more effective. Neuroimaging biomarkers are indicative of the underlying disease pathology and may be used to predict future disease occurrence at the premanifest stage. As observed in many pivotal studies, longitudinal measurements of clinical outcomes, such as motor or cognitive symptoms, often present nonlinear sigmoid shapes over time, where the inflection points of the trajectories mark a meaningful time in disease progression. Therefore, to identify neuroimaging biomarkers predicting disease progression, we propose a nonlinear mixed effects model based on a sigmoid function to predict longitudinal clinical outcomes, and associate a linear combination of neuroimaging biomarkers with subject-specific inflection points. Based on an expectation-maximization (EM) algorithm, we propose a method that can fit a nonlinear model with many potentially correlated biomarkers for random inflection points while achieving computational stability. Variable selection is introduced in the algorithm in order to identify important biomarkers of disease progression and to reduce prediction variability. We apply the proposed method to the data from the Predictors of Huntington's Disease study to select brain subcortical regional volumes predictive of the inflection points of the motor and cognitive function trajectories. Our results reveal that brain atrophy in the striatum and expansion of the ventricular system are highly predictive of the inflection points. Furthermore, these inflection points may precede clinically defined disease onset by as early as a decade and thus may be useful biomarkers as early signs of Huntington's Disease onset.

Pearson's chi-square test has been widely used in testing for association between two categorical responses. Spearman rank correlation and Kendall's tau are often used for measuring and testing association between two continuous or ordered categorical responses. However, the established statistical properties of these tests are only valid when each pair of responses are independent, where each sampling unit has only one pair of responses. When each sampling unit consists of a cluster of paired responses, the assumption of independent pairs is violated. In this article, we apply the within-cluster resampling technique to *U*-statistics to form new tests and rank-based correlation estimators for possibly tied clustered data. We develop large sample properties of the new proposed tests and estimators and evaluate their performance by simulations. The proposed methods are applied to a data set collected from a PET/CT imaging study for illustration.

Missing values appear very often in many applications, but the problem of missing values has not received much attention in testing order-restricted alternatives. Under the missing at random (MAR) assumption, we impute the missing values nonparametrically using kernel regression. For data with imputation, the classical likelihood ratio test designed for testing the order-restricted means is no longer applicable since the likelihood does not exist. This article proposes a novel method for constructing test statistics for assessing means with an increasing order or a decreasing order based on jackknife empirical likelihood (JEL) ratio. It is shown that the JEL ratio statistic evaluated under the null hypothesis converges to a chi-bar-square distribution, whose weights depend on missing probabilities and nonparametric imputation. Simulation study shows that the proposed test performs well under various missing scenarios and is robust for normally and nonnormally distributed data. The proposed method is applied to an Alzheimer's disease neuroimaging initiative data set for finding a biomarker for the diagnosis of the Alzheimer's disease.

Distributed lag non-linear models (DLNMs) are a modelling tool for describing potentially non-linear and delayed dependencies. Here, we illustrate an extension of the DLNM framework through the use of penalized splines within generalized additive models (GAM). This extension offers built-in model selection procedures and the possibility of accommodating assumptions on the shape of the lag structure through specific penalties. In addition, this framework includes, as special cases, simpler models previously proposed for linear relationships (DLMs). Alternative versions of penalized DLNMs are compared with each other and with the standard unpenalized version in a simulation study. Results show that this penalized extension to the DLNM class provides greater flexibility and improved inferential properties. The framework exploits recent theoretical developments of GAMs and is implemented using efficient routines within freely available software. Real-data applications are illustrated through two reproducible examples in time series and survival analysis.

We propose to model a spatio-temporal random field that has nonstationary covariance structure in both space and time domains by applying the concept of the dimension expansion method in Bornn et al. (2012). Simulations are conducted for both separable and nonseparable space-time covariance models, and the model is also illustrated with a streamflow dataset. Both simulation and data analyses show that modeling nonstationarity in both space and time can improve the predictive performance over stationary covariance models or models that are nonstationary in space but stationary in time.

In this article, we present a new method for optimizing designs of experiments for non-linear mixed effects models, where a categorical factor with covariate information is a design variable combined with another design factor. The work is motivated by the need to efficiently design preclinical experiments in enzyme kinetics for a set of Human Liver Microsomes. However, the results are general and can be applied to other experimental situations where the variation in the response due to a categorical factor can be partially accounted for by a covariate. The covariate included in the model explains some systematic variability in a random model parameter. This approach allows better understanding of the population variation as well as estimation of the model parameters with higher precision.

The case-cohort study design is an effective way to reduce cost of assembling and measuring expensive covariates in large cohort studies. Recently, several weighted estimators were proposed for the case-cohort design when multiple diseases are of interest. However, these existing weighted estimators do not make effective use of the covariate information available in the whole cohort. Furthermore, the auxiliary information for the expensive covariates, which may be available in the studies, cannot be incorporated directly. In this article, we propose a class of updated-estimators. We show that, by making effective use of the whole cohort information, the proposed updated-estimators are guaranteed to be more efficient than the existing weighted estimators asymptotically. Furthermore, they are flexible to incorporate the auxiliary information whenever available. The advantages of the proposed updated-estimators are demonstrated in simulation studies and a real data analysis.

Understanding the factors that alter the composition of the human microbiota may help personalized healthcare strategies and therapeutic drug targets. In many sequencing studies, microbial communities are characterized by a list of taxa, their counts, and their evolutionary relationships represented by a phylogenetic tree. In this article, we consider an extension of the Dirichlet multinomial distribution, called the Dirichlet-tree multinomial distribution, for multivariate, over-dispersed, and tree-structured count data. To address the relationships between these counts and a set of covariates, we propose the Dirichlet-tree multinomial regression model for which we develop a penalized likelihood method for estimating parameters and selecting covariates. For efficient optimization, we adopt the accelerated proximal gradient approach. Simulation studies are presented to demonstrate the good performance of the proposed procedure. An analysis of a data set relating dietary nutrients with bacterial counts is used to show that the incorporation of the tree structure into the model helps increase the prediction power.

The processing of auditory information in neurons is an important area in neuroscience. We consider statistical analysis for an electrophysiological experiment related to this area. The recorded synaptic current responses from the experiment are observed as clusters, where the number of clusters is related to an important characteristic of the auditory system. This number is difficult to estimate visually because the clusters are blurred by biological variability. Using singular value decomposition and a Gaussian mixture model, we develop an estimator for the number of clusters. Additionally, we provide a method for hypothesis testing and sample size determination in the two-sample problem. We illustrate our approach with both simulated and experimental data.

We generalize Levene's test for variance (scale) heterogeneity between *k* groups for more complex data, when there are sample correlation and group membership uncertainty. Following a two-stage regression framework, we show that least absolute deviation regression must be used in the stage 1 analysis to ensure a correct asymptotic distribution of the generalized scale (*gS*) test statistic. We then show that the proposed *gS* test is independent of the generalized location test, under the joint null hypothesis of no mean and no variance heterogeneity. Consequently, we generalize the recently proposed joint location-scale (*gJLS*) test, valuable in settings where there is an interaction effect but one interacting variable is not available. We evaluate the proposed method via an extensive simulation study and two genetic association application studies.

Population attributable fraction (PAF) is widely used to quantify the disease burden associated with a modifiable exposure in a population. It has been extended to a time-varying measure that provides additional information on when and how the exposure's impact varies over time for cohort studies. However, there is no estimation procedure for PAF using data that are collected from population-based case-control studies, which, because of time and cost efficiency, are commonly used for studying genetic and environmental risk factors of disease incidences. In this article, we show that time-varying PAF is identifiable from a case-control study and develop a novel estimator of PAF. Our estimator combines odds ratio estimates from logistic regression models and density estimates of the risk factor distribution conditional on failure times in cases from a kernel smoother. The proposed estimator is shown to be consistent and asymptotically normal with asymptotic variance that can be estimated empirically from the data. Simulation studies demonstrate that the proposed estimator performs well in finite sample sizes. Finally, the method is illustrated by a population-based case-control study of colorectal cancer.

In this article, we first propose a Bayesian neighborhood selection method to estimate Gaussian Graphical Models (GGMs). We show the graph selection consistency of this method in the sense that the posterior probability of the true model converges to one. When there are multiple groups of data available, instead of estimating the networks independently for each group, joint estimation of the networks may utilize the shared information among groups and lead to improved estimation for each individual network. Our method is extended to jointly estimate GGMs in multiple groups of data with complex structures, including spatial data, temporal data, and data with both spatial and temporal structures. Markov random field (MRF) models are used to efficiently incorporate the complex data structures. We develop and implement an efficient algorithm for statistical inference that enables parallel computing. Simulation studies suggest that our approach achieves better accuracy in network estimation compared with methods not incorporating spatial and temporal dependencies when there are shared structures among the networks, and that it performs comparably well otherwise. Finally, we illustrate our method using the human brain gene expression microarray dataset, where the expression levels of genes are measured in different brain regions across multiple time periods.

We consider conditional estimation in two-stage sample size adjustable designs and the consequent bias. More specifically, we consider a design which permits raising the sample size when interim results look rather promising, and which retains the originally planned sample size when results look very promising. The estimation procedures reported comprise the unconditional maximum likelihood, the conditionally unbiased Rao–Blackwell estimator, the conditional median unbiased estimator, and the conditional maximum likelihood with and without bias correction. We compare these estimators based on analytical results and a simulation study. We show how they can be applied in a real clinical trial setting.

Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high-dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this article, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets.

Traditionally, phylogeny and sequence alignment are estimated separately: first estimate a multiple sequence alignment and then infer a phylogeny based on the sequence alignment estimated in the previous step. However, uncertainty in the alignment is ignored, resulting, possibly, in overstated certainty in phylogeny estimates. We develop a joint model for co-estimating phylogeny and sequence alignment which improves estimates from the traditional approach by accounting for uncertainty in the alignment in phylogenetic inferences. Our insertion and deletion (indel) model allows arbitrary-length overlapping indel events and a general distribution for indel fragment size. We employ a Bayesian approach using MCMC to estimate the joint posterior distribution of a phylogenetic tree and a multiple sequence alignment. Our approach has a tree and a complete history of indel events mapped onto the tree as the state space of the Markov Chain while alternative previous approaches have a tree and an alignment. A large state space containing a complete history of indel events makes our MCMC approach more challenging, but it enables us to infer more information about the indel process. The performances of this joint method and traditional sequential methods are compared using simulated data as well as real data. Software named BayesCAT (Bayesian Co-estimation of Alignment and Tree) is available at https://github.com/heejungshim/BayesCAT.

In all sorts of regression problems, it has become more and more important to deal with high-dimensional data with lots of potentially influential covariates. A possible solution is to apply estimation methods that aim at the detection of the relevant effect structure by using penalization methods. In this article, the effect structure in the Cox frailty model, which is the most widely used model that accounts for heterogeneity in survival data, is investigated. Since in survival models one has to account for possible variation of the effect strength over time the selection of the relevant features has to distinguish between several cases, covariates can have time-varying effects, time-constant effects, or be irrelevant. A penalization approach is proposed that is able to distinguish between these types of effects to obtain a sparse representation that includes the relevant effects in a proper form. It is shown in simulations that the method works well. The method is applied to model the time until pregnancy, illustrating that the complexity of the influence structure can be strongly reduced by using the proposed penalty approach.

High-throughput genetic and epigenetic data are often screened for associations with an observed phenotype. For example, one may wish to test hundreds of thousands of genetic variants, or DNA methylation sites, for an association with disease status. These genomic variables can naturally be grouped by the gene they encode, among other criteria. However, standard practice in such applications is independent screening with a universal correction for multiplicity. We propose a Bayesian approach in which the prior probability of an association for a given genomic variable depends on its gene, and the gene-specific probabilities are modeled nonparametrically. This hierarchical model allows for appropriate gene and genome-wide multiplicity adjustments, and can be incorporated into a variety of Bayesian association screening methodologies with negligible increase in computational complexity. We describe an application to screening for differences in DNA methylation between lower grade glioma and glioblastoma multiforme tumor samples from The Cancer Genome Atlas. Software is available via the package BayesianScreening for R:github.com/lockEF/BayesianScreening.

To assess the compliance of air quality regulations, the Environmental Protection Agency (EPA) must know if a site exceeds a pre-specified level. In the case of ozone, the level for compliance is fixed at 75 parts per billion, which is high, but not extreme at all locations. We present a new space-time model for threshold exceedances based on the skew-*t* process. Our method incorporates a random partition to permit long-distance asymptotic independence while allowing for sites that are near one another to be asymptotically dependent, and we incorporate thresholding to allow the tails of the data to speak for themselves. We also introduce a transformed AR(1) time-series to allow for temporal dependence. Finally, our model allows for high-dimensional Bayesian inference that is comparable in computation time to traditional geostatistical methods for large data sets. We apply our method to an ozone analysis for July 2005, and find that our model improves over both Gaussian and max-stable methods in terms of predicting exceedances of a high level.

The problem of inferring the relatedness distribution between two individuals from biallelic marker data is considered. This problem can be cast as an estimation task in a mixture model: at each marker the latent variable is the relatedness state, and the observed variable is the genotype of the two individuals. In this model, only the prior proportions are unknown, and can be obtained via ML estimation using the EM algorithm. When the markers are biallelic and the data unphased, the identifiability of the model is known not to be guaranteed. In this article, model identifiability is investigated in the case of phased data generated from a crossing design, a classical situation in plant genetics. It is shown that identifiability can be guaranteed under some conditions on the crossing design. The adapted ML estimator is implemented in an R package called **Relatedness**. The performance of the ML estimator is evaluated and compared to that of the benchmark moment estimator, both on simulated and real data. Compared to its competitor, the ML estimator is shown to be more robust and to provide more realistic estimates.

Partial differential equations (PDEs) are used to model complex dynamical systems in multiple dimensions, and their parameters often have important scientific interpretations. In some applications, PDE parameters are not constant but can change depending on the values of covariates, a feature that we call *varying coefficients*. We propose a parameter cascading method to estimate varying coefficients in PDE models from noisy data. Our estimates of the varying coefficients are shown to be consistent and asymptotically normally distributed. The performance of our method is evaluated by a simulation study and by an empirical study estimating three varying coefficients in a PDE model arising from LIDAR data.

The electroencephalography (EEG) data created in event-related potential (ERP) experiments have a complex high-dimensional structure. Each stimulus presentation, or trial, generates an ERP waveform which is an instance of functional data. The experiments are made up of sequences of multiple trials, resulting in longitudinal functional data and moreover, responses are recorded at multiple electrodes on the scalp, adding an electrode dimension. Traditional EEG analyses involve multiple simplifications of this structure to increase the signal-to-noise ratio, effectively collapsing the functional and longitudinal components by identifying key features of the ERPs and averaging them across trials. Motivated by an implicit learning paradigm used in autism research in which the functional, longitudinal, and electrode components all have critical interpretations, we propose a multidimensional functional principal components analysis (MD-FPCA) technique which does not collapse any of the dimensions of the ERP data. The proposed decomposition is based on separation of the total variation into subject and subunit level variation which are further decomposed in a two-stage functional principal components analysis. The proposed methodology is shown to be useful for modeling longitudinal trends in the ERP functions, leading to novel insights into the learning patterns of children with Autism Spectrum Disorder (ASD) and their typically developing peers as well as comparisons between the two groups. Finite sample properties of MD-FPCA are further studied via extensive simulations.

Dysphagia is a primary cause of death among patients diagnosed with amyotrophic lateral sclerosis (ALS), and percutaneous endoscopic gastrostomy (PEG) is a procedure to insert a tube into the stomach to assist or replace oral feeding. It is believed that PEG is beneficial and, generally, earlier insertion is preferable to later. However, gathering clinical evidence to support these beliefs on the use and timing of PEG is challenging because controlled clinical trials are not feasible and clinical endpoints are confounded with PEG in observational data. Moreover, the confounders are time-varying and time to PEG insertion may be only partially observed. We show how one can view this problem as an adaptive treatment length policy and propose a new estimator via *g*-computation. We show that our estimator is consistent and asymptotically normal for the causal estimand and explore its finite sample properties in simulation studies. Finally, using more than 10 years of data from Emory ALS clinic registry, we found no evidence to suggest that earlier PEG reduced 4-year mortality; thus, our results do not support the hypothesis and belief that initiating palliative care earlier extends life, on average. At the same, we cannot be certain that all important confounding variables are collected and observed to ensure our modeling assumptions are correct, so more work is needed to address these important end-of-life questions for ALS patients.

Brain connectivity analysis is now at the foreground of neuroscience research. A connectivity network is characterized by a graph, where nodes represent neural elements such as neurons and brain regions, and links represent statistical dependence that is often encoded in terms of partial correlation. Such a graph is inferred from the matrix-valued neuroimaging data such as electroencephalography and functional magnetic resonance imaging. There have been a good number of successful proposals for sparse precision matrix estimation under normal or matrix normal distribution; however, this family of solutions does not offer a direct statistical significance quantification for the estimated links. In this article, we adopt a matrix normal distribution framework and formulate the brain connectivity analysis as a precision matrix hypothesis testing problem. Based on the separable spatial-temporal dependence structure, we develop oracle and data-driven procedures to test both the global hypothesis that all spatial locations are conditionally independent, and simultaneous tests for identifying conditional dependent spatial locations with false discovery rate control. Our theoretical results show that the data-driven procedures perform asymptotically as well as the oracle procedures and enjoy certain optimality properties. The empirical finite-sample performance of the proposed tests is studied via intensive simulations, and the new tests are applied on a real electroencephalography data analysis.

The assessment of patients’ functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set.

An important goal of censored quantile regression is to provide reliable predictions of survival quantiles, which are often reported in practice to offer robust and comprehensive biomedical summaries. However, formal methods for evaluating and comparing *working* quantile regression models in terms of their performance in predicting survival quantiles have been lacking, especially when the working models are subject to model mis-specification. In this article, we proposes a sensible and rigorous framework to fill in this gap. We introduce and justify a predictive performance measure defined based on the check loss function. We derive estimators of the proposed predictive performance measure and study their distributional properties and the corresponding inference procedures. More importantly, we develop model comparison procedures that enable thorough evaluations of model predictive performance among nested or non-nested models. Our proposals properly accommodate random censoring to the survival outcome and the realistic complication of model mis-specification, and thus are generally applicable. Extensive simulations and a real data example demonstrate satisfactory performances of the proposed methods in real life settings.

This article investigates a generalized semiparametric varying-coefficient model for longitudinal data that can flexibly model three types of covariate effects: time-constant effects, time-varying effects, and covariate-varying effects. Different link functions can be selected to provide a rich family of models for longitudinal data. The model assumes that the time-varying effects are unspecified functions of time and the covariate-varying effects are parametric functions of an exposure variable specified up to a finite number of unknown parameters. The estimation procedure is developed using local linear smoothing and profile weighted least squares estimation techniques. Hypothesis testing procedures are developed to test the parametric functions of the covariate-varying effects. The asymptotic distributions of the proposed estimators are established. A working formula for bandwidth selection is discussed and examined through simulations. Our simulation study shows that the proposed methods have satisfactory finite sample performance. The proposed methods are applied to the ACTG 244 clinical trial of HIV infected patients being treated with Zidovudine to examine the effects of antiretroviral treatment switching before and after HIV develops the T215Y/F drug resistance mutation. Our analysis shows benefits of treatment switching to the combination therapies as compared to continuing with ZDV monotherapy before and after developing the 215-mutation.

Prognostic survival models are commonly evaluated in terms of both their calibration and their discrimination. Comparing observed and predicted survival curves can assess calibration, while discrimination is typically summarized through comparison of the properties of “cases” or subjects who experience an event, and the properties of “controls” represented by event-free individuals. For binary data, discrimination is characterized either by using the relative ranks of cases and controls and a receiver operating characteristic (ROC) curve, or by summarizing the magnitude of risk placed on cases and controls through calculation of the discrimination slope (DS). In this article, we propose a risk-based measure of time-varying discrimination that generalizes the discrimination slope to allow use with incident events and hazard models. We refer to the new measure as the hazard discrimination summary (HDS) since it compares the relative risk among incident cases to their associated dynamic risk set controls. We introduce both a model-based estimation procedure that adopts the Cox model, and an alternative approach that locally relaxes the proportional hazards assumption. We illustrate the proposed methods using both a benchmark survival data set, and an oncology study where primary interest is in the time-varying performance of candidate biomarkers.

Prostate cancer patients are closely followed after the initial therapy and salvage treatment may be prescribed to prevent or delay cancer recurrence. The salvage treatment decision is usually made dynamically based on the patient's evolving history of disease status and other time-dependent clinical covariates. A multi-center prostate cancer observational study has provided us data on longitudinal prostate specific antigen (PSA) measurements, time-varying salvage treatment, and cancer recurrence time. These data enable us to estimate the best dynamic regime of salvage treatment, while accounting for the complicated confounding of time-varying covariates present in the data. A Random Forest based method is used to model the probability of regime adherence and inverse probability weights are used to account for the complexity of selection bias in regime adherence. The optimal regime is then identified by the largest restricted mean survival time. We conduct simulation studies with different PSA trends to mimic both simple and complex regime adherence mechanisms. The proposed method can efficiently accommodate complex and possibly unknown adherence mechanisms, and it is robust to cases where the proportional hazards assumption is violated. We apply the method to data collected from the observational study and estimate the best salvage treatment regime in managing the risk of prostate cancer recurrence.

It is standard practice for covariates to enter a parametric model through a single distributional parameter of interest, for example, the scale parameter in many standard survival models. Indeed, the well-known proportional hazards model is of this kind. In this article, we discuss a more general approach whereby covariates enter the model through *more than one* distributional parameter simultaneously (e.g., scale *and* shape parameters). We refer to this practice as “multi-parameter regression” (MPR) modeling and explore its use in a survival analysis context. We find that multi-parameter regression leads to more flexible models which can offer greater insight into the underlying data generating process. To illustrate the concept, we consider the two-parameter Weibull model which leads to time-dependent hazard ratios, thus relaxing the typical proportional hazards assumption and motivating a new test of proportionality. A novel variable selection strategy is introduced for such multi-parameter regression models. It accounts for the correlation arising between the estimated regression coefficients in two or more linear predictors—a feature which has not been considered by other authors in similar settings. The methods discussed have been implemented in the mpr package in R.

Researchers estimating causal effects are increasingly challenged with decisions on how to best control for a potentially high-dimensional set of confounders. Typically, a single propensity score model is chosen and used to adjust for confounding, while the uncertainty surrounding which covariates to include into the propensity score model is often ignored, and failure to include even one important confounder will results in bias. We propose a practical and generalizable approach that overcomes the limitations described above through the use of model averaging. We develop and evaluate this approach in the context of double robust estimation. More specifically, we introduce the model averaged double robust (MA-DR) estimators, which account for model uncertainty in both the propensity score and outcome model through the use of model averaging. The MA-DR estimators are defined as weighted averages of double robust estimators, where each double robust estimator corresponds to a specific choice of the outcome model and the propensity score model. The MA-DR estimators extend the desirable double robustness property by achieving consistency under the much weaker assumption that either the true propensity score model or the true outcome model be within a specified, possibly large, class of models. Using simulation studies, we also assessed small sample properties, and found that MA-DR estimators can reduce mean squared error substantially, particularly when the set of potential confounders is large relative to the sample size. We apply the methodology to estimate the average causal effect of temozolomide plus radiotherapy versus radiotherapy alone on one-year survival in a cohort of 1887 Medicare enrollees who were diagnosed with glioblastoma between June 2005 and December 2009.

We study threshold regression models that allow the relationship between the outcome and a covariate of interest to change across a threshold value in the covariate. In particular, we focus on continuous threshold models, which experience no jump at the threshold. Continuous threshold regression functions can provide a useful summary of the association between outcome and the covariate of interest, because they offer a balance between flexibility and simplicity. Motivated by collaborative works in studying immune response biomarkers of transmission of infectious diseases, we study estimation of continuous threshold models in this article with particular attention to inference under model misspecification. We derive the limiting distribution of the maximum likelihood estimator, and propose both Wald and test-inversion confidence intervals. We evaluate finite sample performance of our methods, compare them with bootstrap confidence intervals, and provide guidelines for practitioners to choose the most appropriate method in real data analysis. We illustrate the application of our methods with examples from the HIV-1 immune correlates studies.

Understanding the complex interplay among protein coding genes and regulatory elements requires rigorous interrogation with analytic tools designed for discerning the relative contributions of overlapping genomic regions. To this aim, we offer a novel application of Bayesian variable selection (BVS) for classifying genomic class level associations using existing large meta-analysis summary level resources. This approach is applied using the expectation maximization variable selection (EMVS) algorithm to typed and imputed SNPs across 502 protein coding genes (PCGs) and 220 long intergenic non-coding RNAs (lncRNAs) that overlap 45 known loci for coronary artery disease (CAD) using publicly available Global Lipids Gentics Consortium (GLGC) (Teslovich et al., 2010; Willer et al., 2013) meta-analysis summary statistics for low-density lipoprotein cholesterol (LDL-C). The analysis reveals 33 PCGs and three lncRNAs across 11 loci with 50% posterior probabilities for inclusion in an additive model of association. The findings are consistent with previous reports, while providing some new insight into the architecture of LDL-cholesterol to be investigated further. As genomic taxonomies continue to evolve, additional classes such as enhancer elements and splicing regions, can easily be layered into the proposed analysis framework. Moreover, application of this approach to alternative publicly available meta-analysis resources, or more generally as a post-analytic strategy to further interrogate regions that are identified through single point analysis, is straightforward. All coding examples are implemented in R version 3.2.1 and provided as supplemental material.

Quantitative trait locus analysis has been used as an important tool to identify markers where the phenotype or quantitative trait is linked with the genotype. Most existing tests for single locus association with quantitative traits aim at the detection of the mean differences across genotypic groups. However, recent research has revealed functional genetic loci that affect the variance of traits, known as variability-controlling quantitative trait locus. In addition, it has been suggested that many genotypes have both mean and variance effects, while the mean effects or variance effects alone may not be strong enough to be detected. The existing methods accounting for unequal variances include the Levene's test, the Lepage test, and the *D*-test, but suffer from their limitations of lack of robustness or lack of power. We propose a semiparametric model and a novel pairwise conditional likelihood ratio test. Specifically, the semiparametric model is designed to identify the combined differences in higher moments among genotypic groups. The pairwise likelihood is constructed based on conditioning procedure, where the unknown reference distribution is eliminated. We show that the proposed pairwise likelihood ratio test has a simple asymptotic chi-square distribution, which does not require permutation or bootstrap procedures. Simulation studies show that the proposed test performs well in controlling Type I errors and having competitive power in identifying the differences across genotypic groups. In addition, the proposed test has certain robustness to model mis-specifications. The proposed test is illustrated by an example of identifying both mean and variances effects in body mass index using the Framingham Heart Study data.

Genetic risk prediction is an important component of individualized medicine, but prediction accuracies remain low for many complex diseases. A fundamental limitation is the sample sizes of the studies on which the prediction algorithms are trained. One way to increase the effective sample size is to integrate information from previously existing studies. However, it can be difficult to find existing data that examine the target disease of interest, especially if that disease is rare or poorly studied. Furthermore, individual-level genotype data from these auxiliary studies are typically difficult to obtain. This article proposes a new approach to integrative genetic risk prediction of complex diseases with binary phenotypes. It accommodates possible heterogeneity in the genetic etiologies of the target and auxiliary diseases using a tuning parameter-free non-parametric empirical Bayes procedure, and can be trained using only auxiliary summary statistics. Simulation studies show that the proposed method can provide superior predictive accuracy relative to non-integrative as well as integrative classifiers. The method is applied to a recent study of pediatric autoimmune diseases, where it substantially reduces prediction error for certain target/auxiliary disease combinations. The proposed method is implemented in the R package ssa.

Alternating logistic regressions is an estimating equations procedure used to model marginal means of correlated binary outcomes while simultaneously specifying a within-cluster association model for log odds ratios of outcome pairs. A recent generalization of alternating logistic regressions, known as orthogonalized residuals, is extended to incorporate finite sample adjustments in the estimation of the log odds ratio model parameters for when there is a moderately small number of clusters. Bias adjustments are made both in the sandwich variance estimators and in the estimating equations for the association parameters. The proposed methods are demonstrated in a repeated cross-sectional cluster trial to reduce underage drinking in the United States, and in an analysis of dental caries incidence in a cluster randomized trial of 30 aboriginal communities in the Northern Territory of Australia. A simulation study demonstrates improved performance with respect to bias and coverage of their estimators relative to those based on the uncorrected orthogonalized residuals procedure.

For a fallback randomized clinical trial design with a marker, Choai and Matsui (2015, *Biometrics* 71, 25–32) estimate the bias of the estimator of the treatment effect in the marker-positive subgroup conditional on the treatment effect not being statistically significant in the overall population. This is used to construct and examine conditionally bias-corrected estimators of the treatment effect for the marker-positive subgroup. We argue that it may not be appropriate to correct for conditional bias in this setting. Instead, we consider the unconditional bias of estimators of the treatment effect for marker-positive patients.

In precision medicine, a patient is treated with targeted therapies that are predicted to be effective based on the patient's baseline characteristics such as biomarker profiles. Oftentimes, patient subgroups are unknown and must be learned through inference using observed data. We present SCUBA, a Subgroup ClUster-based Bayesian Adaptive design aiming to fulfill two simultaneous goals in a clinical trial, 1) to treatments enrich the allocation of each subgroup of patients to their precision and desirable treatments and 2) to report multiple subgroup-treatment pairs (STPs). Using random partitions and semiparametric Bayesian models, SCUBA provides coherent and probabilistic assessment of potential patient subgroups and their associated targeted therapies. Each STP can then be used for future confirmatory studies for regulatory approval. Through extensive simulation studies, we present an application of SCUBA to an innovative clinical trial in gastroesphogeal cancer.

Conventional distance sampling (CDS) methods assume that animals are uniformly distributed in the vicinity of lines or points. But when animals move in response to observers before detection, or when lines or points are not located randomly, this assumption may fail. By formulating distance sampling models as survival models, we show that using time to first detection in addition to perpendicular distance (line transect surveys) or radial distance (point transect surveys) allows estimation of detection probability, and hence density, when animal distribution in the vicinity of lines or points is not uniform and is unknown. We also show that times to detection can provide information about failure of the CDS assumption that detection probability is 1 at distance zero. We obtain a maximum likelihood estimator of line transect survey detection probability and effective strip half-width using times to detection, and we investigate its properties by simulation in situations where animals are nonuniformly distributed and their distribution is unknown. The estimator is found to perform well when detection probability at distance zero is 1. It allows unbiased estimates of density to be obtained in this case from surveys in which there has been responsive movement prior to animals coming within detectable range. When responsive movement continues within detectable range, estimates may be biased but are likely less biased than estimates from methods that assuming no responsive movement. We illustrate by estimating primate density from a line transect survey in which animals are known to avoid the transect line, and a shipboard survey of dolphins that are attracted to it.

In randomized studies involving severely ill patients, functional outcomes are often unobserved due to missed clinic visits, premature withdrawal, or death. It is well known that if these unobserved functional outcomes are not handled properly, biased treatment comparisons can be produced. In this article, we propose a procedure for comparing treatments that is based on a composite endpoint that combines information on both the functional outcome and survival. We further propose a missing data imputation scheme and sensitivity analysis strategy to handle the unobserved functional outcomes not due to death. Illustrations of the proposed method are given by analyzing data from a recent non-small cell lung cancer clinical trial and a recent trial of sedation interruption among mechanically ventilated patients.

Survival model construction can be guided by goodness-of-fit techniques as well as measures of predictive strength. Here, we aim to bring together these distinct techniques within the context of a single framework. The goal is how to best characterize and code the effects of the variables, in particular time dependencies, when taken either singly or in combination with other related covariates. Simple graphical techniques can provide an immediate visual indication as to the goodness-of-fit but, in cases of departure from model assumptions, will point in the direction of a more involved and richer alternative model. These techniques appear to be intuitive. This intuition is backed up by formal theorems that underlie the process of building richer models from simpler ones. Measures of predictive strength are used in conjunction with these goodness-of-fit techniques and, again, formal theorems show that these measures can be used to help identify models closest to the unknown non-proportional hazards mechanism that we can suppose generates the observations. Illustrations from studies in breast cancer show how these tools can be of help in guiding the practical problem of efficient model construction for survival data.

We propose a subgroup identification approach for inferring optimal and interpretable personalized treatment rules with high-dimensional covariates. Our approach is based on a two-step greedy tree algorithm to pursue signals in a high-dimensional space. In the first step, we transform the treatment selection problem into a weighted classification problem that can utilize tree-based methods. In the second step, we adopt a newly proposed tree-based method, known as reinforcement learning trees, to detect features involved in the optimal treatment rules and to construct binary splitting rules. The method is further extended to right censored survival data by using the accelerated failure time model and introducing double weighting to the classification trees. The performance of the proposed method is demonstrated via simulation studies, as well as analyses of the Cancer Cell Line Encyclopedia (CCLE) data and the Tamoxifen breast cancer data.

In many biomedical studies that involve correlated data, an outcome is often repeatedly measured for each individual subject along with the number of these measurements, which is also treated as an observed outcome. This type of data has been referred as multivariate random length data by Barnhart and Sampson (1995). A common approach to handling such type of data is to jointly model the multiple measurements and the random length. In previous literature, a key assumption is the multivariate normality for the multiple measurements. Motivated by a reproductive study, we propose a new copula-based joint model which relaxes the normality assumption. Specifically, we adopt the Clayton–Oakes model for multiple measurements with flexible marginal distributions specified as semi-parametric transformation models. The random length is modeled via a generalized linear model. We develop an approximate EM algorithm to derive parameter estimators and standard errors of the estimators are obtained through bootstrapping procedures and the finite-sample performance of the proposed method is investigated using simulation studies. We apply our method to the Mount Sinai Study of Women Office Workers (MSSWOW), where women were prospectively followed for 1 year for studying fertility.

In a sensitivity analysis in an observational study with a binary outcome, is it better to use all of the data or to focus on subgroups that are expected to experience the largest treatment effects? The answer depends on features of the data that may be difficult to anticipate, a trade-off between unknown effect-sizes and known sample sizes. We propose a sensitivity analysis for an adaptive test similar to the Mantel–Haenszel test. The adaptive test performs two highly correlated analyses, one focused analysis using a subgroup, one combined analysis using all of the data, correcting for multiple testing using the joint distribution of the two test statistics. Because the two component tests are highly correlated, this correction for multiple testing is small compared with, for instance, the Bonferroni inequality. The test has the maximum design sensitivity of two component tests. A simulation evaluates the power of a sensitivity analysis using the adaptive test. Two examples are presented. An R package, sensitivity2x2xk, implements the procedure.

Integration of genomic data from multiple platforms has the capability to increase precision, accuracy, and statistical power in the identification of prognostic biomarkers. A fundamental problem faced in many multi-platform studies is unbalanced sample sizes due to the inability to obtain measurements from all the platforms for all the patients in the study. We have developed a novel Bayesian approach that integrates multi-regression models to identify a small set of biomarkers that can accurately predict time-to-event outcomes. This method fully exploits the amount of available information across platforms and does not exclude any of the subjects from the analysis. Through simulations, we demonstrate the utility of our method and compare its performance to that of methods that do not borrow information across regression models. Motivated by The Cancer Genome Atlas kidney renal cell carcinoma dataset, our methodology provides novel insights missed by non-integrative models.

Personalized cancer therapy requires clinical trials with smaller sample sizes compared to trials involving unselected populations that have not been divided into biomarker subgroups. The use of exponential survival modeling for survival endpoints has the potential of gaining 35% efficiency or saving 28% required sample size (Miller, 1983), making personalized therapy trials more feasible. However, the use of exponential survival has not been fully accepted in cancer research practice due to uncertainty about whether or not the exponential assumption holds. We propose a test for identifying violations of the exponential assumption using a reduced piecewise exponential approach. Compared with an alternative goodness-of-fit test, which suffers from inflation of type I error rate under various censoring mechanisms, the proposed test maintains the correct type I error rate. We conduct power analysis using simulated data based on different types of cancer survival distribution in the SEER registry database, and demonstrate the implementation of this approach in existing cancer clinical trials.

It is now well recognized that the effectiveness and potential risk of a treatment often vary by patient subgroups. Although trial-and-error and one-size-fits-all approaches to treatment selection remain a common practice, much recent focus has been placed on individualized treatment selection based on patient information (La Thangue and Kerr, 2011; Ong et al., 2012). Genetic and molecular markers are becoming increasingly available to guide treatment selection for various diseases including HIV and breast cancer (Mallal et al., 2008; Zujewski and Kamin, 2008). In recent years, many statistical procedures for developing individualized treatment rules (ITRs) have been proposed. However, less focus has been given to efficient selection of predictive biomarkers for treatment selection. The standard Wald test for interactions between treatment and the set of markers of interest may not work well when the marker effects are nonlinear. Furthermore, interaction-based test is scale dependent and may fail to capture markers useful for predicting individualized treatment differences. In this article, we propose to overcome these difficulties by developing a kernel machine (KM) score test that can efficiently identify markers predictive of treatment difference. Simulation studies show that our proposed KM-based score test is more powerful than the Wald test when there is nonlinear effect among the predictors and when the outcome is binary with nonlinear link functions. Furthermore, when there is high-correlation among predictors and when the number of predictors is not small, our method also over-performs Wald test. The proposed method is illustrated with two randomized clinical trials.

Many new experimental treatments benefit only a subset of the population. Identifying the baseline covariate profiles of patients who benefit from such a treatment, rather than determining whether or not the treatment has a population-level effect, can substantially lessen the risk in undertaking a clinical trial and expose fewer patients to treatments that do not benefit them. The standard analyses for identifying patient subgroups that benefit from an experimental treatment either do not account for multiplicity, or focus on testing for the presence of treatment–covariate interactions rather than the resulting individualized treatment effects. We propose a Bayesian *credible subgroups* method to identify two bounding subgroups for the benefiting subgroup: one for which it is likely that all members simultaneously have a treatment effect exceeding a specified threshold, and another for which it is likely that no members do. We examine frequentist properties of the credible subgroups method via simulations and illustrate the approach using data from an Alzheimer's disease treatment trial. We conclude with a discussion of the advantages and limitations of this approach to identifying patients for whom the treatment is beneficial.

Identification of novel biomarkers for risk prediction is important for disease prevention and optimal treatment selection. However, studies aiming to discover which biomarkers are useful for risk prediction often require the use of stored biological samples from large assembled cohorts, and thus the depletion of a finite and precious resource. To make efficient use of such stored samples, two-phase sampling designs are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare. Existing methods for analyzing data from two-phase studies focus primarily on single marker analysis or fitting the Cox regression model to combine information from multiple markers. However, the Cox model may not fit the data well. Under model misspecification, the composite score derived from the Cox model may not perform well in predicting the outcome. Under a general two-phase stratified cohort sampling design, we present a novel approach to combining multiple markers to optimize prediction by fitting a flexible nonparametric transformation model. Using inverse probability weighting to account for the outcome-dependent sampling, we propose to estimate the model parameters by maximizing an objective function which can be interpreted as a weighted C-statistic for survival outcomes. Regardless of model adequacy, the proposed procedure yields a sensible composite risk score for prediction. A major obstacle for making inference under two phase studies is due to the correlation induced by the finite population sampling, which prevents standard inference procedures such as the bootstrap from being used for variance estimation. We propose a resampling procedure to derive valid confidence intervals for the model parameters and the C-statistic accuracy measure. We illustrate the new methods with simulation studies and an analysis of a two-phase study of high-density lipoprotein cholesterol (HDL-C) subtypes for predicting the risk of coronary heart disease.

An intermediate response measure that accurately predicts efficacy in a new setting can reduce trial cost and time to product licensure. In this article, we define a *trial level general surrogate*, which is an intermediate response that can be used to accurately predict efficacy in a new setting. Methods for evaluating general surrogates have been developed previously. Many methods in the literature use trial level intermediate responses for prediction. However, all existing methods focus on surrogate evaluation and prediction in new settings, rather than comparison of candidate general surrogates, and few formalize the use of cross validation to quantify the expected prediction error. Our proposed method uses Bayesian non-parametric modeling and cross-validation to estimate the absolute prediction error for use in evaluating and comparing candidate trial level general surrogates. Simulations show that our method performs well across a variety of scenarios. We use our method to evaluate and to compare candidate trial level general surrogates in several multi-national trials of a pentavalent rotavirus vaccine. We identify at least one immune measure that has potential value as a trial level general surrogate and use it to predict efficacy in a new trial where the clinical outcome was not measured.

In this article, we develop new methods for estimating average treatment effects in observational studies, in settings with more than two treatment levels, assuming unconfoundedness given pretreatment variables. We emphasize propensity score subclassification and matching methods which have been among the most popular methods in the binary treatment literature. Whereas the literature has suggested that these particular propensity-based methods do not naturally extend to the multi-level treatment case, we show, using the concept of weak unconfoundedness and the notion of the generalized propensity score, that adjusting for a scalar function of the pretreatment variables removes all biases associated with observed pretreatment variables. We apply the proposed methods to an analysis of the effect of treatments for fibromyalgia. We also carry out a simulation study to assess the finite sample performance of the methods relative to previously proposed methods.

Semi-parametric methods are often used for the estimation of intervention effects on correlated outcomes in cluster-randomized trials (CRTs). When outcomes are missing at random (MAR), Inverse Probability Weighted (IPW) methods incorporating baseline covariates can be used to deal with informative missingness. Also, augmented generalized estimating equations (AUG) correct for imbalance in baseline covariates but need to be extended for MAR outcomes. However, in the presence of interactions between treatment and baseline covariates, neither method alone produces consistent estimates for the marginal treatment effect if the model for interaction is not correctly specified. We propose an AUG–IPW estimator that weights by the inverse of the probability of being a complete case and allows different outcome models in each intervention arm. This estimator is doubly robust (DR); it gives correct estimates whether the missing data process or the outcome model is correctly specified. We consider the problem of covariate interference which arises when the outcome of an individual may depend on covariates of other individuals. When interfering covariates are not modeled, the DR property prevents bias as long as covariate interference is not present simultaneously for the outcome and the missingness. An R package is developed implementing the proposed method. An extensive simulation study and an application to a CRT of HIV risk reduction-intervention in South Africa illustrate the method.

A clinical trial with a factorial design involves randomization of subjects to treatment *A* or and, within each group, further randomization to treatment *B* or . Under this design, one can assess the effects of treatments *A* and *B* on a clinical endpoint using all patients. One may additionally compare treatment *A*, treatment *B*, or combination therapy to . With multiple comparisons, however, it may be desirable to control the overall type I error, especially for regulatory purposes. Because the subjects overlap in the comparisons, the test statistics are generally correlated. By accounting for the correlations, one can achieve higher statistical power compared to the conventional Bonferroni correction. Herein, we derive the correlation between any two (stratified or unstratified) log-rank statistics for a factorial design with a survival time endpoint, such that the overall type I error for multiple treatment comparisons can be properly controlled. In addition, we allow for adjustment of prognostic factors in the treatment comparisons and conduct simultaneous inference on the effect sizes. We use simulation studies to show that the proposed methods perform well in realistic situations. We then provide an application to a recently completed randomized controlled clinical trial on alcohol dependence. Finally, we discuss extensions of our approach to other factorial designs and multiple endpoints.

This article considers sieve estimation in the Cox model with an unknown regression structure based on right-censored data. We propose a semiparametric pursuit method to simultaneously identify and estimate linear and nonparametric covariate effects based on B-spline expansions through a penalized group selection method with concave penalties. We show that the estimators of the linear effects and the nonparametric component are consistent. Furthermore, we establish the asymptotic normality of the estimator of the linear effects. To compute the proposed estimators, we develop a modified blockwise majorization descent algorithm that is efficient and easy to implement. Simulation studies demonstrate that the proposed method performs well in finite sample situations. We also use the primary biliary cirrhosis data to illustrate its application.

The log-rank test is widely used to compare two survival distributions in a randomized clinical trial, while partial likelihood (Cox, 1975) is the method of choice for making inference about the hazard ratio under the Cox (1972) proportional hazards model. The Wald 95% confidence interval of the hazard ratio may include the null value of 1 when the *p*-value of the log-rank test is less than 0.05. Peto et al. (1977) provided an estimator for the hazard ratio based on the log-rank statistic; the corresponding 95% confidence interval excludes the null value of 1 if and only if the *p*-value of the log-rank test is less than 0.05. However, Peto's estimator is not consistent, and the corresponding confidence interval does not have correct coverage probability. In this article, we construct the confidence interval by inverting the score test under the (possibly stratified) Cox model, and we modify the variance estimator such that the resulting score test for the null hypothesis of no treatment difference is identical to the log-rank test in the possible presence of ties. Like Peto's method, the proposed confidence interval excludes the null value if and only if the log-rank test is significant. Unlike Peto's method, however, this interval has correct coverage probability. An added benefit of the proposed confidence interval is that it tends to be more accurate and narrower than the Wald confidence interval. We demonstrate the advantages of the proposed method through extensive simulation studies and a colon cancer study.

Interval-censored failure time data occur in many fields such as demography, economics, medical research, and reliability and many inference procedures on them have been developed (Sun, 2006; Chen, Sun, and Peace, 2012). However, most of the existing approaches assume that the mechanism that yields interval censoring is independent of the failure time of interest and it is clear that this may not be true in practice (Zhang et al., 2007; Ma, Hu, and Sun, 2015). In this article, we consider regression analysis of case *K* interval-censored failure time data when the censoring mechanism may be related to the failure time of interest. For the problem, an estimated sieve maximum-likelihood approach is proposed for the data arising from the proportional hazards frailty model and for estimation, a two-step procedure is presented. In the addition, the asymptotic properties of the proposed estimators of regression parameters are established and an extensive simulation study suggests that the method works well. Finally, we apply the method to a set of real interval-censored data that motivated this study.

Motivated by an ongoing pediatric mental health care (PMHC) study, this article presents weakly structured methods for analyzing doubly censored recurrent event data where only coarsened information on censoring is available. The study extracted administrative records of emergency department visits from provincial health administrative databases. The available information of each individual subject is limited to a subject-specific time window determined up to concealed data. To evaluate time-dependent effect of exposures, we adapt the local linear estimation with right censored survival times under the Cox regression model with time-varying coefficients (cf. Cai and Sun, *Scandinavian Journal of Statistics* 2003, **30**, 93–111). We establish the pointwise consistency and asymptotic normality of the regression parameter estimator, and examine its performance by simulation. The PMHC study illustrates the proposed approach throughout the article.

Joint models are used in ageing studies to investigate the association between longitudinal markers and a time-to-event, and have been extended to multiple markers and/or competing risks. The competing risk of death must be considered in the elderly because death and dementia have common risk factors. Moreover, in cohort studies, time-to-dementia is interval-censored since dementia is assessed intermittently. So subjects can develop dementia and die between two visits without being diagnosed. To study predementia cognitive decline, we propose a joint latent class model combining a (possibly multivariate) mixed model and an illness–death model handling both interval censoring (by accounting for a possible unobserved transition to dementia) and semi-competing risks. Parameters are estimated by maximum-likelihood handling interval censoring. The correlation between the marker and the times-to-events is captured by latent classes, homogeneous sub-groups with specific risks of death, dementia, and profiles of cognitive decline. We propose Markovian and semi-Markovian versions. Both approaches are compared to a joint latent-class model for competing risks through a simulation study, and applied in a prospective cohort study of cerebral and functional ageing to distinguish different profiles of cognitive decline associated with risks of dementia and death. The comparison highlights that among subjects with dementia, mortality depends more on age than on duration of dementia. This model distinguishes the so-called terminal predeath decline (among healthy subjects) from the predementia decline.

Longitudinal covariates in survival models are generally analyzed using random effects models. By framing the estimation of these survival models as a functional measurement error problem, semiparametric approaches such as the conditional score or the corrected score can be applied to find consistent estimators for survival model parameters without distributional assumptions on the random effects. However, in order to satisfy the standard assumptions of a survival model, the semiparametric methods in the literature only use covariate data before each event time. This suggests that these methods may make inefficient use of the longitudinal data. We propose an extension of these approaches that follows a generalization of Rao–Blackwell theorem. A Monte Carlo error augmentation procedure is developed to utilize the entirety of longitudinal information available. The efficiency improvement of the proposed semiparametric approach is confirmed theoretically and demonstrated in a simulation study. A real data set is analyzed as an illustration of a practical application.

Motivated by ultrahigh-dimensional biomarkers screening studies, we propose a model-free screening approach tailored to censored lifetime outcomes. Our proposal is built upon the introduction of a new measure, survival impact index (SII). By its design, SII sensibly captures the overall influence of a covariate on the outcome distribution, and can be estimated with familiar nonparametric procedures that do not require smoothing and are readily adaptable to handle lifetime outcomes under various censoring and truncation mechanisms. We provide large sample distributional results that facilitate the inference on SII in classical multivariate settings. More importantly, we investigate SII as an effective screener for ultrahigh-dimensional data, not relying on rigid regression model assumptions for real applications. We establish the sure screening property of the proposed SII-based screener. Extensive numerical studies are carried out to assess the performance of our method compared with other existing screening methods. A lung cancer microarray data is analyzed to demonstrate the practical utility of our proposals.

Variable selection for recovering sparsity in nonadditive and nonparametric models with high-dimensional variables has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high-dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this article we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric regression model. The advantages of our approach are that it can: (i) recover the sparsity; (ii) automatically model unknown and complicated interactions; (iii) connect with several existing approaches including linear nonnegative garrote and multiple kernel learning; and (iv) provide flexibility for both additive and nonadditive nonparametric models. Our approach can be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a Least Squares Kernel Machine (LSKM) and construct the nonnegative garrote objective function as the function of the sparse scale parameters of kernel machine to recover sparsity of input variables whose relevances to the response are measured by the scale parameters. We also provide the asymptotic properties of our approach. We show that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve the power.

We consider the problem of selecting covariates in a spatial regression model when the response is binary. Penalized likelihood-based approach is proved to be effective for both variable selection and estimation simultaneously. In the context of a spatially dependent binary variable, an uniquely interpretable likelihood is not available, rather a quasi-likelihood might be more suitable. We develop a penalized quasi-likelihood with spatial dependence for simultaneous variable selection and parameter estimation along with an efficient computational algorithm. The theoretical properties including asymptotic normality and consistency are studied under increasing domain asymptotics framework. An extensive simulation study is conducted to validate the methodology. Real data examples are provided for illustration and applicability. Although theoretical justification has not been made, we also investigate empirical performance of the proposed penalized quasi-likelihood approach for spatial count data to explore suitability of this method to a general exponential family of distributions.

In many classical estimation problems, the parameter space has a boundary. In most cases, the standard asymptotic properties of the estimator do not hold when some of the underlying true parameters lie on the boundary. However, without knowledge of the true parameter values, confidence intervals constructed assuming that the parameters lie in the interior are generally over-conservative. A penalized estimation method is proposed in this article to address this issue. An adaptive lasso procedure is employed to shrink the parameters to the boundary, yielding oracle inference which adapt to whether or not the true parameters are on the boundary. When the true parameters are on the boundary, the inference is equivalent to that which would be achieved with a priori knowledge of the boundary, while if the converse is true, the inference is equivalent to that which is obtained in the interior of the parameter space. The method is demonstrated under two practical scenarios, namely the frailty survival model and linear regression with order-restricted parameters. Simulation studies and real data analyses show that the method performs well with realistic sample sizes and exhibits certain advantages over standard methods.

Combining multiple studies is frequently undertaken in biomedical research to increase sample sizes for statistical power improvement. We consider the marginal model for the regression analysis of repeated measurements collected in several similar studies with potentially different variances and correlation structures. It is of great importance to examine whether there exist common parameters across study-specific marginal models so that simpler models, sensible interpretations, and meaningful efficiency gain can be obtained. Combining multiple studies via the classical means of hypothesis testing involves a large number of simultaneous tests for all possible subsets of common regression parameters, in which it results in unduly large degrees of freedom and low statistical power. We develop a new method of *fused lasso with the adaptation of parameter ordering* (FLAPO) to scrutinize only adjacent-pair parameter differences, leading to a substantial reduction for the number of involved constraints. Our method enjoys the oracle properties as does the full fused lasso based on all pairwise parameter differences. We show that FLAPO gives estimators with smaller error bounds and better finite sample performance than the full fused lasso. We also establish a regularized inference procedure based on bias-corrected FLAPO. We illustrate our method through both simulation studies and an analysis of HIV surveillance data collected over five geographic regions in China, in which the presence or absence of common covariate effects is reflective to relative effectiveness of regional policies on HIV control and prevention.

Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify *multiple regulators* or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.

The peptide microarray immunoassay simultaneously screens sample serum against thousands of peptides, determining the presence of antibodies bound to array probes. Peptide microarrays tiling immunogenic regions of pathogens (e.g., envelope proteins of a virus) are an important high throughput tool for querying and mapping antibody binding. Because of the assay's many steps, from probe synthesis to incubation, peptide microarray data can be noisy with extreme outliers. In addition, subjects may produce different antibody profiles in response to an identical vaccine stimulus or infection, due to variability among subjects’ immune systems. We present a robust Bayesian hierarchical model for peptide microarray experiments, pepBayes, to estimate the probability of antibody response for each subject/peptide combination. Heavy-tailed error distributions accommodate outliers and extreme responses, and tailored random effect terms automatically incorporate technical effects prevalent in the assay. We apply our model to two vaccine trial data sets to demonstrate model performance. Our approach enjoys high sensitivity and specificity when detecting vaccine induced antibody responses. A simulation study shows an adaptive thresholding classification method has appropriate false discovery rate control with high sensitivity, and receiver operating characteristics generated on vaccine trial data suggest that pepBayes clearly separates responses from non-responses.

Measuring the similarity between genes is often the starting point for building gene regulatory networks. Most similarity measures used in practice only consider pairwise information with a few also consider network structure. Although theoretical properties of pairwise measures are well understood in the statistics literature, little is known about their statistical properties of those similarity measures based on network structure. In this article, we consider a new whole genome network-based similarity measure, called *CCor*, that makes use of information of all the genes in the network. We derive a concentration inequality of *CCor* and compare it with the commonly used Pearson correlation coefficient for inferring network modules. Both theoretical analysis and real data example demonstrate the advantages of *CCor* over existing measures for inferring gene modules.

In applying scan statistics for public health research, it would be valuable to develop a detection method for multiple clusters that accommodates spatial correlation and covariate effects in an integrated model. In this article, we connect the concepts of the likelihood ratio (LR) scan statistic and the quasi-likelihood (QL) scan statistic to provide a series of detection procedures sufficiently flexible to apply to clusters of arbitrary shape. First, we use an independent scan model for detection of clusters and then a variogram tool to examine the existence of spatial correlation and regional variation based on residuals of the independent scan model. When the estimate of regional variation is significantly different from zero, a mixed QL estimating equation is developed to estimate coefficients of geographic clusters and covariates. We use the Benjamini–Hochberg procedure (1995) to find a threshold for *p*-values to address the multiple testing problem. A quasi-deviance criterion is used to regroup the estimated clusters to find geographic clusters with arbitrary shapes. We conduct simulations to compare the performance of the proposed method with other scan statistics. For illustration, the method is applied to enterovirus data from Taiwan.

The focus of this article is on the nature of the likelihood associated with *N*-mixture models for repeated count data. It is shown that the infinite sum embedded in the likelihood associated with the Poisson mixing distribution can be expressed in terms of a hypergeometric function and, thence, in closed form. The resultant expression for the likelihood can be readily computed to a high degree of accuracy and is algebraically tractable. Specifically, the likelihood equations can be simplified to some advantage, the concentrated likelihood in the probability of detection formulated and problematic cases identified. The results are illustrated by means of a simulation study and a real world example. The study is extended to *N*-mixture models with a negative binomial mixing distribution and results similar to those for the Poisson case obtained. *N*-mixture models with mixing distributions which accommodate excess zeros and, separately, with a beta-binomial distribution rather than a binomial used to model the intra-site counts are also investigated. However the results for these settings, while computationally attractive, do not provide insight into the nature of the maximum likelihood estimates.

We introduce a new Bayesian nonparametric method for estimating the size of a closed population from multiple-recapture data. Our method, based on Dirichlet process mixtures, can accommodate complex patterns of heterogeneity of capture, and can transparently modulate its complexity without a separate model selection step. Additionally, it can handle the massively sparse contingency tables generated by large number of recaptures with moderate sample sizes. We develop an efficient and scalable MCMC algorithm for estimation. We apply our method to simulated data, and to two examples from the literature of estimation of casualties in armed conflicts.

Understanding how aquatic species grow is fundamental in fisheries because stock assessment often relies on growth dependent statistical models. Length-frequency-based methods become important when more applicable data for growth model estimation are either not available or very expensive. In this article, we develop a new framework for growth estimation from length-frequency data using a generalized von Bertalanffy growth model (VBGM) framework that allows for time-dependent covariates to be incorporated. A finite mixture of normal distributions is used to model the length-frequency cohorts of each month with the means constrained to follow a VBGM. The variances of the finite mixture components are constrained to be a function of mean length, reducing the number of parameters and allowing for an estimate of the variance at any length. To optimize the likelihood, we use a minorization–maximization (MM) algorithm with a Nelder–Mead sub-step. This work was motivated by the decline in catches of the blue swimmer crab (BSC) (*Portunus armatus*) off the east coast of Queensland, Australia. We test the method with a simulation study and then apply it to the BSC fishery data.

Applications of circular regression models appear in many different fields such as evolutionary psychology, motor behavior, biology, and, in particular, in the analysis of gene expressions in oscillatory systems. Specifically, for the gene expression problem, a researcher may be interested in modeling the relationship among the phases of cell-cycle genes in two species with differing periods. This challenging problem reduces to the problem of constructing a piecewise circular regression model and, with this objective in mind, we propose a flexible circular regression model which allows different parameter values depending on sectors along the circle. We give a detailed interpretation of the parameters in the model and provide maximum likelihood estimators. We also provide a model selection procedure based on the concept of generalized degrees of freedom. The model is then applied to the analysis of two different cell-cycle data sets and through these examples we highlight the power of our new methodology.

Recently, massive functional data have been widely collected over space across a set of grid points in various imaging studies. It is interesting to correlate functional data with various clinical variables, such as age and gender, in order to address scientific questions of interest. The aim of this article is to develop a single-index varying coefficient (SIVC) model for establishing a varying association between functional responses (e.g., image) and a set of covariates. It enjoys several unique features of both varying-coefficient and single-index models. An estimation procedure is developed to estimate varying coefficient functions, the index function, and the covariance function of individual functions. The optimal integration of information across different grid points is systematically delineated and the asymptotic properties (e.g., consistency and convergence rate) of all estimators are examined. Simulation studies are conducted to assess the finite-sample performance of the proposed estimation procedure. Furthermore, our real data analysis of a white matter tract dataset obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study confirms the advantage and accuracy of SIVC model over the popular varying coefficient model.

Construction of confidence sets for the optimal factor levels is an important topic in response surfaces methodology. In Wan et al. (2015), an exact confidence set has been provided for a maximum or minimum point (i.e., an optimal factor level) of a univariate polynomial function in a given interval. In this article, the method has been extended to construct an exact confidence set for the optimal factor levels of response surfaces. The construction method is readily applied to many parametric and semiparametric regression models involving a quadratic function. A conservative confidence set has been provided as an intermediate step in the construction of the exact confidence set. Two examples are given to illustrate the application of the confidence sets. The comparison between confidence sets indicates that our exact confidence set is better than the only other confidence set available in the statistical literature that guarantees the confidence level.

Individual covariates are commonly used in capture–recapture models as they can provide important information for population size estimation. However, in practice, one or more covariates may be missing at random for some individuals, which can lead to unreliable inference if records with missing data are treated as missing completely at random. We show that, in general, such a naive complete-case analysis in closed capture–recapture models with some covariates missing at random underestimates the population size. We develop methods for estimating regression parameters and population size using regression calibration, inverse probability weighting, and multiple imputation without any distributional assumptions about the covariates. We show that the inverse probability weighting and multiple imputation approaches are asymptotically equivalent. We present a simulation study to investigate the effects of missing covariates and to evaluate the performance of the proposed methods. We also illustrate an analysis using data on the bird species yellow-bellied prinia collected in Hong Kong.

At a time of climate change and major loss of biodiversity, it is important to have efficient tools for monitoring populations. In this context, animal abundance indices play an important rôle. In producing indices for invertebrates, it is important to account for variation in counts within seasons. Two new methods for describing seasonal variation in invertebrate counts have recently been proposed; one is nonparametric, using generalized additive models, and the other is parametric, based on stopover models. We present a novel generalized abundance index which encompasses both parametric and nonparametric approaches. It is extremely efficient to compute this index due to the use of concentrated likelihood techniques. This has particular relevance for the analysis of data from long-term extensive monitoring schemes with records for many species and sites, for which existing modeling techniques can be prohibitively time consuming. Performance of the index is demonstrated by several applications to UK Butterfly Monitoring Scheme data. We demonstrate the potential for new insights into both phenology and spatial variation in seasonal patterns from parametric modeling and the incorporation of covariate dependence, which is relevant for both monitoring and conservation. Associated R code is available on the journal website.

The availability of data in longitudinal studies is often driven by features of the characteristics being studied. For example, clinical databases are increasingly being used for research to address longitudinal questions. Because visit times in such data are often driven by patient characteristics that may be related to the outcome being studied, the danger is that this will result in biased estimation compared to designed, prospective studies. We study longitudinal data that follow a generalized linear mixed model and use a log link to relate an informative visit process to random effects in the mixed model. This device allows us to elucidate which parameters are biased under the informative visit process and to what degree. We show that the informative visit process can badly bias estimators of parameters of covariates associated with the random effects, while allowing consistent estimation of other parameters.

Alzheimer's disease (AD) is usually diagnosed by clinicians through cognitive and functional performance test with a potential risk of misdiagnosis. Since the progression of AD is known to cause structural changes in the corpus callosum (CC), the CC thickness can be used as a functional covariate in AD classification problem for a diagnosis. However, misclassified class labels negatively impact the classification performance. Motivated by AD–CC association studies, we propose a logistic regression for functional data classification that is robust to misdiagnosis or label noise. Specifically, our logistic regression model is constructed by adopting individual intercepts to functional logistic regression model. This approach enables to indicate which observations are possibly mislabeled and also lead to a robust and efficient classifier. An effective algorithm using MM algorithm provides simple closed-form update formulas. We test our method using synthetic datasets to demonstrate its superiority over an existing method, and apply it to differentiating patients with AD from healthy normals based on CC from MRI.

The ready availability of public-use data from various large national complex surveys has immense potential for the assessment of population characteristics using regression models. Complex surveys can be used to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to complex survey design features. That is, stratification, multistage sampling, and weighting. In this article, we accommodate these design features in the analysis of highly skewed response variables arising from large complex surveys. Specifically, we propose a double-transform-both-sides (DTBS)'based estimating equations approach to estimate the median regression parameters of the highly skewed response; the DTBS approach applies the same Box–Cox type transformation twice to both the outcome and regression function. The usual sandwich variance estimate can be used in our approach, whereas a resampling approach would be needed for a pseudo-likelihood based on minimizing absolute deviations (MAD). Furthermore, the approach is relatively robust to the true underlying distribution, and has much smaller mean square error than a MAD approach. The method is motivated by an analysis of laboratory data on urinary iodine (UI) concentration from the National Health and Nutrition Examination Survey.

The evaluation of cure fractions in oncology research under the well known cure rate model has attracted considerable attention in the literature, but most of the existing testing procedures have relied on restrictive assumptions. A common assumption has been to restrict the cure fraction to a constant under alternatives to homogeneity, thereby neglecting any information from covariates. This article extends the literature by developing a score-based statistic that incorporates covariate information to detect cure fractions, with the existing testing procedure serving as a special case. A complication of this extension, however, is that the implied hypotheses are not typical and standard regularity conditions to conduct the test may not even hold. Using empirical processes arguments, we construct a sup-score test statistic for cure fractions and establish its limiting null distribution as a functional of mixtures of chi-square processes. In practice, we suggest a simple resampling procedure to approximate this limiting distribution. Our simulation results show that the proposed test can greatly improve efficiency over tests that neglect the heterogeneity of the cure fraction under the alternative. The practical utility of the methodology is illustrated using ovarian cancer survival data with long-term follow-up from the surveillance, epidemiology, and end results registry.

Most cancer research now involves one or more assays profiling various biological molecules, e.g., messenger RNA and micro RNA, in samples collected on the same individuals. The main interest with these genomic data sets lies in the identification of a subset of features that are active in explaining the dependence between platforms. To quantify the strength of the dependency between two variables, correlation is often preferred. However, expression data obtained from next-generation sequencing platforms are integer with very low counts for some important features. In this case, the sample Pearson correlation is not a valid estimate of the true correlation matrix, because the sample correlation estimate between two features/variables with low counts will often be close to zero, even when the natural parameters of the Poisson distribution are, in actuality, highly correlated. We propose a model-based approach to correlation estimation between two non-normal data sets, via a method we call Probabilistic Correlations ANalysis, or PCAN. PCAN takes into consideration the distributional assumption about both data sets and suggests that correlations estimated at the model natural parameter level are more appropriate than correlations estimated directly on the observed data. We demonstrate through a simulation study that PCAN outperforms other standard approaches in estimating the true correlation between the natural parameters. We then apply PCAN to the joint analysis of a microRNA (miRNA) and a messenger RNA (mRNA) expression data set from a squamous cell lung cancer study, finding a large number of negative correlation pairs when compared to the standard approaches.

For the classical, homoscedastic measurement error model, moment reconstruction (Freedman et al., 2004, 2008) and moment-adjusted imputation (Thomas et al., 2011) are appealing, computationally simple imputation-like methods for general model fitting. Like classical regression calibration, the idea is to replace the unobserved variable subject to measurement error with a proxy that can be used in a variety of analyses. Moment reconstruction and moment-adjusted imputation differ from regression calibration in that they attempt to match multiple features of the latent variable, and also to match some of the latent variable's relationships with the response and additional covariates. In this note, we consider a problem where true exposure is generated by a complex, nonlinear random effects modeling process, and develop analogues of moment reconstruction and moment-adjusted imputation for this case. This general model includes classical measurement errors, Berkson measurement errors, mixtures of Berkson and classical errors and problems that are not measurement error problems, but also cases where the data-generating process for true exposure is a complex, nonlinear random effects modeling process. The methods are illustrated using the National Institutes of Health–AARP Diet and Health Study where the latent variable is a dietary pattern score called the Healthy Eating Index-2005. We also show how our general model includes methods used in radiation epidemiology as a special case. Simulations are used to illustrate the methods.

The usefulness of meta-analysis has been recognized in the evaluation of drug safety, as a single trial usually yields few adverse events and offers limited information. For rare events, conventional meta-analysis methods may yield an invalid inference, as they often rely on large sample theories and require empirical corrections for zero events. These problems motivate research in developing *exact* methods, including Tian et al.'s method of combining confidence intervals (2009, *Biostatistics*, **10**, 275–281) and Liu et al.'s method of combining *p*-value functions (2014, *JASA*, **109**, 1450–1465). This article shows that these two exact methods can be unified under the framework of combining confidence distributions (CDs). Furthermore, we show that the CD method generalizes Tian et al.'s method in several aspects. Given that the CD framework also subsumes the Mantel–Haenszel and Peto methods, we conclude that the CD method offers a general framework for meta-analysis of rare events. We illustrate the CD framework using two real data sets collected for the safety analysis of diabetes drugs.