Survival bias is difficult to detect and adjust for in case–control genetic association studies but can invalidate findings when only surviving cases are studied and survival is associated with the genetic variants under study. Here, we propose a design where one genotypes genetically informative family members (such as offspring, parents, and spouses) of deceased cases and incorporates that surrogate genetic information into a retrospective maximum likelihood analysis. We show that inclusion of genotype data from first-degree relatives permits unbiased estimation of genotype association parameters. We derive closed-form maximum likelihood estimates for association parameters under the widely used log-additive and dominant association models. Our proposed design not only permits a valid analysis but also enhances statistical power by augmenting the sample with indirectly studied individuals. Gene variants associated with poor prognosis can also be identified under this design. We provide simulation results to assess performance of the methods. Copyright © 2016 John Wiley & Sons, Ltd.

]]>We study Bayesian linear regression models with skew-symmetric scale mixtures of normal error distributions. These kinds of models can be used to capture departures from the usual assumption of normality of the errors in terms of heavy tails and asymmetry. We propose a general noninformative prior structure for these regression models and show that the corresponding posterior distribution is proper under mild conditions. We extend these propriety results to cases where the response variables are censored. The latter scenario is of interest in the context of accelerated failure time models, which are relevant in survival analysis. We present a simulation study that demonstrates good frequentist properties of the posterior credible intervals associated with the proposed priors. This study also sheds some light on the trade-off between increased model flexibility and the risk of over-fitting. We illustrate the performance of the proposed models with real data. Although we focus on models with univariate response variables, we also present some extensions to the multivariate case in the Supporting Information. Copyright © 2016 John Wiley & Sons, Ltd.

]]>A relatively recent development in the design of Phase I dose-finding studies is the inclusion of expansion cohort(s), that is, the inclusion of several more patients at a level considered to be the maximum tolerated dose established at the conclusion of the ‘pure’ Phase I part. Little attention has been given to the additional statistical analysis, including design considerations, that we might wish to consider for this more involved design. For instance, how can we best make use of new information that may confirm or may tend to contradict the estimate of the maximum tolerated dose based on the dose escalation phase. Those patients included during the dose expansion phase may possess different eligibility criteria. During the expansion phase, we will also wish to have an eye on any evidence of efficacy, an aspect that clearly distinguishes such studies from the classical Phase I study. Here, we present a methodology that enables us to continue the monitoring of safety in the dose expansion cohort while simultaneously trying to assess efficacy and, in particular, which disease types may be the most promising to take forward for further study. The most elementary problem is where we only wish to take account of further toxicity information obtained during the dose expansion cohort, and where the initial design was model based or the standard 3+3. More complex set-ups also involve efficacy and the presence of subgroups. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Bayesian additive regression trees (BART) provide a framework for flexible nonparametric modeling of relationships of covariates to outcomes. Recently, BART models have been shown to provide excellent predictive performance, for both continuous and binary outcomes, and exceeding that of its competitors. Software is also readily available for such outcomes. In this article, we introduce modeling that extends the usefulness of BART in medical applications by addressing needs arising in survival analysis. Simulation studies of one-sample and two-sample scenarios, in comparison with long-standing traditional methods, establish face validity of the new approach. We then demonstrate the model's ability to accommodate data from complex regression models with a simulation study of a nonproportional hazards scenario with crossing survival functions and survival function estimation in a scenario where hazards are multiplicatively modified by a highly nonlinear function of the covariates. Using data from a recently published study of patients undergoing hematopoietic stem cell transplantation, we illustrate the use and some advantages of the proposed method in medical investigations. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Epidemiologic studies suggest that maternal ambient air pollution exposure during critical periods of pregnancy is associated with adverse effects on fetal development. In this work, we introduce new methodology for identifying critical periods of development during post-conception gestational weeks 2–8 where elevated exposure to particulate matter less than 2.5 µm (PM_{2.5}) adversely impacts development of the heart. Past studies have focused on highly aggregated temporal levels of exposure during the pregnancy and have failed to account for anatomical similarities between the considered congenital heart defects. We introduce a multinomial probit model in the Bayesian setting that allows for joint identification of susceptible daily periods during pregnancy for 12 types of congenital heart defects with respect to maternal PM_{2.5} exposure. We apply the model to a dataset of mothers from the National Birth Defect Prevention Study where daily PM_{2.5} exposures from post-conception gestational weeks 2–8 are assigned using predictions from the downscaler pollution model. This approach is compared with two aggregated exposure models that define exposure as the average value over post-conception gestational weeks 2–8 and the average over individual weeks, respectively. Results suggest an association between increased PM_{2.5} exposure on post-conception gestational day 53 with the development of pulmonary valve stenosis and exposures during days 50 and 51 with tetralogy of Fallot. Significant associations are masked when using the aggregated exposure models. Simulation study results suggest that the findings are robust to multiple sources of error. The general form of the model allows for different exposures and health outcomes to be considered in future applications. Copyright © 2016 John Wiley & Sons, Ltd.

Zero-inflated count outcomes arise quite often in research and practice. Parametric models such as the zero-inflated Poisson and zero-inflated negative binomial are widely used to model such responses. Like most parametric models, they are quite sensitive to departures from assumed distributions. Recently, new approaches have been proposed to provide distribution-free, or semi-parametric, alternatives. These methods extend the generalized estimating equations to provide robust inference for population mixtures defined by zero-inflated count outcomes. In this paper, we propose methods to extend smoothly clipped absolute deviation (SCAD)-based variable selection methods to these new models. Variable selection has been gaining popularity in modern clinical research studies, as determining differential treatment effects of interventions for different subgroups has become the norm, rather the exception, in the era of patent-centered outcome research. Such moderation analysis in general creates many explanatory variables in regression analysis, and the advantages of SCAD-based methods over their traditional counterparts render them a great choice for addressing this important and timely issues in clinical research. We illustrate the proposed approach with both simulated and real study data. Copyright © 2016 John Wiley & Sons, Ltd.

]]>A widely used method in classic random-effects meta-analysis is the DerSimonian–Laird method. An alternative meta-analytical approach is the Hartung–Knapp method. This article reports results of an empirical comparison and a simulation study of these two methods and presents corresponding analytical results. For the empirical evaluation, we took 157 meta-analyses with binary outcomes, analysed each one using both methods and performed a comparison of the results based on treatment estimates, standard errors and associated *P*-values. In several simulation scenarios, we systematically evaluated coverage probabilities and confidence interval lengths. Generally, results are more conservative with the Hartung–Knapp method, giving wider confidence intervals and larger *P*-values for the overall treatment effect. However, in some meta-analyses with very homogeneous individual treatment results, the Hartung–Knapp method yields narrower confidence intervals and smaller *P*-values than the classic random-effects method, which in this situation, actually reduces to a fixed-effect meta-analysis. Therefore, it is recommended to conduct a sensitivity analysis based on the fixed-effect model instead of solely relying on the result of the Hartung–Knapp random-effects meta-analysis. Copyright © 2016 John Wiley & Sons, Ltd.

This paper introduces a new simple divergence measure between two survival distributions. For two groups of patients, the divergence measure between their associated survival distributions is based on the integral of the absolute difference in probabilities that a patient from one group dies at time t and a patient from the other group survives beyond time t and vice versa. In the case of non-crossing hazard functions, the divergence measure is closely linked to the Harrell concordance index, C, the Mann–Whitney test statistic and the area under a receiver operating characteristic curve. The measure can be used in a dynamic way where the divergence between two survival distributions from time zero up to time t is calculated enabling real-time monitoring of treatment differences. The divergence can be found for theoretical survival distributions or can be estimated non-parametrically from survival data using Kaplan–Meier estimates of the survivor functions. The estimator of the divergence is shown to be generally unbiased and approximately normally distributed. For the case of proportional hazards, the constituent parts of the divergence measure can be used to assess the proportional hazards assumption. The use of the divergence measure is illustrated on the survival of pancreatic cancer patients. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Generic drugs have been commercialized in numerous countries. Most of these countries approve the commercialization of a generic drug when there is evidence of bioequivalence between the generic drug and the reference drug. Generally, the pharmaceutical industry is responsible for the bioequivalence test under the supervision of a regulatory agency. This procedure is concluded after a statistical data analysis. Several agencies adopt a standard statistical analysis based on procedures that were previously established. In practice, we face situations in which this standard model does not fit to some sets of bioequivalence data. In this study, we propose an evaluation of bioequivalence using univariate and bivariate models based on an extended generalized gamma distribution and a skew-t distribution, under a Bayesian perspective. We introduce a study of the empirical power of hypothesis tests for univariate models, showing advantages in the use of an extended generalized gamma distribution. Three sets of bioequivalence data were analyzed under these new procedures and compared with the standard model proposed by the majority of regulatory agencies. In order to verify that the asymmetrical distributions are usually better fitted for the data, when compared with the standard model, model discrimination methods were used, such as the Deviance Information Criterion (DIC) and quantile–quantile plots. The research concluded that, in general, the use of the extended generalized gamma distribution may be more appropriate to model bioequivalence data in the original scale. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Meta-analytic methods for combining data from multiple intervention trials are commonly used to estimate the effectiveness of an intervention. They can also be extended to study comparative effectiveness, testing which of several alternative interventions is expected to have the strongest effect. This often requires network meta-analysis (NMA), which combines trials involving direct comparison of two interventions within the same trial and indirect comparisons across trials. In this paper, we extend existing network methods for main effects to examining moderator effects, allowing for tests of whether intervention effects vary for different populations or when employed in different contexts. In addition, we study how the use of individual participant data may increase the sensitivity of NMA for detecting moderator effects, as compared with aggregate data NMA that employs study-level effect sizes in a meta-regression framework. A new NMA diagram is proposed. We also develop a generalized multilevel model for NMA that takes into account within-trial and between-trial heterogeneity and can include participant-level covariates. Within this framework, we present definitions of homogeneity and consistency across trials. A simulation study based on this model is used to assess effects on power to detect both main and moderator effects. Results show that power to detect moderation is substantially greater when applied to individual participant data as compared with study-level effects. We illustrate the use of this method by applying it to data from a classroom-based randomized study that involved two sub-trials, each comparing interventions that were contrasted with separate control groups. Copyright © 2016 John Wiley & Sons, Ltd.

]]>In recent years, developing pharmaceutical products via multiregional clinical trials (MRCTs) has become standard. Traditionally, an MRCT would assume that a treatment effect is uniform across regions. However, heterogeneity among regions may have impact upon the evaluation of a medicine's effect. In this study, we consider a random effects model using discrete distribution (DREM) to account for heterogeneous treatment effects across regions for the design and evaluation of MRCTs. We derive an power function for a treatment that is beneficial under DREM and illustrate determination of the overall sample size in an MRCT. We use the concept of consistency based on Method 2 of the Japanese Ministry of Health, Labour, and Welfare's guidance to evaluate the probability for treatment benefit and consistency under DREM. We further derive an optimal sample size allocation over regions to maximize the power for consistency. Moreover, we provide three algorithms for deriving sample size at the desired level of power for benefit and consistency. In practice, regional treatment effects are unknown. Thus, we provide some guidelines on the design of MRCTs with consistency when the regional treatment effect are assumed to fall into a specified interval. Numerical examples are given to illustrate applications of the proposed approach. Copyright © 2016 John Wiley & Sons, Ltd.

]]>In clinical trials with survival endpoint, it is common to observe an overlap between two Kaplan–Meier curves of treatment and control groups during the early stage of the trials, indicating a potential delayed treatment effect. Formulas have been derived for the asymptotic power of the log-rank test in the presence of delayed treatment effect and its accompanying sample size calculation. In this paper, we first reformulate the alternative hypothesis with the delayed treatment effect in a rescaled time domain, which can yield a simplified sample size formula for the log-rank test in this context. We further propose an intersection-union test to examine the efficacy of treatment with delayed effect and show it to be more powerful than the log-rank test. Simulation studies are conducted to demonstrate the proposed methods. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Converging evidence suggests that common complex diseases with the same or similar clinical manifestations could have different underlying genetic etiologies. While current research interests have shifted toward uncovering rare variants and structural variations predisposing to human diseases, the impact of heterogeneity in genetic studies of complex diseases has been largely overlooked. Most of the existing statistical methods assume the disease under investigation has a homogeneous genetic effect and could, therefore, have low power if the disease undergoes heterogeneous pathophysiological and etiological processes. In this paper, we propose a heterogeneity-weighted U (HWU) method for association analyses considering genetic heterogeneity. HWU can be applied to various types of phenotypes (e.g., binary and continuous) and is computationally efficient for high-dimensional genetic data. Through simulations, we showed the advantage of HWU when the underlying genetic etiology of a disease was heterogeneous, as well as the robustness of HWU against different model assumptions (e.g., phenotype distributions). Using HWU, we conducted a genome-wide analysis of nicotine dependence from the Study of Addiction: Genetics and Environments dataset. The genome-wide analysis of nearly one million genetic markers took 7h, identifying heterogeneous effects of two new genes (i.e., *CYP3A5* and *IKBKB*) on nicotine dependence. Copyright © 2016 John Wiley & Sons, Ltd.

This article focuses on the implementation of propensity score matching for clustered data. Different approaches to reduce bias due to cluster-level confounders are considered and compared using Monte Carlo simulations. We investigated methods that exploit the clustered structure of the data in two ways: in the estimation of the propensity score model (through the inclusion of fixed or random effects) or in the implementation of the matching algorithm. In addition to a pure within-cluster matching, we also assessed the performance of a new approach, ‘preferential’ within-cluster matching. This approach first searches for control units to be matched to treated units within the same cluster. If matching is not possible within-cluster, then the algorithm searches in other clusters. All considered approaches successfully reduced the bias due to the omission of a cluster-level confounder. The preferential within-cluster matching approach, combining the advantages of within-cluster and between-cluster matching, showed a relatively good performance both in the presence of big and small clusters, and it was often the best method. An important advantage of this approach is that it reduces the number of unmatched units as compared with a pure within-cluster matching. We applied these methods to the estimation of the effect of caesarean section on the Apgar score using birth register data. Copyright © 2016 John Wiley & Sons, Ltd.

]]>We propose a dose-finding design for Phase I oncology trials where each new patient is assigned to the dose most likely to be the target dose given observed data. The main model assumption is that the dose–toxicity curve is non-decreasing. This method is beneficial when it is desirable to assign a patient to a dose as soon as the patient is enrolled into a study. To prevent assignments to doses with limited toxicity information in fast accruing trials we propose a conservative rule that assigns temporary fractional toxicities to patients still in follow-up. We also recommend always using a safety rule in any fast accruing dose-finding trial. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Discrete choice experiments (DCEs) are increasingly used for studying and quantifying subjects preferences in a wide variety of healthcare applications. They provide a rich source of data to assess real-life decision-making processes, which involve trade-offs between desirable characteristics pertaining to health and healthcare and identification of key attributes affecting healthcare. The choice of the design for a DCE is critical because it determines which attributes' effects and their interactions are identifiable. We apply blocked fractional factorial designs to construct DCEs and address some identification issues by utilizing the known structure of blocked fractional factorial designs. Our design techniques can be applied to several situations including DCEs where attributes have different number of levels. We demonstrate our design methodology using two healthcare studies to evaluate (i) asthma patients' preferences for symptom-based outcome measures and (ii) patient preference for breast screening services. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Medical expenditure data analysis has recently become an important problem in biostatistics. These data typically have a number of features making their analysis rather difficult. Commonly, they are heavily right-skewed, contain a large percentage of zeros, and often exhibit large numbers of missing observations because of death and/or the lack of follow-up. They are also commonly obtained from records that are linked to large longitudinal data surveys. In this manuscript, we suggest a novel approach to modeling these data through the use of generalized method of moments estimation procedure combined with appropriate weights that account for both dropout due to death and the probability of being sampled from among the National Long Term Care Survey (NLTCS) subjects. This approach seems particularly appropriate because of the large number of subjects relative to the length of observation period (in months). We also use a simulation study to compare our proposed approach with and without the use of weights. The proposed model is applied to medical expenditure data obtained from the 2004–2005 NLTCS-linked Medicare database. The results suggest that the amount of medical expenditures incurred is strongly associated with higher number of activities of daily living (ADL) disabilities and self-reports of unmet need for help with ADL disabilities. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Two main methodologies for assessing equivalence in method-comparison studies are presented separately in the literature. The first one is the well-known and widely applied Bland–Altman approach with its agreement intervals, where two methods are considered interchangeable if their differences are not clinically significant. The second approach is based on errors-in-variables regression in a classical (X,Y) plot and focuses on confidence intervals, whereby two methods are considered equivalent when providing similar measures notwithstanding the random measurement errors. This paper reconciles these two methodologies and shows their similarities and differences using both real data and simulations. A new consistent correlated-errors-in-variables regression is introduced as the errors are shown to be correlated in the Bland–Altman plot. Indeed, the coverage probabilities collapse and the biases soar when this correlation is ignored. Novel tolerance intervals are compared with agreement intervals with or without replicated data, and novel predictive intervals are introduced to predict a single measure in an (X,Y) plot or in a Bland–Atman plot with excellent coverage probabilities. We conclude that the (correlated)-errors-in-variables regressions should not be avoided in method comparison studies, although the Bland–Altman approach is usually applied to avert their complexity. We argue that tolerance or predictive intervals are better alternatives than agreement intervals, and we provide guidelines for practitioners regarding method comparison studies. Copyright © 2016 John Wiley & Sons, Ltd.

]]>The method of generalized estimating equations (GEE) is popular in the biostatistics literature for analyzing longitudinal binary and count data. It assumes a generalized linear model for the outcome variable, and a working correlation among repeated measurements. In this paper, we introduce a viable competitor: the weighted scores method for generalized linear model margins. We weight the univariate score equations using a working discretized multivariate normal model that is a proper multivariate model. Because the weighted scores method is a parametric method based on likelihood, we propose composite likelihood information criteria as an intermediate step for model selection. The same criteria can be used for both correlation structure and variable selection. Simulations studies and the application example show that our method outperforms other existing model selection methods in GEE. From the example, it can be seen that our methods not only improve on GEE in terms of interpretability and efficiency but also can change the inferential conclusions with respect to GEE. Copyright © 2016 John Wiley & Sons, Ltd.

]]>In observational studies, estimation of average causal treatment effect on a patient's response should adjust for confounders that are associated with both treatment exposure and response. In addition, the response, such as medical cost, may have incomplete follow-up. In this article, a double robust estimator is proposed for average causal treatment effect for right censored medical cost data. The estimator is double robust in the sense that it remains consistent when either the model for the treatment assignment or the regression model for the response is correctly specified. Double robust estimators increase the likelihood the results will represent a valid inference. Asymptotic normality is obtained for the proposed estimator, and an estimator for the asymptotic variance is also derived. Simulation studies show good finite sample performance of the proposed estimator and a real data analysis using the proposed method is provided as illustration. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Mortality counts are usually aggregated over age groups assuming similar effects of both time and region, yet the spatio-temporal evolution of cancer mortality rates may depend on changing age structures. In this paper, mortality rates are analyzed by region, time period and age group, and models including space–time, space–age, and age–time interactions are considered. The integrated nested Laplace approximation method, known as INLA, is adopted for model fitting and inference in order to reduce computing time in comparison with Markov chain Monte Carlo (McMC) methods. The methodology provides full posterior distributions of the quantities of interest while avoiding complex simulation techniques. The proposed models are used to analyze prostate cancer mortality data in 50 Spanish provinces over the period 1986–2010. The results reveal a decline in mortality since the late 1990s, particularly in the age group [65,70), probably because of the inclusion of the PSA (prostate-specific antigen) test and better treatment of early-stage disease. The decline is not clearly observed in the oldest age groups. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Population attributable risk measures the public health impact of the removal of a risk factor. To apply this concept to epidemiological data, the calculation of a confidence interval to quantify the uncertainty in the estimate is desirable. However, because perhaps of the confusion surrounding the attributable risk measures, there is no standard confidence interval or variance formula given in the literature. In this paper, we implement a fully Bayesian approach to confidence interval construction of the population attributable risk for cross-sectional studies. We show that, in comparison with a number of standard Frequentist methods for constructing confidence intervals (i.e. delta, jackknife and bootstrap methods), the Bayesian approach is superior in terms of percent coverage in all except a few cases. This paper also explores the effect of the chosen prior on the coverage and provides alternatives for particular situations. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Spatiotemporal calibration of output from deterministic models is an increasingly popular tool to more accurately and efficiently estimate the true distribution of spatial and temporal processes. Current calibration techniques have focused on a single source of data on observed measurements of the process of interest that are both temporally and spatially dense. Additionally, these methods often calibrate deterministic models available in grid-cell format with pixel sizes small enough that the centroid of the pixel closely approximates the measurement for other points within the pixel. We develop a modeling strategy that allows us to simultaneously incorporate information from two sources of data on observed measurements of the process (that differ in their spatial and temporal resolutions) to calibrate estimates from a deterministic model available on a regular grid. This method not only improves estimates of the pollutant at the grid centroids but also refines the spatial resolution of the grid data. The modeling strategy is illustrated by calibrating and spatially refining daily estimates of ambient nitrogen dioxide concentration over Connecticut for 1994 from the Community Multiscale Air Quality model (temporally dense grid-cell estimates on a large pixel size) using observations from an epidemiologic study (spatially dense and temporally sparse) and Environmental Protection Agency monitoring stations (temporally dense and spatially sparse). Copyright © 2016 John Wiley & Sons, Ltd.

]]>The receiver operating characteristic (ROC) curve is a popular technique with applications, for example, investigating an accuracy of a biomarker to delineate between disease and non-disease groups. A common measure of accuracy of a given diagnostic marker is the area under the ROC curve (AUC).

In contrast with the AUC, the partial area under the ROC curve (pAUC) looks into the area with certain specificities (i.e., true negative rate) only, and it can be often clinically more relevant than examining the entire ROC curve. The pAUC is commonly estimated based on a U-statistic with the plug-in sample quantile, making the estimator a non-traditional U-statistic. In this article, we propose an accurate and easy method to obtain the variance of the nonparametric pAUC estimator. The proposed method is easy to implement for both one biomarker test and the comparison of two correlated biomarkers because it simply adapts the existing variance estimator of U-statistics. In this article, we show accuracy and other advantages of the proposed variance estimation method by broadly comparing it with previously existing methods. Further, we develop an empirical likelihood inference method based on the proposed variance estimator through a simple implementation. In an application, we demonstrate that, depending on the inferences by either the AUC or pAUC, we can make a different decision on a prognostic ability of a same set of biomarkers. Copyright © 2016 John Wiley & Sons, Ltd.

Minimization, a dynamic allocation method, is gaining popularity especially in cancer clinical trials. Aiming to achieve balance on all important prognostic factors simultaneously, this procedure can lead to a substantial reduction in covariate imbalance compared with conventional randomization in small clinical trials. While minimization has generated enthusiasm, some controversy exists over the proper analysis of such a trial. Critics argue that standard testing methods that do not account for the dynamic allocation algorithm can lead to invalid statistical inference. Acknowledging this limitation, the International Conference on Harmonization E9 guideline suggests that ‘the complexity of the logistics and potential impact on analyses be carefully evaluated when considering dynamic allocation’. In this article, we investigate the proper analysis approaches to inference in a minimization design for both continuous and time-to-event endpoints and evaluate the validity and power of these approaches under a variety of scenarios both theoretically and empirically. Published 2016. This article is a U.S. Government work and is in the public domain in the USA

]]>The well-known McNemar test assesses the difference between two correlated proportions in binary matched-pairs data. To improve the power of the McNemar test and extend it to related problems, we reinterpret the test in a Bayesian framework. Replacing the prior density by a more realistic one realizes a powerful test. We numerically investigate different choices of the prior density, which strongly affects the performance of the derived test. Furthermore, we compare the maximum actual levels of the proposed test with those of existing tests. The proposed test is advantageous for its wide extendibility. We combine the evidence from multiple strata by an approach that largely differs from existing methods. The test statistic is the product of the posterior probabilities of the alternative models in the multiple strata. The proposed test is validated in practical examples. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Understanding the dynamic disease process is vital in early detection, diagnosis, and measuring progression. Continuous-time Markov chain (CTMC) methods have been used to estimate state-change intensities but challenges arise when stages are potentially misclassified. We present an analytical likelihood approach where the hidden state is modeled as a three-state CTMC model allowing for some observed states to be possibly misclassified. Covariate effects of the hidden process and misclassification probabilities of the hidden state are estimated without information from a ‘gold standard’ as comparison. Parameter estimates are obtained using a modified expectation-maximization (EM) algorithm, and identifiability of CTMC estimation is addressed. Simulation studies and an application studying Alzheimer's disease caregiver stress-levels are presented. The method was highly sensitive to detecting true misclassification and did not falsely identify error in the absence of misclassification. In conclusion, we have developed a robust longitudinal method for analyzing categorical outcome data when classification of disease severity stage is uncertain and the purpose is to study the process' transition behavior without a gold standard. Copyright © 2016 John Wiley & Sons, Ltd.

]]>This paper introduces a method of surveillance using deviations from probabilistic forecasts. Realised observations are compared with probabilistic forecasts, and the “deviation” metric is based on low probability events. If an alert is declared, the algorithm continues to monitor until an all-clear is announced. Specifically, this article addresses the problem of syndromic surveillance for influenza (flu) with the intention of detecting outbreaks, due to new strains of viruses, over and above the normal seasonal pattern. The syndrome is hospital admissions for flu-like illness, and hence, the data are low counts. In accordance with the count properties of the observations, an integer-valued autoregressive process is used to model flu occurrences. Monte Carlo evidence suggests the method works well in stylised but somewhat realistic situations. An application to real flu data indicates that the ideas may have promise. The model estimated on a short run of training data did not declare false alarms when used with new observations deemed in control, ex post. The model easily detected the 2009 *H*1*N*1 outbreak. Copyright © 2016 John Wiley & Sons, Ltd.

We propose a prediction model for the cumulative incidence functions of competing risks, based on a logit link. Because of a concern about censoring potentially depending on time-varying covariates in our motivating human immunodeficiency virus (HIV) application, we describe an approach for estimating the parameters in the prediction models using inverse probability of censoring weighting under a missingness at random assumption. We then illustrate the application of this methodology to identify predictors of the competing outcomes of virologic failure, an efficacy outcome, and treatment limiting adverse event, a safety outcome, among human immunodeficiency virus-infected patients first starting antiretroviral treatment. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Unmeasured confounding is a major threat to the validity of pharmacoepidemiological studies of medication safety and effectiveness. We propose a new method for detecting and reducing the impact of unobserved confounding in large observational database studies. The method uses assumptions similar to the prescribing preference-based instrumental variable (IV) approach. Our method relies on the new ‘missing cause’ principle, according to which the impact of unmeasured confounding by (contra-)indication may be detected by assessing discrepancies between the following: (i) treatment actually received by individual patients and (ii) treatment that they would be expected to receive based on the observed data. Specifically, we use the treatment-by-discrepancy interaction to test for the presence of unmeasured confounding and correct the treatment effect estimate for the resulting bias. Under standard IV assumptions, we first proved that unmeasured confounding induces a spurious treatment-by-discrepancy interaction in risk difference models for binary outcomes and then simulated large pharmacoepidemiological studies with unmeasured confounding. In simulations, our estimates had four to six times smaller bias than conventional treatment effect estimates, adjusted only for measured confounders, and much smaller variance inflation than unbiased but very unstable IV estimates, resulting in uniformly lowest root mean square errors. The much lower variance of our estimates, relative to IV estimates, was also observed in an application comparing gastrointestinal safety of two classes of anti-inflammatory drugs. In conclusion, our missing cause-based method may complement other methods and enhance accuracy of analyses of large pharmacoepidemiological studies. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Integration of data of disparate types has become increasingly important to enhancing the power for new discoveries by combining complementary strengths of multiple types of data. One application is to uncover tumor subtypes in human cancer research in which multiple types of genomic data are integrated, including gene expression, DNA copy number, and DNA methylation data. In spite of their successes, existing approaches based on joint latent variable models require stringent distributional assumptions and may suffer from unbalanced scales (or units) of different types of data and non-scalability of the corresponding algorithms. In this paper, we propose an alternative based on integrative and regularized principal component analysis, which is distribution-free, computationally efficient, and robust against unbalanced scales. The new method performs dimension reduction simultaneously on multiple types of data, seeking data-adaptive sparsity and scaling. As a result, in addition to feature selection for each type of data, integrative clustering is achieved. Numerically, the proposed method compares favorably against its competitors in terms of accuracy (in identifying hidden clusters), computational efficiency, and robustness against unbalanced scales. In particular, compared with a popular method, the new method was competitive in identifying tumor subtypes associated with distinct patient survival patterns when applied to a combined analysis of DNA copy number, mRNA expression, and DNA methylation data in a glioblastoma multiforme study. Copyright © 2016 John Wiley & Sons, Ltd.

]]>In two-armed trials with clustered observations the arms may differ in terms of (i) the intraclass correlation, (ii) the outcome variance, (iii) the average cluster size, and (iv) the number of clusters. For a linear mixed model analysis of the treatment effect, this paper examines the expected efficiency loss due to varying cluster sizes based upon the asymptotic relative efficiency of varying versus constant cluster sizes. Simple, but nearly cost-optimal, correction factors are derived for the numbers of clusters to repair this efficiency loss. In an extensive Monte Carlo simulation, the accuracy of the asymptotic relative efficiency and its Taylor approximation are examined for small sample sizes. Practical guidelines are derived to correct the numbers of clusters calculated under constant cluster sizes (within each treatment) when planning a study. Because of the variety of simulation conditions, these guidelines can be considered conservative but safe in many realistic situations. Copyright © 2016 John Wiley & Sons, Ltd.

A full independent drug development programme to demonstrate efficacy may not be ethical and/or feasible in small populations such as paediatric populations or orphan indications. Different levels of extrapolation from a larger population to smaller target populations are widely used for supporting decisions in this situation. There are guidance documents in drug regulation, where a weakening of the statistical rigour for trials in the target population is mentioned to be an option for dealing with this problem. To this end, we propose clinical trials designs, which make use of prior knowledge on efficacy for inference. We formulate a framework based on prior beliefs in order to investigate when the significance level for the test of the primary endpoint in confirmatory trials can be relaxed (and thus the sample size can be reduced) in the target population while controlling a certain posterior belief in effectiveness after rejection of the null hypothesis in the corresponding confirmatory statistical test. We show that point-priors may be used in the argumentation because under certain constraints, they have favourable limiting properties among other types of priors. The crucial quantity to be elicited is the prior belief in the possibility of extrapolation from a larger population to the target population. We try to illustrate an existing decision tree for extrapolation to paediatric populations within our framework. © 2016 The Authors. *Statistics in Medicine* Published by John Wiley & Sons Ltd.

Identification of the latency period and age-related susceptibility, if any, is an important aspect of assessing risks of environmental, nutritional, and occupational exposures. We consider estimation and inference for latency and age-related susceptibility in relative risk and excess risk models. We focus on likelihood-based methods for point and interval estimation of the latency period and age-related windows of susceptibility coupled with several commonly considered exposure metrics. The method is illustrated in a study of the timing of the effects of constituents of air pollution on mortality in the Nurses' Health Study. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Q-learning is a regression-based approach that uses longitudinal data to construct dynamic treatment regimes, which are sequences of decision rules that use patient information to inform future treatment decisions. An optimal dynamic treatment regime is composed of a sequence of decision rules that indicate how to optimally individualize treatment using the patients' baseline and time-varying characteristics to optimize the final outcome. Constructing optimal dynamic regimes using Q-learning depends heavily on the assumption that regression models at each decision point are correctly specified; yet model checking in the context of Q-learning has been largely overlooked in the current literature. In this article, we show that residual plots obtained from standard Q-learning models may fail to adequately check the quality of the model fit. We present a modified Q-learning procedure that accommodates residual analyses using standard tools. We present simulation studies showing the advantage of the proposed modification over standard Q-learning. We illustrate this new Q-learning approach using data collected from a sequential multiple assignment randomized trial of patients with schizophrenia. Copyright © 2016 John Wiley & Sons, Ltd.

]]>This paper considers the analysis of a repeat event outcome in clinical trials of chronic diseases in the context of dependent censoring (e.g. mortality). It has particular application in the context of recurrent heart failure hospitalisations in trials of heart failure. Semi-parametric joint frailty models (JFMs) simultaneously analyse recurrent heart failure hospitalisations and time to cardiovascular death, estimating distinct hazard ratios whilst individual-specific latent variables induce associations between the two processes. A simulation study was carried out to assess the suitability of the JFM versus marginal analyses of recurrent events and cardiovascular death using standard methods. Hazard ratios were consistently overestimated when marginal models were used, whilst the JFM produced good, well-estimated results. An application to the Candesartan in Heart failure: Assessment of Reduction in Mortality and morbidity programme was considered. The JFM gave unbiased estimates of treatment effects in the presence of dependent censoring. We advocate the use of the JFM for future trials that consider recurrent events as the primary outcome. © 2016 The Authors. *Statistics in Medicine* Published by John Wiley & Sons Ltd.

In stepped cluster designs the intervention is introduced into some (or all) clusters at different times and persists until the end of the study. Instances include traditional parallel cluster designs and the more recent stepped-wedge designs. We consider the precision offered by such designs under mixed-effects models with fixed time and random subject and cluster effects (including interactions with time), and explore the optimal choice of uptake times. The results apply both to cross-sectional studies where new subjects are observed at each time-point, and longitudinal studies with repeat observations on the same subjects.

The efficiency of the design is expressed in terms of a ‘cluster-mean correlation’ which carries information about the dependency-structure of the data, and two design coefficients which reflect the pattern of uptake-times. In cross-sectional studies the cluster-mean correlation combines information about the cluster-size and the intra-cluster correlation coefficient. A formula is given for the ‘design effect’ in both cross-sectional and longitudinal studies.

An algorithm for optimising the choice of uptake times is described and specific results obtained for the best balanced stepped designs. In large studies we show that the best design is a hybrid mixture of parallel and stepped-wedge components, with the proportion of stepped wedge clusters equal to the cluster-mean correlation. The impact of prior uncertainty in the cluster-mean correlation is considered by simulation. Some specific hybrid designs are proposed for consideration when the cluster-mean correlation cannot be reliably estimated, using a minimax principle to ensure acceptable performance across the whole range of unknown values. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

Dynamic prediction uses longitudinal biomarkers for real-time prediction of an individual patient's prognosis. This is critical for patients with an incurable disease such as cancer. Biomarker trajectories are usually not linear, nor even monotone, and vary greatly across individuals. Therefore, it is difficult to fit them with parametric models. With this consideration, we propose an approach for dynamic prediction that does not need to model the biomarker trajectories. Instead, as a trade-off, we assume that the biomarker effects on the risk of disease recurrence are smooth functions over time. This approach turns out to be computationally easier. Simulation studies show that the proposed approach achieves stable estimation of biomarker effects over time, has good predictive performance, and is robust against model misspecification. It is a good compromise between two major approaches, namely, (i) joint modeling of longitudinal and survival data and (ii) landmark analysis. The proposed method is applied to patients with chronic myeloid leukemia. At any time following their treatment with tyrosine kinase inhibitors, longitudinally measured *BCR-ABL* gene expression levels are used to predict the risk of disease progression. Copyright © 2016 John Wiley & Sons, Ltd.

The integrated discrimination improvement (IDI) is commonly used to compare two risk prediction models; it summarizes the extent a new model increases risk in events and decreases risk in non-events. The IDI averages risks across events and non-events and is therefore susceptible to Simpson's paradox. In some settings, adding a predictive covariate to a well calibrated model results in an overall negative (positive) IDI. However, if stratified by that same covariate, the strata-specific IDIs are positive (negative). Meanwhile, the calibration (observed to expected ratio and Hosmer–Lemeshow Goodness of Fit Test), area under the receiver operating characteristic curve, and Brier score improve overall and by stratum. We ran extensive simulations to investigate the impact of an imbalanced covariate upon metrics (IDI, area under the receiver operating characteristic curve, Brier score, and *R*^{2}), provide an analytic explanation for the paradox in the IDI, and use an investigative metric, a Weighted IDI, to better understand the paradox. In simulations, all instances of the paradox occurred under stratum-specific mis-calibration, yet there were mis-calibrated settings in which the paradox did not occur. The paradox is illustrated on Cancer Genomics Network data by calculating predictions based on two versions of BRCAPRO, a Mendelian risk prediction model for breast and ovarian cancer. In both simulations and the Cancer Genomics Network data, overall model calibration did not guarantee stratum-level calibration. We conclude that the IDI should only assess model performance among a clinically relevant subset when stratum-level calibration is strictly met and recommend calculating additional metrics to confirm the direction and conclusions of the IDI. Copyright © 2016 John Wiley & Sons, Ltd.

Vitamin D measurements are influenced by seasonal variation and specific assay used. Motivated by multicenter studies of associations of vitamin D with cancer, we formulated an analytic framework for matched case–control data that accounts for seasonal variation and calibrates to a reference assay. Calibration data were obtained from controls sampled within decile strata of the uncalibrated vitamin D values. Seasonal sine–cosine series were fit to control data. Practical findings included the following: (1) failure to adjust for season and calibrate increased variance, bias, and mean square error and (2) analysis of continuous vitamin D requires a variance adjustment for variation in the calibration estimate. An advantage of the continuous linear risk model is that results are independent of the reference date for seasonal adjustment. (3) For categorical risk models, procedures based on categorizing the seasonally adjusted and calibrated vitamin D have near nominal operating characteristics; estimates of log odds ratios are not robust to choice of seasonal reference date, however. Thus, public health recommendations based on categories of vitamin D should also define the time of year to which they refer. This work supports the use of simple methods for calibration and seasonal adjustment and is informing analytic approaches for the multicenter Vitamin D Pooling Project for Breast and Colorectal Cancer. Published 2016. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

]]>Cystic fibrosis (CF) is a hereditary lung disease characterized by loss of lung function over time. Lung function in CF is believed to decline at a higher rate during the adolescence period. It has been also hypothesized that there is a subgroup of individuals for whom lung disease remains relatively stable with only a slight decline over their lifetime. Using data from the University of Colorado CF Children's Registry, we investigate four change point models to model the decline of lung function in children and adolescents: (i) a two-component mixture random change point model, (ii) a two-component mixture-fixed change point model, (iii) a random change point model, and (iv) a fixed change point model. The models are investigated through posterior predictive simulation at the individual and population levels, and a simulation study examining the effects of model misspecification. The data support the mixed random change point model as the preferred model, with roughly 30% of adolescents experiencing a steady decline of 0.5 %FEV_{1} per year and 70% experiencing an increase in decline of 4.4 %FEV_{1} per year beginning on average at 14.6 years of age. Copyright © 2016 John Wiley & Sons, Ltd.

When conducting a meta-analysis of standardized mean differences (SMDs), it is common to use Cohen's *d*, or its variants, that require equal variances in the two arms of each study. While interpretation of these SMDs is simple, this alone should not be used as a justification for assuming equal variances. Until now, researchers have either used an *F*-test for each individual study or perhaps even conveniently ignored such tools altogether. In this paper, we propose a meta-analysis of ratios of sample variances to assess whether the equality of variances assumptions is justified prior to a meta-analysis of SMDs. Quantile–quantile plots, an omnibus test for equal variances or an overall meta-estimate of the ratio of variances can all be used to formally justify the use of less common methods when evidence of unequal variances is found. The methods in this paper are simple to implement and the validity of the approaches are reinforced by simulation studies and an application to a real data set. Copyright © 2016 John Wiley & Sons, Ltd.

Meta-analysis of clinical trials is a methodology to summarize information from a collection of trials about an intervention, in order to make informed inferences about that intervention. Random effects allow the target population outcomes to vary among trials. Since meta-analysis is often an important element in helping shape public health policy, society depends on biostatisticians to help ensure that the methodology is sound. Yet when meta-analysis involves randomized binomial trials with low event rates, the overwhelming majority of publications use methods currently not intended for such data. This statistical practice issue must be addressed. Proper methods exist, but they are rarely applied. This tutorial is devoted to estimating a well-defined overall relative risk, via a patient-weighted random-effects method. We show what goes wrong with methods based on ‘inverse-variance’ weights, which are almost universally used. To illustrate similarities and differences, we contrast our methods, inverse-variance methods, and the published results (usually inverse-variance) for 18 meta-analyses from 13 *Journal of the American Medical Association* articles. We also consider the 2007 case of rosiglitazone (Avandia), where important public health issues were at stake, involving patient cardiovascular risk. The most widely used method would have reached a different conclusion. © 2016 The Authors. *Statistics in Medicine* published by John Wiley & Sons Ltd.

Prediction of an outcome for a given unit based on prediction models built on a training sample plays a major role in many research areas. The uncertainty of the prediction is predominantly characterized by the subject sampling variation in current practice, where prediction models built on hypothetically re-sampled units yield variable predictions for the same unit of interest. It is almost always true that the predictors used to build prediction models are simply a subset of the entirety of factors related to the outcome. Following the frequentist principle, we can account for the variation because of hypothetically re-sampled predictors used to build the prediction models. This is particularly important in medicine where the prediction has important and sometime life-death consequences on a patient's health status. In this article, we discuss some rationale along this line in the context of medicine. We propose a simple approach to estimate the standard error of the prediction that accounts for the variation because of sampling both subjects and predictors under logistic and Cox regression models. A simulation study is presented to support our argument and demonstrate the performance of our method. The concept and method are applied to a real data set. Copyright © 2015 John Wiley & Sons, Ltd.

]]>The use and development of mobile interventions are experiencing rapid growth. In “just-in-time” mobile interventions, treatments are provided via a mobile device, and they are intended to help an individual make healthy decisions ‘in the moment,’ and thus have a proximal, near future impact. Currently, the development of mobile interventions is proceeding at a much faster pace than that of associated data science methods. A first step toward developing data-based methods is to provide an experimental design for testing the proximal effects of these just-in-time treatments. In this paper, we propose a ‘micro-randomized’ trial design for this purpose. In a micro-randomized trial, treatments are sequentially randomized throughout the conduct of the study, with the result that each participant may be randomized at the 100s or 1000s of occasions at which a treatment might be provided. Further, we develop a test statistic for assessing the proximal effect of a treatment as well as an associated sample size calculator. We conduct simulation evaluations of the sample size calculator in various settings. Rules of thumb that might be used in designing a micro-randomized trial are discussed. This work is motivated by our collaboration on the HeartSteps mobile application designed to increase physical activity. Copyright © 2015 John Wiley & Sons, Ltd.

]]>The 2010 US Food and Drug Administration and European Medicines Agency regulatory approaches to establish bioequivalence in highly variable drugs are both based on linearly scaling the bioequivalence limits, both take a ‘scaled average bioequivalence’ approach. The present paper corroborates previous work suggesting that none of them adequately controls type I error or consumer's risk, so they result in invalid test procedures in the neighbourhood of a within-subject coefficient of variation osf 30*%* for the reference (*R*) formulation. The problem is particularly serious in the US Food and Drug Administration regulation, but it is also appreciable in the European Medicines Agency one. For the partially replicated TRR/RTR/RRT and the replicated TRTR/RTRT crossover designs, we quantify these type I error problems by means of a simulation study, discuss their possible causes and propose straightforward improvements on both regulatory procedures that improve their type I error control while maintaining an adequate power. Copyright © 2015 John Wiley & Sons, Ltd.

The net survival of a patient diagnosed with a given disease is a quantity often interpreted as the hypothetical survival probability in the absence of causes of death other than the disease. In a relative survival framework, net survival summarises the excess mortality that patients experience compared with their relevant reference population. Based on follow-up data from the Finnish Cancer Registry, we derived simulation scenarios that describe survival of patients in eight cancer sites reflecting different excess mortality patterns in order to compare the performance of the classical Ederer II estimator and the new estimator proposed by Pohar Perme *et al.* At 5 years, the age-standardised Ederer II estimator performed equally well as the Pohar Perme estimator with the exception of melanoma in which the Pohar Perme estimator had a smaller mean squared error (MSE). At 10 and 15 years, the age-standardised Ederer II performed most often better than the Pohar Perme estimator. The unstandardised Ederer II estimator had the largest MSE at 5 years. However, its MSE was often superior to those of the other estimators at 10 and 15 years, especially in sparse data. Both the Pohar Perme and the age-standardised Ederer II estimator are valid for 5-year net survival of cancer patients. For longer-term net survival, our simulation results support the use of the age-standardised Ederer II estimator. Copyright © 2015 John Wiley & Sons, Ltd.

Consider a parallel group trial for the comparison of an experimental treatment to a control, where the second-stage sample size may depend on the blinded primary endpoint data as well as on additional blinded data from a secondary endpoint. For the setting of normally distributed endpoints, we demonstrate that this may lead to an inflation of the type I error rate if the null hypothesis holds for the primary but not the secondary endpoint.

We derive upper bounds for the inflation of the type I error rate, both for trials that employ random allocation and for those that use block randomization. We illustrate the worst-case sample size reassessment rule in a case study. For both randomization strategies, the maximum type I error rate increases with the effect size in the secondary endpoint and the correlation between endpoints. The maximum inflation increases with smaller block sizes if information on the block size is used in the reassessment rule. Based on our findings, we do not question the well-established use of blinded sample size reassessment methods with nuisance parameter estimates computed from the blinded interim data of the primary endpoint. However, we demonstrate that the type I error rate control of these methods relies on the application of specific, binding, pre-planned and fully algorithmic sample size reassessment rules and does not extend to general or unplanned sample size adjustments based on blinded data. ©2015 The Authors. *Statistics in Medicine* Published by John Wiley & Sons Ltd.

There has been a series of occasional papers in this journal about semiparametric methods for robust covariate control in the analysis of clinical trials. These methods are fairly easy to apply on currently available computers, but standard software packages do not yet support these methods with easy option selections. Moreover, these methods can be difficult to explain to practitioners who have only a basic statistical education. There is also a somewhat neglected history demonstrating that ordinary least squares (OLS) is very robust to the types of outcome distribution features that have motivated the newer methods for robust covariate control. We review these two strands of literature and report on some new simulations that demonstrate the robustness of OLS to more extreme normality violations than previously explored. The new simulations involve two strongly leptokurtic outcomes: near-zero binary outcomes and zero-inflated gamma outcomes. Potential examples of such outcomes include, respectively, 5-year survival rates for stage IV cancer and healthcare claim amounts for rare conditions. We find that traditional OLS methods work very well down to very small sample sizes for such outcomes. Under some circumstances, OLS with robust standard errors work well with even smaller sample sizes. Given this literature review and our new simulations, we think that most researchers may comfortably continue using standard OLS software, preferably with the robust standard errors. Copyright © 2015 John Wiley & Sons, Ltd.

]]>Receiver operating characteristic (ROC) curve and its summary statistics (e.g., the area under curve (AUC)) are commonly used to evaluate the diagnostic accuracy for disease processes with binary classification. The ROC curve has been extended to ROC surface for scenarios with three ordinal classes or to hyper-surface for scenarios with more than three classes. For classifier under tree or umbrella ordering in which the marker measurement for one class is lower or higher than those for the other classes, the commonly adopted diagnostic measures are the naive AUC (*NAUC*) based on a pooled class of all the unordered classes and the umbrella volume (*UV*) based on the concept of volume under surface. However, both *NAUC* and *UV* have some limitations. For example, *NAUC* depends on the sampling weights for all the classes in population, and *UV* has only been introduced for three-class settings. In this article, we initiate the idea of a new ROC framework for tree or umbrella ordering (denoted as *TROC*) and propose the area under *TROC* curve (denoted as *TAUC*) as an appropriate diagnostic measure. The proposed *TROC* and *TAUC* share many nice features with the traditional ROC and AUC. Both parametric and nonparametric approaches are explored to construct the confidence interval estimation of *TAUC*. The performances of these methods are compared in simulation studies under a variety settings. At the end, the proposed methods are applied to a published microarray data set. Copyright © 2015 John Wiley & Sons, Ltd.

The estimation of treatment effects on medical costs is complicated by the need to account for informative censoring, skewness, and the effects of confounders. Because medical costs are often collected from observational claims data, we investigate propensity score (PS) methods such as covariate adjustment, stratification, and inverse probability weighting taking into account informative censoring of the cost outcome. We compare these more commonly used methods with doubly robust (DR) estimation. We then use a machine learning approach called super learner (SL) to choose among conventional cost models to estimate regression parameters in the DR approach and to choose among various model specifications for PS estimation. Our simulation studies show that when the PS model is correctly specified, weighting and DR perform well. When the PS model is misspecified, the combined approach of DR with SL can still provide unbiased estimates. SL is especially useful when the underlying cost distribution comes from a mixture of different distributions or when the true PS model is unknown. We apply these approaches to a cost analysis of two bladder cancer treatments, cystectomy versus bladder preservation therapy, using SEER-Medicare data. Copyright © 2015 John Wiley & Sons, Ltd.

]]>The issuance of a report in 2010 by the National Research Council (NRC) of the National Academy of Sciences entitled ‘The Prevention and Treatment of Missing Data in Clinical Trials,’ commissioned by the US Food and Drug Administration, had an immediate impact on the way that statisticians and clinical researchers in both industry and regulatory agencies think about the missing data problem. We believe that there is currently great potential to improve study quality and interpretability—by reducing the amount of missing data through changes in trial design and conduct and by planning and conducting analyses that better account for the missing information. Here, we describe our view on some of the recommendations in the report and suggest ways in which these recommendations can be incorporated into new or ongoing clinical trials in order to improve their chance of success. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.

]]>The National Research Council Panel on Handling Missing Data in Clinical Trials recommended that protocols for clinical trials ‘explicitly define... causal estimands of primary interest’. In discussions with sponsors of clinical trials since the publication of the National Research Council report, the expression *causal estimands* has been the subject of confusion. It may not be entirely clear what the National Research Council panel meant, and in any case, it has not been clear how this recommendation might be put in practice. This paper's purpose is to say how the working group understands it and how we think it should be put in practice. We classify possible choices of estimand according to their usefulness for regulatory purposes in various clinical settings. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.

Recently, multiple imputation has been proposed as a tool for individual patient data meta-analysis with sporadically missing observations, and it has been suggested that within-study imputation is usually preferable. However, such within study imputation cannot handle variables that are completely missing within studies. Further, if some of the contributing studies are relatively small, it may be appropriate to share information across studies when imputing. In this paper, we develop and evaluate a joint modelling approach to multiple imputation of individual patient data in meta-analysis, with an across-study probability distribution for the study specific covariance matrices. This retains the flexibility to allow for between-study heterogeneity when imputing while allowing (i) sharing information on the covariance matrix across studies when this is appropriate, and (ii) imputing variables that are wholly missing from studies. Simulation results show both equivalent performance to the within-study imputation approach where this is valid, and good results in more general, practically relevant, scenarios with studies of very different sizes, non-negligible between-study heterogeneity and wholly missing variables. We illustrate our approach using data from an individual patient data meta-analysis of hypertension trials. © 2015 The Authors. *Statistics in Medicine* Published by John Wiley & Sons Ltd.

Where treatments are administered to groups of patients or delivered by therapists, outcomes for patients in the same group or treated by the same therapist may be more similar, leading to clustering. Trials of such treatments should take account of this effect. Where such a treatment is compared with an un-clustered treatment, the trial has a partially nested design. This paper compares statistical methods for this design where the outcome is binary.

Investigation of consistency reveals that a random coefficient model with a random effect for group or therapist is not consistent with other methods for a null treatment effect, and so this model is not recommended for this design. Small sample performance of a cluster-adjusted test of proportions, a summary measures test and logistic generalised estimating equations and random intercept models are investigated through simulation. The expected treatment effect is biased for the logistic models. Empirical test size of two-sided tests is raised only slightly, but there are substantial biases for one-sided tests. Three formulae are proposed for calculating sample size and power based on (i) the difference of proportions, (ii) the log-odds ratio or (iii) the arc-sine transformation of proportions. Calculated power from these formulae is compared with empirical power from a simulations study.

Logistic models appeared to perform better than those based on proportions with the likelihood ratio test performing best in the range of scenarios considered. For these analyses, the log-odds ratio method of calculation of power gave an approximate lower limit for empirical power. © 2015 The Authors. *Statistics in Medicine* Published by John Wiley & Sons Ltd.

Mendelian randomization is the use of genetic instrumental variables to obtain causal inferences from observational data. Two recent developments for combining information on multiple uncorrelated instrumental variables (IVs) into a single causal estimate are as follows: (i) allele scores, in which individual-level data on the IVs are aggregated into a univariate score, which is used as a single IV, and (ii) a summary statistic method, in which causal estimates calculated from each IV using summarized data are combined in an inverse-variance weighted meta-analysis. To avoid bias from weak instruments, unweighted and externally weighted allele scores have been recommended. Here, we propose equivalent approaches using summarized data and also provide extensions of the methods for use with correlated IVs. We investigate the impact of different choices of weights on the bias and precision of estimates in simulation studies. We show that allele score estimates can be reproduced using summarized data on genetic associations with the risk factor and the outcome. Estimates from the summary statistic method using external weights are biased towards the null when the weights are imprecisely estimated; in contrast, allele score estimates are unbiased. With equal or external weights, both methods provide appropriate tests of the null hypothesis of no causal effect even with large numbers of potentially weak instruments. We illustrate these methods using summarized data on the causal effect of low-density lipoprotein cholesterol on coronary heart disease risk. It is shown that a more precise causal estimate can be obtained using multiple genetic variants from a single gene region, even if the variants are correlated. © 2015 The Authors. *Statistics in Medicine* published by John Wiley & Sons Ltd.

Observational cohort studies often feature longitudinal data subject to irregular observation. Moreover, the timings of observations may be associated with the underlying disease process and must thus be accounted for when analysing the data. This paper suggests that multiple outputation, which consists of repeatedly discarding excess observations, may be a helpful way of approaching the problem. Multiple outputation was designed for clustered data where observations within a cluster are exchangeable; an adaptation for longitudinal data subject to irregular observation is proposed. We show how multiple outputation can be used to expand the range of models that can be fitted to irregular longitudinal data. Copyright © 2015 John Wiley & Sons, Ltd.

]]>Recent studies found that infection-related hospitalization was associated with increased risk of cardiovascular (CV) events, such as myocardial infarction and stroke in the dialysis population. In this work, we develop time-varying effects modeling tools in order to examine the CV outcome risk trajectories during the time periods before and after an initial infection-related hospitalization. For this, we propose partly conditional and fully conditional partially linear generalized varying coefficient models (PL-GVCMs) for modeling time-varying effects in longitudinal data with substantial follow-up truncation by death. Unconditional models that implicitly target an immortal population is not a relevant target of inference in applications involving a population with high mortality, like the dialysis population. A partly conditional model characterizes the outcome trajectory for the dynamic cohort of survivors, where each point in the longitudinal trajectory represents a snapshot of the population relationships among subjects who are alive at that time point. In contrast, a fully conditional approach models the time-varying effects of the population stratified by the actual time of death, where the mean response characterizes individual trends in each cohort stratum. We compare and contrast partly and fully conditional PL-GVCMs in our aforementioned application using hospitalization data from the United States Renal Data System. For inference, we develop generalized likelihood ratio tests. Simulation studies examine the efficacy of estimation and inference procedures. Copyright © 2015 John Wiley & Sons, Ltd.

]]>Fifty years after Bradford Hill published his extremely influential criteria to offer some guides for separating causation from association, we have accumulated millions of papers and extensive data on observational research that depends on epidemiologic methods and principles. This allows us to re-examine the accumulated empirical evidence for the nine criteria, and to re-approach epidemiology through the lens of exposure-wide approaches. The lecture discusses the evolution of these exposure-wide approaches and tries to use the evidence from meta-epidemiologic assessments to reassess each of the nine criteria and whether they work well as guides for causation. I argue that of the nine criteria, experiment remains important and consistency (replication) is also very essential. Temporality also makes sense, but it is often difficult to document. Of the other six criteria, strength mostly does not work and may even have to be inversed: small and even tiny effects are more plausible than large effects; when large effects are seen, they are mostly transient and almost always represent biases and errors. There is little evidence for specificity in causation in nature. Biological gradient is often unclear how it should it modeled and thus difficult to prove. Coherence remains usually unclear how to operationalize. Finally, plausibility as well as analogy do not work well in most fields of investigation, and their invocation has been mostly detrimental, although exceptions may exist. Copyright © 2015 John Wiley & Sons, Ltd.

]]>No abstract is available for this article.

]]>No abstract is available for this article.

]]>No abstract is available for this article.

]]>* Background:*The use of standard statistical methods in the medical literature has been studied extensively; however, the adoption of new methods has received less attention. We sought to understand (i) whether there is a perception that new methods are underused, (ii) what the barriers to use of new methods are, (iii) what dissemination activities are used, and (iv) user preferences for learning about new methods.

* Methods:*We conducted a cross-sectional survey of members of the Statistical Society of Canada (SSC) and of principal investigators (knowledge-users) funded by the Canadian Institutes of Health Research (CIHR).

* Results:* There were 157 CIHR respondents (14% response rate), and 39 respondents were statisticians from the Statistical Society of Canada. Seventy percent of CIHR respondents and 82% of statisticians felt that new developments were under-used. Barriers to use of new methods included lack of access to the necessary expertise (selected by over 90% of respondents), lack of suitable software (selected by 81% of statisticians), and lack of time to implement new methods (selected by 78% of statisticians). Greater access to statistical colleagues with an interest in collaboration and availability of software to implement new methods were the top-rated preferences among knowledge-users.

* Conclusions:* There was a clear perception among all respondents that new statistical methods are underused. Encouraging statistical methodologists to develop a knowledge translation plan for improved dissemination and uptake, placing greater value on the role of the statistical collaborator in research, and providing software alongside new methods may improve the use of newly developed statistical methods. Copyright © 2015 John Wiley & Sons, Ltd.

Network meta-analysis is becoming more popular as a way to compare multiple treatments simultaneously. Here, we develop a new estimation method for fitting models for network meta-analysis with random inconsistency effects. This method is an extension of the procedure originally proposed by DerSimonian and Laird. Our methodology allows for inconsistency within the network. The proposed procedure is semi-parametric, non-iterative, fast and highly accessible to applied researchers. The methodology is found to perform satisfactorily in a simulation study provided that the sample size is large enough and the extent of the inconsistency is not very severe. We apply our approach to two real examples. © 2015 The Authors. *Statistics in Medicine* Published by John Wiley & Sons Ltd.

An adaptive treatment strategy (ATS) is an outcome-guided algorithm that allows personalized treatment of complex diseases based on patients' disease status and treatment history. Conditions such as AIDS, depression, and cancer usually require several stages of treatment because of the chronic, multifactorial nature of illness progression and management. Sequential multiple assignment randomized (SMAR) designs permit simultaneous inference about multiple ATSs, where patients are sequentially randomized to treatments at different stages depending upon response status. The purpose of the article is to develop a sample size formula to ensure adequate power for comparing two or more ATSs. Based on a Wald-type statistic for comparing multiple ATSs with a continuous endpoint, we develop a sample size formula and test it through simulation studies. We show via simulation that the proposed sample size formula maintains the nominal power. The proposed sample size formula is not applicable to designs with time-to-event endpoints but the formula will be useful for practitioners while designing SMAR trials to compare adaptive treatment strategies. Copyright © 2015 John Wiley & Sons, Ltd.

]]>The area under the receiver operating characteristic (ROC) curve (AUC) is used as a performance metric for quantitative tests. Although multiple biomarkers may be available for diagnostic or screening purposes, diagnostic accuracy is often assessed individually rather than in combination. In this paper, we consider the interesting problem of combining multiple biomarkers for use in a single diagnostic criterion with the goal of improving the diagnostic accuracy above that of an individual biomarker. The diagnostic criterion created from multiple biomarkers is based on the predictive probability of disease, conditional on given multiple biomarker outcomes. If the computed predictive probability exceeds a specified cutoff, the corresponding subject is allocated as ‘diseased’. This defines a standard diagnostic criterion that has its own ROC curve, namely, the combined ROC (*cROC*). The AUC metric for *cROC*, namely, the combined AUC (*cAUC*), is used to compare the predictive criterion based on multiple biomarkers to one based on fewer biomarkers. A multivariate random-effects model is proposed for modeling multiple normally distributed *dependent* scores. Bayesian methods for estimating ROC curves and corresponding (marginal) AUCs are developed when a perfect reference standard is not available. In addition, *cAUC*s are computed to compare the accuracy of different combinations of biomarkers for diagnosis. The methods are evaluated using simulations and are applied to data for Johne's disease (paratuberculosis) in cattle. Copyright © 2015 John Wiley & Sons, Ltd.

Papers evaluating measures of explained variation, or similar indices, almost invariably use independence from censoring as the most important criterion. And they always end up suggesting that some measures meet this criterion, and some do not, most of the time leading to a conclusion that the first is better than the second. As a consequence, users are offered measures that cannot be used with time-dependent covariates and effects, not to mention extensions to repeated events or multi-state models. We explain in this paper that the aforementioned criterion is of no use in studying such measures, because it simply favors those that make an implicit assumption of a model being valid everywhere. Measures not making such an assumption are disqualified, even though they are better in every other respect. We show that if these, allegedly inferior, measures are allowed to make the same assumption, they are easily corrected to satisfy the ‘independent-from-censoring’ criterion. Even better, it is enough to make such an assumption only for the times greater than the last observed failure time *τ*, which, in contrast with the ‘preferred’ measures, makes it possible to use all the modeling flexibility up to *τ* and assume whatever one wants after *τ*. As a consequence, we claim that some of the measures being preferred as better in the existing reviews are in fact inferior. Copyright © 2015 John Wiley & Sons, Ltd.

We propose a flexible model for correlated medical cost data with several appealing features. First, the mean function is partially linear. Second, the distributional form for the response is not specified. Third, the covariance structure of correlated medical costs has a semiparametric form. We use extended generalized estimating equations to simultaneously estimate all parameters of interest. B-splines are used to estimate unknown functions, and a modification to Akaike information criterion is proposed for selecting knots in spline bases. We apply the model to correlated medical costs in the Medical Expenditure Panel Survey dataset. Simulation studies are conducted to assess the performance of our method. Copyright © 2015 John Wiley & Sons, Ltd.

]]>This paper introduces two general models for computing centiles when the response variable *Y* can take values between 0 and 1, inclusive of 0 or 1. The models developed are more flexible alternatives to the beta inflated distribution. The first proposed model employs a flexible four parameter logit skew Student *t* (* logitSST*) distribution to model the response variable

Rare variant studies are now being used to characterize the genetic diversity between individuals and may help to identify substantial amounts of the genetic variation of complex diseases and quantitative phenotypes. Family data have been shown to be powerful to interrogate rare variants. Consequently, several rare variants association tests have been recently developed for family-based designs, but typically, these assume the normality of the quantitative phenotypes. In this paper, we present a family-based test for rare-variants association in the presence of non-normal quantitative phenotypes. The proposed model relaxes the normality assumption and does not specify any parametric distribution for the marginal distribution of the phenotype. The dependence between relatives is modeled via a Gaussian copula. A score-type test is derived, and several strategies to approximate its distribution under the null hypothesis are derived and investigated. The performance of the proposed test is assessed and compared with existing methods by simulations. The methodology is illustrated with an association study involving the adiponectin trait from the UK10K project. Copyright © 2015 John Wiley & Sons, Ltd.

]]>There has been increasing interest in trials that allow for design adaptations like sample size reassessment or treatment selection at an interim analysis. Ignoring the adaptive and multiplicity issues in such designs leads to an inflation of the type 1 error rate, and treatment effect estimates based on the maximum likelihood principle become biased. Whereas the methodological issues concerning hypothesis testing are well understood, it is not clear how to deal with parameter estimation in designs were adaptation rules are not fixed in advanced so that, in practice, the maximum likelihood estimate (MLE) is used. It is therefore important to understand the behavior of the MLE in such designs. The investigation of Bias and mean squared error (MSE) is complicated by the fact that the adaptation rules need not be fully specified in advance and, hence, are usually unknown. To investigate Bias and MSE under such circumstances, we search for the sample size reassessment and selection rules that lead to the maximum Bias or maximum MSE. Generally, this leads to an overestimation of Bias and MSE, which can be reduced by imposing realistic constraints on the rules like, for example, a maximum sample size. We consider designs that start with *k* treatment groups and a common control and where selection of a single treatment and control is performed at the interim analysis with the possibility to reassess each of the sample sizes. We consider the case of unlimited sample size reassessments as well as several realistically restricted sample size reassessment rules. © 2015 The Authors. *Statistics in Medicine* Published by John Wiley & Sons Ltd.

In this paper, we present a class of graphical tests of the proportional hazards hypothesis for two-sample censored survival data. The proposed tests are improvements over some existing tests based on asymptotic confidence bands of certain functions of the estimated cumulative hazard functions. The new methods are based on the comparison of unrestricted estimates of the said functions and their restricted versions under the hypothesis. They combine the rigour of analytical tests with the descriptive value of plots. Monte Carlo simulations suggest that the proposed asymptotic procedures have reasonable small sample properties. The power is much higher than existing graphical tests and comparable with existing analytical tests. The method is then illustrated through the analysis of a data set on bone marrow transplantation for Leukemia patients. Copyright © 2015 John Wiley & Sons, Ltd.

]]>Based on the physical randomization of completely randomized experiments, in a recent article in *Statistics in Medicine*, Rigdon and Hudgens propose two approaches to obtaining exact confidence intervals for the average causal effect on a binary outcome. They construct the first confidence interval by combining, with the Bonferroni adjustment, the prediction sets for treatment effects among treatment and control groups, and the second one by inverting a series of randomization tests. With sample size *n*, their second approach requires performing *O*(*n*^{4})randomization tests. We demonstrate that the physical randomization also justifies other ways to constructing exact confidence intervals that are more computationally efficient. By exploiting recent advances in hypergeometric confidence intervals and the stochastic order information of randomization tests, we propose approaches that either do not need to invoke Monte Carlo or require performing at most *O*(*n*^{2})randomization tests. We provide technical details and R code in the Supporting Information. Copyright © 2016 John Wiley & Sons, Ltd.