In randomized clinical trials where the times to event of two treatment groups are compared under a proportional hazards assumption, it has been established that omitting prognostic factors from the model entails an underestimation of the hazards ratio. Heterogeneity due to unobserved covariates in cancer patient populations is a concern since genomic investigations have revealed molecular and clinical heterogeneity in these populations. In HIV prevention trials, heterogeneity is unavoidable and has been shown to decrease the treatment effect over time. This article assesses the influence of trial duration on the bias of the estimated hazards ratio resulting from omitting covariates from the Cox analysis. The true model is defined by including an unobserved random frailty term in the individual hazard that reflects the omitted covariate. Three frailty distributions are investigated: gamma, log-normal, and binary, and the asymptotic bias of the hazards ratio estimator is calculated. We show that the attenuation of the treatment effect resulting from unobserved heterogeneity strongly increases with trial duration, especially for continuous frailties that are likely to reflect omitted covariates, as they are often encountered in practice. The possibility of interpreting the long-term decrease in treatment effects as a bias induced by heterogeneity and trial duration is illustrated by a trial in oncology where adjuvant chemotherapy in stage 1B NSCLC was investigated.

]]>Evaluation of diagnostic performance is typically based on the receiver operating characteristic (ROC) curve and the area under the curve (AUC) as its summary index. The partial area under the curve (pAUC) is an alternative index focusing on the range of practical/clinical relevance. One of the problems preventing more frequent use of the pAUC is the perceived loss of efficiency in cases of noncrossing ROC curves. In this paper, we investigated statistical properties of comparisons of two correlated pAUCs. We demonstrated that outside of the classic model there are practically reasonable ROC types for which comparisons of noncrossing concave curves would be more powerful when based on a part of the curve rather than the entire curve. We argue that this phenomenon stems in part from the exclusion of noninformative parts of the ROC curves that resemble straight-lines. We conducted extensive simulation studies in families of binormal, straight-line, and bigamma ROC curves. We demonstrated that comparison of pAUCs is statistically more powerful than comparison of full AUCs when ROC curves are close to a “straight line”. For less flat binormal ROC curves an increase in the integration range often leads to a disproportional increase in pAUCs’ difference, thereby contributing to an increase in statistical power. Thus, efficiency of differences in pAUCs of noncrossing ROC curves depends on the shape of the curves, and for families of ROC curves that are nearly straight-line shaped, such as bigamma ROC curves, there are multiple practical scenarios in which comparisons of pAUCs are preferable.

]]>Discrimination statistics describe the ability of a survival model to assign higher risks to individuals who experience earlier events: examples are Harrell's C-index and Royston and Sauerbrei's D, which we call the D-index. Prognostic covariates whose distributions are controlled by the study design (e.g. age and sex) influence discrimination and can make it difficult to compare model discrimination between studies. Although covariate adjustment is a standard procedure for quantifying disease-risk factor associations, there are no covariate adjustment methods for discrimination statistics in censored survival data.

To develop extensions of the C-index and D-index that describe the prognostic ability of a model adjusted for one or more covariate(s).

We define a covariate-adjusted C-index and D-index for censored survival data, propose several estimators, and investigate their performance in simulation studies and in data from a large individual participant data meta-analysis, the Emerging Risk Factors Collaboration.

The proposed methods perform well in simulations. In the Emerging Risk Factors Collaboration data, the age-adjusted C-index and D-index were substantially smaller than unadjusted values. The study-specific standard deviation of baseline age was strongly associated with the unadjusted C-index and D-index but not significantly associated with the age-adjusted indices.

The proposed estimators improve meta-analysis comparisons, are easy to implement and give a more meaningful clinical interpretation.

Recurrent event data arise in longitudinal follow-up studies, where each subject may experience the same type of events repeatedly. The work in this article is motivated by the data from a study of repeated peritonitis for patients on peritoneal dialysis. Due to the aspects of medicine and cost, the peritonitis cases were classified into two types: Gram-positive and non-Gram-positive peritonitis. Further, since the death and hemodialysis therapy preclude the occurrence of recurrent events, we face multivariate recurrent event data with a dependent terminal event. We propose a flexible marginal model, which has three characteristics: first, we assume marginal proportional hazard and proportional rates models for terminal event time and recurrent event processes, respectively; second, the inter-recurrences dependence and the correlation between the multivariate recurrent event processes and terminal event time are modeled through three multiplicative frailties corresponding to the specified marginal models; third, the rate model with frailties for recurrent events is specified only on the time before the terminal event. We propose a two-stage estimation procedure for estimating unknown parameters. We also establish the consistency of the two-stage estimator. Simulation studies show that the proposed approach is appropriate for practical use. The methodology is applied to the peritonitis cohort data that motivated this study.

]]>This paper presents a collection of dissimilarity measures to describe and then classify spatial point patterns when multiple replicates of different types are available for analysis. In particular, we consider a range of distances including the spike-time distance and its variants, as well as cluster-based distances and dissimilarity measures based on classical statistical summaries of point patterns. We review and explore, in the form of a tutorial, their uses, and their pros and cons. These distances are then used to summarize and describe collections of repeated realizations of point patterns via prototypes and multidimensional scaling. We also show a simulation study to evaluate the performance of multidimensional scaling with two types of selected distances. Finally, a multivariate spatial point pattern of a natural plant community is analyzed through various of these measures of dissimilarity.

]]>In many areas of science where empirical data are analyzed, a task is often to identify important variables with influence on an outcome. Most often this is done by using a variable selection strategy in the context of a multivariable regression model. Using a study on ozone effects in children (*n* = 496, 24 covariates), we will discuss aspects relevant for deriving a suitable model. With an emphasis on model stability, we will explore and illustrate differences between predictive models and explanatory models, the key role of stopping criteria, and the value of bootstrap resampling (with and without replacement). Bootstrap resampling will be used to assess variable selection stability, to derive a predictor that incorporates model uncertainty, check for influential points, and visualize the variable selection process. For the latter two tasks we adapt and extend recent approaches, such as stability paths, to serve our purposes. Based on earlier experiences and on results from the example, we will argue for simpler models and that predictions are usually very similar, irrespective of the selection method used. Important differences exist for the corresponding variances, and the model uncertainty concept helps to protect against serious underestimation of the variance of a predictor-derived data dependently. Results of stability investigations illustrate severe difficulties in the task of deriving a suitable explanatory model. It seems possible to identify a small number of variables with an important and probably true influence on the outcome, but too often several variables are included whose selection may be a result of chance or may depend on a small number of observations.

In recent years, the evaluation of healthcare provider performance has become standard for governments, insurance companies, and other stakeholders. Often, performance is compared across providers using indicators in one time period, for example a year. However it is often important to assess changes in the performance of individual providers over time. Such analyses can be used to determine if any providers have significant improvements, deteriorations, unusual patterns or systematic changes in performance. Studies which monitor healthcare provider performance in this way have to date typically been limited to comparing performance in the most recent period with performance in a previous period. It is also important to consider a longer-term view of performance and assess changes over more than two periods. In this paper, we develop test statistics that account for variable numbers of prior performance indicators, and show that these are particularly useful for assessing consecutive improvements or deteriorations in performance. We apply the tests to coronary artery bypass graft mortality rates in New York State hospitals, and mortality data from Australian and New Zealand intensive care units. Although our applications are to medical data, the new tests have broad application in other areas.

]]>We develop an asymptotic likelihood ratio test for multivariate lognormal data with a point mass at zero in each dimension. The test generalizes Wilks' lambda and Hotelling *T*-test to the case of semicontinuous data. Simulations show that the resulting test statistic attains the nominal Type I error rate and has good power for reasonable alternatives. We conclude with an application to exploration of ecological niches of trees of South Africa.

Marginal structural models (MSMs) have been proposed for estimating a treatment's effect, in the presence of time-dependent confounding. We aimed to evaluate the performance of the Cox MSM in the presence of missing data and to explore methods to adjust for missingness. We simulated data with a continuous time-dependent confounder and a binary treatment. We explored two classes of missing data: (i) missed visits, which resemble clinical cohort studies; (ii) missing confounder's values, which correspond to interval cohort studies. Missing data were generated under various mechanisms. In the first class, the source of the bias was the extreme treatment weights. Truncation or normalization improved estimation. Therefore, particular attention must be paid to the distribution of weights, and truncation or normalization should be applied if extreme weights are noticed. In the second case, bias was due to the misspecification of the treatment model. Last observation carried forward (LOCF), multiple imputation (MI), and inverse probability of missingness weighting (IPMW) were used to correct for the missingness. We found that alternatives, especially the IPMW method, perform better than the classic LOCF method. Nevertheless, in situations with high marker's variance and rarely recorded measurements none of the examined method adequately corrected the bias.

]]>In this paper, we introduce a new model for recurrent event data characterized by a baseline rate function fully parametric, which is based on the exponential-Poisson distribution. The model arises from a latent competing risk scenario, in the sense that there is no information about which cause was responsible for the event occurrence. Then, the time of each recurrence is given by the minimum lifetime value among all latent causes. The new model has a particular case, which is the classical homogeneous Poisson process. The properties of the proposed model are discussed, including its hazard rate function, survival function, and ordinary moments. The inferential procedure is based on the maximum likelihood approach. We consider an important issue of model selection between the proposed model and its particular case by the likelihood ratio test and score test. Goodness of fit of the recurrent event models is assessed using Cox-Snell residuals. A simulation study evaluates the performance of the estimation procedure in the presence of a small and moderate sample sizes. Applications on two real data sets are provided to illustrate the proposed methodology. One of them, first analyzed by our team of researchers, considers the data concerning the recurrence of malaria, which is an infectious disease caused by a protozoan parasite that infects red blood cells.

]]>This paper presents an extension of the joint modeling strategy for the case of multiple longitudinal outcomes and repeated infections of different types over time, motivated by postkidney transplantation data. Our model comprises two parts linked by shared latent terms. On the one hand is a multivariate mixed linear model with random effects, where a low-rank thin-plate spline function is incorporated to collect the nonlinear behavior of the different profiles over time. On the other hand is an infection-specific Cox model, where the dependence between different types of infections and the related times of infection is through a random effect associated with each infection type to catch the *within* dependence and a shared frailty parameter to capture the dependence *between* infection types. We implemented the parameterization used in joint models which uses the fitted longitudinal measurements as time-dependent covariates in a relative risk model. Our proposed model was implemented in OpenBUGS using the MCMC approach.

Methods to examine whether genetic and/or environmental sources can account for the residual variation in ordinal family data usually assume proportional odds. However, standard software to fit the non-proportional odds model to ordinal family data is limited because the correlation structure of family data is more complex than for other types of clustered data. To perform these analyses we propose the non-proportional odds multivariate logistic regression model and take a simulation-based approach to model fitting using Markov chain Monte Carlo methods, such as partially collapsed Gibbs sampling and the Metropolis algorithm. We applied the proposed methodology to male pattern baldness data from the Victorian Family Heart Study.

]]>The problem of variable selection in the generalized linear-mixed models (GLMMs) is pervasive in statistical practice. For the purpose of variable selection, many methodologies for determining the best subset of explanatory variables currently exist according to the model complexity and differences between applications. In this paper, we develop a “higher posterior probability model with bootstrap” (HPMB) approach to select explanatory variables without fitting all possible GLMMs involving a small or moderate number of explanatory variables. Furthermore, to save computational load, we propose an efficient approximation approach with Laplace's method and Taylor's expansion to approximate intractable integrals in GLMMs. Simulation studies and an application of HapMap data provide evidence that this selection approach is computationally feasible and reliable for exploring true candidate genes and gene–gene associations, after adjusting for complex structures among clusters.

]]>New markers may improve prediction of diagnostic and prognostic outcomes. We aimed to review options for graphical display and summary measures to assess the predictive value of markers over standard, readily available predictors. We illustrated various approaches using previously published data on 3264 participants from the Framingham Heart Study, where 183 developed coronary heart disease (10-year risk 5.6%). We considered performance measures for the incremental value of adding HDL cholesterol to a prediction model. An initial assessment may consider statistical significance (HR = 0.65, 95% confidence interval 0.53 to 0.80; likelihood ratio *p* < 0.001), and distributions of predicted risks (densities or box plots) with various summary measures. A range of decision thresholds is considered in predictiveness and receiver operating characteristic curves, where the area under the curve (AUC) increased from 0.762 to 0.774 by adding HDL. We can furthermore focus on reclassification of participants with and without an event in a reclassification graph, with the continuous net reclassification improvement (NRI) as a summary measure. When we focus on one particular decision threshold, the changes in sensitivity and specificity are central. We propose a net reclassification risk graph, which allows us to focus on the number of reclassified persons and their event rates. Summary measures include the binary AUC, the two-category NRI, and decision analytic variants such as the net benefit (NB). Various graphs and summary measures can be used to assess the incremental predictive value of a marker. Important insights for impact on decision making are provided by a simple graph for the net reclassification risk.

A method for simultaneously assessing noninferiority with respect to efficacy and superiority with respect to another endpoint in two-arm noninferiority trials is presented. The procedure controls both the average type I error rate for the intersection-union test problem and the frequentist type I error rate for the noninferiority test by α while allowing an increased level for the superiority test. For normally distributed outcomes, two methods are presented to deal with the uncertainty about the correlation between the endpoints which defines the adjusted levels. The operating characteristics of these procedures are investigated. Furthermore, the sample size required when applying the proposed method is compared with that of alternative procedures. Application of the method in the situation of binary endpoints and mixed normal and binary endpoints, respectively, is sketched. An illustrative example is provided demonstrating implementation of the proposed approach in a clinical trial.

]]>Survey data often contain measurements for variables that are semicontinuous in nature, i.e. they either take a single fixed value (we assume this is zero) or they have a continuous, often skewed, distribution on the positive real line. Standard methods for small area estimation (SAE) based on the use of linear mixed models can be inefficient for such variables. We discuss SAE techniques for semicontinuous variables under a two part random effects model that allows for the presence of excess zeros as well as the skewed nature of the nonzero values of the response variable. In particular, we first model the excess zeros via a generalized linear mixed model fitted to the probability of a nonzero, i.e. strictly positive, value being observed, and then model the response, given that it is strictly positive, using a linear mixed model fitted on the logarithmic scale. Empirical results suggest that the proposed method leads to efficient small area estimates for semicontinuous data of this type. We also propose a parametric bootstrap method to estimate the MSE of the proposed small area estimator. These bootstrap estimates of the MSE are compared to the true MSE in a simulation study.

]]>In recent months one of the most controversially discussed topics among regulatory agencies, the pharmaceutical industry, journal editors, and academia has been the sharing of patient-level clinical trial data. Several projects have been started such as the European Medicines Agency´s (EMA) “proactive publication of clinical trial data”, the *BMJ* open data campaign, or the AllTrials initiative. The executive director of the EMA, Dr. Guido Rasi, has recently announced that clinical trial data on patient level will be published from 2014 onwards (although it has since been delayed). The EMA draft policy on proactive access to clinical trial data was published at the end of June 2013 and open for public consultation until the end of September 2013. These initiatives will change the landscape of drug development and publication of medical research. They provide unprecedented opportunities for research and research synthesis, but pose new challenges for regulatory authorities, sponsors, scientific journals, and the public. Besides these general aspects, data sharing also entails intricate biostatistical questions such as problems of multiplicity. An important issue in this respect is the interpretation of multiple statistical analyses, both prospective and retrospective. Expertise in biostatistics is needed to assess the interpretation of such multiple analyses, for example, in the context of regulatory decision-making by optimizing procedural guidance and sophisticated analysis methods.

Risk assessment studies where human, animal or ecological data are used to set safe low dose levels of a toxic agent are challenging as study information is limited to high dose levels of the agent. Simultaneous hyperbolic confidence bands for low-dose risk estimation with quantal data have been proposed in the literature. In this paper, a new method using three-segment confidence bands to construct simultaneous upper confidence limits on extra risks and simultaneous lower bounds on the benchmark dose for quantal data is proposed. The proposed method is illustrated with a real data application and simulation studies.

]]>Given a sample of independent observations from an unknown continuous distribution function *F*, the problem of constructing a confidence band for *F* is considered, which is a fundamental problem in statistical inference. This confidence band provides simultaneous inferences on all quantiles and also on all of the cumulative probabilities of the distribution, and so they are among the most important inference procedures that address the issue of multiplicity. A fully nonparametric approach is taken where no assumptions are made about the distribution function *F*. Historical approaches to this problem, such as Kolmogorov's famous () procedure, represent some of the earliest inference methodologies that address the issue of multiplicity. This is because a confidence band at a given confidence level allows inferences on all of the quantiles of the distribution, and also on all of the cumulative probabilities, at that specified confidence level. In this paper it is shown how recursive methodologies can be employed to construct both one-sided and two-sided confidence bands of various types. The first approach operates by putting bounds on the cumulative probabilities at the data points, and a recursive integration approach is described. The second approach operates by providing bounds on certain specified quantiles of the distribution, and its implementation using recursive summations of multinomial probabilities is described. These recursive methodologies are illustrated with examples, and R code is available for their implementation.

Normal probability plots are widely used as a statistical tool for assessing whether an observed simple random sample is drawn from a normally distributed population. The users, however, have to judge subjectively, if no objective rule is provided, whether the plotted points fall close to a straight line. In this paper, we focus on how a normal probability plot can be augmented by intervals for all the points so that, if the population distribution is normal, then all the points should fall into the corresponding intervals simultaneously with probability . These simultaneous probability intervals provide therefore an objective mean to judge whether the plotted points fall close to the straight line: the plotted points fall close to the straight line if and only if all the points fall into the corresponding intervals. The powers of several normal probability plot based (graphical) tests and the most popular nongraphical Anderson-Darling and Shapiro-Wilk tests are compared by simulation. Based on this comparison, recommendations are given in Section 3 on which graphical tests should be used in what circumstances. An example is provided to illustrate the methods.

]]>This paper focuses on the concept of optimizing a multiple testing procedure (MTP) with respect to a predefined utility function. The class of Bonferroni-based closed testing procedures, which includes, for example, (weighted) Holm, fallback, gatekeeping, and recycling/graphical procedures, is used in this context. Numerical algorithms for calculating expected utility for some MTPs in this class are given. The obtained optimal procedures, as well as the gain resulting from performing an optimization are then examined in a few, but informative, examples.

]]>If the response to treatment depends on genetic biomarkers, it is important to identify predictive biomarkers that define (sub-)populations where the treatment has a positive benefit risk balance. One approach to determine relevant subpopulations are subgroup analyses where the treatment effect is estimated in biomarker positive and biomarker negative groups. Subgroup analyses are challenging because several types of risks are associated with inference on subgroups. On the one hand, by disregarding a relevant subpopulation a treatment option may be missed due to a dilution of the treatment effect in the full population. Furthermore, even if the diluted treatment effect can be demonstrated in an overall population, it is not ethical to treat patients that do not benefit from the treatment when they can be identified in advance. On the other hand, selecting a spurious subpopulation increases the risk to restrict an efficacious treatment to a too narrow fraction of a potential benefiting population. We propose to quantify these risks with utility functions and investigate nonadaptive study designs that allow for inference on subgroups using multiple testing procedures as well as adaptive designs, where subgroups may be selected in an interim analysis. The characteristics of such adaptive and nonadaptive designs are compared for a range of scenarios.

]]>Graphical approaches have been proposed in the literature for testing hypotheses on multiple endpoints by recycling significance levels from rejected hypotheses to unrejected ones. Recently, they have been extended to group sequential procedures (GSPs). Our focus in this paper is on the allocation of recycled significance levels from rejected hypotheses to the stages of the GSPs for unrejected hypotheses. We propose a delayed recycling method that allocates the recycled significance level from Stage *r* onward, where *r* is prespecified. We show that *r* cannot be chosen adaptively to coincide with the random stage at which the hypothesis from which the significance level is recycled is rejected. Such an adaptive GSP does not always control the FWER. One can choose *r* to minimize the expected sample size for a given power requirement. We illustrate how a simulation approach can be used for this purpose. Several examples, including a clinical trial example, are given to illustrate the proposed procedure.

In the field of multiple comparison procedures, adjusted *p*-values are an important tool to evaluate the significance of a test statistic while taking the multiplicity into account. In this paper, we introduce adjusted *p*-values for the recently proposed Sequential Goodness-of-Fit (SGoF) multiple test procedure by letting the level of the test vary on the unit interval. This extends previous research on the SGoF method, which is a method of high interest when one aims to increase the statistical power in a multiple testing scenario. The adjusted *p*-value is the smallest level at which the SGoF procedure would still reject the given null hypothesis, while controlling for the multiplicity of tests. The main properties of the adjusted *p*-values are investigated. In particular, we show that they are a subset of the original *p*-values, being equal to 1 for *p*-values above a certain threshold. These are very useful properties from a numerical viewpoint, since they allow for a simplified method to compute the adjusted *p*-values. We introduce a modification of the SGoF method, termed majorant version, which rejects the null hypotheses with adjusted *p*-values below the level. This modification rejects more null hypotheses as the level increases, something which is not in general the case for the original SGoF. Adjusted *p*-values for the conservative version of the SGoF procedure, which estimates the variance without assuming that all the null hypotheses are true, are also included. The situation with ties among the *p*-values is discussed too. Several real data applications are investigated to illustrate the practical usage of adjusted *p*-values, ranging from a small to a large number of tests.

We present a novel multiple testing method for testing null hypotheses that are structured in a directed acyclic graph (DAG). The method is a top-down method that strongly controls the familywise error rate and can be seen as a generalization of Meinshausen's procedure for tree-structured hypotheses. Just as Meinshausen's procedure, our proposed method can be used to test for variable importance, only the corresponding variable clusters can be chosen more freely, because the method allows for multiple parent nodes and partially overlapping hypotheses. An important application of our method is in gene set analysis, in which one often wants to test multiple gene sets as well as individual genes for their association with a clinical outcome. By considering the genes and gene sets as nodes in a DAG, our method enables us to test both for significant gene sets as well as for significant individual genes within the same multiple testing procedure. The method will be illustrated by testing Gene Ontology terms for evidence of differential expression in a survival setting and is implemented in the package cherry.

]]>In many applications, researchers are interested in making *q* pairwise comparisons among *k* test groups on the basis of *m* outcome variables. Often, *m* is very large. For example, such situations arise in gene expression microarray studies involving several experimental groups. Researchers are often not only interested in identifying differentially expressed genes between a given pair of experimental groups, but are also interested in making directional inferences such as whether a gene is up- or downregulated in one treatment group relative to another. In such situations, in addition to the usual errors such as false positive (Type I error) and false negative (Type II error), one may commit directional error (Type III error). For example, in a dose response microarray study, a gene may be declared to be upregulated in the high dose group compared to the low dose group when it is not. In this paper, we introduce a mixed directional false discovery rate (mdFDR) controlling procedure using weighted *p*-values to select positives in different directions. The weights are defined as the inverse of two times the proportion of either positive or negative discoveries. The proposed procedure has been proved mathematically to control the mdFDR at level α and to have a larger power (which is defined as the expected proportion of nontrue null hypotheses) than the GSP10 procedure proposed by Guo et al. (2010). Simulation studies and real data analysis are also conducted to show the outperformance of the proposed procedure than the GSP10 procedure.

The higher criticism (HC) statistic, which can be seen as a normalized version of the famous Kolmogorov–Smirnov statistic, has a long history, dating back to the mid seventies. Originally, HC statistics were used in connection with goodness of fit (GOF) tests but they recently gained some attention in the context of testing the global null hypothesis in high dimensional data. The continuing interest for HC seems to be inspired by a series of nice asymptotic properties related to this statistic. For example, unlike Kolmogorov–Smirnov tests, GOF tests based on the HC statistic are known to be asymptotically sensitive in the moderate tails, hence it is favorably applied for detecting the presence of signals in sparse mixture models. However, some questions around the asymptotic behavior of the HC statistic are still open. We focus on two of them, namely, why a specific intermediate range is crucial for GOF tests based on the HC statistic and why the convergence of the HC distribution to the limiting one is extremely slow. Moreover, the inconsistency in the asymptotic and finite behavior of the HC statistic prompts us to provide a new HC test that has better finite properties than the original HC test while showing the same asymptotics. This test is motivated by the asymptotic behavior of the so-called local levels related to the original HC test. By means of numerical calculations and simulations we show that the new HC test is typically more powerful than the original HC test in normal mixture models.

]]>