## 1 Introduction

Many diseases put individuals at elevated risk for a multitude of adverse clinical events, and researchers routinely design randomized clinical trials to evaluate the effectiveness of experimental interventions for the prevention of these events. Trials in cardiology, for example, record times of events such as non-fatal myocardial infaction, non-fatal cardiac arrest, and cardiovascular death [1]. In cerebrovascular disease, patients with carotid stenosis can be treated with medical therapy or surgery, and trials evaluating their relative effectiveness may record endpoints such as strokes ipsilateral to the surgical site, contralateral strokes, and death [2]. In oncology, researchers often design trials to study treatment effects on disease progression and death [3], but palliative trials of patients with skeletal metastases may be directed at preventing skeletal complications including vertebral and non-vertebral fractures, bone pain, and the need for surgery to repair bone [4]. In these and many other settings, although interest lies in preventing each of the respective events, it is generally infeasible to conduct studies to answer questions about each component.

When one type of event is of greater clinical importance than others, it can be chosen as the basis of the primary treatment comparison, and effects on other types of events can then be assessed through secondary analyses. When two or more events are of comparable importance, co-primary endpoints can be specified, but tests of hypotheses must typically control the experimental type I error rate through multiple comparison procedures [5-7]; these make decision analyses more complex. A seemingly simple alternative strategy is to adopt a so-called composite event [8, 9] that is said to have occurred if any one of a set of component events occurs. The time of the composite event is therefore the minimum of the times of all component events.

There are several additional reasons investigators may consider the use of composite endpoints in clinical trials. In studies involving a time-to-event analysis, the use of a composite endpoint will mean that more events will be observed than would be observed for any particular component. If the same clinically important effect is specified for the composite endpoint and one of its components, this increased event rate will translate into greater power for tests of treatment effects; at the design stage this translates to a reduction in the required number of subjects or duration of follow-up [9-11]. Composite endpoints are routinely adopted through the introduction of one or more less serious events, however, which presumably warrants revising the clinically important effect of interest. Moreover, we show later that with models featuring a high degree of structure, model assumptions may not even be compatible for the composite endpoint and one of its components.

In time-to-event analyses, interest may lie in the effect of an experimental treatment versus standard care on the risk of a non-fatal event. This is a common framework in trials of patients with advanced diseases where interest lies in improving quality of life through the prevention of complications. In such settings, individuals are at considerable risk of death and a competing risks problem arises. Investigators often deal with this by adopting a composite endpoint based on the time to the minimum of the non-fatal event of interest and death [12, 13]. This strategy leads to an ‘event-free survival’ analysis that is particularly common in cancer where progression-free survival is routinely adopted as a primary endpoint [14]. In palliative trials, however, a treatment may not be expected to have an effect of survival, and if a non-negligible proportion of individuals die before experiencing the clinical event of interest, this analysis can lead to a serious underestimation of the effect of the treatment [10, 15].

Recommendations are available in the literature on how to design trials, analyze resultant data, and report findings when composite endpoints are to be used [10-12, 16]. The main recommendations include that (i) individual components should have similar frequency of occurrence, (ii) the treatment should have a similar effect on all components, (iii) individual components should have similar importance to patients, (iv) data from all components should be collected until the end of trial, and (v) individual components should be analyzed and reported separately as secondary endpoints. The first three recommendations have face validity and seem geared towards helping ensure that conclusions regarding treatment effects on the composite endpoint have some relation to treatment effects on the component endpoints, thus helping in the interpretation of results. The collection of data on the occurrence of the component endpoints until the end of the trial facilitates separate assessment of treatment effects on each of the component endpoints. This means the consistency of findings across components can be empirically assessed.

The aforementioned issues have been actively debated in the medical literature [11, 16-19], but there has been relatively little formal statistical investigation of these points. In this paper, we discuss statistical considerations related to composite endpoint analyses and use the recommendations to guide the investigation. Because the Cox regression model is routinely adopted for the analysis of composite endpoints in clinical trials [12], we consider it here and point out important issues regarding model specification and interpretation. We formulate multivariate failure time models with proportional hazards for the marginal distributions that may be used to reflect the settings where composite endpoints are most reasonable according to the current guidelines. We study the asymptotic and empirical properties of estimators arising from a composite endpoint analysis. We also explore the utility of marginal methods based on multivariate failure time data [20]. We argue that the belief that composite endpoints provide an overall measure of the effect of treatment is overly simplistic, and a thoughtful interpretation of intervention effects based on composite endpoints alone is difficult. Their use as a primary basis for treatment comparison in clinical trials therefore warrants careful consideration.

The remainder of this paper is organized as follows. In Section 2, we construct bivariate failure time distributions for which the marginal distributions have proportional hazards between two treatment groups. We then derive the distribution for the time to the first event and show that it does not typically feature proportional hazards across the two treatment groups. We use large sample theory for misspecified models to derive the limiting value of the log hazard ratio from a naive Cox model, and empirical studies demonstrate finite sample properties which are in close alignment with the theory. An alternative approach to synthesizing data over component events is to conduct a global analysis on the basis of the marginal methods of Wei *et al.* [20]; we explore this in Section 3. An application to a recently completed asthma management study illustrates the various methods in Section 4, and we make the concluding remarks in Section 5.