Disclosure: David B. Allison has received book royalties, grants, consulting fees, and donations from multiple profit and nonprofit entities with interests in obesity, including pharmaceutical companies which compete with the manufacturers of sibutramine and from the manufacturers of sibutramine. The remaining authors declare no conflict of interest.
Funding agencies: This research was sponsored by NIH grants T32HL079888, T32HL072757, P30DK056336, and R01DK076771.
We demonstrate the utility of parametric survival analysis. The analysis of longevity as a function of risk factors such as body mass index (BMI; kg/m2), activity levels, and dietary factors is a mainstay of obesity research. Modeling survival through hazard functions, relative risks, or odds of dying with methods such as Cox proportional hazards or logistic regression are the most common approaches and have many advantages. However, they also have disadvantages in terms of the ease of interpretability, especially for non-statisticians; the need for additional data to convert parameter estimates to estimates of years of life lost (YLL); debates about the appropriate time scale in the model; and an inability to estimate median survival time when the censoring rate is too high.
Design and Methods:
We will conduct parametric survival analyses with multiple distributions, including distributions that are known to be poor fits (Gaussian), as well as a newly discovered “Compressed Gaussian”'' distribution.
Parametric survival analysis models were able to accurately estimate median survival times in a population-based data set of 15,703 individuals, even for distributions that were not good fits and the censoring rate was high, due to the central limit theorem.
Parametric survival models are able to provide more direct answers, and in our analysis of an obesity-related data set, gave consistent YLL estimates regardless of the distribution used. We recommend increased consideration of parametric survival models in chronic disease and risk factor epidemiology.
The associations or effects of chronic disease risk factors such as body mass index (BMI, kg/m2), serum cholesterol, or blood pressure on health and lifespan are of great interest and importance. Interested parties include litigators trying wrongful death cases and determining appropriate settlements, demographers estimating population trends and planning accordingly, insurers setting premium rates, public health officials advising the public, policy makers determining priorities, clinicians advising their patients, and the general public. The analysis of such data is made challenging by several factors, perhaps most notably that not all individuals will be observed until their time of death, leading to censoring in the survival times.
Cox proportional hazards regression is the most common way to accommodate censoring (1). The many advantages of this approach are well-documented and described elsewhere (2). However, there are at least three major disadvantages. The first involves the inability to estimate medians or other quantiles of survival time when the censoring rate exceeds the quantile of interest. The second concerns interpretability. The primary output of such an analysis is a hazard ratio, which requires understanding of calculus to interpret, is not easily understood by many nonstatisticians, and is not expressed in units such as years of life lost (YLL) that are part of everyday parlance and well understood. Third, the proportionality assumption may not hold.
In practice, large epidemiological data sets such as the National Health and Nutrition Examination Surveys (NHANES), the National Health Interview Surveys (NHIS), Atherosclerosis Risk in Communities (ARIC), and the Framingham Heart Study are often analyzed at follow-up times which have censoring rates well above 50%; hence, the median survival times for subjects may be estimable only for subjects at greatly increased risk.
Moreover, additional data beyond those necessary for the initial analysis are required in some approaches (3) to convert hazard ratio estimates into expected survival times or YLL. YLL is defined here as the difference in conditional expectations of survival time between individuals who differ only in their level of risk factors.
Fontaine et al. (3) developed a method for converting hazard ratio estimates obtained with Cox proportional hazards regression to estimates of YLL that can be used by clinicians, the general public, and those interested in understanding the effects of factors such as high BMI on relevant aspects of lifespan. Such an approach, while useful, is cumbersome to implement and required three different data sets (one to estimate the hazard ratios, one to estimate distributions of the risk factor in the general population, and one to estimate overall survival distributions in the general population). Furthermore, there was no readily accessible solution for obtaining a confidence interval for YLL estimates. Hence, a method which more directly yielded estimates expressed in terms of years of survival time would be more desirable.
Fully parametric models offer an alternative to Cox regression that can provide direct estimates of YLL even in the presence of high censoring rates. However, new problems emerge: namely, which distribution should be used? Human longevity is characterized by (a) bimodality, including peaks at infancy and old age; (b) strong left skew. If investigators seek a good fit over all ages, then they may need to address the bimodality challenge by considering complex mixture models such as the five-parameter Siler model or the eight-parameter Heligman–Pollard and Mode–Busby models. However, models with fewer parameters are better for interpretability and reproducibility (4). In contrast, when the outcome of interest is YLL, life expectancy, or median survival times, then the fit of the tails may not matter greatly.
For instance, in a typical epidemiological study that estimates the effects of obesity or other metabolic risk factors on morbidity or mortality, the patients are adults, obviating the need for the distribution to accurately estimate mortality among the very young or very old. The central limit theorem guarantees that a normal parametric model (given a sufficient sample size) will accurately estimate the mean, even if the outcome is not normally distributed.
Unfortunately, the normal distribution, while easy to interpret, does not very effectively address the strong left skew challenge. The mean may differ significantly from the median, and other quantiles will be inaccurately predicted. Closer approximations can be obtained through extreme value distributions such as Weibull or Gompertz. By convention, the Gompertz distribution is typically used to model all-cause mortality, whereas the Weibull distribution is used for specific causes of death (4), but a more effective solution would be desirable.
Recently, Robertson and Allison (11) introduced the compressed normal distribution, which was especially designed to accommodate features of the observed distribution of human lifespan after the period of high mortality rate in early childhood. This distribution expanded upon the findings of Kannisto (12), who observed that the distribution of longevity conditioned on survival to the modal age closely resembled the behavior of a normal distribution. Kannisto also noted that the standard deviation of remaining lifespan conditional upon having survived until age x seemed to decrease more rapidly as a function of x than would occur were total lifespan were normally distributed.
The normal distribution is characterized by the location-scale transformation:
Robertson and Allison (11) derived a distribution where a compression of the standard deviation occurs with advancing age, by modifying the location-scale transformation:
where λ is an upper bound of longevity and (1 – x/λ) is the unspent portion of longevity at age x. The denominator decreases as x increases. In effect, it conditions the scale on attained age, and models the increasing homogeneity of survivors as they age. The compressed normal distribution was found to model life table data more accurately than other three-parameter distributions, including the Makeham–Gompertz, generalized extreme value, generalized gamma, and the Azzalini skew-normal distributions.
To demonstrate the advantages of parametric survival analysis, we fit models of different distributions to a large epidemiologic data set with a high censoring rate. We also demonstrate the uses of multiparameter optimization, which is not currently a common practice in survival analysis.
Materials and Methods
When age is the outcome of interest, and a study enrolls participant i who was alive as of age ai, the model should incorporate left truncation to reflect the conditional probability of surviving to age ai at the beginning of the study (6). Doing so makes the proper adjustments for older participants who have higher life expectancies (Figure 1). The likelihood equation is then:
where ei is the age of participant i upon exiting the study, whether alive or dead; f is the density function; θ is the vector of distribution parameters; and Π is the product of a sequence.
In the process of writing this article, we identified a shortage of available software that is able to fit parametric survival models for left-truncated, right-censored data. As of this writing, parametric survival analysis in SAS is performed via PROC LIFEREG, but does not allow specification of age at entry (7). SPSS is not able to fit parametric survival models (8). In R, procedure phreg in package “eha” is theoretically able to do the above, but did not return plausible results (9). In STATA, procedure streg is able to do the above, but is not able to fit the normal or logistic distributions (10). Additionally, there appeared to be inconsistencies in the way log-likelihood scores are tabulated: some software dropped constant terms from the equations (such as for the normal distribution), whereas others kept them.
As equation (3) is a straightforward optimization problem, we decided to write our own software to maximize the likelihood and solve for the parameters. The software was written in R, and made use of procedure optim. We specified the conjugate gradients method with gradient functions. The programs were short (a few dozen lines per distribution), and the model calculations only took a few seconds on modern desktop computers. All constant terms were preserved. We fit the Gompertz, Weibull, logistic, normal, and compressed normal distributions. We verified the consistency of estimated parameters with STATA software for the Gompertz and Weibull distributions.
For the purposes of illustrating parametric models and gauging their real-world utility, we selected a recent population-based study with a simple data structure that did not involve complex sampling, as with NHANES (13) or the National Health Interview Survey (14). A large data set including measured BMI values, smoking status, and age at follow-up was obtained from the Atherosclerosis Risk in Communities (ARIC) study (15), begun in 1987. These data are characterized by a high censoring rate (85.0%), such that the Cox model could not estimate median survival times. All ARIC participants were African-American or European-American, male, or female.
The variables fitted were smoking and BMI. Smoking was coded with indicator variables for current and former smokers. BMI was fitted as a cubic polynomial in keeping with conventions (3,16). For ease of interpretation, BMI variables were also centered and scaled as (BMI-25)/10, such that the “intercept” terms corresponded to a BMI of 25. We also tested for interactions between BMI and smoking, sex, and race. We checked that the interaction terms yielded results consistent with stratified analyses by race, sex, or smoking status. Also, we validated our findings with Cox models.
A total of 15,703 participants had known values for BMI, smoking status, and age at follow-up (Table 1). Fifteen percent of the participants had deaths observed over the course of the study, whereas 85% of the observations were censored. The participants in the study were from a relatively narrow age range of 45-61 years at baseline; the subjects were no older than 81 years at the end of the study. ARIC exemplifies the characteristics of many population-based studies, which have limited age ranges and high censoring rates. Nevertheless, the large sample sizes yielded estimates that were consistent with the previous population-based studies.
Table 1. Descriptive statistics for ARIC study
Age at exit
Age at exit
Age at exit
Age at exit
Age at exit
All five distributions yielded similar log-likelihood scores and gave similar estimates of longevity (Figure 2). This phenomenon occurred owing to the limited age range of the patients in the study, which limited the information on the tails of the distribution. All models yielded similar results for the effects of predictor variables; smoking was associated with reduced longevity while BMI exhibited a J-curve pattern. The compressed normal distribution (in the solid black line) gave slightly lower estimates owing to its thicker left tail, which was found elsewhere to follow the distribution of life table longevities more closely than other distributions11. The J-curve was more pronounced for smokers, consistent with some previous studies (Figure 3) (17,18).
We found that YLL estimates were similar whether we defined them in terms of means or medians (Figure 4); again, this was consistent with the limited age range of patients in the study. We also verified our findings by fitting the Cox model (Figure 5); the estimated hazards are a mirror image of Figure 4.
As some past research has found (3), African-Americans had higher optimal BMIs than did whites and the difference was statistically significant (Table 2). White American nonsmokers had an optimal BMI of approximately 20, whereas African-American nonsmokers had an optimal BMI of 25. Smokers fared relatively better in the overweight (25-30) range of BMI. The shifting of the peak may reflect the greater prevalence of chronic diseases among smokers and African-Americans, such that a lower BMI was more likely to be a result of disease rather than good lifestyle; we will explore this topic further in future articles.
Finally, we illustrate one more benefit of parametric survival analysis (Figure 6). Life expectancy changes conditioned on attained age, as E(Y) ≠ E(Y|Y>y). By making use of conditional expectations, one can compute remaining life expectancy for a patient at a given age. As baseline mortality rises with advancing age, the effect of risk factors on life expectancy decreases; this is apparent in the converging lines. This is a natural consequence of mathematics: among young people whose baseline mortality is low, a small change in the hazard rate causes a large increase or decrease in life expectancy. But among older people whose baseline mortality is high, the same change makes little difference in life expectancy. This phenomenon is consistent with repeated observations in the literature where BMI and smoking have a less deleterious effect on life expectancy among older patients.
Advantages of parametric models
Parametric modeling yielded results that were sensible and consistent in shape with those observed when modeling hazard ratios3. We contrast the life expectancy estimates in Figure 4 to the Cox estimates in Figure 5. The Cox hazard ratios appear upside down to the life expectancy curves in Figure 4, with no direct formula for converting from hazard ratios to life expectancy. In particular, the normal distribution offers greater ease of interpretability when the data set consists of subjects near the modal age of longevity and the outcome is YLL, median longevity, or life expectancy. These are marked advantages in the context of communicating with demographers, litigators, public health officials, clinicians, and the general public and we therefore advocate that in chronic disease and risk factor epidemiology parametric models be considered as a primary analytic approach.
When greater precision is desired or the investigator is interested in quantiles other than the middle, we advocate the use of left truncation and possibly optimization of multiple parameters. There appears to be a gap in the capabilities of commonly available software, and we have written software that addresses this issue.
Additionally, extending this method to accommodate complex sampling, as is used in the NHANES series would be valuable. A Bayesian extension would also be useful for data sets with high-dimensional predictors, such as genomic data, where penalized regressions are needed.
We thank Sir David Cox, Raymond Carroll, and Kevin Fontaine for their editorial assistance. The opinions expressed are those of the authors and not necessarily those of the NIH or any other organization with which the authors are affiliated. This manuscript was prepared using ARIC research materials obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the ARIC of NHLBI.