Modeling exposure–lag–response associations with distributed lag non-linear models

In biomedical research, a health effect is frequently associated with protracted exposures of varying intensity sustained in the past. The main complexity of modeling and interpreting such phenomena lies in the additional temporal dimension needed to express the association, as the risk depends on both intensity and timing of past exposures. This type of dependency is defined here as exposure–lag–response association. In this contribution, I illustrate a general statistical framework for such associations, established through the extension of distributed lag non-linear models, originally developed in time series analysis. This modeling class is based on the definition of a cross-basis, obtained by the combination of two functions to flexibly model linear or nonlinear exposure-responses and the lag structure of the relationship, respectively. The methodology is illustrated with an example application to cohort data and validated through a simulation study. This modeling framework generalizes to various study designs and regression models, and can be applied to study the health effects of protracted exposures to environmental factors, drugs or carcinogenic agents, among others. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd.

• Data preparation. For time-to-event or other longitudinal data, N subject-period observations need to be derived from the data on the n subjects, as detailed in Section 4.2 of the manuscript and in Section B.1 below. This step is not required for other types of data used in different study designs where each record relates to a single observation sampled in a specific time, such as case-control, time series or cross-sectional data. • Modelling choices. The user must define the lag period (minimum and maximum lag) and the model candidates, namely the different options for the exposure-response function f (x) and lag function w( ) composing the lagbasis or cross-basis functions. In particular, this step identifies the type of function (linear, constant, B-splines, piecewise constant and threshold amongst others) and optional constraints. These choices are dictated by specific assumptions on the exposure-lag-associations. Multiple predictors can be modelled through lag-basis and/or crossbasis functions. • Generate exposure histories. The matrix of exposure histories Q from Eq. 3 in the manuscript represents the exposure events related to each observation within the selected lag period. This can be reconstructed if not available, as in the example for the Colorado Plateau uranium miners cohort. • Compute the cross-basis matrix. For each model, the lag-basis or cross-basis matrix W in Eq. 3 and 7 in the manuscript is computed with the function crossbasis() of the package dlnm. This function accepts as

Statistics in Medicine
A. Gasparrini arguments the matrix Q, the selected lag period and the options for defining the functions f (x) and w( ), as selected above. • Fit the model. The lag-basis or cross-basis matrix W, containing the transformed variables, is included in the regression formulae of standard regression functions, such as coxph(), clogit(), lm(), glm(), gee(), lme() or gam(), for estimating the parameters η. • Model comparison. The fit of the models with different specifications of f (x) and w( ) can be compared through AIC and BIC, identifying the best-fitting model and assessing the evidence for a non-linear exposure-response and non-constant risk along lags. • Predict the exposure-lag-response surface. Predictions from DLM and DLNMs are obtained from the function crosspred(), which accepts as arguments the cross-basis and model objects (or alternatively the selected coefficients and (co)variance matrix). This function returns the grid of risk contributions β xp, p from Eq. 8-9, composing the risk surface for specific user-defined exposure and lag values, together with the overall cumulative effects. • Graphical representation. The risk surface of the exposure-lag-response association, or specific exposure-response or lag-response curves, can be represented graphically with method functions plot(), lines() and points(), reproducing all the figures included in the manuscript. The comparison of different models can help the interpretation of the results. • Predict the risk for specific exposure histories. The overall cumulative effectsβ c in Eq. 10 for specific exposure histories can be also derived from crosspred(). The trend of overall cumulative risk over time can be similarly predicted, generating a matrix with time-varying exposure histories.

B. Additional information on modelling
B.1. Dealing with time-varying exposures in time-to-event data As described in Section 3.2 of the manuscript, the analysis of the Colorado Plateau uranium miners cohort is performed using a Cox proportional hazard model. The specification of models for time-to-event data with continuous time-varying predictors, assumed by definition when modelling exposure-lag-response associations, requires specific analytical and computational strategies [1]. In particular, the analysis needs to account for the fact that the exposure of a given subject changes along time. A suggested method is to divide the follow-up period for each subject into short equally-spaced intervals, and to compute the exposure values for each of them. However this approach, previously used for exposurelag-response analyses [2], may considerably expand the dataset in the case of extended follow-up periods. An alternative and less approximate approach takes advantage of the specific likelihood form of the Cox model, based on the sum of contributions of risk sets defined at each failure time [3]. The risk sets can be created by splitting the follow-up period of each subject at each failure time, generating subject/period observations. Each observation belongs to a single risk set, and its exit time corresponds to the exact moment the subject contributes to the risk set. In practice, for the likelihood to be defined, the computation of the time-varying exposure level can be limited to the exit time of each subject/period observation. In addition, in exposure-lag-response relationships the relevant time-varying exposure is represented by the whole exposure history in a defined lag period. This is computed for each subject/period observation, given the exposure profile of each subject and the specific exit time. In the example included in the manuscript, the chosen time axis is age, and the exposure histories are reconstructed by following the method described in Section B.2. However, the split of the dataset generates 343,236 subject/period observations, introducing substantial computational problems. Actually, the function coxph still performs quite well with this number of observations, but the computation of the cross-basis, following Eq. 7 in the manuscript, is not manageable in terms of memory allocation. Although this step can be optimized regarding memory and computing time in future releases of the package dlnm, for this specific example I chose to reduce the volume of the data by sampling from the risk set.
This approach is thoroughly illustrated by Langholz and Goldstein [4]. In practice, each risk set is composed by one or few cases and usually a high number of subjects acting as controls. The number of subject/period observations can be reduced by randomly selecting a sub-sample of controls from each risk set. This approach resembles a matched casecontrol design nested within the cohort, where the inclusion of all the controls in each risk set produces the same likelihood of the standard cohort analysis. The number of sampled controls is proportional to the precision of the estimates compared to the full analysis. In the example illustrated in the manuscript, I selected all the controls in each risk set up to a maximum of 100, a number which guarantees a feasible computation and a negligible approximation. With this split dataset of 25,096 subject/period observations, the regression can be performed with both coxph or clogit functions, which return exactly the same results. For each subject, the histories of radon exposures and smoking are reported as cumulative measures in five-year age periods. First, the approximated exposure values at each age in years (from 0.5 to 99.5) are computed, accounting for sub-periods defined by age at start and end of mining/smoking. Then, the exposure history for lag 2-40 is retrieved by linking such age-specific values with age at exit of each of the 25,906 subject/period observations, generating the matrix Q in Eq. 2 of the manuscript. The approximation is to yearly exposures, where lag 0 refers to the cumulative exposure sustained in the last year, lag 1 between 1 and 2 years ago, and so on.

C.1. The Colorado Plateau uranium miners cohort
The data for the Colorado Plateau uranium miners cohort includes information for 3,347 male subjects. Eligibility criteria were for the miner to have worked in the mines of the four-state Colorado Plateau area for at least one month, with at least one examination from public health service physicians between 1950 and 1960, and to have completed a questionnaire providing social and occupational information. The dataset used here refers to the follow-up at 31 st December 1982, with 1,258 deaths (37.6%), including 258 lung cancer cases (7.7%) [5]. A more recent follow-up is also available [6]. Exposure data available in the dataset include cumulative measures of radon and smoking in five-year age intervals. At the time of writing, a detailed description of the cohort was available at hydra.usc.edu/timefactors/.
The data is available as a comma-separated values (.csv) file. The labels and descriptions of the 52 variables included in the dataset are summarized in Table S1.

C.2. The R package dlnm and the R code
The package dlnm contains functions to run the analyses illustrated in Section 3 of the manuscript. The analysis in the manuscript and the illustration below refer to version 1.6.6. Specifically, the function crossbasis generates the lagbasis and cross-basis matrices, explained by Eq. 1-7 in the manuscript, to be included in the regression function coxph. The function crosspred is used to predict the risk summaries in Eq. 8-10 for specific exposure values or exposure histories. The method functions plot and lines are called to graphically represent such risk summaries, producing the figures included in the manuscript.
The R scripts provided as supplementary material reproduce all the results illustrated in the manuscript, figures included. The R packages dlnm, xtable and PermAlgo, available in the R CRAN, need to be installed. The scripts are meant to be run consecutively. In particular, the script example.R provides a short and easy-to-follow illustration of the main steps of the analysis, if compared to the more convoluted programming used in the other scripts.
The package dlnm is under constant development, and the usage of existing functions may change, although portability of the existing code in future versions will be preserved whenever possible. An updated version of the code can be found at my personal web-page www.ag-myresearch.com.

D.1. Specifying right-constrained models
As discussed in Section 2.4 of the manuscript, a right constraint on the lag-response curve can be specified by excluding specific variables of the B-spline basis for the space of . Figure S1 shows the four basis variables of a quadratic B-spline without intercept, defined in the support interval 0-40, with internal knots at 13.3 and 26.6. Basically, the constraint is applied by excluding the last two variables (green dash-dotted and blue long-dashed curves in Figure S1). This can be achieved by specifying a user-defined modification of the B-spline function, as shown in the R code included in the Supplementary Web Appendix.
An example is provided in Figure S2, where a left and right-constrained model is compared to the left-constrained Model 8 described in the manuscript. As expected, the two estimated exposure-lag-response associations are very similar, given that Model 8 already estimates a null risk at the end of the lag period. The double-constrained model shows a lower AIC of 2148.0.

Statistics in Medicine
A. Gasparrini

D.2. Analysis on a subset of subjects
The analysis is repeated using to the subset of subjects with a maximum yearly exposure to radon less than 300 WLM/year (81.6% of the total). The results are shown in Figure S3, showing a comparison of the estimated exposure-response curves and lag-response curves from Model 8 with the complete data, and those from Model 8 and Model 4 with the subset. The estimates for Model 8 are almost identical, indicating that the results from this more flexible option are not biased by few high exposure events. Although the fit of Model 4 and Model 8 in this subset of data is now more similar, the AIC still largely favours the latter, with values 1685.2 and 1659.2 respectively, suggesting that non-linearity is not limited to very high exposures.

D.3. Modelling the exposure-response with the log function
The analysis is also carried out by replacing the spline in Model 8 with the log function to model the exposure-response relationship, namely a DLNM with f (x) = log(x + 1). Again, this can be achieved by specifying a user-defined function to be used in the cross-basis definition, as shown in the R code included in the Supplementary Web Appendix. The AIC value of the new model and Model 8 are 2148.6 and 2153.2 respectively. However, Figure S4 indicates that the estimates are very similar.

D.4. Sensitivity analysis on knot location
An additional sensitivity analysis assesses the impact of knot location for the lag-exposure function w( ) in Model 8. The best fitting model presented in the main text has a single knot at 13.3 lag. Models with alternative knot placements at 20, 26.6 and 13.3-26.6 lags (total df equal to 9, 9 and 12) display an higher AIC of 2154.7, 2157.9 and 2158.0 respectively. The comparison of the fit of alternative models is shown in Figure S5.

E.1. Simulating the exposure profiles
A n s × 100 matrix of exposure profiles is produced for each of the three simulation settings, with number of subjects n s equal to 200, 400 or 800. Each row of the matrix represents a series of exposure events x t potentially experienced by a specific subject at times t = 1, . . . , 100. The series is generated as 9-14 random exposure periods of length 1-10, with intensity between 0 and 10. The distribution of the exposure events across the n s subjects and examples of exposure profiles from 3 random subjects are reported in Figure S6.

E.2. Simulating scenarios of exposure-lag-response associations
The scenarios of exposure-lag-response associations are defined through bi-dimensional functions f s (x) · w s ( ), generating risk contributions for each value of exposure x and lag . Three different shapes are generated for f s (x) and w s ( ) through simple mathematical functions. Specifically, the exposure-response function f s (x), defined for x ∈ [0, 10], is specified as linear, plateau and exponential, with: The lag function w s ( ), defined for ∈ [0, 40], is specified as constant, decay and peak, with: These shapes are illustrated in Figure S7. Nine simulated scenarios of exposure-lag-response associations are then specified with f s (x) · w s ( ) including all possible combinations of f s (x) and w s ( ) as above. The nine bi-dimensional dependencies are illustrated in Figure S8. These associations generate different distributions of overall cumulative risk β c,i for random exposure histories, with a median of HR ranging approximately from 2.5 to 4.5.

E.3. Sampling time-to-event data
Time-to-event data are simulated in each of the m = 500 data sets as exit time and type of event (censored or uncensored) for the n s subjects, using a permutational algorithm developed for simulating time-to-event data in the presence of timevarying exposures [7]. Event times are generated conditional on the cumulative contribution of the exposure history at each time t, computed as s s (x, t) = f s (x t− ) · w s ( ) over lag = 0, . . . , 40, with alternative functions f s and w( ) selected in different scenarios.
The algorithm is implemented in the R package PermAlgo, and involves three steps. First, the individual timevarying cumulative contribution is computed from s s (x, t) for each time t = 1, . . . , 100 for each of the n s subjects, given the associated time-varying exposure history. In the second step, event and censoring times are randomly sampled from uniform distributions with limits 0-100 and 0-200, respectively, thus producing approximately 25% of censored observations. Finally, event and censoring times are matched with the time-varying cumulative contributions, proceeding from the earliest to the latest times, to define the related risk sets. In censoring times, a censored subject is randomly selected with equal probability across the risk sets. In event times, the case is randomly sampled with probabilities conditional on the time-varying cumulative contributions, simply assigning to this covariate a coefficient equal to 1. Additional details are provided elsewhere [7].

E.4. Additional simulation results
Results for the simulations study reported in tables and graphs in the manuscript refer to the data sets with n s = 400 subjects. The versions of Tables 4 and 5 Medicine   Table S3. Version of Table 4 in the manuscript with n s = 800 subjects.  Table S4. Version of Table 5 in the manuscript with n s = 200 subjects.

Average df
Empirical rejection rate  Table S5. Version of Table 5 in the manuscript with n s = 800 subjects.
Average df Empirical rejection rate           Figure S10. Version of Figure 5 in the manuscript for ns = 800.