Meta‐analysis of non‐linear exposure‐outcome relationships using individual participant data: A comparison of two methods

Non‐linear exposure‐outcome relationships such as between body mass index (BMI) and mortality are common. They are best explored as continuous functions using individual participant data from multiple studies. We explore two two‐stage methods for meta‐analysis of such relationships, where the confounder‐adjusted relationship is first estimated in a non‐linear regression model in each study, then combined across studies. The “metacurve” approach combines the estimated curves using multiple meta‐analyses of the relative effect between a given exposure level and a reference level. The “mvmeta” approach combines the estimated model parameters in a single multivariate meta‐analysis. Both methods allow the exposure‐outcome relationship to differ across studies. Using theoretical arguments, we show that the methods differ most when covariate distributions differ across studies; using simulated data, we show that mvmeta gains precision but metacurve is more robust to model mis‐specification. We then compare the two methods using data from the Emerging Risk Factors Collaboration on BMI, coronary heart disease events, and all‐cause mortality (>80 cohorts, >18 000 events). For each outcome, we model BMI using fractional polynomials of degree 2 in each study, with adjustment for confounders. For metacurve, the powers defining the fractional polynomials may be study‐specific or common across studies. For coronary heart disease, metacurve with common powers and mvmeta correctly identify a small increase in risk in the lowest levels of BMI, but metacurve with study‐specific powers does not. For all‐cause mortality, all methods identify a steep U‐shape. The metacurve and mvmeta methods perform well in combining complex exposure‐disease relationships across studies.


INTRODUCTION
It is often of primary interest in epidemiological studies to assess the nature of the relationship between continuous exposure(s) or covariate(s) and disease outcome(s), with or without adjustment for confounders. Moreover, with increasing adoption of collaborative research (or consortia approaches) in many disease fields to help standardize methodology and improve statistical power by meta-analysis, 1,2 there is increasing interest in making such assessments, especially in the context of meta-analysis of individual participant data (IPD) from multiple studies. Outcome data are often time to event in prospective studies or case-control status in case-control studies.
Methods for fitting non-linear exposure-outcome relationships are well established for the analysis of data from a single study. A commonly used approach is to categorize the exposure in model fitting, but this is sensitive to the choice of categories and risks losing power and information and introducing bias. 3,4 A better alternative, widely used and our main choice here, is fractional polynomials (FPs). 5 Alternatives to FPs, which we discuss in Section 7.4, include ordinary polynomials, 6 restricted cubic splines, 6,7 and other splines. 8,9 Two extensions to the meta-analysis context, such as application to IPD meta-analysis involving hundreds of studies and millions of participants, have been proposed: the metacurve approach 10 uses pointwise averaging of best-fitting study-specific functions, while the mvmeta approach 11,12 uses multivariate meta-analysis of study-specific regression coefficients. These have not been broadly assessed for validity of results or to reveal any computational challenges.
Our broad aim is to compare the approaches proposed for the meta-analysis of the shape of the relationship between a continuous exposure and an outcome, adjusted for one or more confounders using IPD from multiple studies. We specifically consider the Cox proportional hazards regression framework and two-stage approaches, where the exposure-outcome relationship is first estimated in each study and then combined across studies. The key question is how to combine relationships across studies. The aims of this paper are to evaluate and compare the metacurve and mvmeta approaches and hence to gain insight into when both methods, one method or neither method may be appropriate.
In Section 2, we describe the motivating data, a large IPD dataset from the Emerging Risk Factors Collaboration (ERFC), assessing the exposure-outcome relationship of body mass index (BMI) with risk of coronary heart disease (CHD) and all-cause mortality (ACM). In Section 3, we provide detailed descriptions of the two methods compared. We then evaluate the methods using mathematical comparison (Section 4), analysis of hypothetical data (Section 5), and analysis of the ERFC data (Section 6). In Section 7, we discuss the results, outline some advantages and disadvantages of each method, discuss some issues that need further investigation, and draw tentative conclusions.

DATA
The ERFC 13 has requested and harmonized data from >120 prospective cohort studies in Western populations in order to facilitate IPD meta-analysis of the associations of risk factors with incident cardiovascular disease outcomes and cause-specific mortality. For the current analysis we use two ERFC datasets. Dataset 1 comprises 121 prospective cohorts and a total of 1 234 354 participants with data on BMI (exposure variable); age, sex and smoking status (confounders); and incident outcomes (55 780 CHD events; 225 356 ACM). The cohorts were based in Western Europe (82 cohorts, 734 779 participants), North America (25 cohorts, 417 726 participants), and other regions (Australia, Caribbean, Israel, Japan, and Turkey: 14 cohorts, 81 849 participants). In further modeling that might be used to explore the mechanism of association, we also adjust for systolic blood pressure (SBP), diabetes status, total cholesterol, and high-density lipoprotein cholesterol (HDL-C) as potential mediators using complete-case data from 454 415 participants in 82 cohorts (dataset 2). For more details see other works. [13][14][15][16] To reduce the influence of extreme observations in modeling of continuous variables, we truncate the distribution of each continuous variable except age at its 1 st and 99 th centiles across cohorts, replacing values more extreme than these centiles by the values of these centiles.

Notation
Denoting studies as i = 1, … , I, individuals as j = 1, … , J i , the continuous exposure of interest (BMI at baseline) as x ij , and confounders as a vector c ij (possibly including non-linear terms or interactions), the Cox proportional hazards model 17 in study i may be written as: where h 0i (t) is a baseline hazard, f i (.) is defined below and i and i are vectors of study-specific coefficients. The parameters i and i are estimated by maximising the partial likelihood for the data (time to event or censoring and event indicator).

Fractional polynomials in a single data set
FPs model the effect of an exposure x as (x; ) = ∑ m k=1 k x p k where m is the degree of the FP, the powers p k are chosen from the set −2, − 1, − 0.5, 0, 0.5, 1, 2, 3, and = ( 1 , 2 , … , m ). In FP models, x 0 is defined as log(x), and modifications allow two powers to be equal: for example, an FP model with degree 2 and powers (p, p) is defined as 1 x p + 2 x p log (x). 18 The degree is constrained to be less than or equal to a limit called the dimension. Usually the dimension is set to 2 so that the degree is 1 or 2, denoted FP1 or FP2. Models of all degrees considered and all possible powers are fitted by maximising the likelihood or partial likelihood. The degree may be selected by a closed testing procedure 19 or on subject-matter grounds. The best-fitting powers of the model with the selected degree are used.

Overall modeling strategy with fractional polynomials
The overall modeling strategy comprises five steps: 1. Select the confounders using subject matter knowledge and the data. 2. Select the degree of the FP function. This selection should not be performed within studies, which causes bias towards simple functional forms (here, towards linearity) because single studies tend to lack power to select complex functional forms. 5,10 Selection may be performed across studies by summing maximized likelihoods for each degree across studies or on subject-matter grounds. 3. Select the powers in the FP function. This could be done either using study-specific powers, giving where i = ( i1 , i2 , … , im ), or using common powers, giving (with the modifications noted above if powers equal each other or zero). 4. Fit the selected model to each study to get estimateŝi = (̂i 1 ,̂i 2 , … ,̂) with variance-covariance matrix S i . 5. Combine estimates or fitted functions across studies using the metacurve method (Section 3.4) or the mvmeta method (Section 3.5). Both methods involve meta-analysis. The meta-analysis may assume no heterogeneity: this is usually called the fixed-effect model, but we follow Higgins et al 20 in calling it the common-effect (CE) model. Alternatively the meta-analysis may allow for heterogeneity using the random-effects (RE) model. With the RE model, the empirical Bayes estimates of the study-specific curves may help in understanding the nature of the heterogeneity. 21

Pooling method 1: Metacurve
The metacurve approach (implemented in the Stata package metacurve) conducts pointwise averaging of estimated study-specific curves, and proceeds as follows: 1. Choose reference level x 0 (eg, the overall mean of x).

2.
For each x in the data (pointwise): • For each study i, estimatêi (x, x 0 ) = i (x;̂i) − i (x 0 ;̂i) and derive its standard error from S i .
• Perform meta-analysis (CE or RE) on thêi (x, x 0 ) to get a pooled estimatê(x, x 0 ) and its variance.
The metacurve approach is a direct way to pool estimated curves. However, different choices of reference level may give different answers. More important is the variation of study weights which reflect at each point the information in each of the studies, relevant if the distribution of x varies across studies. The next proposal avoids these issues.

Pooling method 2: Mvmeta
The mvmeta (multivariate meta-analysis) approach uses meta-analysis of the set of regression coefficients. This only works when common powers have been used across studies, as in Equation (3). The approach proceeds as follows.
1. Combine thêi and S i using multivariate meta-analysis (CE or RE) to give an overall mean̂and its variance S.
where the reference level x 0 is only used to graph the estimated curves. Add pointwise confidence bands.

Categorized exposure
An alternative to FPs, included here because it is frequently used in the epidemiological literature, is to categorize the exposure into K groups and fit K − 1 dummy variables. For example, let I k (x) be an indicator of x falling in the kth group, and take the rth group as the reference group. This strategy fits into the framework presented in Section 3.1 above with . The first stage of estimation involves fitting model (1) to each study separately, yielding estimateŝi with variance-covariance matrix S i . Whether we choose the CE or RE model, we must also choose to perform the meta-analysis separately for eacĥ(univariate meta-analysis), or jointly for the whole vector̂i (multivariate meta-analysis).

Approach to method comparison
We evaluate the metacurve and mvmeta methods in three ways. First, we use mathematics to identify when they are likely to agree and when they are likely to differ, in the common powers case (Section 4). Second, we create hypothetical data sets to explore the extent of differences in rather extreme cases with common powers (Section 5). Third, we analyze the ERFC data using metacurve and mvmeta, and investigate any differences (Section 6). In this comparison, we include metacurve with study-specific powers as well as with common powers, because the study-specific powers option is a potential advantage of metacurve, affording greater flexibility. We also perform a sensitivity analysis of the ERFC data using metacurve in order to assess the importance of choice of reference level.

Summary of FP pooling methods
eTable 1 summarizes the modeling approaches we have described and (in bold) options within those approaches: metacurve with study-specific or common powers, mvmeta with common powers, and categorization. All approaches have the option of CE or RE meta-analysis. Metacurve approaches further require a choice of reference level, while categorization approaches require a choice of number of categories and whether to perform univariate or multivariate meta-analysis.

MATHEMATICAL EXPLORATION
We perform a mathematical comparison of metacurve and mvmeta, using the within-studies model defined by Equations (1) and (3). For simplicity, we assume a correctly specified FP2 model with common powers, although this is a critical assumption in practice. Let the study-specific parameters be i = ( i1 , i2 ) and the overall mean parameters be = ( 1 , 2 ). The between-studies model for CE meta-analysis is i = , and for RE meta-analysis is i ∼N( , ). For given reference level x 0 , our aim is to estimate the average log hazard ratio at exposure x compared to reference: Our data yield estimateŝi ∼ N( i , S i ) where S i is the within-study variance-covariance matrix. We can write the marginal model aŝi ∼ N ( , V i ) where V i = S i for the CE meta-analysis and V i = S i +̂for the RE meta-analysis, and the heterogeneity variance matrix̂is estimated from the data. The mvmeta approach uses the multivariate pooled estimatê= The metacurve approach on the other hand pools univariate estimates a(x) T̂i 2 for the RE meta-analysis, and̂(x) 2 is estimated from the data. Then, the pooled estimate iŝ( . This is always true under a CE meta-analysis, but typically not true under a RE meta-analysis because univariate and multivariate models yield different heterogeneity variance estimates. 22 It occurs under a RE meta-analysis if all S i are much larger than , so that we are close to the CE model, or if heterogeneity variance estimates are stable between univariate and multivariate models, which would occur if there were little borrowing of strength. We call this second case "stable heterogeneity." In the case of proportional variances, it follows that T̂i which is a weighted average of study-specific x − x 0 contrasts. If addition, we have stable heterogeneity, then w i (x) = w * i for all x, the weights are equal across the two methods and mv (x, x 0 ) =̂(x, x 0 ).
If on the other hand proportional variances does not hold, then the mvmeta estimate is not a combination of study-specific x − x 0 contrasts: it therefore draws strength from other aspects of the fitted curve. We argue that mvmeta therefore borrows strength and gains precision if the model is correct but may lose robustness (that is, have greater error) if the model is mis-specified. These results motivate the hypothetical data below.

HYPOTHETICAL DATA
We compare the metacurve and mvmeta methods using four hypothetical data sets. using studies with very different exposure distributions to explore the implications of non-proportional variances. For simplicity, we use only two studies and consider only quadratic polynomials: that is, we use FP2 with common powers (1, 2). Our aims are: to compare the precisions of the two methods when both studies follow the same quadratic polynomial (a correctly specified model, example 1); to compare the methods' ability to approximate the true exposure-outcome relationship when both studies follow the same curve that is not a quadratic polynomial (a mis-specified model, examples 2 and 3); and to explore possible artefacts due to weights varying with exposure (example 4).
In examples 1-3, the studies have adjacent but non-overlapping ranges of the exposure x (BMI): uniformly distributed between 20 and 25 in Study 1 and between 25 and 30 in Study 2. The two studies are both large (size 500 000) to minimize noise. A time-to-event outcome is simulated from a proportional hazards model. We fit quadratic models followed by CE meta-analyses, with reference point x 0 = 25 for metacurve. In Example 1, the true model is quadratic with minimum at x = 22 (Figure 1, top left panel). Both mvmeta and metacurve correctly reproduce this curve (Figure 1, middle left panel), but metacurve has larger standard error than mvmeta (Figure 1, bottom left panel). Note that, by definition, standard errors are zero at the reference point x = 25. The difference in standard errors arises because at x < 25 metacurve yields standard errors approximately equal to those for Study 1 (leading the two curves almost to coincide), indicating that it draws very little information from Study 2 in this range, and similarly at x > 25 metacurve draws information primarily from Study 2. The mvmeta method, on the other hand, can draw information from both curves.
In Example 2, the true model is flat at x < 25 and a quadratic at x > 25 (Figure 1 second column). This is mis-specified because it is not a quadratic model. The fitted curves are wrong for both mvmeta and metacurve, but metacurve is much closer to the truth. Differences in standard errors are similar to those in Example 1. In Example 3, the true model follows different quadratics at x < 25 and x > 25 ( Figure 1, third column). In this case, metacurve again approximates the true model well, but mvmeta (which is restricted to quadratics) misses all the curvature.
In Example 4, the studies have overlapping exposure ranges: BMI is uniformly distributed between 20 and 30 in study 1, and between 22 and 28 in study 2 ( Figure 1, last column). The sample sizes are 25 000 and 75 000 respectively. Further, the two studies follow very different linear models. When they are pooled using metacurve, the impact of the weights varying with BMI becomes clear, as study 1 dominates for BMI around 25 (giving a positive gradient) while study 2 dominates for BMI below 22 and above 28 (giving a negative gradient). This demonstrates that weights varying with exposure can yield implausible conclusions: however, this is an extreme example where it is questionable whether the curves should be combined. Table 1 summarizes the EFRC data used for modeling. Dataset 1 is used for the confounder-adjusted analyses in Sections 6.2 to 6.4; dataset 2 is used for the analyses additionally adjusted for mediators in Section 6.5. Between-cohort variation in risk factors was generally smaller than within-cohort variation, except for age for which the within-and between-cohort SDs were similar. Cohort-specific distributions of age and BMI are shown in the supplementary materials (eFigure 1). These findings suggest that exposure distributions vary between studies and therefore the results in Section 4 suggest that metacurve and mvmeta approaches are likely to differ. We set the reference level x 0 at 25 kg/m 2 (close to the overall mean of just over 26 kg/m 2 ) because this is the WHO common cutoff for overweight. 16,23 We choose to use FP2 functions on subject-matter grounds, because previous studies suggest non-monotonic relationships. 15,16 An FP1 function would be preferred if a monotonic function were expected.

Main meta-analysis results: BMI and CHD adjusted for confounders in dataset 1
The metacurve method with study-specific FP powers showed an increasing association with a leveling off at low values of BMI (Figure 2, left-hand panels). For the metacurve and mvmeta methods with common FP powers, the powers (selected across studies) were (−2, −2), implying the model f (x; ) = 1 x −2 + 2 log (x)x −2 . Using these common FP powers, metacurve and mvmeta gave similar results (Figure 2, middle and right-hand panels): both methods identified a J-shape with nadir at 20.5 kg/m 2 , and unlike the results with study-specific powers, a small increase in risk at the lowest levels of BMI. With the CE approach, confidence intervals were narrow; with RE they were a little wider.  To explore whether the data truly support an upturn at low BMI, we conducted a sensitivity analysis by restricting data to the BMI range of 18 to 20 kg/m 2 , assuming the BMI-CHD associations are linear in this short range, and performing a univariate meta-analysis. This found a negative association ( Figure 3) consistent with the upturn. We also used the exposure categorization method with categorization of BMI into 10 or 20 quantile groups across studies (Figure 4). The reference group in each case is the quantile group containing 25 kg/m 2 . With 10 groups, a plateau is apparent at the lowest end of the BMI distribution, but using 20 groups instead suggests a J-shaped association. Inferences from univariate and multivariate meta-analysis were broadly similar in these results. The analysis with 10 groups may not be sensitive enough to explore the upturn at low BMI, which is found by the analysis with 20 groups. Thus the upturn at low BMI appears real and the analyses with common FP powers appear to have performed better here than the analysis with study-specific FP powers.

FIGURE 4
Random-effects meta-analysis of BMI-CHD association using categorization approach. Dataset 1, adjusted for confounders.
Overlapping data points are slightly offset in the x-axis for better clarity. BMI, body mass index; CHD, coronary heart disease [Colour figure can be viewed at wileyonlinelibrary.com] . The middle and right-hand panels graph the meta-analysis percentage weight for each study against BMI level by meta-analysis method (CE or RE). In the large studies, fitted study-specific functions were consistent with the J-shaped association evident in the overall meta-analysis result. Study weights varied from 2% to 12% depending on BMI value and meta-analysis method, with RE meta-analysis shrinking study weights to be similar at the extremes of BMI levels (suggesting greater estimated between-study heterogeneity). For the small studies, the fitted functions were noisy, and each study contributed <1% weight in any meta-analysis. The empirical Bayes estimates revealed a greater extent of shrinkage in the functions of the smallest studies (eFigure 2).

Meta-analysis of BMI and ACM adjusted for confounders in dataset 1
Meta-analysis of the BMI-ACM association using metacurve and mvmeta identified steep U-shaped associations, irrespective of the method of FP power selection (eFigure 3), with nadirs for CE and RE meta-analysis at 23.6 and 24.3 kg/m 2 respectively. Using the categorization approach demonstrated clear U-shaped associations whether using 10 groups or 20 groups (eFigure 4).

Meta-analyses adjusted for confounders and mediators in dataset 2
We repeated the above analyses with further adjustment for mediators, thus estimating the direct effect of BMI on outcomes. Estimated overall curves were again very similar between FP methods for CHD (eFigure 5) and ACM (eFigure 6).

Choice of reference category
To explore whether metacurve results were sensitive to the choice of reference value x 0 , we varied this value to 20, 30, and 35 kg/m 2 . The fitted curves were similar in shape for BMI and CHD ( Figure 6) and for BMI and ACM (eFigure 7).

Which method?
We have explored relationships between BMI and two outcomes using metacurve of FPs and mvmeta of FPs. Table 2 summarizes the advantages and disadvantages of the two methods. We now highlight the key points. Mathematical exploration showed that the two approaches tend to differ when exposure distributions vary considerably between studies, implying that mvmeta is more efficient when models are correctly specified and suggesting that metacurve is more robust to model mis-specification. Hypothetical data simulated with extreme variation in exposure distributions between studies and rather extreme effects supported this view and showed that implausible pooled associations can in principle occur. In real analyses, exposure distributions always differ to some extent (see eFigure 1 for an example) and the different weighting schemes should result in some differences between averaged curves. While mvmeta may have advantages because it borrows strength, metacurve may have advantages because the weights better reflect the information in the data. We found similar results for metacurve and mvmeta in the ERFC data, except that metacurve with study-specific powers did not show an upturn in CHD risk at low levels of BMI. Closer exploration showed that these data did support the upturn, thus supporting the use of metacurve and mvmeta with common FP powers. In principle, the fit to the data should be better (or at least not worse) with study-specific powers: metacurve with study-specific powers may have performed worse because power selection in some studies was driven by other features of the data. Overall, these results suggest that the theoretical differences between methods may not be important in practice, and both methods may be recommended.
The results of metacurve are in principle sensitive to the choice of reference value. However, we did not find any important sensitivity over a wide range of reference values. In practice, subject-matter knowledge often provides arguments for a suitable interval for the reference value, and then the specific choice made is highly unlikely to influence the result. Reference values should not be chosen outside the range of the data.
Both methods extrapolate study results beyond the range of their data. Metacurve extrapolates explicitly, since it averages weighted estimates from each study at each covariate value. Weights reflect the information at each point, implying that a study gets a relatively low weight in areas with a small number of participants (or events). Including studies only within their own covariate ranges is a possible alternative, but it risks artefacts in the curve shape where studies switch from included to excluded. Mvmeta extrapolates implicitly, by assuming the study-specific curves are correct over the whole covariate range.
Exposure categorization may be useful as a first step to modeling when data are sufficient to accommodate fine categorization without much loss of precision, as in the current context of large-scale IPD meta-analysis. However, the results in Figure 4 highlight a major limitation of this approach, that the choice of categories affects the results and in particular the extent to which nonlinear associations may be apparent.

Outstanding issues
Outstanding issues in the metacurve and mvmeta methods include how to assess and compare model fit, which is complicated by the difficulty of pooling likelihood or discrimination over studies after random-effects meta-analysis; how to assess influential points; and how to display and interpret heterogeneity. Further, the impact of truncation (Section 2) is not fully understood. The methods proposed use IPD and are unlikely to be useful with aggregate data.

Confounders
Because of the considerable richness and standardization of the ERFC data, we were easily able to handle the common and difficult problem of adjusting for confounders. In much smaller meta-analyses (say with about 5 to 10 studies), with larger differences between measured confounders, how to conduct a meta-analysis adjusting for confounders is less straightforward. For a more detailed discussion, see Section 5 in Sauerbrei and Royston. 10

Extensions
We have illustrated the methods using FPs as the method to derive functions in each study, but they apply for any model with a small number of parameters. For example, with restricted cubic splines, 24 we would initially choose the number of knots (say K) and their location (eg, by model comparisons across all studies), then construct the appropriate spline basis (of size K − 1), and fit the model to each study separately. Metacurve would then pool the estimated relative effects exactly as with FPs, while mvmeta would pool the K − 1 coefficients. Results would depend on the choice of the knots. For models with many parameters, eg, smoothing splines or penalized splines, metacurve would still apply as described. Here we propose implementing mvmeta by applying multivariate meta-analysis to a set of relative effects estimated over a grid of exposure values. The size of the grid would need to balance accuracy with the computational difficulty of high-dimensional multivariate meta-analysis. This method is also appropriate to apply the mvmeta approach when study-specific FPs have been selected.
In a randomized trial, it may be of interest to explore interactions between randomized treatment z i and a continuous effect modifier x ij . 25 The multivariable fractional polynomial interaction (MFPI) approach has been proposed to tackle this flexibly. 26 To explore interactions across multiple trials, the linear predictor in Equation (1) is changed to f i (x ij ; i ) + f i (x ij ; i )z ij , and the methods presented in this paper could be used to combine the treatment-effect functions i (x ;̂i). 27 The methods have been described for Cox models, but they can be applied to any model where the exposure is modeled through a linear predictor. Unlike the Cox model, such models usually include an intercept. It would therefore be possible to combine absolute risk curves (including the intercept) across studies; however, there is a danger of confounding by study. The methods presented for combining relative risk curves (ignoring the intercept) would be preferable since they only combine within-study information.

Conclusions
The metacurve and mvmeta methods appear to perform well in combining complex exposure-disease relationships across studies. Compared with the popular approach based on categorization, they use the full information from continuous variables, and results do not depend on the number and choice of cutpoints. Both methods rely on having good evidence from the study-specific analyses, as was achieved here using a large consortium approach to standardize methodology and improve statistical power. Software is available in Stata for wider practical application.