SEARCH

SEARCH BY CITATION

Keywords:

  • Glomerular filtration rate;
  • graft function;
  • kidney transplantation;
  • randomized controlled trials;
  • systematic review

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References

Kidney function endpoints are commonly used in randomized controlled trials (RCTs) in kidney transplantation (KTx). We conducted this study to estimate the proportion of ongoing RCTs with kidney function endpoints in KTx where the proposed sample size is large enough to detect meaningful differences in glomerular filtration rate (GFR) with adequate statistical power. RCTs were retrieved using the key word “kidney transplantation” from the National Institute of Health online clinical trial registry. Included trials had at least one measure of kidney function tracked for at least 1 month after transplant. We determined the proportion of two-arm parallel trials that had sufficient sample sizes to detect a minimum 5, 7.5 and 10 mL/min difference in GFR between arms. Fifty RCTs met inclusion criteria. Only 7% of the trials were above a sample size of 562, the number needed to detect a minimum 5 mL/min difference between the groups should one exist (assumptions: α = 0.05; power = 80%, 10% loss to follow-up, common standard deviation of 20 mL/min). The result increased modestly to 36% of trials when a minimum 10 mL/min difference was considered. Only a minority of ongoing trials have adequate statistical power to detect between-group differences in kidney function using conventional sample size estimating parameters. For this reason, some potentially effective interventions which ultimately could benefit patients may be abandoned from future assessment.


Abbreviations
AUC

area under the curve

FDA

Federal Drug Administration

GFR

glomerular filtration rate

KTx

kidney transplantation

NIH

National Institute of Health

RCT

randomized controlled trials

ROC

receiver operating curve

SCr

serum creatinine concentration

SD

standard deviation

Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References

Kidney transplantation (KTx) is considered the treatment of choice for end-stage renal disease. Transplantation prolongs patient survival [1], improves health-related quality of life [2, 3] and is considerably cheaper than dialysis [2]. Unfortunately, kidney transplants continue to be lost prematurely due to potentially modifiable causes such as allograft nephropathy, recurrent disease and drug toxicity [4]. Interventions supported by well-designed randomized controlled trials (RCTs) with appropriate and meaningful endpoints are required to further improve allograft survival.

The selection of outcome measures in KTx is an ongoing source of debate [5-10]. Ideally, the outcomes should include the definitive endpoints of patient and graft survival. This however, requires large numbers of patients with long follow-up times making them highly impractical and costly [8]. As a result, patient and allograft survival are rarely used as primary outcomes instead being replaced by a variety of surrogate endpoints. Treatment effects on surrogate endpoints suggest anticipated effects on definitive endpoints as long as changes in the surrogate are predictive of changes in the definitive end point [11]. Examples of commonly used, yet unvalidated, surrogate endpoints in KTx are biopsy-proven acute rejection, markers of kidney function and proteinuria [8, 10, 12-14]. Biopsy-proven acute rejection has traditionally served as the primary efficacy endpoint for Federal Drug Administration (FDA) approval of immunosuppressive medications [15] although this practice has recently been questioned [9].

In a previous systematic review, we reported that nearly 80% of RCTs enrolling KTx recipients included a kidney function endpoint (mostly estimates of glomerular filtration rate [GFR] based on serum creatinine concentrations [sCr]) [14]. Of these, nearly one-third had a marker of kidney function as the primary endpoint [14]. The trials in the review demonstrated a general lack of rigor in design with poor documentation of study power and justification of sample sizes [14]. Furthermore, sample sizes were generally small raising the question as to whether they were adequately powered to detect minimal clinically important differences between treatment groups should they in truth exist. Accordingly, the purpose of this study was to estimate the proportion of registered KTx trials with renal function endpoints that are powered to detect meaningful differences.

Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References

Search strategy

RCTs were retrieved from the National Institute of Health (NIH) online clinical trial registry (clinicaltrials.gov) using the following four key words in separate searches: kidney transplant; kidney transplant and acute rejection; kidney transplant and graft failure and kidney transplant and death. The search was limited to studies which were open/recruiting at the time of retrieval (May 14, 2011).

Study selection

Two investigators (C.A.W. and A.I.) independently reviewed all search results for potentially eligible trials. Any disagreement in trial eligibility was resolved by consensus. Eligible studies met the following criteria: (1) participants were kidney transplant recipients; (2) participants were 18 years or older; (3) the study had at least one kidney function measure as a primary or secondary endpoint; (4) the kidney function endpoint was measured at least 1 month after the transplant surgery and (5) the study was a RCT.

Data abstraction

Data were abstracted independently by two investigators (C.A.W. and A.I.) using a standardized form. Any initial differences for data abstraction were resolved by consensus. The following data were abstracted: sample size, marker of kidney transplant function (SCr, timed urinary creatinine clearance [uCrCl], GFR estimation equation [eGFR], serum cystatin C, GFR measurement [mGFR] using any exogenous marker or unspecified kidney function). If GFR was estimated, the specific equation was recorded. If GFR was measured, the tracer and collection method (urinary or plasma) were noted. All endpoints were categorized as primary or secondary, as well as continuous, categorical or unspecified. The time posttransplant of each endpoint was also recorded. Endpoints were deemed continuous if they represented a dynamic range of values (such as GFR values, measured and/or estimated, as well as serum creatinine values). Categorical endpoints contained discrete thresholds which an endpoint either met or did not meet (e.g. GFR increased or decreased by a specific increment).

Sample size calculations

The vast majority of trials utilized a continuous outcome that compared between group kidney function at a specified time point in follow-up (e.g. eGFR at 1-year posttransplant). This is consistent with our previous systematic review where most studies compared GFR outcomes at a single time frame and did not perform slope-based analysis comparing changes in kidney function over time [14]. Thus, sample size calculations were based on a two-sample t-test performed to detect various between-group differences in GFR at a single follow-up time point. For these estimates, we set α = 0.05, β = 0.2 and assumed a 10% loss to follow-up. Based on data from the literature, the minimal clinically importance between-group difference in GFR at a given point in time in follow-up was set at 5.0, 7.5 and 10 mL/min [16-18]. The standard deviation (SD) on the GFR outcome was set at 15, 20 and 25 mL/min based on published data [5, 19-23]. For example, the variance in the four-variable Modification of Diet in Renal Disease equation at 6 months in the Long-Term Efficacy and Safety Surveillance database of 1334 kidney transplant recipients (mean eGFR of 56.8 mL/min/1.73 m2) was 22.5 mL/min/1.73 m2 [5]. In the BENEFIT study, the SDs of the 1-year mean mGFR in the three study arms were 30, 27.7 and 18.7 mL/min/1.73 m2 [23]. In the ALERT trial, the SD of the mean mGFRs at 18 months were 20.5 and 19.4 mL/min/1.73 m2 in the two study arms [19]. In the ELITE-Symphony study, the SDs of the mean 1-year eGFR and mGFRs ranged between 25.1 and 28.5 mL/min/1.73 m2 [20].

The percent of trials with a continuous GFR outcome measure and two study arms (n = 44) that had adequate sample sizes was then calculated for each combination of between-group differences and SDs.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References

The search yielded 306 studies (Figure 1). Fifty trials met all inclusion criteria. The most common reason for exclusion (n = 240) was that the trial was not restricted to kidney transplant recipients.

image

Figure 1. Search results and selection of studies for analysis.

Download figure to PowerPoint

Study characteristics

The 50 studies included 9467 patients (Table 1). The mean (SD) study sample size was 189 (287) whereas the median (25th, 75th percentile) was 119 (40 254). Continuous measures of kidney function were described in 47 (94%) trials with two (4%) trials describing it as a categorical endpoint. The description of the endpoint in one trial was too vague to permit labeling it as continuous or categorical. Twenty (40%) trials included more than one kidney function endpoint. The time of measurement after transplant was variable, ranging anywhere from 1 month to 5 years.

Table 1. Included studies
     Cr eGFRmGFRCysC eGFRSCrCrClUnknown
Clinical Trial IDYearPrincipal investigatorSample sizeTreatment groups
NCT003071252006M. H. Sayegh1342           
NCT004761642007A. Dorling1202          
NCT005022422007Pfizer4402           
NCT005653312007L. Hilbrands2802           
NCT007318742008S. Kapur1202           
NCT007524792008G. Remuzzi62           
NCT007718752008E. S. Woodle303          
NCT008119152008C. Freguin2002           
NCT008591312009K. D. Chavin2002           
NCT008615362009J. Steiger402           
NCT008668792009L. Gallon2002           
NCT008952062009R.Vanholder1282        
NCT008955832009Pfizer2562           
NCT009031882009J. L. Bosmans1522          
NCT009062042009R .B. Stevens1652           
NCT009221292009A. Kapoor302          
NCT009288112009M. A. S. Kumar402          
NCT009312552009A.Haririan802           
NCT009332312009Astellas2804          
NCT009562932009Novartis2502         
NCT009650942009Novartis402         
NCT009836452009P. Baron502           
NCT009992582009S. Constantinescu502           
NCT010023392009A. Torres2103           
NCT010238152009Novartis5003          
NCT010258172009Novartis5902           
NCT010280922009Y. Le Meur3063          
NCT010532212010W. Burlingham302          
NCT010568352010J. Ha402         
NCT010666892010Y. Lebrancheu642           
NCT010951722010N. Mamode3162           
NCT011145292010Novartis6762           
NCT011176622010H. Haller1182           
NCT011200282010P. Friend8004           
NCT011473022010M. E. Uknis202           
NCT011543872010S. Flechner853           
NCT011590802010S. K. Park3502          
NCT011667242010T. Srinivas252           
NCT011697012010Novartis802          
NCT011709102010L. Badet3002           
NCT012248602010P. Cravedi202           
NCT012394722010A. B. Pereira302         
NCT012446592010F. Pinho602           
NCT012655372010J. Gill302          
NCT012853752011J. Stenman202           
NCT012893012011A. Schwarz1242           
NCT012925252011M. Giral1062          
NCT013048362011Astellas11662           
NCT013120642011M. K. Tsai902           
NCT013275732011S. Kulkarni202          

Primary endpoints

Nineteen trials (38%) had at least one marker of kidney function as a primary endpoint (Table 2). Two of these had more than one kidney function primary endpoint. The most common endpoint was an SCr-based eGFR (n = 8) followed by an mGFR (n = 5), and SCr (n = 3 trials). Only two trials used a categorical primary outcome. The overall median sample of trials with a continuous primary endpoint (25th, 75th percentile) was 134 (64, 306) (Table 3). The trials with a continuous eGFR primary outcome had larger median sample sizes than those with an mGFR primary outcome (250 vs. 134 participants) (Table 3).

Table 2. Primary and secondary endpoints
 Primary endpoint (19 trials)Secondary endpoint (40 trials)
  1. 1Le Bricon GFR = [(78) * (1 / cystatin C)] + 4 with cystatin C in mg/L

  2. 2Nankivell GFR = (6700/Cr) + (weight/4) – (urea/2) – [100/(height)2] + 35 (if male) or 25 (if female) with Cr in umol/L weight in kg, urea in mmol/L and height in meters.MDRD = Modification of Diet in Renal Disease.

eGFR n (%)8 (42)22 (54)
Four-variable MDRD Study13
Cockcroft-Gault19
Le Bricon Cystatin C110
Seven-variable MDRD01
Nankivell213
MDRD unspecified35
Unknown/other28
SCr n (%)3 (16)13 (32)
mGFR n (%)5 (26)5 (12)
Iohexol01
Iothalamate01
1Cr-EDTA21
Gd-DTPA10
Unspecified22
Urinary CrCl1 (5)2 (5)
Unspecified renal function3 (16)12 (29)
Table 3. Median sample sizes (IQR) of trials with continuous primary and secondary endpoints
 Continuous primary endpointsContinuous secondary endpoints
 StudiesSample size (IQR)StudiesSample size (IQR)
All trials17134 (242)4098 (220)
eGFR8250 (310)2298 (170)
SCr260 (80)1380 (220)
mGFR4134 (206)540 (108)
CrCl1402315 (70)
Unspecified3118 (780)1145 (280)

Secondary endpoints

Forty studies (80%) had at least one kidney function secondary endpoint, with 13 trials having more than one such endpoint (Table 2). The most common endpoint was an SCr-based eGFR (n = 22), followed by SCr (n = 13) and mGFR (n = 5). All secondary outcomes were continuous measurements. The overall median (25th, 75th percentile) sample size of the trials with a secondary kidney outcome measure was 98 (30 230) (Table 3).

Estimated GFR

In all, 28 studies (56%) used creatinine-based eGFR as a study outcome. The most common estimates were the CG (n = 10) and one of the MDRD study equations (n = 13). One study used the Le Bricon eGFR equation based on serum cystatin C level. In 11 studies, the equation used to calculate the eGFR was not specified.

Measured GFR

Nine trials used an mGFR endpoint (Table 2). The most commonly used GFR measurement tracer was 51Cr-EDTA (n = 2). Iohexol, iothalamate and gadolinium-DTPA were each used in one trial. Four trials had an unspecified tracer. Two studies specified using plasma clearance with the remaining seven studies providing no information on clearance strategy.

Sample size calculations

Required sample sizes (α = 0.05; power = 80%, 10% loss to follow-up) to detect a minimum between-group difference in GFR of 5, 7.5 and 10 mL/min are shown in Table 4. For a conservative SD of 15 mL/min, sample sizes of 318, 142 and 82 patients are needed to detect minimum between-group differences in GFR of 5, 7.5 and 10 mL/min. Only 11%, 36% and 55% of trials were powered to detect these differences, respectively (Table 5). With an SD of 20 mL/min, the number of patients required to detect the same differences were 562 (7% of trials), 251 (23% of trials) and 142 (36% of trials) (Tables 4 and 5).

Table 4. Total sample size required (α = 0.05; power = 80%)
 Difference in 1GFR between two independent groups
  1. 1in mL/min.

1Standard deviation57.510
1531814282
20562251142
25876391222
Table 5. Percent of trials with two treatment arms and continuous outcomes with adequate sample size
 Difference in 1GFR between two independent groups
  1. 1in mL/min.

1Standard deviation57.510
15113655
2072336
252925

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References

This study reveals the current popular use of estimates of GFR as study endpoints over all other markers of kidney function in kidney transplant trials and ongoing issues in RCT design with respect to study power. A minority of registered trials were adequately powered to detect reasonable differences in GFR between study arms. Only 11% of studies had sufficient sample size to detect a GFR difference of 5 mL/min assuming a conservative common SD of 15 mL/min. This falls to 7% with a more realistic SD of 20 mL/min [5, 19, 23]. An important implication of this study is that potentially beneficial treatments may be abandoned from further testing if underpowered trials using markers of kidney function report negative findings (type II error). Poor adherence to the randomly allocated therapy exacerbates the problem of underpowering as, in such circumstances, the overall potential treatment effect to be detected is diluted.

The power of a statistical test is the probability that the test will find a statistically significant difference if one truly exists (appropriately reject a false null hypothesis). Power depends on three components: the sample size (which is impacted by loss to follow-up), the desired level of significance (alpha) and the standardized effect size. Power is improved with larger samples sizes, a larger significance criterion and a larger standardized effect size. The standardized effect size takes into account not only the anticipated effect size (here difference in kidney function between groups) but also the variability in the kidney function measure (the SD of the kidney function measure). In most RCTs, 80% is considered as an adequate standard for power. The desired level of significance (alpha) is often set at 0.05. Power analyses are then used to calculate the minimal sample size required to accept the outcome of a statistical test. Trial designers need to decide on the anticipated magnitude of the effect size and take into account the variability in the measure. Although the desired effect size does not directly change the type I error rate (alpha), seeking a small difference between treatment arms may lead to a large sample size and a statistically significant result that is not clinically relevant.

The variability in the GFR is dependent on the population being studied. In the majority of transplant trials, patients are randomized at time of transplantation. After a kidney transplant, interindividual graft function is highly variable. Thus, in the setting of an RCT, at any given time posttransplant variability in GFR is quite high (15–30 mL/min) leading to the requirement for large sample sizes [19, 20, 24, 25]. It is possible that the use of mGFR instead of the less accurate eGFR would lower the variability in the measure and allow for smaller sample sizes. The limited evidence available does not, however, suggest this to be the case [19, 20, 24, 25]. For example, in the ELITE-Symphony study the SDs were very similar for the eGFR (25–27 mL/min) and mGFR (25-25-28 mL/min) [20]. In the ALERT trial, the SD was slightly lower for the 36-month eGFR (19.1 and 18.8 mL/min/1.73 m2 in the two treatment arms) as compared to mGFR (21.7 and 21.7 mL/min/1.73 m2) [19].

An unresolved yet seminal issue is what constitutes a minimal meaningful short-term difference in GFR between groups? Establishing this will require that short-term graft function and long-term patient and graft survival be measured in trials and corresponding changes in both be observed putatively due to the therapeutic intervention [12]. Such prospective data are lacking. The recent BENEFIT study does report significantly improved short-term GFR in the Belatacept arms as compared to the cyclosporine arm but no difference in patient and graft survival at 3 years [26]. Longer follow-up in such studies would assist in the establishment of GFR as a valid surrogate endpoint and minimally meaningful differences in this measure.

In a recent editorial, Vincenti comments on the protracted FDA approval process for Belatacept [9] and argues that GFR should become the primary efficacy endpoint (in lieu of acute rejection) since the latter is “the most durable marker of efficacy”. In contrast, others have argued that kidney function should not be used as a surrogate endpoint in KTx given that, while associated with graft survival [5, 27], it is not robustly predictive of graft survival [5, 7, 28]. Accurate predictive ability is required to establish legitimacy of surrogates [11, 12, 29] and can be determined by performing receiver operating curve (ROC) analysis and determining the area under the curve (AUC) [7, 12]. An AUC of greater than 0.8 is considered good or excellent predictive ability [30]. In one study using a 6-month posttransplant eGFR, He et al. found an AUC of only 0.6 for 5-year graft failure [5]. In other words, by randomly selecting two individuals, one from each of two groups (those who developed graft failure and those who did not) the individual who developed graft failure will have had the lower eGFR only 60% of the time. This is only slightly improved over the results you would obtain by randomly guessing. Other groups have also reported very similar findings [7, 28]. Presumably, there are multiple other important independent risks for graft failure/death. Furthermore, in the setting of an RCT, there is a potential that interventions designed to improve short-term graft function may in fact lead to increased risk for graft loss or death if the intervention increases infections, malignancies or cardiovascular disease. Pathways by which changes in surrogate outcomes can fail to reflect changes in hard outcome in KTx have been reviewed by Schold et al. [10]. The authors also caution against the reliance on short-term kidney function surrogates that have yet be shown to predict long-term outcomes.

Complicating matters is the well-described inaccuracy of estimates of GFR based on serum creatinine [31]. The poor performance of the estimates has led various authors to argue that measured GFR should be used instead of estimated GFR in RCTs [12, 32, 33]. In a previous systematic review in KTx, half the trials that reported both mGFR and eGFR had discrepant results between the two measures [14]. In the current study, only nine (18%) trials utilized a measured GFR likely due to issues of cost and impracticality. Studies looking at the ability of short-term measured GFR posttransplant to predict long-term graft failure/death have not been conducted. If eGFR is the chosen outcome, then a single equation should be utilized. Which equation is of lesser importance as it is the differences in eGFR between groups that is of interest rather than the actual point values. Investigators should avoid post hoc analysis of multiple different equations as this can introduce issues of multiplicity (and type I error) into the analysis.

Strengths of our study include the use of the clinical trial registry which allows assessment of contemporary trials and thus more accurately represents the current state of evidence generation. Weaknesses include the lack of detailed descriptions of the study endpoints in the database. It is certainly possible that the study endpoints are not exactly as described in the database and that the lack of descriptors may have led to misclassification of endpoints as continuous/categorical. Also, the registry does not include the rationalization of study sample size or anticipated effect size and effect variability. In addition, the trial registry we used only includes studies registered at the NIH and did not include other known trial registries. The analysis did not consider “hard” clinical outcome such as graft failure or death but given the rare occurrence of these events in the short term, we anticipate that similar inadequacies in power would be found. Our analysis also did not consider more sophisticated approaches to the analysis, such as rank-based procedures which account for the competing event of death when assessing a continuous outcome such as GFR at a given time in follow-up [34]. However, such types of analyses are not expected to materially impact the sample size requirements that we presented using simpler analytic approaches.

In conclusion, although kidney function endpoints are commonly used in RCTs in KTx, most trials are underpowered to detect “reasonable” differences in graft function between treatment groups. The results suggest that, in many cases, sample sizes are determined not by statistical considerations but rather by practical considerations such as time, effort and financial constraints. This may lead to the inappropriate discarding of beneficial therapeutic strategies which is highly problematic given the deleterious implications of graft loss. However, it is imperative to establish a minimally important difference in GFR in the short term that predicts long-term outcomes of death and graft failure in order for kidney function to be considered an acceptable surrogate at all. This will require adequately powered trials with long-term follow-up. Once established, these short-term surrogates could be used in phase 1 and 2 trials to identify those interventions with a high priority for testing in major trials with definitive endpoints.

Acknowledgments

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References

We thank Ms. Heather Thiessen-Philbrook for her assistance with the power calculations presented in this paper.

Disclosure

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References

The authors of this manuscript have no conflicts of interest to disclose as described by the American Journal of Transplantation.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgments
  8. Disclosure
  9. References