### Abstract

- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgments
- Disclosure
- References

Kidney function endpoints are commonly used in randomized controlled trials (RCTs) in kidney transplantation (KTx). We conducted this study to estimate the proportion of ongoing RCTs with kidney function endpoints in KTx where the proposed sample size is large enough to detect meaningful differences in glomerular filtration rate (GFR) with adequate statistical power. RCTs were retrieved using the key word “kidney transplantation” from the National Institute of Health online clinical trial registry. Included trials had at least one measure of kidney function tracked for at least 1 month after transplant. We determined the proportion of two-arm parallel trials that had sufficient sample sizes to detect a minimum 5, 7.5 and 10 mL/min difference in GFR between arms. Fifty RCTs met inclusion criteria. Only 7% of the trials were above a sample size of 562, the number needed to detect a minimum 5 mL/min difference between the groups should one exist (assumptions: α = 0.05; power = 80%, 10% loss to follow-up, common standard deviation of 20 mL/min). The result increased modestly to 36% of trials when a minimum 10 mL/min difference was considered. Only a minority of ongoing trials have adequate statistical power to detect between-group differences in kidney function using conventional sample size estimating parameters. For this reason, some potentially effective interventions which ultimately could benefit patients may be abandoned from future assessment.

### Introduction

- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgments
- Disclosure
- References

Kidney transplantation (KTx) is considered the treatment of choice for end-stage renal disease. Transplantation prolongs patient survival [1], improves health-related quality of life [2, 3] and is considerably cheaper than dialysis [2]. Unfortunately, kidney transplants continue to be lost prematurely due to potentially modifiable causes such as allograft nephropathy, recurrent disease and drug toxicity [4]. Interventions supported by well-designed randomized controlled trials (RCTs) with appropriate and meaningful endpoints are required to further improve allograft survival.

The selection of outcome measures in KTx is an ongoing source of debate [5-10]. Ideally, the outcomes should include the definitive endpoints of patient and graft survival. This however, requires large numbers of patients with long follow-up times making them highly impractical and costly [8]. As a result, patient and allograft survival are rarely used as primary outcomes instead being replaced by a variety of surrogate endpoints. Treatment effects on surrogate endpoints suggest anticipated effects on definitive endpoints as long as changes in the surrogate are predictive of changes in the definitive end point [11]. Examples of commonly used, yet unvalidated, surrogate endpoints in KTx are biopsy-proven acute rejection, markers of kidney function and proteinuria [8, 10, 12-14]. Biopsy-proven acute rejection has traditionally served as the primary efficacy endpoint for Federal Drug Administration (FDA) approval of immunosuppressive medications [15] although this practice has recently been questioned [9].

In a previous systematic review, we reported that nearly 80% of RCTs enrolling KTx recipients included a kidney function endpoint (mostly estimates of glomerular filtration rate [GFR] based on serum creatinine concentrations [sCr]) [14]. Of these, nearly one-third had a marker of kidney function as the primary endpoint [14]. The trials in the review demonstrated a general lack of rigor in design with poor documentation of study power and justification of sample sizes [14]. Furthermore, sample sizes were generally small raising the question as to whether they were adequately powered to detect minimal clinically important differences between treatment groups should they in truth exist. Accordingly, the purpose of this study was to estimate the proportion of registered KTx trials with renal function endpoints that are powered to detect meaningful differences.

### Discussion

- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgments
- Disclosure
- References

This study reveals the current popular use of estimates of GFR as study endpoints over all other markers of kidney function in kidney transplant trials and ongoing issues in RCT design with respect to study power. A minority of registered trials were adequately powered to detect reasonable differences in GFR between study arms. Only 11% of studies had sufficient sample size to detect a GFR difference of 5 mL/min assuming a conservative common SD of 15 mL/min. This falls to 7% with a more realistic SD of 20 mL/min [5, 19, 23]. An important implication of this study is that potentially beneficial treatments may be abandoned from further testing if underpowered trials using markers of kidney function report negative findings (type II error). Poor adherence to the randomly allocated therapy exacerbates the problem of underpowering as, in such circumstances, the overall potential treatment effect to be detected is diluted.

The power of a statistical test is the probability that the test will find a statistically significant difference if one truly exists (appropriately reject a false null hypothesis). Power depends on three components: the sample size (which is impacted by loss to follow-up), the desired level of significance (alpha) and the standardized effect size. Power is improved with larger samples sizes, a larger significance criterion and a larger standardized effect size. The standardized effect size takes into account not only the anticipated effect size (here difference in kidney function between groups) but also the variability in the kidney function measure (the SD of the kidney function measure). In most RCTs, 80% is considered as an adequate standard for power. The desired level of significance (alpha) is often set at 0.05. Power analyses are then used to calculate the minimal sample size required to accept the outcome of a statistical test. Trial designers need to decide on the anticipated magnitude of the effect size and take into account the variability in the measure. Although the desired effect size does not directly change the type I error rate (alpha), seeking a small difference between treatment arms may lead to a large sample size and a statistically significant result that is not clinically relevant.

The variability in the GFR is dependent on the population being studied. In the majority of transplant trials, patients are randomized at time of transplantation. After a kidney transplant, interindividual graft function is highly variable. Thus, in the setting of an RCT, at any given time posttransplant variability in GFR is quite high (15–30 mL/min) leading to the requirement for large sample sizes [19, 20, 24, 25]. It is possible that the use of mGFR instead of the less accurate eGFR would lower the variability in the measure and allow for smaller sample sizes. The limited evidence available does not, however, suggest this to be the case [19, 20, 24, 25]. For example, in the ELITE-Symphony study the SDs were very similar for the eGFR (25–27 mL/min) and mGFR (25-25-28 mL/min) [20]. In the ALERT trial, the SD was slightly lower for the 36-month eGFR (19.1 and 18.8 mL/min/1.73 m^{2} in the two treatment arms) as compared to mGFR (21.7 and 21.7 mL/min/1.73 m^{2}) [19].

An unresolved yet seminal issue is what constitutes a minimal meaningful short-term difference in GFR between groups? Establishing this will require that short-term graft function and long-term patient and graft survival be measured in trials and corresponding changes in both be observed putatively due to the therapeutic intervention [12]. Such prospective data are lacking. The recent BENEFIT study does report significantly improved short-term GFR in the Belatacept arms as compared to the cyclosporine arm but no difference in patient and graft survival at 3 years [26]. Longer follow-up in such studies would assist in the establishment of GFR as a valid surrogate endpoint and minimally meaningful differences in this measure.

In a recent editorial, Vincenti comments on the protracted FDA approval process for Belatacept [9] and argues that GFR should become the primary efficacy endpoint (in lieu of acute rejection) since the latter is “the most durable marker of efficacy”. In contrast, others have argued that kidney function should not be used as a surrogate endpoint in KTx given that, while associated with graft survival [5, 27], it is not robustly predictive of graft survival [5, 7, 28]. Accurate predictive ability is required to establish legitimacy of surrogates [11, 12, 29] and can be determined by performing receiver operating curve (ROC) analysis and determining the area under the curve (AUC) [7, 12]. An AUC of greater than 0.8 is considered good or excellent predictive ability [30]. In one study using a 6-month posttransplant eGFR, He et al. found an AUC of only 0.6 for 5-year graft failure [5]. In other words, by randomly selecting two individuals, one from each of two groups (those who developed graft failure and those who did not) the individual who developed graft failure will have had the lower eGFR only 60% of the time. This is only slightly improved over the results you would obtain by randomly guessing. Other groups have also reported very similar findings [7, 28]. Presumably, there are multiple other important independent risks for graft failure/death. Furthermore, in the setting of an RCT, there is a potential that interventions designed to improve short-term graft function may in fact lead to increased risk for graft loss or death if the intervention increases infections, malignancies or cardiovascular disease. Pathways by which changes in surrogate outcomes can fail to reflect changes in hard outcome in KTx have been reviewed by Schold et al. [10]. The authors also caution against the reliance on short-term kidney function surrogates that have yet be shown to predict long-term outcomes.

Complicating matters is the well-described inaccuracy of estimates of GFR based on serum creatinine [31]. The poor performance of the estimates has led various authors to argue that measured GFR should be used instead of estimated GFR in RCTs [12, 32, 33]. In a previous systematic review in KTx, half the trials that reported both mGFR and eGFR had discrepant results between the two measures [14]. In the current study, only nine (18%) trials utilized a measured GFR likely due to issues of cost and impracticality. Studies looking at the ability of short-term measured GFR posttransplant to predict long-term graft failure/death have not been conducted. If eGFR is the chosen outcome, then a single equation should be utilized. Which equation is of lesser importance as it is the differences in eGFR between groups that is of interest rather than the actual point values. Investigators should avoid *post hoc* analysis of multiple different equations as this can introduce issues of multiplicity (and type I error) into the analysis.

Strengths of our study include the use of the clinical trial registry which allows assessment of contemporary trials and thus more accurately represents the current state of evidence generation. Weaknesses include the lack of detailed descriptions of the study endpoints in the database. It is certainly possible that the study endpoints are not exactly as described in the database and that the lack of descriptors may have led to misclassification of endpoints as continuous/categorical. Also, the registry does not include the rationalization of study sample size or anticipated effect size and effect variability. In addition, the trial registry we used only includes studies registered at the NIH and did not include other known trial registries. The analysis did not consider “hard” clinical outcome such as graft failure or death but given the rare occurrence of these events in the short term, we anticipate that similar inadequacies in power would be found. Our analysis also did not consider more sophisticated approaches to the analysis, such as rank-based procedures which account for the competing event of death when assessing a continuous outcome such as GFR at a given time in follow-up [34]. However, such types of analyses are not expected to materially impact the sample size requirements that we presented using simpler analytic approaches.

In conclusion, although kidney function endpoints are commonly used in RCTs in KTx, most trials are underpowered to detect “reasonable” differences in graft function between treatment groups. The results suggest that, in many cases, sample sizes are determined not by statistical considerations but rather by practical considerations such as time, effort and financial constraints. This may lead to the inappropriate discarding of beneficial therapeutic strategies which is highly problematic given the deleterious implications of graft loss. However, it is imperative to establish a minimally important difference in GFR in the short term that predicts long-term outcomes of death and graft failure in order for kidney function to be considered an acceptable surrogate at all. This will require adequately powered trials with long-term follow-up. Once established, these short-term surrogates could be used in phase 1 and 2 trials to identify those interventions with a high priority for testing in major trials with definitive endpoints.