In this paper, we shall provide simple methods where nonstatisticians can evaluate sample size calculations for most large simple trials, as an important part of the peer review process, whether a grant, an Institutional Review Board review, an internal scientific review committee, or a journal referee. Through the methods of the paper, not only can readers determine if there is a major disparity, but they can readily determine the correct sample size. It will be of comfort to find in most cases that the sample size computation is correct, but the implications can be major for the minority where serious errors occur. We shall provide three real examples, one where the sample size need was seriously overestimated, one (HIP PRO-test of a device to prevent hip fractures) where the sample size need was dramatically underestimated, and one where the sample size was correct. The HIP PRO case is especially troubling as it went through an NIH study section and two peer reviewed journal reports without anyone catching this sample size error of a factor of more than five-fold.

This paper provides readers with a “back of the envelope” way to conduct a quality review of sample size calculations for randomized two treatment clinical trials that one might see in a grant application, protocol review, or manuscript. While a peer reviewed grant may have undergone a biostatistical review or a multicenter pharmaceutical company protocol may have undergone multiple levels of review, this extra look can still provide very important information. We shall examine three real situations[1-5] to demonstrate the methods. One example shows a clear overestimate of the sample size needs, one a severe underestimate, and the third was spot on. The final sections are devoted to a discussion and conclusion. Formulas are deferred to an Appendix.

Methods

Quality assurance of sample size calculation

For large sample randomized two treatment trials (studies comparing proportions, events per person year, or means via the gold standard, two-sided tests), one can conduct a quality assurance as follows:

Step 1: Calculate the value of either (1) the Z statistic or t-statistic (estimate of effect divided by its standard error), or (2) chi-square (one degree of freedom), assuming everything goes exactly as expected in the planning parameters.

Step 2: Compare this to values in Table1 whose entries are Z_{α/2}+ Z_{β} for t or Z and (Z_{α/2}+ Z_{β})^{2} for chi-square, where the Zs are the upper 100α/2 (100β) percentiles of the standard normal distribution corresponding to Type I (Type II) errors 100α% (100β%), Power 100(1-β)%. If all goes as planned, the value in Step 1 should be highly statistically significant. If the value computed in Step 1 has a substantially lower (higher) value than that of Step 2, the study power is below (above) the required value.

Table 1. Comparison values for Step 2 of sample size quality assurance

Type I error = a

Type II error = a

Z or T value

Chi-square value

2-tail p-Value

0.05

0.80

2.80

7.84

0.0051

0.05

0.85

3.00

9.00

0.0026

0.05

0.90

3.24

10.50

0.0012

0.01

0.80

3.42

11.70

0.00063

0.01

0.90

3.86

14.90

0.00011

Sample size recalculation (nonsequential trials): For a corrected sample size, you simply multiply the original sample size by the ratio of the chi-square entry for Table2, to either (a) the square of your Z- or t-statistic or (b) the chi-square statistic you obtained in Step 1. Trivial differences should be viewed as confirmatory of the sample size calculations as different statistical sample size methods use different approximations that easily can vary with each other by up to 5%.

Table 2. HIP PRO planning parameters for 50% compliance time in the experimental group (per HIP PRO investigators’ presumptions)

Calculations presume independence of outcomes for the paired experimental and control hips in the same subject. This was presumed by the HIP PRO investigators.[2]

Details are confidential, but the crux of the example is that there were to be 40 centers for the trial, and according to the pharmaceutical company, the sample size requirement for type I error of 5% (two-sided) and power 90% was 1,150 per group.

Our Clinical and Translational Institute (CTSI) reviewed the study because it utilized CTSI resources. The University of Florida was the only center that found the sample size error. The planning numbers, extracted from the protocol, were as follows for their proposed two-sample, two-sided t-test: Group 1: planning mean (SD) 1.061(1.030); Group 2: planning mean (SD) 0.872(0.934). The drug company claimed 1,150 subjects per group were needed.

We computed the T-value for the idealized result using (A3) in the Appendix as

The comparative value in Table1 is 3.24, well below 4.71, indicating the study has far more power than claimed. To obtain the necessary power, the total sample size per group would be 1,150*(3.24/4.71)^{2} or 544 per group (1,088 total). While excessive power may seem to be a good thing, capitation costs for pharmaceutical companies may be several thousand dollars per subject. Hence, upper management would certainly be reluctant to spend capitation costs on over 1,200 subjects who would not be needed to fulfill the power goals. It is suspected that the company biostatisticians may have misread the output of a computer program as 1,150 total (actual intent of the output) as 1,150 per group (presumed by the company). The values 575 (1,150/2) and 544 would be compatible with typical round-off errors of different methods of determining sample size.

Example 2: HIP PRO Application

This was an individually matched study, where one hip was assigned to the HIP PRO device to be put in their underwear with the other hip left as a matched control without protection. The simplest way to see that there was a major error is to examine Table1 of the HIP PRO Design paper.[2] It states that the expected number of falls resulting in hip fractures from the study would be 46 Control hips versus 34.5 hips assigned to HIP PRO, based on 50% compliance (explained later) and independence of the side to side outcomes (meaning very few bilateral fractures). Rounding these idealized results in favor of a significant difference, let us assume there were 80 total falls (46 + 34) and of these 46 were on the control side and 34 on the side assigned to the HIP PRO. We can calculate the statistic on the basis of a null hypothesis of a 50–50 chance that a given fracture was on the unprotected side. Using the results in the Appendix, the estimated effect size under these idealized results would be (46/80–0.50) = 0.0750. The standard error from the Appendix (A1) and (A2) is √{(46/80)(34/80)/80} = 0.0553. Thus the observed Z = 0.0750/0.053 = 1.36, p = 0.17. The study supposedly was designed to have 90% power at p = 0.05, so that from Table1, had the sample size been correct, Z should have been 3.24 (p = 0.0012). The sample size, in terms of years of follow-up, should not have been 1,632 person years as the design article[2] suggests, but 1,632(3.24/1.36)^{2} = 9,263 person years (about 5.7 times as many person years as proposed). Hence, the study was severely underpowered.

A second way to see this, more in line with how the designers planned to analyze the study relies on Table2 further. Historically, there were about 5.6 hip fractures per 100 patient years, and hence 2.8 fractures per 100 hip years. The planning improvement was to cut the rate in half to 1.4 fractures per 100 hip years, but the investigators presumed that the HIP PRO would actually be worn in only half the hip years on the hips assigned to the HIP PRO. They assumed independence and hence from the lower panel of Table2, there would be an expectation of 1.0 bilateral fractures, which would be uninformative for the comparison. Rounding, the expectation among the unilateral fractures would be 44.7 on the control side and 33.3 randomized to protection. Again, rounding in favor of significance, the ideal split would be 45 versus 33. Using the same method as above, Z = 0.0769/√{(45/78)(33/78)/78} = 1.375 (p = 0.17). The corrected sample size, which uses McNemar's test as prescribed in the protocol is 1,632(3.24/1.375)^{2} = 9,066 person years.

The investigators planned this randomized study of an experimental agent versus an active control with respect to an objectively defined primary failure endpoint as follows: Failure rates of 3.9% versus 5.1%, two-sided type I error of 5%, and power 85%. They calculated that a total of 10,900 subjects were needed, 5,450 per arm. The idealized outcome, where everything goes exactly as planned is given in Table3. From this table, chi-sq = sum of (obs-expected)^{2}/expected = 9.01. From Table1, the value of chi-square for 85% power at a 5% two-sided type I error is 9.00. Here we confirm that the power analysis was correct. Alternatively, using the Z-statistic as indicated in the Appendix

Table 3. Ingredients to compute chi-square and/or Z from Bhatt et. al.[4]

Failures (expected)

Successes (expected)

Total

Arm A (3.9%)

213 (245.5)

5,237 (5,204.5)

5,450

Arm B (5.1%)

278 (245.5)

5,172 (5,204.5)

5,450

Total

491

10,409

10,900

This matches up extremely well to the value 3.00 in Table1 for Z, again indicating a correct sample size computation by the authors.

Discussion

We first note that this exercise can catch major disparities between what the investigators claim for sample size needs and what is truly needed. In the first example, potentially millions of dollars were at stake for the pharmaceutical company. In the HIP PRO study, had the sample size error been discovered at the time of the study section review, it seems doubtful that the study would have been deemed feasible.

These examples also show that no matter how many times a protocol has apparently been reviewed, the local institution has a responsibility to assure that the study is both safe and feasible for adaption at their institution. While the overwhelming majority of trials that have passed peer review by other groups will indeed be properly designed, the consequences for the minority that are not can be extremely serious.

This paper can therefore have implications for shared Institutional Review Boards (IRBs), which will reduce the redundancy for peer reviewing, at the cost of a lower likelihood of catching major problems. Finally, we have had several instances where we caught problems of sample size errors or correctable design inefficiencies for multicenter NIH supported trials. Extra sets of eyes, used in constructive ways, can be extremely valuable, regardless of how many times a project has been reviewed.

Caution: When looking at censored survival data using the Cox proportional hazards model, the fixed term survival rates do not quite fit into this quality review process. However, if the intent is to follow every subject for say 2 years, and you do a quality assurance as if it was the outcome of interest was the binary two-year survival, you will overestimate the sample size needs. Your quality assurance will work in one direction. Be suspicious if you overestimate the sample size needs by 50% or more. If you underestimate the claimed sample size needs (i.e., think the study is over powered), there is a problem as the Cox model should be more efficient than the binary outcome model.

Conclusion

Peer review is a collective responsibility of the research community at large. Increasing the pool of reviewers and the scope of the tools that reviewers have access to will enhance the process. It is hoped that this paper stimulates experienced reviewers, who might bring forward their “tricks of the trade” in this journal, to improve the process and expand the pool of volunteers.

Acknowledgments

This work was partially supported by grant 1UL1TR000064 from the National Center for Advancing Translational Science, National Institutes of Health. Thanks go to Mr. Michael Mahoney of the University of Florida IRB for bringing the HIP PRO study to my attention.

Financial Disclosure

None.

Appendix

In this appendix, we provide definitions and formulas for determination of ingredients to verify and recalculate sample size needs.

E = Effect size: Difference in means or proportions between the groups (two-sample) or difference between the alternate and null hypothesized mean or proportion (one sample).

SE = Standard error of the effect size estimate under the alternate hypothesis. This is calculated under ideal conditions:

One sample:

SE =σ/N(A1)

where N = sample size and σ = population standard deviation.

Note that

σ={P(1−P)}(A2)

for proportions 0 < P < 1, using the alternate hypothesized value.

Two sample:

SE ={(σ12/N1)+(σ22/N2)}(A3)

where N_{1}(N_{2}) and σ_{1} (σ_{2}) are the sample sizes and standard deviations of population 1 (population 2), respectively. Use (A2) above for proportion data.

Two sample: (equal sample sizes Neach and equal standard deviations, σ for the t-test)

SE =σ(2/N)(A4)

Once the effect size E and standard error SE are calculated, we use (A5) to complete Step 1 of the methods section.

Z=|E|/SE chi − square =Z2(A5)

Interested readers in the theory behind this method for quality review should look at sections A130 and A131 of Shuster.[7]