How is more negative evidence being used to support claims of benefit: The curious case of the third international stroke trial (IST-3)


  • Jerome R Hoffman, MA, MD, Professor Emeritus of Medicine/Emergency Medicine; Richelle J Cooper, MD, MSHS, Associate Professor of Medicine/Emergency Medicine.

Still a man hears what he wants to hear, and disregards the rest. ‘The Boxer’ – Paul Simon and Art Garfunkel, 1969.

Just before the release of the results of the third international stroke trial (IST-3),[1] the largest trial of thrombolysis in acute ischaemic stroke (AIS), the journal Stroke published a remarkable pre-emptive strike – a commentary in which the author identifies a legion of concerns regarding the study's methodology, only to reassure us about the study's value.[2] In one astonishing section he lucidly catalogues a host of important biases likely to skew the study's results in ways that increase the chance of finding a spurious benefit from the use of tissue plasminogen activator (tPA) – but then proceeds to trivialise the very concerns he has elucidated. He starts by noting that (among many other problems) the study determined ultimate patient outcome using a score that is highly unreliable even when calculated by a neurologist performing an in-person neurologic evaluation. He then acknowledges that this problem was enormously exacerbated in IST-3 because the scoring was done by a layperson; to make matters even worse, this was a layperson who, like the patient himself, was unblinded to treatment group. After all this, he nevertheless concludes with the soothing statement that ‘reassuringly, all images in IST-3 have been read by a blinded central observer, so outcomes based on imaging will have meaning free from any recall bias.’ This seems to be another way of saying ‘sure, there are many ways in which our measurements are very distorted … but don't worry about that, because there is one other way in which we got it right.’

The author of this commentary – and by extension the editors of Stroke who approved it – ultimately concludes that despite its many flaws, there is much to learn from IST-3.[2] We agree … although given that the actual results of IST-3 uniformly failed to show benefit, even in the face of severe bias, we believe the lessons are precisely the opposite of those being trumpeted by the study's own authors.[1]

The third international stroke trial is the latest addition to the long-running controversy over the use of thrombolysis in AIS, but the claims being made about ‘benefit’ in this trial seem to go beyond the common situation, where well meaning people can look at the same information and come to wildly disparate conclusions.[1] In this commentary we focus on IST-3, while also briefly revisiting the other two randomised controlled trials (RCTs)[3, 4] that are commonly cited as providing support for the use of tPA in AIS. We will not review the many RCTs that found either no benefit, or clear harm, even though we believe the limited attention given to these trials, simply because they are negative, greatly distorts the discussion of thrombolysis in AIS.[5-7] When the European Cooperative Acute Stroke Study III (ECASS III) was published in one of the world's most prominent journals, its claims of efficacy received enormous publicity;[4] meanwhile, the other contemporaneous RCT of thrombolytic treatment of AIS, which reported negative findings, appeared in a less prominent specialty journal and received essentially no notice after its publication – it was negative, so it is somehow considered irrelevant.[8] Nor will we address the many non-randomised ‘effectiveness’ studies other than to note that they too are mostly very negative, and that the few that purport to show utility suffer from the very same types of biases and misinterpretations that plague the ‘positive’ RCTs we will be addressing.

However, first we would like to touch on the role of chance when multiple studies evaluate a treatment that is in reality neutral. In such a case one should expect that most of the studies would find no overall effect, but by chance alone some would find benefit, whereas a similar few would find harm. At the time the National Institute of Neurological Disorders and Stroke (NINDS) study was published, it was the only one of five contemporaneous RCTs[9-12] that found benefit (whereas at least two found substantial harm). Although there is now a second study (ECASS III) with slightly positive results,[4] and a third (IST-3)[1] for which similar claims are being made despite its actual negative findings, this pattern remains consistent with what one would expect of a treatment that is non-beneficial (or somewhat harmful), because of chance alone. There are now more than a dozen RCTs of thrombolysis in AIS, most of which fail to find benefit, and several of which show harm.[5-7]

We have previously written extensively about NINDS, by far the more important of the two ‘positive’ RCTs[3, 4] that preceded IST-3. A key issue regarding NINDS is that most, if not all, of the ‘benefit’ from tPA could well be explained by the fact that on average patients randomised to placebo happened to have much more severe stroke at baseline. The differences favouring tPA at 90 days were less than the differences favouring tPA patients even before treatment. In addition, NINDS hypothesised that tPA would produce early improvement, as measured by a beneficial change at 24 h in patients' National Institutes of Health Stroke Scale (NIHSS) score (Δ-NIHSS). When part I of NINDS failed to find any such early benefit, which seems to contradict the mechanism by which tPA purportedly works, the authors not only moved the goalposts by evaluating outcome at 90 days, using different metrics, but also failed to publish the negative results in terms of Δ-NIHSS. To make matters even worse, they later vociferously challenged the appropriateness of Δ-NIHSS,[13, 14] once we subsequently showed that it did not favour tPA.[15] It is hard to imagine that they would have been similarly dismissive, had Δ-NIHSS proved favourable in NINDS. Indeed, despite their scathing criticisms, advocates of tPA not only originally intended to use this metric in NINDS, but have also continued to use it in other studies,[16-18] although we can only find it in studies where its results are not clearly negative.

We will only mention of a number of similar problems associated with ECASS III that, just as in NINDS, more of the patients receiving placebo had a severe stroke at baseline, and the magnitude of this difference again seems to explain most or even all of the (small) difference seen at follow up. We are unable to analyse this in detail, however, because (unlike with NINDS, where we were ultimately able to obtain the study's raw data), the patient-level data from this proprietary study are not available to the public.

Although it has always been true that many published studies are biased, that is, use a methodology that invites the non-random introduction of error into the results, or interpret their own findings in ways that are not reflective of the actual results, we believe that the uncritical publication of IST-3 by the Lancet as a ‘positive’ study, in direct contradiction of every one of its own acknowledged results, seems unprecedented.

First of all, IST-3 is a very large trial, involving more than 3000 subjects. Nortin Hadler, in his book Rethinking Aging, wisely suggests that whenever a very large trial is required to show statistical benefit, it means that the purported benefit cannot be clinically important.[19] This is because when benefit is large enough to be meaningful, it will be evident in a relatively small study population. Conversely, when statistical benefit is only evident following the recruitment of large numbers of patients, such benefit has to be extremely small and thus, in most cases, clinically meaningless. Almost none of the differences reported in IST-3, for a multitude of binary outcomes, even attained statistical significance. This suggests not only that they are probably due to chance alone because such a large study is substantially overpowered to find differences, as well as because the few ‘significant’ findings are an expected product of multiple subgroup slicing and dicing,[20-22] but it is virtually certain that they cannot be meaningful. It is also worth noting that IST-3 was originally designed to be a double-blind trial, with an enrolment of 6000 subjects, but these elements were revised because of problems with recruiting and funding. We will not further address the enormous implications of making such changes mid-stream, particularly on subgroup analyses, other than to note that even the ‘trends’ towards benefit reported for the overall study are absent in the initial group of patients, who were indeed treated and evaluated in a blinded fashion.

Second, the minor trends that are present in IST-3, on which the authors rely to make their claims of efficacy, have to do mostly with differences at the ‘bad outcome’ portion of the Oxford Handicap Score (OHS) that was used in the study. In NINDS (and many other trials), the 90-day outcomes were determined using metrics, including the modified Rankin Scale (mRS). This was judged by a neurologist who evaluated study subjects face to face, and who was blinded to the treatment each subject had received. Even under those optimal circumstances – face-to-face evaluation, by a neurologist, with blinding about which treatment had been given – the reliability of mRS is poor (i.e. different observers often fail to agree on the exact score).[23] Although we are not aware of any studies addressing reliability with the OHS, it is almost certainly similar to that of mRS.[2] If that is indeed the case, it becomes hard to credit small differences in the percentage of patients who achieved an outcome that was ‘terrible’, compared with those in whom the outcome was only ‘very bad'! Now try to imagine the threat to such an observation when it was made not by an examining neurologist, but rather by a carer or a family member (as in IST-3). Then add that whichever layperson did score this metric was not blinded to treatment group, and ask whether it is possible (inevitable) that claims about such subtle differences might be affected when the person doing the scoring was aware that their loved one ‘didn't even get any active treatment, and now ends up so terrible,’ as compared with ‘he may not be doing well, but imagine how much worse it might be if she hadn't gotten that clot buster!’

The primary outcome of IST-3, as stated in the trial registration and clearly noted in the paper itself, is a difference in the percentage of patients who achieved an OHS score of 0–2. As the paper also clearly reports, no such difference was found. We could understand if supporters acknowledged the negative results of IST-3, but claimed that it does not invalidate the benefit they believe was found in NINDS. However, we are dumbfounded that anyone could claim that a study in which there was no difference between groups in the primary outcome – despite having recruited more than 3000 subjects (so that even small and clinically meaningless differences would appear to be ‘significant’), and despite its many and substantial methodological problems that biased it in favour of finding a positive result – can be construed as showing benefit! Or that advocates can add IST-3 to a prior database of almost uniformly negative results, and then tell us that we should now extend the window of use to 6 h!

Those who believe that tPA is beneficial in AIS have published numerous articles lauding this treatment, and dismissing those who question their conclusions as naysayers and cynics (or far worse). Critics have far less access to journals and guideline committees and other venues with the potential to influence large numbers of physicians. In addition, they are placed in the almost untenable position of having repeatedly to critique yet one more false claim of benefit. And needing to deny once again that we are simply stubborn, and unwilling to accept the truth, which is a bit like having to say on multiple occasions that ‘no, we didn't beat our wife’ that is typically seen as tantamount to an admission of guilt! Nevertheless, we are confident that anyone who takes the time to look at IST-3 will understand that it is not merely a poorly executed study that does not actually support the use of thrombolysis, but rather a clearly negative study of major importance. The largest trial to date of thrombolysis in AIS clearly shows that, despite its many design flaws that artificially favoured the treatment group, there was no difference at 6 months in either overall mortality or neurological outcome. That IST-3 is being cited not merely as a ‘positive’ study, but also as a basis for expanding the use of tPA in AIS, should be seen as shocking, even if somehow we are not actually surprised.

Disability from stroke is a major public health concern, and we would of course welcome any intervention that could benefit our patients. Nevertheless, we cannot agree that wishing it were so, in the face of both IST-3 and the totality of available evidence, is an adequate basis for putting patients at risk from a ‘treatment’ that is almost certainly non-beneficial, and in routine clinical practice even more than under the ‘ideal’ circumstance of a RCT, very likely to cause harm.

Competing interests

JRH has consulted on lawsuits involving allegations of negligence related to non-use of thrombolysis in stroke. He donates all fees from such consulting to charity, and takes no personal reimbursement for any such work. He has no other potential conflict to declare.