How should treatment effect on spinal radiographic progression in patients with ankylosing spondylitis be measured?


  • Désirée van der Heijde,

    Corresponding author
    1. University Hospital Maastricht and CAPHRI Research Institute, Maastricht, The Netherlands
    • University Hospital Maastricht, PO Box 5800, 6202 AZ Maastricht, The Netherlands
    Search for more papers by this author
    • Drs. van der Heijde and Landewé serve as consultants for Abbott, Amgen, Centocor, Schering-Plough, and Wyeth. The authors have full ownership of the OASIS database and have licensed pharmaceutical companies to use the OASIS database for purposes described in this report.

  • Robert Landewé,

    1. University Hospital Maastricht and CAPHRI Research Institute, Maastricht, The Netherlands
    Search for more papers by this author
    • Drs. van der Heijde and Landewé serve as consultants for Abbott, Amgen, Centocor, Schering-Plough, and Wyeth. The authors have full ownership of the OASIS database and have licensed pharmaceutical companies to use the OASIS database for purposes described in this report.

  • Sjef van der Linden

    1. University Hospital Maastricht and CAPHRI Research Institute, Maastricht, The Netherlands
    Search for more papers by this author


Ankylosing spondylitis (AS) is a chronic inflammatory disease characterized by pain, stiffness, and often, impaired mobility of the spine. The latter is due to spinal inflammation as well as the formation of erosions and vertebral syndesmophytes, and may result in complete ankylosis (“bamboo spine”). AS has a burden of illness comparable with that of rheumatoid arthritis (RA) (1–3). Recent research from our group has now convincingly shown that radiographic damage in AS interferes with long-term functioning, independent of actual disease activity (4), and therefore, radiographic damage is an important target for therapeutic intervention.

Previous experience in the field of RA may create the impression that the causal chain of inflammation leading to radiographic damage may hold for AS, as has been proven for RA. However, firm evidence that inflammation of the spine in AS actually precedes radiographic damage in a causal manner is currently lacking. To date, it has also not been possible to predict at an early stage which patients will develop significant radiographic damage of the spine over time and which patients will not. In the Outcome in Ankylosing Spondylitis International Study (OASIS) (5) (see below), the only variable with some prognostic ability regarding progression of damage appeared to be the presence of radiographic damage at baseline (6), which is obviously not a true predictor.

Radiographic progression in AS also differs from that in RA with respect to the rate at which it can be observed. Progression in AS is a slow process, often occurring over several years (5), whereas in some RA patients with a high level of disease activity, progression can be detected reliably as early as 3 months after the first radiograph (7). In AS, it takes ∼2 years, using currently available radiographic scoring methods, to reliably distinguish true progression from background noise and measurement error (8).

The current status of therapy in AS

Until recently, AS patients were treated mainly with nonsteroidal antiinflammatory drugs (NSAIDs), which have been proven to be efficacious in alleviating signs and symptoms of the disease, and possibly in retarding radiographic progression (9). Other treatment modalities with proven efficacy with regard to signs and symptoms are physiotherapy, group exercise (10), and spa therapy (11).

A number of placebo-controlled studies have now shown that tumor necrosis factor (TNF)–blocking drugs are highly effective in improving signs and symptoms in patients with longstanding AS (12, 13). There is also evidence that these improvements coincide with improvements in the activity score as measured by magnetic resonance imaging (MRI) of the spine (14). It is not known, however, whether TNF-blocking drugs may also retard structural damage of the spine, as measured by conventional radiography. Information about radiographic progression in AS is relevant because of its association with function independent of disease activity (4). Therefore, new therapies in AS should be assessed for their potential to influence radiographic progression, as was recognized at a consensus meeting to develop recommendations for clinical trials (15).

Randomized clinical trials

Analogous to the situation in RA, the gold standard for investigating the effects of drugs on radiographic progression in AS should be the randomized clinical trial (RCT), in which patients are allocated randomly to an active intervention group (e.g., a TNF-blocking drug) or a control group. The principal argument for the use of randomization in such studies is that all factors—either known or unknown—that may influence radiographic progression would be expected to be similarly divided between the 2 groups (the “prognostic similarity” principle) (16).

The active intervention group will then be treated with the drug under investigation, and the control group preferably with the drug that represents the current standard of treatment. In an RCT with radiographic progression as the primary outcome parameter, sample sizes will be chosen such that if there is a true difference in radiographic progression between the active treatment and control groups, the experiment should confirm this difference (statistical power).

There are, however, a number of specific difficulties related to AS clinical trials, which make a proper RCT with radiographic progression as the primary end point almost, if not completely, impossible. These features are 1) the low frequency and slow rate of radiographic progression in AS, resulting in the need for a minimum followup of 2 years, 2) the lack of an appropriate comparator drug that has efficacy with regard to the signs and symptoms of AS similar to that of the TNF-blocking drugs, and 3) the lack of appropriate prognostic factors with regard to radiographic progression. These difficulties will be discussed briefly below.

Low frequency and slow rate of radiographic progression in AS

There are 3 available methods to assess and follow up radiographic damage in AS: the Bath Ankylosing Spondylitis Radiology Index (17), the Stoke Ankylosing Spondylitis Spine Score (SASSS) (18), and a modification of the SASSS (mSASSS) (19). Recently, we compared the validity of these 3 methods for use in clinical trials (20) and evaluated their use in a clinical trial of NSAID treatment in AS (9). The mSASSS was found to be the preferred method, because with this method, the most progression in a 2-year period was identified and interreader variability of change scores was the smallest; in addition, this method assesses changes in both the cervical and the lumbar spine, thus adding to the face validity. Using the mSASSS, we investigated the sample sizes that would be needed in an RCT to provide sufficient statistical power to detect true differences in radiographic progression between patients treated with active drug and controls after 1 year and 2 years of followup (see below). Because of the low number of patients with actual progression, the number of patients needed for a 1-year trial was unacceptably high. A followup duration of ∼2 years was necessary in order for a study to have a sufficiently large number of patients showing measurable progression.

Such a long duration of followup creates serious methodologic drawbacks. The most important one is probably bias due to crossover of treatment, i.e., patients in the placebo arm will start treatment from the comparator arm. A second drawback is concomitant treatment with other drugs that may influence radiographic progression, such as NSAIDs. Crossover will be the inevitable consequence of a placebo-controlled trial of TNF-blocking drugs if these drugs are registered and reimbursed for clinical use, which is already the case in several countries and will be the case soon in many others. It is unethical to expect that patients with active disease will continue to take placebo for 2 years when effective drugs are available. Patients with the most active disease will be the most likely to stop taking placebo, so the patient group that completes the 2-year followup with placebo will not be representative of the original control group. Moreover, patients in the placebo group will withdraw from the trial after a short period of followup because the effectiveness of TNF-blocking drugs is apparent within only 6 weeks in the majority of patients.

Lack of an acceptable comparator drug in AS trials

One of the issues complicating long-term comparative trials in AS is that, unlike RA, in which methotrexate is considered an acceptable anchor drug for control groups in RCTs, there is currently no acceptable alternative drug for TNF blockers. Conventional disease-modifying antirheumatic drugs (DMARDs) such as sulfasalazine (21) and methotrexate (22), as well as corticosteroids, are not effective in treating axial disease. This means that any RCT with radiographic progression as an end point should be placebo controlled. However, as noted above, 2-year placebo-controlled trials in AS would be neither feasible nor considered ethical.

Lack of appropriate prognostic factors in AS

RCTs in RA usually include patients with a high level of disease activity, in order to create a trial population that is “prone to change” and has a high propensity for an unfavorable outcome (e.g., radiographic progression). In AS, however, in designing RCTs with radiographic progression as the primary end point, it is not known which patients to select. In a recent study of prediction of radiographic progression in the OASIS cohort, a prospectively followed cohort of consecutive patients with AS from 3 countries, we found that of >25 variables potentially related to radiographic progression, none had any predictive value except the baseline presence of damage itself (odds ratio 1.05 [95% confidence interval 1.03–1.07] per mSASSS unit of baseline damage) (6). The inability to define patients at risk for further radiographic progression deflates the sensitivity to change in an RCT, which additionally complicates the design of the trial.

Potential design of trials to assess radiographic progression in AS

Taking into account the above considerations, two possible designs for RCTs may appear attractive at first glance. The first would be a 2-year RCT comparing TNF-blocking drugs with placebo in patients with low levels of disease activity. Such a design is methodologically feasible but ethically disputable because of administration of potentially harmful drugs to patients who, according to international expert opinion (23), do not need them. Thus, such a trial would not be advocated by rheumatologists or by the pharmaceutical industry.

The second design would be a 2-year RCT comparing TNF-blocking drugs with NSAIDs in patients whose disease is already responding to NSAIDs. Such a design creates the same ethical concerns as the first because effective treatment (NSAIDs) would be replaced by TNF-blocking drugs in patients who do not a priori need the latter treatment. In addition, this design may increase the potential for Type II error because, as we have recently shown (9), NSAIDs alone can retard radiographic progression in AS. The consequence is that such a trial would have to include such a large number of patients that we consider it unfeasible.

In summary, both the low frequency/slow rate of measurable radiographic progression and the inability to predict further radiographic progression in AS make it necessary to design RCTs of at least 2 years' duration. Because of the lack of acceptable (effective) comparator drugs, such trials can only be placebo controlled. The availability of highly effective TNF-blocking drugs for short-term relief of signs and symptoms means that such RCTs cannot be thoroughly conducted for a period of 2 years. Crossover and unbalanced concomitant interventions during followup (with patients in the placebo group more likely to take NSAIDs than those in the active treatment group) will compromise prognostic comparability with respect to radiographic progression, and devalue the trial results.

Because of the above issues, alternative solutions must be explored in order to obtain the most scientifically sound data regarding the ability of TNF-blocking agents to inhibit radiographic progression in AS. It should be noted that MRI of the spine is currently not a good alternative to conventional plain radiography, because scoring of structural damage observed by MRI is still entirely experimental (24), and we are only beginning to learn about the performance of existing scoring methods in comparison with radiography in assessing change over time. Data on sensitivity to change over time are not available. Another alternative to conventional radiography that could be considered is computed tomography, a modality known for good imaging of bone structures. However, its usefulness in the assessment of AS is completely unknown. Moreover, high costs, high radiation exposure, and lack of a scoring system limit the applicability of this method.

We present here an alternative design, using a historical control group consisting of patients who participated in a followup study on the natural course of AS with conventional treatment: the OASIS cohort. We propose using baseline and 2-year radiographs from the OASIS cohort (186 patients), mixing these sets of radiographs with sets of radiographs from patients in trials of TNF-blocking drugs, and offering these sets for scoring according to the mSASSS, by readers who are blinded with regard to the cohort from which the radiographs were derived, and the time sequence. As shown below, the OASIS is a representative cohort with respect to radiographic progression in AS patients; selection of only patients with high levels of disease activity is not necessary. Furthermore, the degree of change observed in the OASIS cohort is sufficient to create an experimental environment with adequate statistical power to demonstrate true inhibition of radiographic progression.

The OASIS cohort

The OASIS cohort consists of Dutch, French, and Belgian patients with AS who have been followed up since 1996 (5). There are no specific inclusion criteria, other than consecutive enrollment of patients who have AS according to the modified New York criteria (25) and are receiving care from a rheumatologist. As such, OASIS is a cross-sectional representation of AS patients seen in rheumatology practice (Table 1).

Table 1. Characteristics of the members of the Outcome in Ankylosing Spondylitis International Study cohort, assessed at baseline (n = 186)*
  • *

    Except where indicated otherwise, values are the mean ± SD/median (interquartile range).

  • IBD = inflammatory bowel disease (Crohn's disease or ulcerative colitis).

Age, years44 ± 13/43 (34–53)
Sex, % male68
Disease duration, years11.4 ± 8.6/9.6 (4.5–15.2)
Age at onset of symptoms, years32.4 ± 10.6/31.1 (24.2–40.5)
HLA–B27 positive, %82
Hip involvement, %24
History or presence of uveitis, %15
History or presence of IBD, %7
History or presence of psoriasis, %5
Presence of peripheral arthritis, %20

Since the OASIS cohort is an unselected cross-sectional representation of AS patients, disease activity at baseline could be expected to be lower compared with AS patients in clinical trials (Table 2). The mean baseline score on the Bath Ankylosing Spondylitis Disease Activity Index (26) in the OASIS cohort was <4, and thus below the level for inclusion in some RCTs of TNF-blocking drugs in AS. As expected, the mean erythrocyte sedimentation rate was also quite low.

Table 2. Patient-reported disease activity and levels of acute-phase reactants in the Outcome in Ankylosing Spondylitis International Study cohort, assessed at baseline
Parameter*Mean ± SD/median (interquartile range)
  • *

    BASDAI = Bath Ankylosing Spondylitis Disease Activity Index; BASFI = Bath Ankylosing Spondylitis Functional Index; ESR = erythrocyte sedimentation rate.

  • Assessed by the patient on a 10-cm visual analog scale.

BASDAI3.5 ± 2.1/3.3 (1.8–5.1)
Morning stiffness severity3.6 ± 2.9/2.7 (0.9–6.0)
Morning stiffness duration3.7 ± 3.1/3.0 (0.9–6.0)
Global assessment of disease activity3.8 ± 2.8/3.6 (1.2–5.6)
Back pain4.4 ± 2.7/4.5 (2.2–6.6)
BASFI3.5 ± 2.6/3.3 (1.1–5.3)
C-reactive protein, mg/dl1.6 ± 2.1/0.7 (0.6–1.6)
ESR, mm/hour13.3 ± 14.3/10.0 (4.0–17.0)

Treatment in the OASIS was according to “pre-TNF” standards (Table 3). The majority of patients received NSAIDs, and only a few were treated with DMARDs (sulfasalazine). None of the patients was treated with TNF-blocking drugs or with other biologic agents during the first 4 years of followup. One of the aims of the OASIS was to explore radiographic progression of the axial disease, and for this purpose, radiographs of the cervical and lumbar spine, as well as of the pelvis, were obtained in every patient at baseline, at 1 year, at 2 years, and at 2-year intervals thereafter.

Table 3. Treatment with nonsteroidal antiinflammatory drugs (NSAIDs), disease-modifying antirheumatic drugs (DMARDs), and other agents in the Outcome in Ankylosing Spondylitis International Study cohort, assessed at baseline
Treatment% of patients
  • *

    Sulfasalazine (18 patients), methotrexate (2 patients), or intramuscular gold (1 patient).

Simple analgesics12

Radiographic progression in the OASIS cohort

Radiographic progression was assessed independently, using the mSASSS (scale of 0–72), by 2 readers who were not aware of the time sequence. The mean ± SD 2-year progression scores were 1.38 ± 2.86 mSASSS units for reader 1 and 1.05 ± 2.89 for reader 2.

In order to test whether the OASIS is a representative cohort with respect to radiographic progression, we assessed radiographic progression under the same reading conditions in a different RCT comparing continuous NSAID use versus NSAID use on demand (9). Only reader 2 scored the radiographs in this trial, and the mean ± SD progression as scored by this reader was 0.91 ± 2.19 mSASSS units, which is close to the mean progression scored by the same reader in the OASIS cohort. These results indicate that radiographic progression measured in the OASIS is representative (and reproducible) of 2-year radiographic progression in AS.

We tested the influence of disease activity on radiographic progression in the OASIS cohort by comparing radiographic progression in patients who would have fulfilled the inclusion criteria in one of the RCTs with TNF-blocking drugs (the most strict inclusion criteria) versus that in patients not fulfilling these criteria due to low disease activity. The results of this exercise are depicted as probability plots (Figure 1), as recently recommended by us (27). Briefly, a probability plot depicts the actual scores of all patients against their cumulative probability (scores are ordered from low to high, and are assigned a cumulative probability). The probability plots in Figure 1 show the following: 1) ∼35% of all patients have some degree of radiographic progression (mSASSS >0) over a 2-year period; 2) “negative progression scores” occur in ∼10% of the patients as a consequence of measurement error with blinded reading order; and 3) the 2 curves almost overlie one another, which means that selection of patients for high disease activity does not influence radiographic progression.

Figure 1.

Probability plot of patients in the Outcome in Ankylosing Spondylitis International Study cohort, stratified by eligibility for a randomized controlled trial of tumor necrosis factor (TNF)–blocking drugs based on inclusion and exclusion criteria. mSASSS = modified Stoke Ankylosing Spondylitis Spine Score.

Radiographic progression in the OASIS cohort as a benchmark

The question is now whether the level of radiographic progression as measured by the mSASSS over a 2-year period suffices to serve as a benchmark for the effects of new drugs, e.g., TNF blockers. The null hypothesis in such an experiment is that radiographic progression in both groups (active intervention and historical control) is similar, or that the effect of the new drug (TNF blocker) on radiographic progression is negligible. If the null hypothesis can be rejected by statistical testing, the new drug is considered to be better at inhibiting radiographic progression than the “usual treatment” in the historical control group, and it may be claimed that such a drug has the potential to influence structural damage. A useful means to explore the feasibility of such an approach, which does not differ fundamentally from sample size calculations in the design phase of an RCT, is performance of power calculations. Figure 2 shows the results of such power calculations for the OASIS cohort, presented as sample sizes (on the y-axis) against minimum detectable differences, depicted as treatment contrast (on the x-axis).

Figure 2.

Sample size calculations (“power curves”) that describe the relationship between treatment contrast and standard deviation, and the number of patients required, in a trial aimed at detecting a contrast in radiographic progression between an “untreated” control group (e.g., the Outcome in Ankylosing Spondylitis International Study [OASIS] cohort) and an active intervention group (e.g., patients treated with tumor necrosis factor [TNF]–blocking drugs). Treatment contrasts and SDs are chosen based on observations in the OASIS cohort. Statistical power is fixed at 0.80. The probability of a Type I (alpha) error is set at 0.05.

A noteworthy assumption is that rereading of the radiographs of patients in the OASIS cohort under blinded conditions, as described above, will result in a mean progression score of 1.05 mSASSS units (assay sensitivity). This number serves as a benchmark against which different putative progression scores derived from the trial with the TNF-blocking drug (but scored in the same reading session) are compared. The difference between the benchmark (1.05 mSASSS units) and the putative score from the trial with the new drug is called the contrast, or, as in RCTs, the treatment effect. A second assumption is that true negative progression scores do not exist in AS, so the treatment effect will be 1.05 mSASSS units at most. Repeated t-test–based sample size calculations were performed to construct the power curves in Figure 2, which reflect the relationship between required sample sizes and minimum detectable difference, for different levels of the standard deviation of the mean progression score. The curves show that studies with 130–150 patients per treatment group provide sufficient statistical power to detect differences of 0.8–1.0 mSASSS units in radiographic progression between treatment groups. This number of patients is feasible in the context of the OASIS cohort, which includes 2-year radiographic data from 186 AS patients. It should be noted that the t-test–based sample size calculation may have caused an overestimation of the actual required number, because of the skewness of the radiographic data. Usually, nonparametric statistical testing or parametric statistical testing after a data normalization procedure increases statistical power.

In summary, we have shown here that the OASIS cohort is a representative cohort of AS patients in terms of radiographic progression, that the level of radiographic progression in patients in the OASIS cohort is not dependent on the level of disease activity. Therefore, radiographic findings in this group can appropriately be used as a background against which the effects of (TNF-blocking) drugs can be tested.


The advantages of adopting a study design with a historical control group should be weighed against the disadvantage of inability to perform an RCT due to feasibility and ethical considerations. An advantage of a historical control group design is that it eliminates the need for 2-year placebo-controlled trials of TNF-blocking drugs, and the effect of TNF-blocking drugs on radiographic progression can be assessed without delay, instead of after 2 years. The major disadvantage of this design is potential prognostic dissimilarity between the OASIS cohort and the efficacy trial. The question thus becomes whether this type of prognostic dissimilarity is so relevant that it outweighs the advantages.

First, prognostic similarity, which is expected “by definition” with the use of a randomization procedure, is important in that differences at baseline determine to some extent the outcome of interest. As noted above, we could not identify variables that had prognostic impact on radiographic progression in the OASIS, the only exception being the presence of radiographic damage itself. The absence of known predictors of radiographic progression does not preclude the existence of yet-unknown predictors (e.g., geographically determined), but unknown prognostically relevant predictors would be important only if they are unequally distributed among the treatment groups. Furthermore, all phase 3 trials of TNF-blocking drugs in AS that are currently being performed include patients from both the US and Europe. Unexpected geographic differences with respect to radiographic progression can easily be traced (and adjusted for) in the analysis.

The second and more important argument favoring the use of a historical control group over an RCT design in this specific context is the above-mentioned bias that will occur in a 2-year placebo-controlled RCT, due to treatment crossover and concomitant interventions. This kind of bias, which has the same consequences regarding comparability of groups as does prognostic dissimilarity at baseline (16), can be avoided entirely by use of a historical control group design. It is not reasonable to assume that the potential prognostic dissimilarity at baseline in the historical control group design outweighs the longitudinal bias that will compromise the RCT design.

A third argument moderating the importance of the RCT design in this setting is that the actual acquisition of the data, i.e., the reading of radiographs, will occur concurrently and under strictly blinded conditions in both groups. Looking at treatment effects in terms of biases and error (noise), one can distinguish noise generated by prognostic dissimilarity (baseline and/or longitudinal) and noise generated by measurement error (radiographic scoring). We do not know how these sources of error quantitatively relate to one another, but we do know that the signal of interest (radiographic progression) is rather low, and that reading variation should be kept as low as possible in order to pick up the signal. Presumably, optimization of the quality of the reading procedure greatly outweighs optimization of the trial design, and we therefore recommend thorough training of the readers in order to improve reliability, and use of readers who are familiar with the types of abnormalities often seen in AS.

Taken together, these arguments provide evidence in favor of the use of historical control patients rather than designing a 2-year placebo-controlled RCT in order to investigate whether TNF-blocking drugs inhibit radiographic progression in AS. The OASIS cohort can be considered an appropriate example of a feasible historical cohort that can be used as a control group.