Do Patient Preferences for Core Outcome Domains for Chronic Gout Studies Support the Validity of Composite Response Criteria?

Authors


Abstract

Objective

To determine patient-derived weights or prioritization for core outcome domains in chronic gout clinical studies.

Methods

Three patient groups participated in a conjoint decision-making exercise using 1000Minds software, which asked participants to make repeated judgments regarding which of 2 hypothetical patients with gout represented the best response to treatment. Each scenario compared 2 patients on the basis of change in 2 of 5 core outcome domains at a time. Agreement of 80% of the group was required to answer each scenario. Re-voting was performed once after discussion in instances of disagreement.

Results

The relative importance accorded to each outcome domain was different across the 3 groups of patients. There was some consistency that tophi was the least or second to least important outcome domain for every group and pain between attacks was ranked in the bottom third of priority for all groups. Gout attacks were ranked as the second or third most important domain in each group. However, the relative importance of serum urate (SUA) and activity limitations was quite different among the 3 groups, with 1 group ranking SUA as the most important outcome and 1 group ranking it as the second to least important outcome.

Conclusion

Despite some consistency in the relative value of some outcome domains for chronic gout studies, there is sufficient disagreement in the relative importance of other domains of outcome to challenge the validity of constructing a composite index of response that would be applicable to most gout patients.

INTRODUCTION

Composite response measures that incorporate and summarize information from a number of individual response indicators can be useful outcome measures for clinical trials ([1]). There is a need for standardization of outcome measurement in trials of treatment for chronic gout, with the introduction of new agents, including febuxostat, pegloticase, and novel uricosuric agents in recent years ([2]). In addition, consensus work through the Outcome Measure in Rheumatology (OMERACT) initiative have identified several core domains of measurement that should be considered in studies of chronic gout ([3]).

Composite measures have been useful in clinical trials for rheumatoid arthritis (RA) ([4]), psoriatic arthritis ([5, 6]), osteoarthritis ([7]), and ankylosing spondylitis ([8, 9]). Principal advantages of composite indices are to ease interpretation of trial results, to allow transformation of response rates into number needed to treat, to reduce Type I error rate inflation by performing fewer statistical tests, and where the composite index or response occurs more commonly than individual response components, to reduce the sample size needed to detect a response. However, important assumptions are implicit in the use of composite indices and evaluation of these is rarely reported. In particular, the individual response components should vary appropriately together in response to the intervention, and the different components should be valued similarly by patients, clinicians, or researchers. An important example of a misleading composite index is in cardiovascular studies that combine death and hospital admission events as a single index ([10, 11]). Such outcomes clearly have different relative values, and it is also possible that interventions could have similar effects on a composite index that mask potentially important differences in the underlying individual response components such as death rates.

Composite response measures are examples of multiple criteria decision making (MCDM), a class of activities characterized as “procedures by which concerns about multiple conflicting criteria can be formally incorporated into the management planning process” ([12]). In the case of response criteria for gout clinical trials, multiple indicators of response are incorporated into a decision as to whether the patient has responded or not to the intervention. Other examples of MCDM include classification criteria ([13]), governmental decision support for transport projects ([14]), access to publically funded elective health services ([15]), and marketing research. This list is not exhaustive. The particular approach to MCDM we introduce here is Potentially All Pairwise RanKings of all possible Alternatives (PAPRIKA) as implemented in 1000Minds software.

This study aims to determine, from the patient perspective, the relative importance of the chronic gout core outcome domains endorsed by OMERACT for measurement in clinical trial settings. In particular, we test the notion that patients rank the importance of each domain similarly and that each domain is valued equally.

Box 1. Significance & Innovations

  • The assumptions underlying composite response criteria are rarely examined.
  • A novel approach using conjoint decision analysis with gout patients showed important variability in the rank ordering of outcome components and in the relative weighting of outcome components.
  • Composite response criteria may not be valid for chronic gout trials, and alternative approaches to multiple outcome assessment should be considered.

PATIENTS AND METHODS

Patients

Three groups of patients with gout were recruited by mailed invitations to patients in clinical registries from rheumatology departments in Auckland or Wellington, New Zealand. There were no selection criteria except for availability to attend a single group meeting. Participants were consecutive responders to the mailed invitations until a group of 8 to 10 people were identified. Each group had a meeting during which the 1000Minds activity, described below, took place and concluded with refreshments. Before the meeting, information about the study was sent to each participant, informed consent was obtained, and participants completed a questionnaire about themselves and their gout history. The New Zealand Ministry of Health Multiregional Ethics Committee approved the study.

At the meeting, an explanation was given about outcome measurement in clinical trials and how the outcome domains for chronic gout studies were chosen. Then the participants responded to a decision survey, in which a series of hypothetical scenarios, where 2 patients were compared on the basis of changes in 2 outcome domains, were presented to the group. An example is shown in Figure 1. The group was asked to choose which of the 2 patients had responded best to treatment. This instruction was also formulated as “to imagine which patient you would rather be.” Each member of the group voted for their preference without any discussion. Where there was 6 or fewer in the group, no more than 1 person could disagree for the consensus decision to be made, otherwise the decision was left unanswered and the group went on to the next decision. For groups of 7 or more, no more than 2 persons could disagree for the consensus decision to be made. In addition to expressing a preference for either hypothetical patient, the groups could alternatively choose that the scenario was impossible or that the 2 patient scenarios were equally preferred.

Figure 1.

Example of a decision scenario used in the 1000Minds exercise.

In instances of disagreement, participants were encouraged to discuss the reasons for their choice and voting was repeated to give participants the opportunity to change their mind. The decision survey was conducted over 3 45-minute periods with breaks between each period.

PAPRIKA and 1000Minds software

The use of a preference-based approach enables the direct weighting of multiple potential indicators of response based on the direct views of participants. PAPRIKA is a mathematical algorithm for constructing the relative weights for each response indicator. This is based upon the results of a series of pairwise comparisons of undominated pairs of all possible alternatives. Each pairwise scenario compares different levels of 2 indicators at a time, in order for the respondent to determine which combination of indicators has “responded” the most ([16]). An “undominated pair” is characterized by a higher ranking category for at least one indicator and a lower ranking category for the other indicator. Pairs that are implicitly ranked as corollaries of explicitly ranked pairs are also identified by the algorithm, which leads to efficiency of the algorithm and the requirement for only a portion of all possible combinations to be evaluated by the decision maker. Sufficient pairwise comparisons are made until the algorithm identifies a series of weights that is consistent with all the decisions that have been made.

PAPRIKA is implemented in 1000Minds software. Two important limitations of this approach are, first, that relatively few indicators and levels of each indicator can be used without requiring an overwhelming number of pairwise comparisons to be performed and, second, that the algorithm generates an additive, linear structure for the response criteria without incorporation of interactions between indicators.

The core set of domains for chronic gout studies identified and endorsed at OMERACT 9 ([3]) are pain between gout flares, gout attacks, activity limitations, tophi, serum urate (SUA), patient global assessment, and health-related quality of life (HRQOL). It was necessary to restrict the number of response indicators (and therefore the number of decisions to be made). Two core set indicators, patient global assessment and HRQOL, were not included in the decision exercise in order to make the exercise feasible. The meaning and interpretation of patient global assessment is known to vary ([17]), and so this indicator was not included in the decision exercise. Since HRQOL is itself multicomponent and the 1000Minds approach is limited by the complexity of the indicators, we decided not to include HRQOL. Each of the other indicators were categorized as “worse,” “the same,” “a bit better,” and “a lot better.” For the 5 indicators and 4 levels of each indicator, the 1000Minds algorithm predicted that as many as 360 pairwise decisions would need to be made. However, in the implementation a single decision can often resolve multiple pairwise comparisons so that many fewer actual decisions than this were used in the study.

RESULTS

Two groups were formed in Wellington (n = 6 and n = 8) and 1 in Auckland (n = 7). Each group was facilitated and led by the same investigators (WJT and MB). The characteristics of the groups are shown in Table 1. Participants were mainly men and had a wide range of disease duration and severity.

Table 1. Demographic and disease characteristics of participants
 Wellington 1 (n = 6)Auckland (n = 7)Wellington 2 (n = 8)
  1. aParticipants could nominate more than 1 ethnicity.
Age, mean (range) years59.7 (46–69)57.0 (38–77)62.3 (51–75)
Sex, M:F6:07:07:1
Ethnicity, no.a   
New Zealand European563
Samoan111
Maori  2
Asian 12
Time since first attack, mean (range) years17.8 (5–32)16.1 (7–25)20.9 (10–35)
Attacks in previous 12 months, mean (range) no.2.5 (1–5)7 (0–20)11 (0–80)
Time since most recent attack, mean (range) weeks16 (1–52)27 (1–156)163 (1–728)
Participants with tophi, no.415

The number of actual decisions, completeness of the exercise, the number of scenarios that could not be decided, and the average time taken for each decision are shown in Table 2. None of the groups were able to entirely complete the decision survey; 307 of 360 (85%) to 334 of 360 (93%) were completed in the available time. Each group evaluated a similar number of scenarios (53–64), which represented 15% of the potential number of scenarios, given the number of indicators and levels of each indicator. There was a similar extent of disagreement within each group, with 20 of 53 (26%) to 29 of 64 (31%) of scenarios unable to be agreed upon despite discussion and re-voting. Where decisions were made, these took an average of 1 minute and this was similar between the 3 groups.

Table 2. Process variables for each decision survey group
 Wellington 1 (total answered = 53)Auckland (total answered = 64)Wellington 2 (total answered = 61)
  1. aTotal of 344 potential decisions because 16 scenarios were deemed impossible.
Consensus, no. (%)   
Full (no disagreement)35 (66)9 (14)19 (31)
Partial (1 person disagreed)18 (34)21 (32)20 (32)
Partial (2 people disagreed)N/A34 (53)22 (37)
Impossible4 (5)00
Skip (no consensus)20 (26)29 (31)27 (31)
Algorithm completed, %   
315 of 344a92  
334 of 360 93 
307 of 360  85
Time per decision for skipped or impossible decisions, minutes2.032.011.75
Time per decision for agreed decisions, minutes1.010.961.14

The numerical weights for each level of each outcome domain calculated by PAPRIKA are shown in Table 3. These are scaled so that the sum of the highest levels of each domain sum to 100. The rank order and relative weighting for the domains are shown in Figure 2. The priorities for each group were different, i.e., the rank order of most to least important for the Auckland group was SUA, gout attacks, disability, and tophi/pain (equally ranked as least important). For the first Wellington group, the rank order of most to least important was disability, SUA, gout attacks, pain, and tophi. For the second Wellington group, the rank order of most to least important was disability, gout attacks, pain, SUA, and tophi. This is displayed graphically in Figure 2.

Table 3. Numerical weights for each level of each outcome domain
 Wellington 1AucklandWellington 2
Number of gout attacks has …   
Worsened000
Not changed8.514.213.0
Improved a bit12.820.022.2
Improved a lot15.920.627.8
Level of pain between gout attacks has …   
Worsened000
Not changed7.99.013.9
Improved a bit12.812.920.4
Improved a lot15.213.524.1
Serum urate level has …   
Worsened000
Not changed12.820.62.8
Improved a bit20.128.46.5
Improved a lot25.038.19.3
Size or number of tophi have …   
Worsened000
Not changed5.56.52.8
Improved a bit6.19.73.7
Improved a lot6.713.55.6
Ability to do usual activities has …   
Worsened000
Not changed25.08.421.3
Improved a bit28.79.028.7
Improved a lot37.214.233.3
Figure 2.

Stacked bar graph showing the relative order (least to most) in importance and the relative size of the importance accorded to each outcome domain. SUA = serum urate; Well = Wellington; Akl = Auckland.

Since the decision exercise was incomplete, the underlying mathematical structure developed for each group was directly inspected by a 1000Minds developer. This confirmed that the precise order of importance between “pain between attacks” and “gout attacks” was not fully resolved in the first Wellington group. Similarly, in the Auckland group, the precise order of importance between “pain between attacks” and “tophi” was not fully resolved. This can be readily appreciated by the equal weighting allocated to these pairs of outcome domains (Table 3). The relationships between domains in the second Wellington group were all fully resolved despite incompleteness of the decision survey.

Tophi were consistently ranked least or second to least important domain in all 3 groups. Pain between attacks was ranked in the bottom third of priority for all groups. Gout attacks were ranked as second or third most important domain in each group. However, the relative importance of SUA and activity limitations was different among the 3 groups.

DISCUSSION

We found that the patients with gout did not give each component of a potential composite index an equal weight. We also found lack of agreement for the rank ordering of importance of individual components of a potential composite index of response. Thus, any composite index using these components will fail to reflect the relative importance of each component of the index for some patients with gout. This is problematic for the construction of valid composite response criteria for chronic gout studies.

It is useful to consider the rationale for the development of composite response criteria in RA and whether the same issues are present for chronic gout. The American College of Rheumatology 20% improvement (ACR20) response criteria for RA are a success story for composite criteria in rheumatology. They were developed partly in response to the use of multiple measures of outcome in RA clinical trials. Interestingly, one of the early calls for changes in the way RA clinical trials were designed and analyzed did not recommend composite measures as a way to resolve the problem of multiplicity because of the issue of appropriate weights for constituent components for a composite index ([18]). Nevertheless, the ACR20 definition of response that incorporated changes in 7 core domains of outcome was developed and published in 1995 ([4]).

We suggest that multiplicity, interpretability, and statistical power issues are not as relevant to studies of chronic gout. The pathologic mechanisms that underlie gout, hyperuricemia, and crystal deposition are well understood. It is biologically implausible that therapies that do not cause a sustained reduction in SUA will cause meaningful long-term clinical benefit. Thus, a crucial outcome for urate-lowering therapy is SUA. We do not suggest that other outcome domains should not be measured. Patient-reported outcomes and other OMERACT core domains are also very important and should be reported. However, these outcomes do not necessarily need to be incorporated into a composite index. Separation of HRQOL from composite indicators of response is actually necessary to show that therapies make a difference in people's lives, and not just to the disease. Identification of one primary outcome measure is the most straightforward means of resolving the issue of multiple inferential testing. All subsequent inferential tests are then contingent upon the primary outcome, demonstrating a statistically significant and clinically meaningful difference. This can be done in a prespecified hierarchical manner. There are other statistical techniques for dealing with multiple outcomes that do not increase the chances of a Type 1 error, such as the global statistical tests that take correlations among outcomes into consideration ([19]) or by comparing summed ranks of multiple end points ([20]).

Clinical interpretation of the results of clinical trials for chronic gout requires measurements of the individual core outcome domains. This is because not all domains are applicable to all patients (e.g., pain between attacks or tophi). In addition, some domains may change independently of the disease process (e.g., HRQOL or activity limitations). Interpretation of a composite response criteria is likely to require inspection of how individual components changed, making the composite response criteria more or less redundant.

Statistical power in a clinical trial depends on the strength of the intervention in relation to a primary outcome measure, the clinically meaningful difference, and the variability of the primary outcome variable. This means that the closer the outcome measure is mechanistically related to the treatment under study, the greater the statistical power. In studies of urate-lowering therapy, the mechanism of action is very clearly understood and amenable to direct observation. In studies of interventions that influence gout disease activity through a different mechanism (e.g., antiinflammatory treatment), a different primary outcome is necessary (e.g., frequency of gout attacks). However, choice of primary outcome is also dependent on whether clinically important differences in the outcome would change practice ([15]). This requires additional consideration.

There are some limitations to this study. The lack of agreement between the 3 groups may reflect small numbers of patients and heterogeneity with respect to duration of gout, presence of tophi, number of weeks since the last attack, and probably many other factors. The absence of women, while reflective of the demographic profile of gout, may have biased the study toward less disagreement and so is unlikely to alter the main conclusion. Our study was not designed to determine why disagreement occurred with respect to outcome domain prioritization, only to show that such disagreement exists. It would be interesting to explore the underlying basis for the disagreement in future research.

We suggest that the critical outcome measure design decision in clinical trials for chronic gout is choice of the primary outcome measure. For chronic gout clinical trials, we propose that more work needs to be done to identify the preferred primary outcome domain for specific kinds of interventions. We need to know which outcome domain is most likely to change clinical practice for particular kinds of interventions, e.g., urate-lowering therapy, antiinflammatory therapy, educational interventions, and others. There is some suggestion from the current study that gout attacks are generally important to patients with gout, but not necessarily the most important outcome. Further work should be undertaken to identify which outcome domain in specific clinical trial contexts would be most likely to influence physician practice.

AUTHOR CONTRIBUTIONS

All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Taylor had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Taylor, Dalbeth.

Acquisition of data. Taylor, Brown, Aati, Dalbeth.

Analysis and interpretation of data. Taylor, Weatherall.

ACKNOWLEDGMENTS

The assistance of Franz Ombler from 1000Minds, Cat Bjazavich in Wellington, and Meaghan House, Maria Lobo, and Christopher Franklin in Auckland for helping with the groups is gratefully acknowledged. We also thank all the participants who contributed to the group discussions.

Ancillary