Psychometric characteristics of outcome measures in juvenile idiopathic arthritis: A systematic review


  • Statements herein should not be construed as endorsements by the Agency for Healthcare Research and Quality or the US Department of Health and Human Services.



To review the performance characteristics of the instruments most commonly used to measure clinical outcomes in juvenile idiopathic arthritis (JIA), including global assessments, articular indices, functional/disability assessments, and quality of life measures.


As part of an Agency for Healthcare Research and Quality comparative effectiveness review of antirheumatic drugs, we explored the characteristics of commonly used outcome measures for JIA. English-language studies of children with JIA were identified from Medline and Embase. Two independent reviewers screened titles and abstracts, with subsequent full-text review of studies selected based on predetermined criteria.


We included 35 publications describing 34 unique studies and involving 14,831 patients. The Childhood Health Assessment Questionnaire (C-HAQ) was the most extensively studied instrument and had high reliability, but only moderate correlations with other indices of disease activity and poor responsiveness to change in disease status. The physician global assessment of disease activity (PGA) and articular indices had the strongest association with disease activity and were the most responsive to change. Measures of psychosocial function and quality of life were moderately associated with measures of disease activity, but were less responsive to changes in disease status.


In children with JIA, no single instrument was superior in reliability or validity or in describing the impact of JIA. Although the C-HAQ has been extensively evaluated, the PGA and articular indices appear to have the highest responsiveness to change and, therefore, the highest potential for detecting important differences in treatment response.


Juvenile idiopathic arthritis (JIA) is the most common rheumatic disease of childhood, affecting close to 300,000 children in the US. JIA has a broad impact on a child's physical and mental health. The heterogeneity of disease severity, the broad age range of affected individuals, and fluctuations in disease course complicate measuring disease activity and treatment effects in children with JIA. Developing instruments that accurately assess the effect of JIA on health and well-being is critical to assessing the overall impact of the disease and quantifying the impact of treatments.

Many instruments are used to assess severity of disease, disability, and quality of life in JIA. To inform clinicians, patients, and families about the current evidence regarding the management of JIA with disease-modifying antirheumatic drugs, and to help researchers identify critical gaps in knowledge, the Agency for Healthcare Research and Quality commissioned the Duke Evidence-based Practice Center to conduct a comparative effectiveness review (CER) (1), which included a key question regarding the psychometric properties of the most commonly utilized outcome measures in JIA. In the present report, we describe our findings for that key question, which was stated as follows: “What is the validity, reliability, responsiveness, and feasibility of the clinical outcome measures for childhood JIA that are commonly used in clinical trials or within the clinical practice setting?”


We developed and followed a standard protocol for all steps of this review. The key question and the methods used were developed with input from a technical expert panel. Details are available in the full CER (1).

Search strategy and identification of relevant studies.

We conducted a comprehensive search of Medline (1966 to December 2010) and Embase (1947 to December 2010) using medical subject heading terms and key words for JIA and its older designations (i.e., juvenile rheumatoid arthritis [JRA]) and the names of the common instruments used to assess outcomes of treatment. We limited our search to English-language articles of studies in humans and identified prospective clinical studies and cross-sectional studies relevant to our question. We also manually reviewed the references from review articles and articles meeting our selection criteria for additional pertinent studies.

Study selection.

Two independent reviewers reviewed all abstracts, with subsequent full-text review and study selection based on predetermined selection criteria. Differences were resolved by consensus. We included peer-reviewed, English-language articles of studies that had a sample population of individuals ages 18 years or younger with JIA according to the current American College of Rheumatology (ACR) definition (2), as well as past designations such as JRA and juvenile chronic arthritis. After compiling a list of outcome measures used in the studies identified in our initial search, we prioritized and selected the measures for detailed review, using input from the technical expert panel and our assessment of measures already commonly used and those of growing relevance. We chose to focus on studies in which the instrument's psychometric characteristics were examined specifically for children with JIA (Table 1). In addition, we described several composite measures and definitions of disease state of growing importance, although they lacked full psychometric evaluations, such as the ACR 30% improvement criteria for JRA (ACR Pedi 30) (3). While the recently developed composite measures aim to more broadly reflect overall health status and disability, we did not identify studies of psychometric characteristics for the composite measures. We therefore included those measures that are key components of the composite measures, including physician global, parent/patient global, joint counts, functional ability, and quality of life measures.

Table 1. Outcome measures*
Measure/instrumentNo. of itemsDomain descriptionsResponse categoriesScoring rangeMode of administrationFeasibilityComments
  • *

    VAS = visual analog scale; C-HAQ = Childhood Health Assessment Questionnaire; DI = disability index; NA = not applicable; CHQ = Childhood Health Questionnaire; PedsQL = Pediatric Quality of Life Inventory; RM = Rheumatology Module.

  • Higher score equals higher disease activity/functional impairment.

  • Higher score indicates better quality of life. Mean ± SD score in US is 50 ± 10.

  • §

    Higher score indicates better quality of life.

Measures of disease activity       
Active joint countFull 71-joint examinationActive arthritisActive, inactive0–71Health professionalJoint count summedReduced joint count measures exist
Physician global assessment1 itemActive diseaseMost commonly 100-mm VAS0–100Health professionalMeasure distance from 0 anchor 
Parent/patient global assessment1 itemVAS or categorical, overall well beingMost commonly 100-mm VAS0–100Self-administeredValue of VAS, no calculationAssesses disease activity, functional status, and quality of life
Measures of functional status       
C-HAQC-HAQ DI: 30 items; VAS: pain, overall well-beingPhysical function (covering 8 domains), pain, overall well-being0–3 and NA, 0 = no difficulty, 3 = inability to performPhysical function: 0–3; VAS: 0–100 mmSelf-administered, parent or patient5 minutes to complete, highest score in each domain = score for domain, 2 minutes to scoreAdapted from Stanford Health Assessment Questionnaire
Measures of health-related quality of life       
CHQParent form: 50 or 28 items; child form: 87 itemsPhysical health, pain, mental health, school, social, family0–100, 0 = poor well-being, 100 = excellent well-being0–100Self-administered, children self-administer after age 10 yearsApply scoring formula as per manual 
PedsQL23 itemsPhysical, emotional, social, school functioning5-point Likert scale (never to always)0–100§Self-administeredTogether (generic and RM) takes 10–15 minutes 
PedsQL-RM22 itemsPain and hurt, daily activities, treatment, worry, communication5-point Likert scale (never to always)0–100§Self-administered  

Psychometric properties evaluated.

Reliability addresses the consistency of the instrument in measuring the construct of interest. We examined 3 areas of reliability: reproducibility, interrater reliability, and internal consistency. Instruments with greater reproducibility and interrater reliability may be more feasible to use in clinical trials and require smaller sample sizes to detect clinically important differences between treatment groups. Internal consistency assesses whether the items purported to measure the same general construct actually produce similar results. Cronbach's alpha is usually interpreted as a measure of internal consistency, with a range from 0 (i.e., no internal consistency) to 1 (i.e., completely internally consistent). Recent research has challenged this interpretation (4). However, Cronbach's alpha is the most commonly reported method for measuring internal consistency for the measures of interest. Validity refers to how well an instrument measures what it claims to measure. Since many of the constructs assessed by the clinical outcome measures have no reference standard, we evaluated construct validity based on how well the measures correlated with other indicators of disease, including global assessments, articular counts, and scores from other validated instruments.

Responsiveness is determined by 2 properties: reproducibility and the ability to register changes in scores when a patient's symptom status shows clinically important improvement or deterioration. The effect size is a measure of responsiveness that uses the mean change score in the numerator and a measure of variability in the denominator. Responsiveness is often reported on a continuous scale, even when the scales in question are ordinal. Although a common practice, some researchers have challenged the resulting interpretation of effect sizes (5) and minimum clinically important differences (MCIDs) (6) as potentially inflated when calculated on ordinal scales, and it is important to be aware of this when reviewing the results.

Data extraction and quality assessment.

Data extraction was completed by pairs of reviewers. One reviewer performed the initial abstraction, while the second read over each article to ensure accuracy and completeness. We extracted data regarding inter- and intrarater reliability, test–retest reliability, responsiveness (standardized response mean [SRM] and responsiveness index), time needed to administer, and construct validity for our selected outcome measures.

To assess study quality, we adapted pertinent criteria from the Quality Assessment of Diagnostic Accuracy Studies tool, a validated measure designed to assess the quality of diagnostic test studies (7). We evaluated the selection of study participants, independent and blind comparison of the study instrument to other outcome measures, and the appropriateness of the analytical approach.


Literature identified.

Our initial search included broad search terms for studies pertaining to the treatment of JIA as well as studies of the outcome measures used to assess treatment response. We identified a total of 4,815 potentially relevant citations, of which 35 were subsequently determined to meet eligibility criteria for the key question considered in this report. Figure 1 shows the flow of literature through the selection process. The 35 publications identified described 34 unique studies involving 14,831 patients that investigated the psychometrics of the selected outcome measures or developing definitions of treatment response. Among these were 14 studies that evaluated reliability (8–21), 21 studies that evaluated validity (8, 11–13, 15–18, 21–33), and 9 that evaluated responsiveness (9, 10, 17, 26, 32, 34–37) of the selected outcome measures. Of these measures, the Childhood Health Assessment Questionnaire (C-HAQ) was the most extensively studied measure, with 23 studies (8–16, 20, 22–24, 26, 28, 30–37). The overall quality of the studies was fair, with few studies commenting on blinding, and only 1 (9) reporting sample size calculations. Results were reported as median values as well as the ranges of values from the different studies.

Figure 1.

Literature flow diagram.


We identified 10 studies examining various aspects of reliability for the C-HAQ (8–16, 20); 2 studies each for the physician global assessment of disease activity (PGA), parent/patient global assessment of well being (PGW) (19, 21), and Pediatric Quality of Life Inventory (PedsQL) (18, 20); and 1 for the Child Health Questionnaire (CHQ) (17). Reproducibility, also called test–retest reliability, was assessed for the C-HAQ in 5 studies, all of which demonstrated high correlation between administrations (correlation coefficient range 0.79–0.96) (8, 11–14). The reliability of the PedsQL, CHQ, joint counts, PGA, and PGW has not been studied specifically in JIA populations.

Interrater reliability was most commonly explored to determine the correlation between parent and patient scores. Interrater reliability was measured for the C-HAQ, CHQ, and PedsQL, all of which demonstrated a moderate to strong correlation between parent and child when assessing functional status or disability (C-HAQ: 0.54–0.84 [9, 10, 13, 20], CHQ physical score [PhS]: 0.69–0.87 [17], PedsQL: 0.46–0.8, and PedsQL Rheumatology Module [PedsQL-RM]: 0.3–0.90 [18, 20]). The correlation between parent and child was lower for the psychosocial domain, including the PedsQL-RM worry domain (correlation coefficient 0.3) (18) and the CHQ psychosocial score (PsS; correlation coefficient range 0.38–0.53) (17).

Internal consistency, assessed most commonly using Cronbach's alpha, was evaluated in 4 studies for the C-HAQ, with all showing high internal consistency (Cronbach's alpha 0.88–0.94 for all domains except the “arising” domain [0.69]) (12, 13, 15, 16). In addition, shorter versions of the C-HAQ disability index (DI) were found to have high internal consistency, with Cronbach's alpha of 0.93 for both the 29-item and the 18-item instruments (15).

The C-HAQ was also evaluated for unidimensionality in 4 studies. Item response theory was used to examine unidimensionality in 2 studies, 1 of which showed a misfit rate of 13% with only 1 problematic area (hygiene), while the other study found 4 items that did not fit the model (16, 31). Factor analysis was used to confirm the unidimensional character of the C-HAQ and C-HAQ DI in 2 additional studies, with the full C-HAQ reported as having a very high fit of the model to single dimension (functional disability) with P < 0.0001 (12). The analysis of the C-HAQ DI demonstrated 2 principal components: an upper extremity functional component and a lower extremity functional component.


Of the 21 articles that met our inclusion criteria, 17 explored validation of the C-HAQ (8, 11–13, 15, 16, 21–24, 26, 28–33), 4 explored validation of the CHQ (17, 25, 26, 29), and 2 explored validation of the PGA and PGW (21, 27). In addition, 1 study focused on the correlation of the PedsQL and PedsQL-RM with pain assessments (18).

Results are summarized in Table 2. The C-HAQ was most strongly correlated with the PGW, with a median correlation of 0.54 (range 0.44–0.7, 6 studies [12,21,24,26, 30,32]). Of the articular measures of disease, both the active joint count (AJC) and the joints with limited range of motion (LROM) demonstrated moderate correlations with the C-HAQ, with a median correlation of 0.45 (range 0.14–0.67, 9 studies [12, 13, 16, 21, 23, 26, 30–32]) and 0.49 (range 0.3–0.76, 7 studies [8, 22, 24, 26, 29–31]), respectively. There was considerable variability in these correlations, with the most significant variations among children categorized by disease duration. Palmisani et al reported that the C-HAQ correlated less well with AJC for children early in the course of disease than for children later in the course of disease (0.14 and 0.61, respectively) (30). Those with late disease had a strong correlation with LROM (0.76), but lower correlations with PGA (0.51) (30). Modified forms of the C-HAQ, including reduced-item and digital versions, have been validated as well, although the correlation with articular measures was slightly less than for the original C-HAQ (values of 0.34–0.59) (8, 15, 28).

Table 2. Validity: correlations of instruments with measures of diseases and other instruments*
Instrument (ref.)PGAPGWAJCLROMSwollen joint countOther instruments
  • *

    Values are the median (range). If no range, only 1 study reported that measure. PGA = physician global assessment of disease activity; PGW = parent/patient global assessment of well-being; AC = active joint count; LROM = limited range of motion; C-HAQ = Childhood Health Assessment Questionnaire; PedsQL = Pediatric Quality of Life Inventory; RM = Rheumatology Module; CHQ = Child Health Questionnaire; PhS = physical score; PsS = psychosocial score; ACR = American College of Rheumatology.

C-HAQ (8, 11–13, 15, 16, 21–24, 26, 28–33)0.45 (0.2–0.67), 9 studies0.54 (0.44–0.7), 6 studies0.45 (0.14–0.67), 9 studies0.49 (0.33–0.76), 6 studies0.40 (0.22–0.65), 4 studiesPedsQL: −0.62; PedsQL-RM: −0.63; CHQ PhS: −0.63 and 0.58, 2 studies; CHQ PsS: −0.25; Steinbrocker functional class: 0.77; Disease Activity Index: 0.60; ACR functional class: 0.64; digital C-HAQ: 0.97
CHQ (17, 26, 29)CHQ PhS: −0.54 (−0.52 to −0.56), 2 studies; CHQ PsS: −0.048CHQ PhS: −0.64 (−0.63 to −0.65), 2 studies; CHQ PsS: −0.315CHQ PhS: −0.39 (−0.36 to −0.42), 2 studies; CHQ PsS: −0.024  C-HAQ PhS: −0.54 (−0.50 to −0.57), 2 studies; C-HAQ PsS: −0.25 (−0.22 to −0.28), 2 studies
PGA (21, 27)0.540.62 (0.47–0.77), 2 studies0.49 (0.4–0.58), 2 studies0.64 (0.51–0.76), 2 studies0.39; CHQ PhS: −0.53; CHQ PsS: −0.13
PGW (21, 27)0.540.45 (0.40–0.49), 2 studies0.43 (0.38–0.48), 2 studies0.43 (0.42–0.43), 2 studies0.53; CHQ PhS: −0.7; CHQ PsS: −0.29

While there were no strong correlations between indicators of disease activity and the C-HAQ, there were moderate correlations with measures of quality of life, including the PedsQL (−0.62) and the PedsQL-RM (−0.63) (24). Of interest, while there were moderate correlations between the C-HAQ and the CHQ PhS (−0.58), there was poor correlation with the CHQ PsS (−0.25) (17). The 2 studies reporting on validity of the CHQ found consistently higher correlations between the physical component on all measures, from PGA and PGW to articular indices and functional status (17, 29). While the CHQ was found to differentiate healthy children from those with JIA, we did not find any results indicating discriminate validity to accurately classify children with JIA by the extent of their disease (25). While the PedsQL and PedsQL-RM have been studied in the general pediatric rheumatology populations, only 1 study focused on children with JIA. That study found that child-reported pain assessments correlated with all subscales of the PedsQL and PedsQL-RM, and that parent pain assessments correlated with 3 of 4 subscales for both instruments (18).


The SRM (38) calculates an effect size that incorporates information about the response variance into the denominator. According to Cohen (39), an effect size of 0.2–0.3 is considered a small effect, ∼0.5 (0.4–0.7) a medium effect size, and ≥0.8 a large effect size.

Responsiveness was assessed in 9 studies (9, 10, 17, 26, 32, 34–37). Results are summarized in Table 3. The responsiveness of the C-HAQ was assessed in 6 studies (9, 26, 32, 34–36). The results of the 6 studies were quite variable, with effect sizes ranging from 0–0.5. The 2 studies evaluating responsiveness in oligoarticular populations (35, 36) found that the C-HAQ was less responsive in patients with oligoarticular disease, with SRMs ranging from 0–0.25, compared to studies of polyarticular disease, where SRMs ranged from 0.48–0.6 (9, 26, 32, 34). This difference in responsiveness by disease category was seen even when the same definition of improvement was used (32, 36).

Table 3. Responsiveness*
Instrument (ref.)Standardized response meansEffect sizesArea under ROC curves
  • *

    ROC = receiver operating characteristic; C-HAQ = Childhood Health Assessment Questionnaire; 95% CI = 95% confidence interval; PGA = physician global assessment of disease activity; PGW = parent/patient global assessment of well-being; CHQ = Child Health Questionnaire; PhS = physical score; PsS = psychosocial score.

C-HAQ (9, 26, 32, 34–36)Responders, median 0.60 (range 0.39–0.8); non-responders, median 0.08 (range 0.01–0.15)Median 0.24 (range 0–0.5)0.56 (95% CI 0.41–0.71)
PGA and PGW (34–36)PGA, median 0.9 (range 0.82–2.07); PGW, median 0.5 (range 0.3–0.8); PGA, mean ± SD change 5.4 ± 2.6; PGW, mean ± SD change 1.5 ± 2.0PGA, median 1.59 (95% CI 1.0–2.32); PGW, median 0.5 (range 0.33–0.97)PGA, 0.86 (95% CI 0.72–0.95); PGW, 0.63 (95% CI 0.46–0.78)
Joint counts (36)No. swollen joints 0.7; no. active joints 1.3No. swollen joints 1.3; no. active joints 0.7 
CHQ (17, 35)CHQ PhS 0.19; CHQ PsS 0.28; CHQ overall 0.23CHQ PhS 0.18; CHQ PsS 0.23CHQ PhS, 0.67 (95% CI 0.5–0.81); CHQ PsS, 0.71 (95% CI 0.54–0.85)

Three studies reported on the responsiveness of the global assessment measures and joint count indices. The most responsive measure was the PGA, with a large effect size of 1.59 (95% confidence interval [95% CI] 1.0–2.32) (34–36). However, in 2 of these studies, the patients' initial designation of improved or not improved was based on the physician's assessment, either as a categorical assessment on a 5-point scale for the first study (35), or by a definition of flare based on the addition or escalation of therapy in the second (34). Swollen joint count and AJC were also found to have moderate to high responsiveness (effect sizes 1.3 and 0.7, respectively) and may be appropriate alternative measures (36).

The responsiveness of the CHQ was evaluated in 2 studies, both of which demonstrated poor overall responsiveness, with an SRM of 0.23 and an effect size of 0.18–0.23 (17, 35). However, in the study that reported responsiveness separately based on disease state, the responsiveness was high in those designated as improved, at 0.96, indicating that the CHQ is sensitive to improvement, but the SRM was lower (−0.60) in those with worsening disease (17).

The MCID was evaluated for the C-HAQ in 2 studies. The MCID helps clinicians interpret study results by estimating the amount of change on an instrument that is associated with a clinically meaningful change in the patient's status. The first study explored the question of minimal clinically important change using a theoretical scenario and found a mean MCID for improvement of −0.13 in the C-HAQ, and 0.75 for worsening (10). The second study evaluated MCID in a JIA population and found that results differed by which external standard of disease was used (patient, parent, or physician assessment of disease). The mean MCID for improvement was −0.188 to 0 compared to child ratings, and 0 for parent and physician ratings (37). The authors concluded that changes in a patient's condition did not correlate well with the C-HAQ, and therefore that the C-HAQ is unlikely to be to a useful tool when making short-term medical decisions.

The ability of the various outcome measures to differentiate those who improved from those who did not was assessed using receiver operating characteristic (ROC) curves. The most discriminate measure of the instruments we examined was the PGA, with an area under the ROC curve of 0.86 (95% CI 0.72–0.95), compared to the PGW value of 0.63 (95% CI 0.46–0.78) and the C-HAQ value of 0.56 (95% CI 0.41–0.71) (35). A summary of the evidence for the measures assessed is provided in Table 4.

Table 4. Evidence summary table*
 No. studies (no. subjects)Evidence summary
  • *

    PGA = physician global assessment of disease activity; C-HAQ = Childhood Health Assessment Questionnaire; PGW = parent/patient global assessment of well-being; CHQ = Child Health Questionnaire; JIA = juvenile idiopathic arthritis; PedsQL = Pediatric Quality of Life Inventory; PedsQL-RM = Pediatric Quality of Life Inventory Rheumatology Module.

Active joint count12 (8,064)Shows high responsiveness and moderate correlation with other measures of disease activity and functional status, but poor correlation with psychosocial aspects of quality of life Lack of interrater reliability data
PGA12 (8,668)Moderate correlations with measures of disease activity, C-HAQ, and quality of life measures Responsiveness difficult to measure, as often compared to other  physician measures of disease activity No data on interrater reliability between providers
PGW8 (8,182)Moderate correlations with other measures of disease activity, C-HAQ, and physical aspects of the quality of life measures, but poor correlation with CHQ psychosocial aspects Moderate responsiveness and discriminate abilities
C-HAQ23 (13,374)Most commonly reported outcome measure with strong reliability, including moderate to strong interrater reliability between parent and child Moderate correlations with other measures of disease activity,  but poor responsiveness, which varies depending on how  extensive the arthritis is at baseline (ceiling effect)
CHQ5 (4,687)Limited data for JIA population Moderate to strong parent to child interrater reliability for  physical components, but lower for psychosocial aspects Similarly, moderate correlations with measures of disease  activity, and C-HAQ for physical component of CHQ, but poor  for the psychosocial domains Poor responsiveness
PedsQL/PedsQL-RM2 (173)Insufficient data in JIA populations to evaluate fully Moderate to strong parent to child interrater reliability for  physical components, but lower for psychosocial aspects

Composite definitions of disease status or response to therapy.

Because JIA is a complex disorder, several composite definitions have been developed to categorize disease status or response to therapy. While these definitions are in various stages of validation and lack the full psychometric evaluations needed to be included above, we recognize their growing importance as a means of capturing overall health status. The most commonly used of these definitions is the ACR Pedi 30. Developed to assess response to therapy in clinical trials, it is composed of a core set of 6 variables, 5 of which were individually evaluated in our review, including PGA, PGW, C-HAQ, number of active joints, and joints with LROM (3). A recently developed composite score, the Juvenile Arthritis Disease Activity Score, was designed to better characterize the absolute level of disease activity in JIA patients and consists of 4 measures: PGA, PGW, number of joints with active arthritis, and erythrocyte sedimentation rate. Initial validation studies have been performed (40).

Several definitions deserve mention as well, including the consensus-based definition of remission (including inactive disease, remission on medications, and remission off medications) (41, 42). While these definitions have been applied retrospectively to JIA populations, further validation studies are reportedly underway. A preliminary definition of flare has also been described. This definition was derived from a cohort of patients with polyarticular JIA using the ACR Pedi 30 (43). The success of recent innovative new therapies for treating JIA has pushed the goals of treatment from minimizing disease activity to remission. The various definitions of remission, as well as flare, serve to clarify and standardize the terminology used and improve our ability to determine treatment responses and comparative effectiveness.


Our results indicate that no single instrument or outcome measure appears superior in describing the various aspects of JIA with high reliability, validity, and responsiveness. While composite measures are commonly used in JIA trials today, the lack of psychometric evaluations of these composite measures prevented their inclusion. We therefore examined the individual measures most commonly used in JIA trials, including 5 of the 6 measures that make up the ACR Pediatric 30.

The C-HAQ was the most extensively evaluated instrument of the priority measures we considered. While it demonstrated high reproducibility and internal consistency, it had only moderate correlations with indices of disease activity and quality of life, and poor to moderate responsiveness. The C-HAQ is sensitive to the degree of disability at baseline, with higher responsiveness for patients with initially worse functional impairment, and therefore may have different utility in the various categories of JIA. Furthermore, the C-HAQ is a measure of disability and not disease activity. For JIA, the C-HAQ may fail to capture the full spectrum of physical function impairments, and some of the limitations detected in the C-HAQ reflect joint damage rather than disease activity. Both of these factors likely contribute to the poor responsiveness noted for the C-HAQ and limit its usefulness in clinical trials. Therefore, although the C-HAQ is a familiar and validated measure, our findings suggest the need for a better functional outcome measure that is responsive to change across the full spectrum of disease severity.

In general, across the measures studied, reliability was moderate to high for measures of physical function but poor to moderate for psychosocial domains. Similar findings were noted in validity and responsiveness, where measures of psychosocial function and quality of life showed less correlation with disease activity indices and less responsiveness compared to the physical aspects of JIA. The reasons for this discrepancy are likely multifactorial, although having to live with a chronic disease and taking medications that may result in nausea or may require painful injections likely negatively impact measures of psychosocial function and quality of life regardless of improvements in disease activity. These findings are important to consider when discussing risks and benefits of altering treatments since patients may have different tradeoffs based on the psychosocial aspects of the disease, which can impact treatment choices.

The psychometric methods shown in this study reflect those reported in the individual studies. It is important to note there may be limitations based on the methods chosen. In particular, more recently developed psychometric methods, such as Rasch analysis, were reported infrequently. These approaches have the potential to add to classic methods in the assessment of scale properties and ultimately in the interpretation of clinical trial results. Furthermore, the common practice of reporting responsiveness on a continuous scale, even when the scales in question are ordinal, may potentially inflate the interpretation of effect sizes and MCIDs (5, 6). Assessing the full impact of JIA is complicated by the heterogeneity in disease severity, the broad effects on both physical and psychosocial health, and the potential for both chronic and acute limitations. Efforts to develop a standardized composite measure that incorporates articular indices, severity, and a broader assessment of functional limitations and psychosocial impact would be useful to better discriminate levels of disease activity, overall impact of disease, and responsiveness to therapy. Consistent use of such outcome measures would facilitate comparative effectiveness research. However, developing one instrument that can serve as an accurate measure of all facets of disease, encompass the variable disease manifestations of the different categories of JIA, and be responsive enough to detect meaningful changes in disease status seems unlikely. The ACR Pedi 30 definition of improvement attempts to incorporate many of the clinically meaningful indices, but as our systematic review highlighted, the responsiveness of several of these measures, including functional status and PGW, is poor to moderate, and may not adequately reflect changes in disease state. Given that the amount of clinical change can vary depending on both the effectiveness of the intervention (e.g., nonsteroidal antiinflammatory drug versus biologic agent) and the disease severity (e.g, number of joints involved and severity of joint symptoms), responsiveness will need to be assessed across a broad spectrum of JIA severity and for treatments of varying effectiveness.

Knowing the performance characteristics of the outcome measures and standardizing the measures used in clinical trials is especially important for evaluating the comparative effectiveness of various treatments for JIA. While the selection of the most appropriate outcome measure may differ depending on the specific question being investigated (improvement in disease activity versus changes in quality of life), efforts should be made to standardize the measure chosen to evaluate a specific domain. To best assess treatment effects, the responsiveness of the instrument used is crucial. Therefore, focusing on the more responsive measures improves our ability to assess treatment effects and enhances our ability to detect promising new treatments. Reporting functional status and quality of life are also important, especially given that many of the current treatments require infusions or injection and have varying side effects that can negatively impact a child's quality of life. While these measures may be less responsive to changes than disease activity, they still provide valuable information. Examining the more responsive articular measures separately from the less responsive functional status and quality of life measures may actually improve our understanding of both the efficacy and effectiveness of treatment regimens, providing further insight into the complexities of living with JIA and its treatments.


All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Dr. Van Mater had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Van Mater, Williams, Coeytaux, Sanders, Kemper.

Acquisition of data. Van Mater, Williams, Coeytaux, Sanders, Kemper.

Analysis and interpretation of data. Van Mater, Williams, Coeytaux, Kemper.