Measurement properties of multidimensional patient‐reported outcome measures in neurodisability: a systematic review of evaluation studies

Aim To identify and appraise the quality of studies that primarily assessed the measurement properties of English language versions of multidimensional patient‐reported outcome measures (PROMs) when evaluated with children with neurodisability, and to summarize this evidence. Method MEDLINE, Embase, PsycINFO, CINAHL, AMED, and the National Health Service Economic Evaluation Database were searched. The methodological quality of the papers was assessed using the COnsensus‐based Standards for selection of health Measurement INstruments checklist. Evidence of content validity, construct validity, internal consistency, test–retest reliability, proxy reliability, responsiveness, and precision was extracted and judged against standardized reference criteria. Results We identified 48 studies of mostly fair to good methodological quality: 37 papers for seven generic PROMs (CHIP, CHQ, CQoL, KIDSCREEN, PedsQL, SLSS, and YQOL), seven papers for two chronic–generic PROMs (DISABKIDS and Neuro‐QOL), and four papers for three preference‐based measures (HUI, EQ‐5D‐Y, and CHSCS‐PS). Interpretation On the basis of this appraisal, the DISABKIDS appears to have more supportive evidence in samples of children with neurodisability. The overall lack of evidence for responsiveness and measurement error is a concern when using these instruments to measure change, or to interpret the findings of studies in which these PROMs have been used to assess change.

Patient-reported outcome measures (PROMs) assess a patient's health at a single point in time, and are collected through short, self-completed questionnaires. PROMs are advocated for use in clinical trials; 1,2 they are also proposed as key performance indicators for evaluating health systems. 3 Some PROMs are domain-specific, focusing on a particular aspect of health, such as behaviour; other instruments are multidimensional with sub-scales that assess various aspects of health and wellbeing. PROMs can be condition-specific, designed for use by people with a particular health problem; or they can be generic and therefore appropriate for anyone to report their health; or chronic-generic, designed for people with any long-term health conditions. Preference-based measures incorporate a weighting of scores based on a reference valuation of health states into a single index score; they are used in economic evaluations to assessing cost-effectiveness. 2 'Neurodisability' is an umbrella term commonly used in the UK for a range of functional problems of neurological origin. Previously, we proposed a definition of neurodis-ability for children that was supported by many professionals and parents, and indicated a similar grouping of conditions in other countries, albeit with different terminology. 4 For some applications in neurodisability, a condition-specific PROM may be preferable if available; for instance, condition-specific measures exist for cerebral palsy (CP) 5 and epilepsy. 6 However, it is also common for generic PROMs to be utilized in neurodisability, especially for comparison across conditions or with normative samples. Individually, many conditions that result in a neurodisability are rare, but when grouped together they are common. Hence there are situations when it will be expedient for children with neurodisability to be grouped for research, service evaluation, or audits.
When selecting PROMs for a specific purpose, it is necessary to examine both the construct that is being assessed and the measurement properties of candidate instruments. 7 Mapping of items in PROMs using the International Classification of Functioning, Disability and Health for Children and Youth is useful to understand the content assessed by questionnaires. 8,9 Measurement properties should, ideally, have been evaluated in samples representative of the intended population to determine whether the instrument is applicable for that population. 1 Language and cultural issues also affect how people interpret and respond to questions; hence one cannot simply assume that PROMs perform consistently across languages and cultures. 10,11 Therefore, the US Food and Drug Administration recommends that evidence be provided of the process used to test measurement properties in the language where assessments will be made. 1 Scale development methodology has evolved in recent years and approaches using item response theory are more commonly utilized; such approaches use mathematical models to examine responses to individual items in questionnaires and offer better scale precision. 12 Methods for appraising the evidence of psychometric performance on measures have also become more standardized. 13 The COnsensus-based Standards for selection of health Measurement INstruments (COSMIN) was developed to enable a standardized assessment of the methodological quality of research studies evaluating measurement properties of tools. 13,14 In previous papers we documented a systematic review of generic multidimensional PROMs for children and young people, in which we mapped the content to the International Classification of Functioning, Disability and Health for Children and Youth insofar as it was possible, 9 and appraised studies evaluating measurement properties in general population samples. 15 In this study we build upon that foundation by focusing on evaluations of the PROMs identified in the previous systematic review where they have been tested in samples of populations with neurodisabilities. In this instance, as the aim was to examine which instruments could be considered robust for application across children with neurodisability, we also included chronic-generic tools. We use the word 'PROM' to refer to the group of questionnaires (different versions according to age group, length, or responder) of a certain instrument; we use the word 'questionnaire' to refer to a specific version of an instrument. A list of the PROMs, the different types of questionnaires, and full names is presented in Table I.

METHOD Search strategy
Candidate generic PROMs were identified and catalogued as part of a previous systematic review. 9 In addition, chronic-generic PROMs were included as they could be used across neurodisability conditions; three eligible chronic-generic tools were identified in our broader research programme (Disabkids, Functional Disability Index, Neuro-QoL). For this review, three groups of search terms were combined: the names of the PROMs and their synonyms; terms for children; and terms for neurodisability, for which both free text and medical subject headings were used.
Searches were conducted on MEDLINE (including in-process and other non-indexed citations), Embase, Psy-cINFO, and AMED (via OvidSP), CINAHL (via EBSCOhost), and the National Health Service Economic • Variable methodological quality of studies; improved quality in more recent evaluations.
• Evidence lacking for instruments to assess meaningful changes in health.

Inclusion and exclusion criteria
Articles were selected when written in English and reporting on a study that was (1) specifically designed to evaluate the psychometric properties of candidate PROMs using an English language version of the questionnaire, (2) conducted in a population including at least 10% children up to 18 years old with neurodisability, or mixed chronic conditions including neurodisability, and (3) published in a peer-reviewed journal. Articles were excluded if (1) the PROM was used as a criterion standard to test another instrument, (2) less than 10% of the study population was younger than 18 years, (3) less than 10% of the study sample was diagnosed with neurodisability.

Study selection
Titles and abstracts of all unique citations were screened against the eligibility criteria by two reviewers (AJ and RG/ CM); any disagreements were resolved by discussion. The full text of any potentially relevant article was retrieved and screened using the same procedure. A flowchart describing the process of study selection can be found in Figure 1.

Assessment of methodological quality of included articles
For each included paper, the COSMIN checklist was used to appraise the methodological quality of the study and the completeness of the report. 13 We assessed the methods and reporting of how the following properties had been tested: internal consistency, reliability, measurement error, content validity, structural validity, hypothesis testing, criterion validity, and responsiveness. Cross-cultural validity was not examined as the purpose of the work was to inform UK health services policy where currently only English language versions are administered. The COSMIN checklist uses a 'worst score counts' rating of methodological quality as excellent, good, fair, or poor based on factors such as adequate sample size and appropriate statistical methods used. 16 The checklist was administered by two reviewers (AJ and CM); discrepancies were resolved by discussion.

Data extraction
For each included paper the following descriptive data were extracted using a standardized, piloted data extraction form: first author name and year, name and version of the instrument (including child or parent version), study aim, study population (participants' characteristics including type of neurodisability and diagnosis), number of participants, age range, mean age (and standard deviation), and setting or country where the study was conducted. Data were extracted by one reviewer (RG/AJ) and a 10% sample was checked by a second (AJ/CM). Then, any data on evidence of the measurement properties of instruments were extracted including content validity (theoretical framework and/or qualitative research), construct validity (structural validity concerning how domain sub-scales were determined for instance using factor analysis, and hypothesis testing to verify sub-scales measure the intended construct), internal consistency (including domain sub-scales where appropriate), test-retest reliability, proxy reliability (between child and parent), precision, and responsiveness (whether the increases/decreases in scores can be considered robust and exceed measurement error). Data were extracted by one reviewer (RG/AJ) and checked by a second (AJ/CM); disagreements were resolved by discussion.

Appraisal of measurement properties and summary of evidence
Evidence of measurement properties was judged using standardized reference criteria and thresholds (Table II). 12,17 These data were summarized in a single rating for each measurement property following methods commonly used for the presentation of such findings. 18,19 To summarize available evidence, we took into account the following elements (Table III): (1) data extracted from included studies, with reference to standard criteria (Table II); (2) the methodological quality of studies (COS-MIN) and number of studies; and (3) the thoroughness of testing, giving further weight to any studies that appeared not to have been conducted by the original developers. 20 Two reviewers (AJ and CM) made the judgement through discussion based on available evidence.

RESULTS
We found 48 papers that report evaluations of measurement properties of 12 PROMs (see Table IV): 37 papers for seven generic PROMs (CHIP, CHQ, CQoL, KIDSCREEN, PedsQL, SLSS and YQOL), seven papers for two chronicgeneric PROMs (DISABKIDS and Neuro QOL), and four papers for three preference-based measures (HUI, EQ-5D-Y and CHSCS-PS) (Table IV). Twenty papers described evaluations of the PedsQL4.0 in various neurodisability samples, and 11 papers pertained to various versions of the CHQ. The most common conditions in samples were CP, epilepsy, attention-deficit-hyperactivity disorder (ADHD), autism spectrum disorder (ASD), and traumatic brain injury. The evaluations spanned children in a variety of age groups from 2 to 18 years old. The evaluations were performed in Canada, USA, Europe, and Australia.
The methodological quality of the included studies was variable (Table V). Internal consistency, test-retest reliability, and construct validity (hypothesis testing) have been more frequently studied in neurodisability samples; several studies have examined structural validity; very few studies Review 439 have evaluated responsiveness and measurement error. A summary appraisal of the evidence for measurement properties of each PROM is given in Table VI.
The PedsQL has been evaluated with children with a wide range of neurodisability including CP, ADHD, ASD, acquired brain injury, neuromuscular, and neuro-oncology conditions. Although there is supportive evidence for the structural validity and test-retest reliability of the PedsQL, there is conflicting evidence for the internal consistency of the subscales, particularly the school functioning domain, Papers retrieved for full text screening by previous systematic review n=180 Title/abstract screening n=2386

Papers selected for data extraction n=28
Papers selected for data extraction n=38 Duplicates n=9 which scored consistently low (0.45-0.65). [21][22][23][24] Other papers reported values of Cronbach's alpha below 0.7 for emotional functioning, 23 social functioning, 25,26 and physical functioning. 22 We found conflicting evidence for precision; overall floor and ceiling effects were less than 15% for most scales, except social functioning (up to 36%). 21,27 The responsiveness of the PedsQL was assessed in one poor-quality study, thus a rating was not determined. 28 Several studies reported child-proxy reliability, all reporting low to moderate agreement (intraclass correlation coefficient 0.10-0.75, with most between 0.20 and 0.60).
Versions of the CHQ have been evaluated with children with CP, ADHD, acquired brain injury, and epilepsy. There is supportive evidence for the structural validity of child and parent report versions of the CHQ and internal consistency of the child report version. Both parent versions show poor results for ceiling and floor effects on several scales; ceiling effects are found for most of the individual scales, with scores up to 86% for role and social functioning. 29 One study reports low values of Cronbach's alpha for the domains family cohesion, bodily pain role/social functioning, and role/social limitations of the 28-item version; 25 three studies report conflicting findings for the 50-item version, with one paper reporting supportive evidence for all domains 30 and two studies reporting low alpha scores for general health perceptions and family impact (emotional and time). 29,31 We also found conflicting evidence for the CHQ-PF50 for construct validity and responsiveness.
The DISABKIDS was developed for and with children who have chronic health conditions including CP and epilepsy. Supportive evidence from methodologically robust studies exists for content validity, construct validity, and internal consistency of the 37-item version, and there is favourable evidence for structural validity, test-retest reliability, and precision. Evidence did not support child-proxy reliability. For the 6-item version for younger children, evidence supports the content validity, structural validity, test-retest reliability, but is conflicting for internal consistency, with values of Cronbach's alpha dropping just below 0.70 for the child version. 32 Kidscreen-52 has been evaluated in one study using Rasch analysis with data from children with CP in countries across Europe; the findings for the English language version were reported separately. 33 Evidence supports structural validity, construct validity, and precision of Kidscreen-52. Supporting evidence was found for internal consistency and test-retest reliability for the 10-item version in one study of poorer quality, including children with CP. 25 One methodologically robust study evaluated the SLSS and BMSLSS with adolescents with conditions including CP, acquired brain injury, and ASD, providing evidence for construct validity, structural validity, internal consistency, and test-retest reliability. 34 Two papers evaluating the CHIP-CE parent report version with children with ADHD support structural validity, construct validity, internal consistency and precision. 35,36 Each of the four preference-based measures has been evaluated in one study. Evidence from hypothesis testing supports the construct validity of the CHSCS-PS and HUI3, but was inconsistent for the EQ-5D-Y. The childproxy reliability of the HUI2 was not rated as the study was of poor quality.       Tables I and SI (online supporting information). PR, parent report; ADHD, attention-deficit-hyperactivity disorder; SR, self-report; SRFNDP, separate results reported for reported population with neurodisability; CP, cerebral palsy; DMD, Duchenne muscular dystrophy; CYP, Children and young people; SRFESP, Separate results reported for English speaking population; MD, muscular dystrophy; TBI, traumatic brain injury; VLBW, very low birthweight; QoL, quality of life; HRQoL, health-related quality of life; SMA, spinal muscular atrophy; CNS, central nervous system; ASD, autism spectrum disorder; DSM-IV, Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition. 88 Review 445  The CQoL was developed with children with intellectual disability, chronic physical disorders, and psychiatric disorders, and parents. The study reporting the development and preliminary testing provides supportive evidence for content validity; internal validity and test-retest reliability could not be determined as these elements of the study were of poor quality.
The YQOL was developed for and with children with disabilities. However, the study reporting on the content validity of the instrument does not state which conditions were included. 37 A companion paper reports supportive evidence on structural and construct validity, and internal consistency in a study of moderate quality, including children with ADHD or mobility disability.
We identified three papers reporting on the development and initial testing of the Neuro-QOL; two papers report the same data. 38,39 Epilepsy and muscular dystrophy were selected as conditions for test development of the paediatric Neuro-QOL item pool. The content validity, reported in three papers, was rated as good. Domains were identified through a literature review, expert interviews, parent and carer focus groups, and keyword search. 39,40 Cognitive interviews were conducted with children aged 10 to 18 years to ensure appropriate understanding and literacy levels. 38 Other measurement properties were not rated owing to the poor quality of the studies.

DISCUSSION
This review identified 12 multidimensional PROMs, with 18 versions of questionnaires, that have been evaluated with children with various neurodisability conditions, including CP, ADHD, ASD, epilepsy, acquired brain injury, neuromuscular and neuro-oncology conditions. The PedsQL and CHQ have been evaluated more than other instruments, though some of the evidence undermines confidence in their ability to produce robust measurement. On the basis of this appraisal, the DISABKIDS appears to have more evidence to support its measurement properties in samples of children with neurodisability. None of the PROMs has been evaluated comprehensively across all relevant measurement properties, with responsiveness and measurement error being the least studied.
The paucity of evidence available for the properties of responsiveness and measurement error should be a concern for anyone wishing to use the instruments to measure change, or for those seeking to interpret the findings of studies in which these PROMs have been used to assess change. This gap needs to be evaluated in paediatric populations with neurodisability to inform decisions about what constitutes meaningful change scores. Changes in scores may be statistically significant, especially in large samples, but may not be clinically important. Indices such as the minimal clinically important difference is the mean change in score reported by the respondents who indicate that they had noticed some small change. 41 The minimal clinically important difference has been evaluated for the PedsQL in a sample of children with diabetes; 42 nevertheless one cannot necessarily assume this difference will be the same for children with neurodisability conditions. Other ways to address the lack of evidence for responsiveness include the minimum detectable change, which is an indication of the amount of change required to have confidence that any observed change is beyond measurement error; a common standard is to use a 90% confidence level. 43 The effect size is calculated by dividing the amount of change by the standard deviation of the baseline score. 44 Revicki et al. 45 suggest calculating different indices of minimal change, and for these to triangulate towards a range of values, in which confidence increases with replication.
There appears a dearth of evaluations of the measurement properties of preference-based measures in children with neurodisability, adding to the lack of evidence for these instruments in general populations. 15 Internal consistency may conflict with the underlying theory of health Definitions of the instruments are presented in Tables I and SI (online supporting information).
Review 447 economic instruments, 46 but the properties of face, content and construct validity, and test-retest reliability remain requisite. Lack of evidence for these measurement properties undermines confidence in health economic evaluations based on preference-based measures. We did not examine the methods used to derive the scaling of the preferencebased measures; the methods for creating the preference weighting were assumed to produce interval-level measurement. 47 As the purpose of preference-based measures is to quantify the value or strength of preference for health change, the means for assuming and eliciting preference values should be critically assessed. 46 The information from this review makes it difficult to recommend a multidimensional PROM for use in paediatric neurodisability based on measurement properties established in relevant conditions. Our review of evaluations of generic multidimensional PROMs in general population samples identified 41 potentially eligible PROMs, 9 and identified 126 papers that reported evidence of the measurement properties of 25 PROMs using English-language versions in general population samples. 15 Although robust evidence was lacking for one or more properties for all PROMs, there was evidence to support more measurement properties for the CHIP, Healthy Pathways, 48 KIDSC-REEN, and MSLSS. The CHU-9D 49 was the preferencebased measure with greater evidence of adequate measurement properties. Except for the Healthy Pathways and CHU-9D, these PROMs have been tested with children and young people affected by neurodisability; the evidence shows a similar pattern albeit supported by fewer studies. Most noticeably absent for all these PROMs are studies examining content validity. Thus these PROMs might be leading candidates for further testing in groups with neurodisability, particularly the properties of responsiveness and longitudinal validity. Tests of responsiveness and longitudinal validity assess how scale scores change over time and whether the direction and magnitude of the changes reflect what would be expected on the basis of theory determined in advance, ideally incorporating a comparison with a group not expected to change. In the absence of evidence of responsiveness, those selecting PROMs should appraise whether the aspects of health assessed by tools and the response options to questions suggest that these are 'likely' to change in their specific context for application.
Although PROMs are generally designed for use as group measures in service evaluations, audits, and research, there is also growing interest in using them clinically as individualized measures. 50,51 The proposed criterion for test-retest reliability is more stringent for individualized use (intraclass correlation coefficient >0.9), 17 and such high levels of stability would need to be demonstrated in paediatric neurodisability.
Aside from the standard measurement properties, there are several other criteria that apply when selecting a candidate PROM. These include appropriateness, acceptability to potential respondents, and feasibility: for example the burden on respondents and those administering and processing data. 17 We studied the appropriateness of existing generic and chronic-generic PROMs for children with neurodisability by asking whether they cover the more important aspects of health for this particular group. We sought to identify a core set of outcomes that could be assessed using PROMs for these children; that is, outcomes beyond mortality and morbidity. To this end we performed qualitative research separately with children and parents, 52 a Delphi survey with health professionals, 53 and held a prioritization meeting. 54 This work produced a core set of outcomes deemed important to children and/or parents that were aspects of health targeted by National Health Service clinicians. The domains were communication, emotional wellbeing, pain, sleep, mobility, self-care, independence, mental health, community and social life, behaviour, toileting, and safety. However, none of the identified PROMs capture all these key domains. Adding to this the scarce evidence of good overall psychometric performance for existing measures in a population with neurodisability, there could be a place to refine or develop existing PROMs accordingly.
There are some limitations to this systematic review; most are a consequence of the strict inclusion criteria. Neurodisability comprises a vast number of conditions, and although we included other general descriptions and MeSH (Medical Subject Headings) terms for developmental disabilities, we only had three key marker conditions (CP, autism, and epilepsy) and relevant variations on neuro-motor, neuropsychiatric, and developmental disabilities. Although we updated searches for evaluation studies up to 30 July 2014, we did not repeat the systematic search to identify any new PROMs after September 2012. Hence, we will not have included any new PROMs published after this date; however, we are not aware of any such PROMs that would meet the eligibility criteria.
One of our inclusion criteria was published peerreviewed reports of studies that specifically set out to evaluate measurement properties of PROMs. Hence, we excluded papers that might have presented incidental evidence from studies where PROMs were used in observational or experimental studies. However, information from studies that were not designed specifically to test measurement properties can be misleading. Studies testing responsiveness require testing of some a priori hypothesis in a longitudinal study, whereas evaluative trials typically test interventions of unknown effectiveness. Therefore, for instance, observing no change could be interpreted as either a blunt, non-responsive measure or an ineffective intervention, and it is not possible to determine which is true. 55 In addition, we will have omitted any information that may be contained in manuals, if these data have not been published in peer-reviewed journals. We justify this as peer review provides some level of quality assurance to the evidence being appraised. We included studies with children and young people with chronic conditions, providing the samples included neurodisability. Hence, we did not appraise studies examining PROMs with children with other conditions (e.g. arthritis or asthma).
Limiting the review to studies where an English version of the PROM was administered excluded some PROMs from further analyses. Two PROMs excluded from this review that may warrant further investigation are ITQoL (for infants), 56 which was developed in the Netherlands and for which an English translation is available but no published studies of this version were found, and the TNO-AZL (TACQOL, TAPQOL, and TAAQOL). 57 If studies had been included that used versions of questionnaires in languages other than English, then further evidence would have emerged, for instance regarding the KINDL 58 and the plethora of translated versions of the more popular instruments such as PedsQL. Nevertheless, psychometric performance cannot be assumed across languages and cultures; 11 therefore, in our view, limiting the review to evaluations of English-language versions is a relative strength of it.
There remains much scope for research in evaluating multidimensional PROMs to measure health outcomes in paediatric neurodisability, particularly in testing item invariance across conditions and the responsiveness of PROM scores to quantify meaningful change that is beyond measurement error. also benefited from support from NIHR PenCLAHRC and the Charity Cerebra. The views and opinions expressed in this paper are those of the authors and not necessarily those of the National Health Service, the NIHR, the Department of Health, or Cerebra. We are grateful to co-investigators in the broader project, particularly Colin Green and Jo Thompson Coon for their contribution to the development of the protocol, and Anna Stimson for her administrative support. The authors have stated that they had no interests that could be perceived as posing a conflict or bias.

SUPPORTING INFORMATION
The following additional material may be found online: Table SI: PROMs (group of questionnaires), the different versions (according to age group, length, or responder), acronyms, and reference citations, including reference citation.
Appendix S1: An example of the search strategy used on Ovid MEDLINE(R), In-Process & Other Non-Indexed Citations, and Ovid MEDLINE(R) (1946 to present).