*1. Developing the instrument* |

Physiological device | Does this device agree sufficiently well with one in current use? | Method comparison study | Mean difference, 95% limits of agreement†*κ* with 95% confidence interval‡ |

How do I handle repeated measures and duplicates? | | Analysis of variance or confirmatory factor analysis† Multilevel modelling†‡ |

How do I compare this device with a known gold standard? | Diagnostic test study | Receiver operating characteristic curve, area under the curve with 95% confidence interval† Sensitivity, specificity, positive predicted value, 95% confidence interval‡ |

Problems: poor study design, e.g. no device calibration, time lapse between sequential measurements; insufficient control of sources of variability in design and/or analysis; repeated measures treated as independent observations to boost small samples |

Questionnaire | Creation of items | Focus groups, interviews, expert review | Qualitative analysis to identify themes and to assess face, content validity |

Do the items (e.g. measured on a Likert scale) work reasonably well? | Pilot study to evaluate item performance on a small sample | Mean, minimum, maximum responses; percentage missing; floor and ceiling effects; item–total correlations |

Problems: lack of use of age-appropriate language and instrument formatting; range of vocabulary available in different languages may affect translation and scoring; poor item performance because of cultural misunderstandings |

*2. Establishing reliability and validity* |

Reliability | Does the instrument give stability and consistency of measurement? | Intrarater or test–retest (using the same assessor on two different occasions) Interrater (using 2 different assessors on the same occasion) | ICC (also mean difference, 95% limits of agreement may be useful)† *κ*, weighted *κ* with 95% confidence interval, latent class analysis‡ |

Does the instrument have good internal properties (reliability)? | Item performance evaluation on larger sample | As in stage 1 above for pilot study |

(For questionnaire-type instruments typically using dichotomous items or Likert scales) | Examine internal structure and number of domains | Cronbach's *α*, split-half coefficient, Kuder–Richardson 20-method Factor analysis†‡, item response analysis‡ |

Problems: Incomplete data for occasions or assessors, e.g. because of upset child or parent did not return for second assessment; sample size too small for robust factor analysis or item response analysis |

*2. Establishing reliability and validity* |

Validity | Does the instrument measure what it is supposed to be measuring? | Convergent–divergent validity | Correlation (Pearson or Spearman rank)† |

| Extreme groups; known groups; concurrent validity using external criterion; predictive validity | *T*-test, Mann–Whitney *U*, Cuzick test for trend, Analysis of variance, Kruskal–Wallis etc.† *χ*^{2}-test‡ |

| Criterion validity (with gold standard) | As in stage 1 above for diagnostic test study |

Problems: may not be feasible to assess all listed types of validity as may require too many assessments for a child to tolerate; no known gold standard criterion may be available for children |

*3. Measuring change over time and discriminating between subjects* |

Magnitude of change | How do I assess the magnitude of change from baseline to end point? | Pilot study and/or main intervention study | Cohen's effect size, Guyatt's responsiveness statistic, standard error of measurement† |

Is there a better way of taking into account pretest and baseline measures (e.g. to avoid regression to the mean)? | Randomized or non-randomized two-group comparison on a large sample | Analysis of covariance to compare groups† Logistic or ordinal regression‡ |

Measurement error | How do I handle measurement error (particularly in dietary data)? | Take additional measurements by using a reference instrument | Use reference instrument (e.g. diet diary, biomarker, till receipts) to calibrate or adjust results |

Multiple measurements | How do I account for multiple measurements taken over time? | | Select the baseline and most important follow-up time point only and analyse as above Use summary measures (e.g.mean, peak, area under curve, gradient) to obtainone measure for each child Longitudinal data analysis,growth curve modelling†‡ |

Problems: establishing meaningful cut-offs and magnitudes of change that can be interpretedin different groups of children; missing data especially when follow-up visits are necessaryfor sick children; use of proxy respondents and age-specific questionnaires; scaling of items |

*4. Reference values for a healthy population* |

Reference range | How do I create referencevalues for a normal,healthy population? | Take measurements from alarge consecutive orrandom sample ofchildren from schools orcommunity | Mean, 95% reference range†, or median, 2.5th and 97.5th centiles† |

Gender specific | What if the measurementsvary by gender (usuallyidentified by a bimodaldistribution)? | Separate the sample measurements into those forboys and girls | Calculate the reference rangefor each gender groupseparately |

*4. Reference values for a healthy population* |

Age specific | What if the measurementsvary for children ofdifferent ages? | Separate sample measurements into age groups Use the whole sample | Calculate age-specific reference ranges for each agegroup separately Linear regression, fractionalpolynomials† Logistic regression, regression splines‡ |

*z*-scores | How do I measure andcompare deviations fromthe average across different groups of children? | | Express measurements interms of *z*-score unitsfrom the mean (i.e. reference mean for that ageand gender) and comparesummary statistics for*z*-scores across groups |

Anthropometricmeasurements | Is there a way of dealingwith variability in growthand non-linear change, e.g. in weight for age,weight for height and BMI? | | *z*-scores, LMS (lambda,mu, sigma) method(skewed data), generalized additive models forlocation, scale and shape |

Problems: need large samples of children from the reference population; need establishedreference means or medians to calculate *z*-scores; more sophisticated methods (e.g. LMSand generalized additive models for location, scale and shape) require specialist software |