Comparison of four statistical approaches to score child development: a study of Malawian children


Corresponding Author Yin-Bun Cheung, CTERU, Block A #03-02, 226 Outram Road, Singapore 169039. Tel.: +65 6325 7107; Fax: +65 6324 2700; E-mail:


Objective  Assessment of child development often results in a multitude of binary outcome data. There is no agreed way to use them to score the developmental status of children. Conventional methods include age-standardized Z-scores and simple sum of number of passes. Recently two approaches based on the Rasch model and the concept of ‘developmental age’ have been proposed. This study aims to compare the performance of the four approaches.

Methods  In a longitudinal study, 473 Malawian children were measured for growth status at age 36 months and administered a new test of developmental milestones between age 3 and 6 years. The test consisted of four domains: gross motor (GM), fine motor (FM), social and language development. The four approaches were used to score the developmental level of each child in each domain, and the results compared.

Results  In this sample, the approach based on the Rasch model provided development scores that were more normally distributed than the other approaches did. The four sets of scores were highly correlated with each other. They gave similar estimates of the effect of height-for-age on GM, social and language development. In FM development, the maximum difference in the effect size estimates was only 0.04 standard deviation despite its statistical significance (P = 0.009).

Conclusion  The four approaches were practically equivalent in the context of the estimation of an intervention effect or association. Their relative advantages and disadvantages are discussed. None of them can be universally recommended.


Objectif:  L’évaluation du développement de l’enfant se traduit souvent par une multitude de données binaires des résultats. Il n’existe pas de méthode conforme de leur utilisation pour classifier le stade de développement des enfants. Les méthodes conventionnelles comprennent les Z-scores standardisés pour l’âge et une simple somme du nombre de passes. Récemment, deux approches basées sur le modèle de Rasch et le concept de “développement selon l’âge” ont été proposées. Notre objectif a été de comparer les performances des quatre approches.

Méthodes:  Dans une étude longitudinale, 473 enfants malawiens ont étéévalués pour leur stade de croissance à l’âge de 36 mois. Ils ont été de nouveau évalués entre l’âge de trois et de six ans. Le test consistait en quatre domaines: la motricité générale, la motricité fine, le développement social et du langage. Les quatre approches ont été utilisées pour marquer le niveau de développement de chaque enfant dans chacun des domaines, et les résultats obtenus on été comparés.

Résultats:  Dans cet exemple, l’approche basée sur le modèle de Rasch a fourni des scores de développement distribués de façon normale comparée aux autres approches. Les quatre séries de scores corrélaient hautement les unes avec les autres. Ils ont fourni des estimations similaires (P > 0,05) de l’effet de la taille pour l’âge sur la motricité générale, sur le développement social et du langage. Dans le développement de la motricité fine, la différence maximale sur les estimations de l’ampleur de l’effet ne correspondait qu’à un écart type de 0,04 malgré une signification statistique (P = 0,009).

Conclusion:  Les quatre approches étaient pratiquement équivalentes dans le contexte de l’estimation d’un effet d’intervention ou d’une association. Aucune d’elles ne peut être universellement recommandée.


Objetivo:  Evaluar el desarrollo infantil a menudo resulta en la obtención de una multitud de datos binarios. No existe una fórmula acordada sobre el uso que debe dárseles para puntuar el nivel de desarrollo de los niños. Los métodos convencionales incluyen: Z-score estandarizado por edad y la simple suma del número de aprobados. Recientemente se han propuesto dos nuevos enfoques basados en el modelo de Rasch y el concepto de “desarrollo por edad”. Hemos comparado el desempeño de los cuatro enfoques.

Métodos:  En un estudio longitudinal, 473 niños de Malawi fueron medidos para determinar su estatus de crecimiento a los 36 meses, y posteriormente, entre los tres y seis años, se les evaluó con una nueva prueba. Esta prueba consistía de evaluación en cuatro áreas: motricidad gruesa, motricidad fina, desarrollo social y del lenguaje. Los cuatro enfoques fueron utilizados para puntuar el nivel de desarrollo de cada niño en cada área y los resultados fueron comparados.

Resultados:  En esta muestra, el enfoque basado en el modelo Rasch arrojó los puntajes de desarrollo que tenían la distribución más normal comparándolos con los otros enfoques. Los cuatro grupos de puntajes estaban altamente correlacionados entre ellos. Daban estimativos similares (c/u P > 0.05) sobre el efecto de la altura por edad sobre la motricidad gruesa, y el desarrollo social y del lenguaje. En el desarrollo de la motricidad fina, la diferencia máxima con el efecto del tamaño estimado era solo 0.04 DS a pesar de su significancia estadística (P = 0.009).

Conclusión:  Los cuatro enfoques eran prácticamente equivalentes dentro del contexto de la estimación de la asociación o del efecto de una intervención. Ninguno de ellos puede ser universalmente recomendado.


Child development is an important aspect of child health and an important step to reaching the Millennium Development Goals, but many children in developing countries are failing to achieve their developmental potential (Grantham-McGregor et al. 2007). In epidemiological and intervention studies, an association with child development may be quantified as an effect size, indicating the degree of change in the response (in standard deviation, SD) per unit change in the exposure variable or in the presence of an intervention (Machin et al. 1997). Child development can be assessed by inventories of milestones. A tester assesses the child’s ability in performing a series of tasks. A child may succeed or fail to pass each of these items, resulting in a multitude of binary outcomes. Some examples of commonly used inventories include the Bayley Scales of Infant Development (Bayley 1993) and the Griffiths Scale of Mental Development (Griffiths 1970). Many of the commonly used development tests were originally developed in Western Europe or Northern America. Cultural adjustment and appropriate modification are needed before they can be used in other societies. One such recent work is a test developed for use in children up to age 6 years in Malawi, southern Africa (Gladstone et al. 2008). The test consists of four domains: gross motor (GM), fine motor (FM), social and language development, and 110 items in all.

Having administered a developmental test, the challenge is to use the multitude of binary data to score the developmental status of a child. How to use these data in epidemiological and intervention studies has remained debatable (Jacobusse et al. 2006; Drachler et al. 2007; Jacobusse & Buuren 2007). One simple approach is to count the total number of successes. This approach has two limitations.

First, it does not adjust for variation in age at assessment. So, the same score may represent different developmental statuses for children at different ages. Consequently, it is quite common to use an approach that standardizes the scores according to age-specific reference data and results in an age-standardized Z-score. The Bayley Scales, for example, use this approach (Bayley 1993). Similarly, some instruments adjust for age and create percentile scores instead of Z-scores. These two methods share a similar basis but the percentile scores tend to give a non-normal, flat distribution. Here we will only discuss the Z-scores.

Second, the counting of successes allows all items to contribute equally to the raw scores and Z-scores regardless of the items’ difficulty level. Drachler et al. (2007) recently proposed an approach to score child development that takes into account the difficulty level of each item. The resultant developmental score is the natural logarithm of the ratio of the child’s ‘ability age’ to actual age. A negative (positive) value means developmental delay (advance). One drawback of this score is that it cannot be estimated for children who either pass or fail all test items.

Another recently proposed approach offers a quantitative developmental score based on the Rasch Model (Jacobusse et al. 2006; Jacobusse & Buuren 2007). This approach does not standardize for age, but it has the advantage that the scores are on the same metric across age and can be compared across age. Under this approach, children who have the same number of passes will have the same score even if they pass different items with different level of difficulty. There is no consensus on what the most appropriate way to score child development is. Both the simple count (SC) and Z-scores are seen in the literature. Some studies analysed the items separately, leading to a problem of multiple testing.

To our knowledge there has been no empirical comparison of the four scoring approaches. This article aims to compare the four approaches to the scoring of child development in the context of paediatric epidemiology and intervention studies by drawing on data from the Malawian study (Gladstone et al. 2008). In this context, the focus of analysis and the motivation of scoring child development usually concern the estimation of an association between child development (response variable) and a risk factor or intervention (exposure variable), or the effect size. This article does not aim to study how best to use developmental tests in clinical services.


Participants and data collection

The Lungwena Child Survival Study is an ongoing prospective cohort study of children born in 1995 and 1996 in Lungwena, a rural area in southern Malawi. Anthropometric measurements were collected at monthly intervals from birth up to the age of 18 months, 3-month intervals from 18 to 60 months and wider intervals thereafter. Details of the cohort study have been described previously (Maleta et al. 2003). A sub-study was conducted to develop an inventory for the assessment of child development. The sample for this sub-study was the cohort of survivors aged between 3 and 6 years and their younger siblings, excluding those who had significant disability, severe malnutrition, ≤34 weeks of gestation or twin births. The children were assessed on one occasion in 2000 to 2001 on a home visit by research assistants trained by a paediatrician in the study team (MG). The assessment took approximately 35 min to complete. Items were scored as either pass or fail, or ‘don’t know’ if the child was uncooperative or unwell. Items were administered until the child failed seven consecutive items in the same domain. After this, it was assumed that the child would not attain any further milestones in that domain. Details of the sub-study on child development have been described earlier (Gladstone et al. 2008). The present work is a secondary analysis of the previously reported data. Only the cohort members were included in this work. The study was approved by the National Health Science Research Committee, Malawi.

Scoring algorithms

All procedures deal with each of the four domains of development separately, giving four domain scores for each child. The first approach simply counts the total number of items passed. In this article we will call this the SC.

The second approach standardizes the SC for age. For the present study, we do not have a separate population reference. Hence the standardization is internal. The children assessed were arranged into quintiles according to age. Within each quintile the mean and SD were calculated and a Z-score was obtained for each child by subtracting the age-specific mean from the child’s SC and then dividing the difference by the age-specific SD.

The third approach contrasts the ‘ability age’ and the actual age of a child. The score is computed in three steps (Drachler et al. 2007). First, a logistic regression model with the natural logarithm of age at assessment as an independent variable is estimated for each binary item. This gives the intercept (alpha) and regression coefficient (beta) that can be used to characterize the probability of success or failure in a test item in relation to age. Second, separately for each child, a logistic regression model:


is estimated, where Yj, j = 1, 2, …, p, are the binary items and p is the total number of items. In this equation, both α and β are known values from step 1, α is an ‘offset’ term having a fixed coefficient of 1 and β is an independent variable. The log (ability age) is the regression coefficient to be estimated from the child’s data. This is the value that maximizes the likelihood of observing the profile of test outcomes of this child. Finally, the score is computed as the log of the ratio of the estimated ability age to actual age at assessment. We will refer to this score as the log age ratio (LAR) for brevity. A disadvantage of this scoring method is that it cannot give a score to children who either fail or pass all items. In such cases the maximization process does not converge and a missing value is assigned (Drachler et al. 2007). We will discuss alternatives to the treatment of this problem later.

The fourth approach is based on the Rasch model (Jacobusse et al. 2006; Jacobusse & Buuren 2007), which describes the probability of child i passes item j as


where Yij is the binary outcome for child i on item j, θi is the ability parameter for child i and δj is the difficulty parameter for item j. The estimate of θi scaled to have mean 50 and SD 10 is the developmental score Jacobusse and colleagues proposed. We will refer to this as the Rasch score (RS). In contrast to the LAR, the RS can be estimated for all children, including those who fail or pass all items (Hoijtink & Boomsma 1995). We estimated the Rasch model using a stata programme written by Hardouin (2007).

Statistical analysis

We compare the four scores from three aspects. First, we described and contrasted their distribution and examined which of them tend to show a more normal distribution, using the V-statistics described in Royston (1991). V > 1.4 represents a statistically significant deviation from normality (P < 0.05); a larger V indicates a larger degree of deviation from normality. Second, we estimated the pair-wise Pearson’s and Spearman’s correlation coefficients between the scores in order to see how similar they are. Third, we calculated the height-for-age Z-score (HAZ) at age 36 months using the WHO Multicentre Growth Reference Study Group (2006) and estimated and compared the effect size (in standard deviation) per unit change in HAZ by ordinary least squares regression, with gender and age at assessment as covariates. Testing for equality of effect sizes across the four scores was based on Zellner’s Semmingly Unrelated Regression method (Greene 2003).

The present sample was assessed at mean age 4.2 years (SD = 0.7) and so the items for infants and younger children are irrelevant (i.e. all passes). Only items with a pass rate below 100% in this sample were retained for the analysis. The developmental scores for each domain were estimated and analysed for subjects with non-missing values in HAZ and items in that domain. For the purpose of comparison with the LAR, which is not estimable if a child passes or fails all items, the analysis excluded these subjects, but they were included in the processes of deriving the other three scores.


Participant profile

The 473 cohort members (233 males) were assessed, with median age 4.05 and range 3.13–6.15 years. The number of subjects with no missing values in HAZ at 36 months and the milestones for inclusion in the studies of GM, FM, social and language development were 384, 360, 399 and 389, respectively.

Distribution of developmental scores

Table 1 is a descriptive summary of the four scores for each of the four domains of child development. There were 20, 48, 31 and 36 children, respectively, who passed or failed all GM, FM, social and language items and therefore had no LAR. They were excluded from the analysis and comparison here. Most scores were negatively skewed (skewness < 0) in this age range and more pointed (kurtosis > 3) than a normal distribution would be. None of the score distributions shows evidence of normality. However, in the GM, FM and language domains, the RS were closer to normality than the other scores (V = 4.1, 13.4 and 3.2, respectively). Especially in GM and language development, the RS’s degree of skewness (-0.5 and -0.4) and excess kurtosis (3.2 and 3.3) was mild. The LAR of social development was closer to normality than the other three scores (V = 5.6).

Table 1.   Distribution of four developmental scores in four domains
DomainStatisticsSimple countZ-scoreLog age ratioRasch score
Gross motor (18 items; n = 364)Mean13.3−
Fine motor (13 items; n = 312)Mean9.1−
Social (13 items; n = 368)Mean9.8−
Language (10 items; n = 353)Mean6.2−0.1−0.148.0

Correlation analysis

Table 2 presents the correlations between the scores. The Spearman’s correlation coefficients show that the two scores without age adjustment, that is SC and RS, had exact agreement in ranks (coefficients = 1.00). The two scores with age adjustment, that is Z-score and LAR, were also strongly correlated (from 0.80 to 0.92). Correlation between the scores with and without age adjustment was more modest (from 0.61 to 0.83). Pearson’s correlation coefficients gave similar results.

Table 2.   Pearson’s and Spearman’s correlation coefficients between developmental scores
Domain PearsonSpearman
Gross motorZ-score0.71 0.760.61 0.65
Fine motorZ-score0.71 0.880.64 0.83
SocialZ-score0.84 0.790.83 0.80
LanguageZ-score0.75 0.760.68 0.77

Comparison of effect size

Table 3 shows the effect size per unit increase in height-for-age at 36 months, with sex and age as covariates in the regression analysis. In GM development, the four scores increased by about a quarter of a standard deviation (0.23–0.28) per SD increase in HAZ. There was no statistically significant difference between the four effect size estimates (P = 0.070). The Z-statistics (effect size divided by standard error of effect size) were also similar (from 5.62 to 6.46). In FM development, the effect size estimates for SC, Z-score and RS were all 0.28; that for LAR was 0.32. The difference between these effect size estimates was only 0.32 - 0.28 = 0.04 SD, although it was statistically significant (P = 0.009). In the social domain, the effect size estimates were slightly below 0.2 SD (0.17–0.20). There was no significant difference between the four estimates (P = 0.428). The Z-statistics of all the aforementioned effect sizes were larger than 2.58 and so all were statistically significant at the 1% level (P < 0.01), indicating associations between HAZ and the three aspects of child development. HAZ was not associated with language development scores. The four effect size estimates were all close to zero and statistically insignificant. Again, there was no significant difference between the effect size estimates (P = 0.320).

Table 3.   Effect size in relation to one unit increase in height-for-age at 36 months†
Domain Simple countZ-scoreLog age ratioRasch scoreTest for equal effect size (P-value)
  1. †Effect size in SD per unit increase in height-for-age at 36 months, estimated by ordinary least squares regression with adjustment for gender and age at assessment.

  2. Z = effect size divided by standard error of effect size.

Gross motorEffect size0.
Fine motorEffect size0.280.280.320.280.009
SocialEffect size0.
LanguageEffect size−0.04−0.05−0.02−0.040.320


Tests of child development typically result in a large array of binary data. To avoid the inflated risk of type I error resultant of multiple testing and to avoid being overwhelmed by a multitude of data, it is desirable to develop a summary measure of child development. There is no universally agreed way to do this. One common concern is whether the outcome scores follow a normal distribution. This concern is often overemphasized. T-test and ordinary least squares regression are robust to deviation from normality, especially when the sample size is large (Heeren & D’Agostino 1987; Gujarati 1995; Cheung et al. 2008). This concern is more relevant when the sample size is small, though there is no fixed rule on how small it is. In the present comparison in a Malawian sample, none of the four methods provided developmental scores that closely follow a normal distribution. Nevertheless, the RS performed better in this regard, providing scores that had smaller V and skewness closer to zero and kurtosis closer to three than the others in several domains.

Despite very different conceptual frameworks and technical procedures, the four scores are strongly correlated. The relatively modest correlation between the age-adjusted and -unadjusted scores is a result of the children being measured at variable ages. The strong correlation coefficients suggest that, in research practice, the practical difference in employing the different scoring methods may not be significant.

In paediatric epidemiology and intervention studies, the primary objective is often to estimate an association, or effect size. Related to this is the testing of the null hypothesis of effect size 0. In the context of behavioural sciences, Cohen suggested that an effect size of about 0.2 SD is a small effect (Machin et al. 1997). Stunting is an established predictor of a range of developmental and educational outcomes (Grantham-McGregor et al. 2007). We have compared the effect sizes in relation to one unit increase in HAZ at age 36 months. The use of different scoring algorithms made only minor variations in the effect size estimates. In three of four domains, the variations were not statistically significant. In FM development, where the variation in effect size estimates were statistically significant (P = 0.009), three methods gave an identical estimate of 0.28 whereas the LAR gave an estimate of 0.32. The difference between these high and low estimates was only 0.04 SD per unit increase in HAZ, which is substantially smaller than the ‘small’ effect suggested by Cohen. It takes a five units’ difference in HAZ to get the difference in effect size to accumulate to a ‘small’ level of 0.2 SD between LAR and the other three scores. Such a minor difference between scoring algorithms is unlikely to be of scientific significance in paediatric research.

We maintain that for many epidemiological and intervention studies, where the purpose is to estimate an effect size, there is no practical difference between the methods. Nevertheless, other research purposes may arise from time to time and they may be better served by one of the approaches. For example, the RS is not standardized for age and the values are comparable across age groups. If one’s purpose is to study the acceleration and deceleration of development in relation to age, this is likely to be the method of choice.

One may also consider ease and meaning in the presentation and interpretation of data. For example, the mean RS is meaningless and it depends on how one scales it. In the proposal of Jacobusse and colleagues, it is scaled to have a mean of 50 (Jacobusse et al. 2006; Jacobusse & Buuren 2007). In contrast, the LAR (or its exponent) has a nice interpretation of whether a child’s ability age is above or below his/her actual age.

One drawback of the LAR as proposed by Drachler et al. (2007) is that it is not defined if a child passes or fails all test items, when the estimation of ‘ability age’ will not converge. This may be a minor issue if the test items vary substantially in difficulty for a sample of children in a particular age range. In such situations the number of children with all passes or all failures would be small. Otherwise, the missing values in LAR would mean not only a reduced sample size but also a possibility of bias. Hence, the other three methods may be preferred. One way to deal with this problem of the LAR is to set the missing values to a high (or low) score for cases with all passes (or failures) and analyse them as right (or left) censored values. The calculable highest and lowest ‘ability age’ may be used to form the censoring thresholds. The possible statistical procedures for the analysis of censored data include Tobit (Greene 2003), censored least absolute deviations (Powell 1984) and the more typical survival analysis techniques (Machin et al. 2006). However, some of these techniques to deal with censoring, for example Tobit, are not robust to the violation of distributional assumption.

In conclusion, this empirical comparison of the four approaches to the scoring of child development suggests that the four methods provide scores that are highly correlated and equivalent for the purpose of estimation of effect size. The simplest approach of counting the total number of successes is as useful as the much more statistically advanced methods in this context. For studies with smaller sample size, where normality in the data is a concern, there is some sign that the RS is preferable. It may be that at other times the research purposes and consideration of interpretability may require a particular method. One needs to consider the drawback of the LAR in not providing scores for all subjects. An approach to deal with this as a censoring problem is proposed.


The study was funded by grants from the Academy of Finland (grants 200720 and 109796), the Foundation for Paediatric Research in Finland, and the Medical Research Fund of Tampere University Hospital. XD received a scholarship from the Finnish Centre for International Mobility. The funders played no role in the study’s implementation, analysis or reporting.