Measurement Issues: Screening and diagnostic instruments for autism spectrum disorders – lessons from research and practise
Background and Scope
Significant progress has been made over the past two decades in the development of screening and diagnostic instruments for autism spectrum disorders (ASDs). This article reviews this progress, including recent innovations, focussing on those instruments for which the strongest research data on validity exists, and then addresses issues arising from their use in clinical settings.
Research studies have evaluated the ability of screens to prospectively identify cases of ASD in population-based and clinically referred samples, as well as the accuracy of diagnostic instruments to map onto ‘gold standard’ clinical best estimate diagnosis. However, extension of the findings to clinical services must be done with caution, with a full understanding that instrument properties are sample-specific. Furthermore, we are limited by the lack of a true test for ASD which remains a behaviourally defined disorder. In addition, screening and diagnostic instruments help clinicians least in the cases where they are most in want of direction, as their accuracy will always be lower for marginal cases.
Instruments help clinicians to collect detailed, structured information and increase accuracy and reliability of referral for in-depth assessment and recommendations for support, but further research is needed to refine their effective use in clinical settings.
Key Practitioner Message
- Over the past two decades screening instruments (many) and diagnostic measures (few) for ASD have been developed, although few are well-evaluated
- Extension of research findings to clinical services must be done with caution, understanding that instrument properties are sample-specific
- Screening and diagnostic instruments help clinicians least in the cases where they are most in want of direction as their accuracy will always be lower for marginal cases
- Timely and expensive ASD-specific diagnostic instruments will not always be feasible or appropriate, but clinical teams benefit from practitioners being trained in these methods
Autism spectrum disorders (ASDs) affect approximately 1% of children (Baird et al., 2006; Center for Disease Control, 2009). The term ASD is used to describe a range of neurodevelopmental conditions that demonstrate considerable phenotypic heterogeneity, both in terms of presentation at any one age and across development, and which are likely to differ in underlying aetiology (Levy, Mandell & Schultz, 2009); this has led to biologists adopting the term ‘the autisms’ (Geschwind & Levitt, 2007). However, individuals with an ASD share common impairments in social relatedness and reciprocity, the use of language for communication and rigid and repetitive ways of thinking and behaving (DSM-IV-TR; American Psychiatric Association, 2000; ICD-10; World Health Organisation, 1993). Proposed revision for DSM-5 combines the social and communication impairments into one domain and includes sensory with repetitive behaviours (www.dsm5.org).
Notwithstanding progress in our understanding of the genetic and neurodevelopmental processes that lead to ASD, clinical diagnosis is reliant on the developmental and behavioural presentation. Until the 1990s it was rare for children to receive a diagnosis of autism until the age of 3 or 4 years. Today many children are now first identified in toddlerhood, although others, in particular those with average or above average language and cognitive abilities, are not diagnosed until school age or older (Charman & Baird, 2002; Mandell, Novak & Zubritsky, 2005; Manning et al., 2011).
With the recognition that ASDs are relatively common, and in many but not all cases can be identified early, there has been both burgeoning clinical interest and research activity in the development of ASD-specific screening and diagnostic instruments. In the present review, we will focus on those instruments for which the strongest research data on validity exists; this applies to many more screeners than diagnostic instruments. One critical comment is that it would be better for the clinical and research fields if we knew more about fewer screening instruments, rather than the somewhat inadequate information on validity that we do have available for this ever-increasing list of ASD screens (see Al-Qabandi, Gorter & Rosenbaum, 2011; for a critical review on early screening).
Issues in screening and surveillance
Screening and surveillance are different, but related activities involving the detection of impairments with a view to prevention or amelioration of disability. Screening is the prospective identification of unrecognised disorder by the application of specific tests or examinations. Surveillance refers to the ongoing and systematic collection of data relevant to the identification of a disorder over time by an integrated health system.
Several parameters of screening instruments are important in assessing their efficacy and utility:
- Sensitivity is the proportion of individuals with a disorder who have a positive screen result,
- Specificity is the proportion of individuals with a disorder who have a negative screen result,
- Positive predictive value (PPV) is the proportion of individuals with a positive screen result who have the disorder.
Sensitivity is required to be high in order that the screen misses few cases of the disorder (avoiding falsely reassuring parents and professionals). Specificity is required to be high in order that few cases without the disorder are screen positive (avoiding falsely alarming parents and costly referral for in-depth assessment). When the sensitivity and specificity of a screen remain constant, the PPV is lower the rarer a disorder is within the population (Clark & Harrington, 1999). Hence, PPV will be lower in population than in referred samples. Glascoe (1996) has estimated that acceptable sensitivity and specificity for developmental screening tests are 70%–80%, reflecting the nature and complexity of measuring the continuous process of child development (American Academy of Pediatrics, 2001; National Institute of Clinical Excellence, 2011).
Surveillance involves a parent-professional partnership that combines the observations of parents with the knowledge of the professional and the deployment of specific tests. There is evidence that the use of screening instruments in combination with asking parents about their concerns improves the efficiency of an instrument (Glascoe, 1999). In this context, the use of specific screens where there is some concern on the part of the parent or professional can be a useful adjunct to clinical judgement.
Population prospective screening studies
Less than a handful of population screening studies have been conducted and even fewer have undertaken the long-term follow-up required to fully ascertain vsensitivity – i.e., identifying cases missed by systematically revisiting the sample at a later agepoint. Baron-Cohen and colleagues developed the Checklist for Autism in Toddlers (CHAT) that assesses simple joint attention and pretend play behaviours by parental report and health practitioner observation through direct testing. In a sample of 41 18-month-old younger siblings of children diagnosed with autism the four children in the at-risk group who were subsequently diagnosed with autism were identified by the CHAT (Baron-Cohen, Allen & Gillberg, 1992). Baird et al. (2000) went on to test the effectiveness of the CHAT in a population of 16,235 18-month-olds as part of routine health surveillance. Children who failed all five joint attention and pretend play key items were the ‘high risk for autism’ group; children who failed both items measuring protodeclarative pointing (pointing for interest) were the ‘medium risk for autism’ group. To minimise false positives, a two-stage screening procedure was adopted. Children who were initially screen positive (at the high or medium risk threshold) received a second administration of the screen 1 month later, via a telephone follow-up. Used in this two-stage way, the PPV for all ASDs of the high and medium risk thresholds combined was good (59%). However, sensitivity was poor (21%), indicating that four fifths of children subsequently identified as having ASD in the study population were missed by the screen. If a one-stage screening procedure only had been adopted, the sensitivity increased to 35%, although in clinical use this would have entailed the assessment of more screen false positives.
Buitelaar and colleagues in the Netherlands developed a screening instrument (ESAT: Early Screening of Autistic Traits) to identify ASD (Willemsen-Swinkels, Dietz, & van Daalen, 2006). Dietz, Willemsen-Swinkels, Daalen, Engeland and Buitelaar (2006) completed screening of 31,724 children at 14 months of age. Health practitioners at a well-baby clinic appointment administered an initial screen of four items (measuring varied play with toys, readability of emotional expression and sensory abnormalities). If a child failed one or more of the four items they were offered a follow-up home visit where a longer version of the ESAT (14 items that included many social communication items, such as eye contact, response to name, etc.) was administered. Children who failed three or more items of the 14-item ESAT were invited for a diagnostic evaluation at a mean age of 23 months. The ESAT did correctly identify children with ASD (n = 18) but also identified children with language disorder (LD; n = 18) and intellectual disability (ID; n = 13). The items that discriminated best between children with and without ASD were items assessing early social communication impairments, including ‘shows interest in people’, ‘smiles directly’ and ‘reacts when spoken to’. However, at this earlier age compared to the CHAT study (14 vs. 18 months) these early signs appear to be less specific to ASD, identifying almost as many children who went on to have LD and ID. This and other studies cannot report on the sensitivity of these signs to all cases of ASD until a population follow-up of the whole sample has been conducted to identify missed cases.
Screening in referred samples
Robins, Fein, Barton and Green (2001) developed a modified version of the CHAT (the M-CHAT) that included additional items measuring other aspects of early social communication impairments characteristic of autism (e.g. response to name, imitation) as well as repetitive behaviours (e.g. unusual fingers mannerisms) and sensory abnormalities (e.g. oversensitivity to noise). The M-CHAT is a parent report instrument and the health practitioner does no direct testing. In their initial report, Robins et al. (2001) had tested 1122 unselected children (initially at 18 months, but subsequently at 24 months of age) and 171 children referred for early intervention services (considered to be at high risk of having an ASD or other developmental disability). Following analysis of the first 600 returns, a cut-off was set as failing two from six ‘critical items’ or any three items from the total of 23 items (Robins et al., 2001). Once a child failed the M-CHAT the research team readministered the screen by telephone, and if a child still scored above cut-off the family was invited for an assessment. Of the 58 children who failed on both administrations of the M-CHAT, 39 subsequently received an ASD diagnosis and the remaining 19 were found to have language or global developmental delay. Only 3 of the 39 children with ASD were from the unselected population, with the majority being identified from the sample referred for early intervention services. The items that best discriminated between children with ASD and children with other developmental problems were those that measured joint attention behaviours (pointing, bringing things to show), social relatedness (interest in other children, imitation) and communication (response to name). Robins et al. (2001) calculated sensitivity, specificity and PPV for various combinations of M-CHAT items and demonstrated that in this largely referred sample its instrument parameters were reasonably strong (PPV for two-stage 68%).
Two recent studies (Kleinman et al., 2008; Robins, 2008) have reported on the M-CHAT with new samples of 3793 and 4797 children, respectively, aged 14–30 months. In a sample combining an unselected and a high risk (referred) sample, Kleinman et al. (2008) found a PPV of 0.36 for the initial screening, which improved to 0.74 for the screening plus the follow-up telephone interview. Again, most cases were identified from the ‘high risk’ sample of children referred for early intervention services or because of a developmental concern. Robins (2008) found a much lower (0.06) PPV in an unselected sample attending well-child visits, but following the telephone interview the PPV increased to 0.57. Follow-up studies will allow us to estimate the instrument's parameters when used on an unselected population, in particular its sensitivity in detecting cases of ASDs in children about whom there had not been previous developmental concerns. Canal-Bedia et al. (2011) report on a Spanish translation of the M-CHAT and found screening plus follow-up PPVs of 0.35 and 0.19 in a combined sample of low and high risk (referred) (N = 2,480) and low risk (N = 2,055) children, respectively.
Wetherby and colleagues (Wetherby, Brosnan-Maddox, Peace & Newton, 2008; Wetherby et al., 2004) developed an early screening tool that can be used from 6 to 24 months of age – the Infant Toddler Checklist (ITC). The ITC is a broader developmental screen that successfully identifies children with developmental delay as well as children with an ASD. The ITC is a 24-item parent completed screen that asks about early social responsiveness, gestures, babble and early language and motor development. In their most recent work, 5385 families with infants were recruited between 6 and 24 months of age and then rescreened every 3 months (Wetherby et al., 2008). They have shown that it is possible in some cases to prospectively identify children who will go on to have a diagnosis of an ASD towards the end of the first year of life, although most cases were identified during the second year. PPV was low at 6–8 months (43%), but increased to over 70% in the second year of life (Wetherby et al., 2008).
Overall these prospective screening studies (summarised in Table 1) have shown that it is possible to prospectively identify ASD, including in children about whom parents and professionals did not have preexisting concerns, from the age of 18 months or even earlier in the second year of life. The most common early signs captured by the screen are impairments or delays in early emerging social communication behaviours, such as response to name, joint attention and play behaviours, although sensory abnormalities or a restricted repertoire of play activities might also be early indicators of later ASD. However, these early signs are neither universal nor specific to ASD as opposed to other neurodevelopmental disorders. In addition, one-stage screening has been shown to have low PPV with the attendant risk of overreferral. These research findings suggest caution about recommending universal population screening. Despite this, the American Academy of Pediatrics recently published guidelines that recommended routine use of screens to help community paediatricians to identify at-risk cases at 18- and 24-month health checks (Johnson & Myers, 2007; see Al-Qabandi et al., 2011). It remains to be determined whether screening instrument parameters warrant this clinical guidance. However, screens can be a useful adjunct to ongoing parent-practitioner surveillance.
Table 1. Prospective screening studies
|Baird et al. (2000)||CHAT||18||Two-stage population screening (N = 16,235)||Parental questionnaire and health practitioner observation|| |
Se = 0.21 (two-stage)a
PPV = 0.59 (two-stage)
|Dietz et al. (2006)||ESAT||14||Two-stage population screening (N = 31,724)||Health practitioner completes items following interview with parents|| |
PPV ASD = 0.25 (two-stage)
PPV DD = 0.68 (two-stage)
|Robins et al. (2001)||M-CHAT||24||Two-stage combined high-risk and low-risk (N = 1293)||Parent questionnaire|| |
PPV = 0.36 (one-stage)
PPV = 0.68 (two-stage)
|Kleinman et al. (2008)||M-CHAT||16–30||Two-stage combined high-risk and low-risk (N = 3793)||Parent questionnaire|| |
PPV = 0.36 (one-stage)
PPV = 0.74 (two-stage)
|Robins (2008)||M-CHAT||16–27||Two-stage low-risk (N = 4797)||Parent questionnaire|| |
PPV = 0.06 (one-stage)
PPV = 0.57 (two-stage)
|Wetherby et al. (2004)||ITC||6–24||Population screening (n = 5385); multiple stage follow-ups||Parent questionnaire|| |
PPV DD = 0.43 (6–8 months)
PPV DD = 0.79 (21–24 months)
Two recent, innovative studies remind us that screening instruments do not act in isolation to improve early detection. When introducing screening instruments to practitioners, one is inevitably involved in training the practitioners to use the screen and about the early signs of ASD. In a controlled study, Oosterling, Wensing, et al. (2010) showed that in combination, training health professionals in the early signs of autism and introducing use of the ESAT screen and a clear referral and diagnostic pathway led to a reduction in age of diagnosis from 83 to 64 months. The proportion of cases diagnosed before the age of 24 months increased from 12.7% to 24.3%. Dereu et al. (2010) also provided training and introduced a new screening instrument the Checklist for Early Signs of Developmental Disorders (CESDD) to daycare workers caring for 6808 children between 3 and 39 months of age. Dereu et al. report a moderate to high sensitivity of 0.68, but a low PPV of 0.10, although the sample has yet to be followed up to determine cases missed. By being trained in the use of screens and the early signs of ASD, health and education practitioners will develop their expertise to identify possible ASD cases and come to understand the benefits but also the potential limitations (e.g. false positives) of using the screening instruments themselves.
One question is whether the ultimate goal should be the development of a universal population screen to identify undetected cases of ASD or the development of instruments that can be used, in combination with parental and professional expressions of concern regarding a child's development, in ongoing health surveillance. Another consideration is whether screens should target ASD specifically or whether they should also attempt to identify children with language and general developmental delays or other neurodevelopmental conditions.
Screening with older children
The Social Communication Questionnaire (SCQ; Rutter, Bailey & Lord, 2003; Rutter, LeCouteur & Lord, 2003) is a screening tool for ASD based on the Autism Diagnostic Interview-Revised (ADI-R; Lord, Rutter & LeCouteur, 1994) and is increasingly widely used in research and clinical practise. It consists of 40 items answered yes/no and is suitable for both verbal and nonverbal children. In the initial validation study that included children and adults, the SCQ discriminated well between ASD and non-ASD cases with a sensitivity of 0.85 and a specificity of 0.75 (Berument, Rutter, Lord, Pickles & Bailey, 1999) (Table 2). Chandler et al. (2007) found the SCQ to work well with at-risk 9-to-10-year-old children (sensitivity 0.88, specificity 0.72). Chandler et al. (2007) also reported that in two small general population samples between 4% and 5% of children scored above the ASD cut-off, including 1.5% who scored above the autism cut-off. Many of these high-scoring children had an ASD diagnosis and almost all (~90%) had a diagnosed neurodevelopmental disorder. Corsello et al. (2007) reported reduced sensitivity (0.68) for detecting ASD in children below the age of 5 years compared with a sensitivity of 0.80 for children 11 years and older. Recently the first cross-cultural validation study also found strong instrument properties in a German sample (Bölte, Holtmann & Poustka, 2008). The ability of the SCQ to discriminate between ASD and non-ASD cases in two samples of 3-to-6-year-old children has recently been reported. In two studies Eaves and colleagues found moderate sensitivity 0.70 and specificity (0.79; Eaves, Wingert & Ho, 2006; Eaves, Wingert, Ho & Mickelson, 2006), but reduced specificity (0.54) in the second study (Eaves, Wingert & Ho, 2006). One recent study has reported the ability of the SCQ to detect ASD in a younger (20–40 month old) referred/high-risk sample and found a sensitivity of 0.66 and a PPV of 0.79 (Oosterling, Wensing, et al., 2010; Oosterling, Ross, et al., 2010; Oosterling, Rommelse, et al., 2010).
Table 2. Studies reporting on the Social Communication Questionnaire (SCQ) and the Social Responsiveness Scale (SRS) – autism spectrum disorder (ASD) vs. nonspectrum
|Berument et al. (1999)||4–40|| |
|Chandler et al. (2007)||9–10|| |
|Corsello et al. (2007)||2–16|| |
|Eaves, Wingert and Ho (2006), Eaves, Wingert, Ho, et al. (2006)||3–7|| |
|Eaves, Wingert and Ho (2006)||4–6|| |
|Charman et al. (2007)||9–13|| |
|Constantino et al. (2007)||4–18|| |
The Social Responsiveness Scale (SRS; Constantino & Gruber, 2005; Constantino et al., 2000) is a parent and teacher completed questionnaire with 65 items rated on a 4-point Likert scale (from ‘not true’ to ‘almost always true’). Charman et al. (2007) found that the SRS had a specificity of 0.78, a sensitivity of 0.67 and a PPV of 0.63 in a clinical sample (N = 119) of 9 to 13 year olds, with higher PPV (0.78) in the high IQ (>70) compared with the low IQ subsample (0.52). Constantino et al. (2007) found that in a large sample (N = 442) of children with ASD and non-ASD, a combined parent and teacher T-score cut-point of >60 yielded a sensitivity of 0.75 and a specificity of 0.96 (PPV not reported). A German version of the SRS also reported strong instrument properties (Bölte, Poustka & Constantino, 2008).
We have chosen to focus on the two most widely used and best-validated ASD screening instruments for children and adolescents. Other ASD screens for children and young people have been developed over the past decade, including the Childhood Autism Screening Test (CAST; Williams et al., 2008, 2005) and the Autism Spectrum Screening Questionnaire (ASSQ; Ehlers et al., 1999; Posserud, Lundervold & Gillberg, 2006, 2009); summaries of their instrument properties are available in other recent reviews (Norris & Lecavalier, 2010). Only a few studies have directly compared different screens against each other, providing insufficient data at this stage to make clear recommendations regarding which screens ‘work best’ at a particular age or ability level. Charman et al. (2007) found that the SRS and the Children's Communication Checklist (CCC; Bishop, 1998) have lower precision than the SCQ in the subsample with low IQ (<70). Oosterling et al. (2009) reported on the ESAT, SCQ and CHAT screens in a high risk sample of children screened around 30 months of age (8–44 months, N = 238). The instrument with the highest PPV was the CHAT (0.97 for five high risk key items), but this was associated with the lowest sensitivity (0.18) as was found in the original CHAT population study. The ESAT and the SCQ both had moderate PPV (0.68 and 0.79, respectively) and moderate sensitivity (0.88 and 0.66). The decision of which screen to use will be sample-specific in both clinical and research use, and some understanding of how sample and respondent characteristics might systematically affect instrument properties is required to make recommendations for one screen over another.
Clinical issues in screening and surveillance
When screening for undetected cases of ASD, some parents' first recognition that something might be wrong may follow ‘failure’ of a screen and consequent discussion about their child's development with the professional involved. For a parent to make use of information about their child, it first has to make sense and they have to be ready to agree on it. Recognition, belief and acceptance can be particularly difficult when the professional is giving completely unexpected information. One of the benefits of active surveillance is the opportunity to discuss ‘risk status’ with parents and what it means when a particular child fails a screen. In practise, being screen positive does not constitute a diagnosis, even when tests have very high positive predictive value; rather, the initial screening process should be seen as the beginning of a dialogue between the parent and professional about the child's development, with additional assessments being couched as helpful checks to make sure things are going OK.
Another caution is that screening results are sample-specific. The prevalence of ASD cases within a sample, the characteristics (e.g. clinical diagnosis, IQ, age) of the ASD and non-ASD cases, family factors (e.g. parental education, parental knowledge about autism) and methodological factors, including whether the screen was completed prior to or after a diagnostic assessment, can all affect how a screen performs. However, the utility of any particular screening instrument and the application of any particular cut-point for identifying ‘screen positives’ depend both on the sample characteristics and on the intended purpose of screening. Charman et al. (2007) outline different hypothetical clinical and research scenarios that illustrate how three different screens would perform on different tasks. The choice of which screen to use, and for which particular purpose, critically depends on the relative costs of false positives and false negatives. It is also important to remember that these costs tend to fall on different parties. False positives involve costly further investigation and parental anxiety. False negatives may deprive children of clinical and education resources or place the burden of provision entirely on parents.
ASD screening instruments function to identify children in need of further monitoring or diagnostic evaluation. At that point, standardised autism diagnostic instruments are often employed to structure the information-gathering from both parents and identified children within a diagnostic assessment. The existence of and ongoing improvements to such measures are associated with more accurate diagnosis of ASD, including the ability to reliably describe milder and younger cases, as well as increased comparability of research findings, based on better agreement as to “caseness” across research teams. However, as with screening tools, diagnostic instruments are often limited by inadequate power to correctly identify individuals with and without ASD. Furthermore, the estimates of such performance validity for each particular measure are necessarily limited by the absence of an absolute test for ASD, and as such are influenced by clinical experience in diagnosing ASD, training and experience in using the diagnostic measure and evolution within the field in terms of what is recognised and labelled “ASD.” Four commonly used autism diagnostic instruments are reviewed briefly below in terms of intended purpose, administration and scoring and psychometric properties. Instrument parameters of these measures based on the largest available samples are briefly summarised in Table 3. For more detailed psychometric information, see the National Institute of Clinical Excellence guidelines (National Institute of Clinical Excellence, 2011).
Table 3. Overview of autism diagnostic instruments
|ADOS||Age 12 months and up; any verbal level; nvma>=12a|| |
Examiner-patient interactive observation
|Social interaction, communication, play and imagination, repetitive behaviours|| |
Toddler module: Mκw = 0.76 for 30/38 items; mean exact agreement = 84%.a
See also Lord et al., 2012, for revised algorithm reliability
AUT vs. NS: Se = 91–98, Sp = 84–94 (N = 1072)b; Se = 82–94, Sp = 80–100 (N = 949)c; Se = 87–92, Sp = 71–76 (N = 337)d
Non-Autism ASD vs. NS: Se = 72–84, Sp = 76–83 (N = 649)b; Se = 60–95, Sp = 75–100 (N = 238)c; Se = 53–86, Sp = 62–63 (N = 354)d
Toddlers: Se = 88–91, Sp = 91 (n = 104, ASD vs. NS) a
See Gray, Tonge and Sweeney (2008), Gray et al. (2008) and Oosterling, Roos, et al. (2010) for more data from Modules 1 & 2, and Lord, Rutter, DiLavore and Risi (1999) for data on module 4
|ADI-R||Ages 12 months and up;e any verbal level; nvma>=105|| |
|Social interaction, communication, play and imagination, repetitive behaviours, developmental milestones, problem behaviours and special skills|| |
Kw=0.73–0.78, Mean Agreement = 90–93% on all items (n = 10 AUT, 10 NS)f
Kw=0.62–0.89 (preschoolers; n = 51 ASD, 43 NS)g
Se = 89, Sp = 59, PPV = 74, NPV = 81 (N = 960, ASD vs. NS)h
Se = 83, Sp = 72, PPV = 82, NPV = 74 (N = 270 under age 3, ASD vs. NS)h
Se = 72; Sp = 79 (N = 137, AUT+ID vs. NS+ID); Se = 77, Sp = 63 (N = 136, non-aut ASD+ID vs. NS+ID)i
|DISCO||All ages, verbal levels, and mental ages|| |
Examiner-parent/caregiver interview (+ outside information)
|Social interaction, Communication, imagination, repetitive behaviours, developmental level and daily living skills, problem behaviours|| |
K = 0.75 + for 80% of all items. j,k
K = 0.75 + for 50%–68% of items across algorithms l
Se = 100, Sp = 55 (N = 67, ASD vs. NS).k
Se = 82, Sp = 83 (N = 57; ASD vs. NS)l
|3Di||Children with normal-range abilities, unselected clinical and general population samples|| |
Examiner-parent/caregiver interview (computerised;
45–180 min, depending on optional module inclusion)
|Social interaction, communication, repetitive behaviours, demographics, Developmental history, motor skills; optional: comorbid symptoms||ICC = 0.86+ (n = 23 ASD, 27 NS)m||Correctly identified 100% of 27 children with ASD and 98% of 60 typical children and 33 with unknown diagnosesm|
The Autism Diagnostic Interview-Revised
The ADI-R (Lord et al., 1994) is a standardised, semistructured interview that is administered by a trained clinician to a parent or caretaker familiar with the developmental history and current behaviour of the individual being evaluated. Scoring is based on the interviewer's judgment of the behavioural reports obtained, rather than on the informant's judgment. Administration and scoring of this interview takes approximately 1.5–3 hr in a face-to-face setting. The ADI-R version published by Western Psychological Services (WPS) can be used to assess those with mental ages above 24 months (Rutter, Bailey & Lord, 2003; Rutter, LeCouteur & Lord, 2003); newly created algorithms extend the use to children aged 12–47 months and down to nonverbal mental ages of 10 months (Kim & Lord, 2012).
The ADI-R is comprised of 93 items focusing on Early Development, Language/Communication, Reciprocal Social Interactions and Restricted, Repetitive Behaviours and Interests. Most items include distinct current and historical scores, the latter based either on the period between the individual's fourth and fifth birthdays (“Most Abnormal 4–5”) or the point in the individual's lifetime at which the behaviour in question was regarded as most atypical (“Ever”). Scores are assigned on a 0–3 scale; higher numbers indicate more definite presence or greater severity of symptoms. Diagnostic algorithms for children aged 2:0–3:11, or 4:0 and older are based on sums of specific item scores across the domains noted above. Algorithm cut-offs for all domains must be met or exceeded to achieve the instrument classification of “autism.” Current behaviour algorithms exist, but do not yield classifications.
In two large independent samples aged 3 and older (N = 960 American; N = 232 Canadian), the ADI-R correctly identified 89%–95% of children with ASD, however, it yielded a nonspectrum classification for only 56%–59% of children with non-ASD disorders (Risi et al., 2006). When used in combination with the Autism Diagnostic Observation Schedule (ADOS), this specificity improved to 77% and 75% by sample. This supports the recommendation by de Bildt et al. (2004) that the ADI-R and ADOS are most valuable in combination. WPS ADI-R diagnostic algorithms have been found to be overinclusive for individuals with nonverbal mental ages below 18 months and those with severe to profound intellectual disability (Lord, Storoschuk, Rutter & Pickles, 1993; Nordin & Gillberg, 1998; Risi et al., 2006); see Kim & Lord, 2012, for algorithm revisions to address those issues.
Interrater reliability was found to be very good in a sample of 20 children with autism or intellectual or language impairments (Lord et al., 1994), as well as in larger sample of 94 preschoolers (Lord et al., 1993). A later study of seven reliable examiners rating one administration reported good to perfect agreement on 87% of items (Cicchetti, Lord, Koenig, Klin & Volkmar, 2008). Original test-retest data were available for six children: with blind interviewers administering the measure 2–3 months later, exact agreement exceeded 83% for all but 6 items (Lord et al., 1994).
The Diagnostic Interview for Social and Communication Disorders (DISCO)
The Diagnostic Interview for Social and Communication Disorders (DISCO; Wing, Leekam, Libby, Gould & Larcombe, 2002), now in its 11th version, is a semistructured, standardised interview used in ASD assessment for diagnosis and educational and/or treatment planning. Like the ADI-R, the DISCO is administered face-to-face by a trained examiner interviewing a parent or close caregiver of an individual with suspected ASD. However, ratings on the DISCO can be based on any available information, including direct observation of the individual or reports from teachers or other caregivers. The instrument is intended for individuals of all chronological and mental ages, although published data on the diagnostic validity of the measure in special populations (such as very young children or those with above average or impaired intellectual ability) are limited.
The DISCO takes approximately 2–4 hr to administer (including scoring). It encompasses 362 items covering domains of social interaction, communication, imagination and repetitive behaviours, as well as domains assessing developmental levels and daily living skills, and non-ASD-specific behaviours, such as problems with attention, overactivity, sexual or psychiatric difficulties and other challenging behaviours. Developmental items are rated on a 3-point scale of “delay,” “minor delay,” or “no problem”; atypical behaviour items receive separate ratings based on current behaviour and most atypical behaviour ever, with a scale of “severe,” “minor,” or “not present.” The DISCO was originally intended to assist in clinical assessment and treatment planning for an individual rather than to yield a categorical diagnosis, however, computerised diagnostic algorithms for research use have been created (Leekam, Libby, Wing, Gould & Gillberg, 2000; Leekam, Libby, Wing, Gould & Taylor, 2002). These include two algorithms operationalising the ICD-10/DSM-IV criteria for autistic disorder and Asperger's Disorder, a 5-item algorithm based on Wing and Gould (1979) criteria for ASDs and an algorithm based on Ehlers and Gillberg (1993) criteria for Asperger's Disorder.
In a DISCO-9 validity study of children aged 3–11 years (Leekam et al., 2002), all 36 children with ASD were correctly identified by either Wing & Gould or ICD-10 autism algorithms, however, 10 of 17 with intellectual disability and 4 of 14 with language disorders were also identified as “ASD.” In a study using the Swedish translation of the DISCO-10, the ICD-10 autism algorithm identified 42 of 51 individuals with an ASD diagnosis and incorrectly identified 1 individual of 6 without ASD (Nygren et al., 2009). Another Swedish study reported that 89 adult longitudinal participants with an ASD were correctly identified by the DISCO as having an autism spectrum condition by either ICD-10 autism or atypical autism algorithms (Billstedt, Gillberg & Gillberg, 2007). Interrater reliability of the DISCO-10 was assessed in 33 children (aged 2.5–15) and 7 adults (20–38 years), with over 90% of items showing good or excellent interrater reliability and the majority of algorithm items showing at least moderate reliability (Nygren et al., 2009).
The developmental, dimensional and diagnostic interview (3Di)
The Developmental, Dimensional and Diagnostic Interview (3Di; Skuse et al., 2004) is a parent/caregiver interview administered face-to-face by a trained examiner using a laptop computer. Prior to the interview, parent/caregivers complete questionnaires that are entered into the software and used to tailor the order and wording of questions asked in the in-person interview. Parent responses are entered directly into the software in the moment, and immediately following the 90–180 min interview, a computer-generated report, including algorithm scores and classification, is available to the clinician. In addition, a spreadsheet of up to 1300 3Di variables is automatically created. The 3Di was developed primarily to assess ASD symptoms in children with average-range abilities and to differentiate ASD and nonspectrum conditions in a general population; the authors suggest it may also be used in populations with moderate to severe intellectual disability (Skuse et al., 2004).
The 3Di is comprised of mandatory modules covering autism spectrum symptoms (266 items), optional modules on comorbid symptoms (291 items) and 183 items concerning patient demographics, family background, developmental history and motor skills. A short 53-item form of the 3Di has also been piloted (3Di-sv; Santosh et al., 2009). On both the original and short forms, response options vary within 3-point scales. Computer-generated algorithms weight and sum responses within domains, although published reports are not clear as to which domains are represented in the algorithm, how items were chosen for algorithm inclusion and which classifications are produced.
The majority of data on the reliability and validity of the 3Di comes from the authors' original 2004 article (Skuse et al., 2004) and the published report of the short form (Santosh et al., 2009). In the former report, the measure discriminated between 27 children with ASD, 60 typically developing children and 33 with unreported diagnoses with 100% sensitivity and 98% specificity. In the latter report, sensitivity of the short form ranged from 0.90 to 0.96 and specificity ranged from 0.85 to 0.96 by domain in an independent sample of 439 children, 58% of whom had ASD (Santosh et al., 2009).
The Autism Diagnostic Observation Schedule
The ADOS (Lord et al., 2000) is a semistructured, standardised observation of children and adults referred for ASD. With the recent addition of a Toddler module (Luyster et al., 2009), the ADOS has five development- and language-dependent modules, which are 30–60 min protocols of activities that are based on talking and/or play-based interaction. Trained examiners choose the appropriate module for the individual's age (from 12 months through adulthood) and verbal level (from no words to fluent speech), and follow that protocol of activities using standardised materials (e.g., books, toys), for a semistructured social interaction. An adapted ADOS module to address diagnostic needs in older individuals with minimal language is currently undergoing validity testing (Hus & Lord, in preparation).
Like its companion measure, the ADI-R, the ADOS was created by operationalising DSM-IV criteria for autism. Item scores on a 0–3 scale, with higher scores indicating greater symptom severity, are assigned immediately after completing the administration. Specific items from Communication, Reciprocal Social Interaction and Restricted and Repetitive Behaviour domains comprise algorithms, which yield a classification of “autism,” “autism spectrum disorder,” or “non-spectrum.” In 2007, revised algorithms were created with the same number of items and of similar content across modules 1–3 (Gotham, Risi, Pickles & Lord, 2007), and in 2009, two algorithms were created to correspond to the Toddler module (Luyster et al., 2009). In these seven new algorithms, raw scores of algorithm items from a “Social Affect” and a “Restricted Repetitive Behaviour” domain are summed and applied to one set of cut-offs to yield the instrument classification. In addition, Toddler algorithm totals can be located within three “ranges of concern,” to discuss the scores dimensionally rather than applying a cut-off score. For the original algorithm still in use for Module 4 (for older adolescents and adults with fluent speech), separate cut-offs exist for Communication, Social Interaction and the combination of those two totals; all three sets of cut-offs must be met or exceeded to achieve an “autism” or “autism spectrum” classification.
Both original and revised diagnostic algorithms have strong predictive validity against best estimate clinical diagnoses, with the revised set of algorithms showing minimal association between ADOS totals and chronological age, generally decreased association between ADOS total and verbal IQ when compared to the original algorithms, and improved sensitivity in lower-functioning developmental groups (Gotham et al., 2007, 2008; Luyster et al., 2009). In 2009, de Bildt et al. offered a caveat that the inclusion of the RRB domain towards the classification cut-off may lead to overinclusion of children with cognitive impairments on the revised algorithms (de Bildt et al., 2009). Of note, the ADOS performs better in autism clinic samples in which the information gained from the measure is directly used in clinicians' diagnostic decisions; in samples in which the ADOS examiner is not the primary diagnostician (such as the Canadian sample in Risi et al., 2006; or the American medical center sample in Molloy, Murray, Akers, Mitchell & Manning-Courtney, 2011), the predictive validity of the measure tends to be significantly lower.
Interrater reliability, internal consistency and test-retest reliability of the measure were found to be good to excellent in the original ADOS reliability sample of 98 individuals and 12 reliable examiners, as well as in updated data on the revised algorithms (see the ADOS-2 manual, Lord et al., 2012) and the toddler module (Luyster et al., 2009).
Issues in the use of autism diagnostic measures
Because of their strong discriminant validity, the ADI-R and ADOS have been translated into over 18 languages and are used worldwide to establish caseness and aid in treatment planning; they have also been linked to diverse genetic and neuroimaging findings. Scores from both instruments also have been used to measure severity of autism symptoms and changes over time, however, it is important to keep in mind that these measures were developed with the goal of differentiating individuals with and without ASD. Recent updates to the ADOS have resulted in the creation of a 10-point calibrated severity scale proposed as an alternative method of quantifying ASD severity with greater independence from participant characteristics, such as chronological age and IQ (Gotham, Pickles & Lord, 2009). Calibrated severity scores do not measure functional impairment, but rather provide an alternative for comparing ADOS scores across modules and time. Although initial replications provide some evidence for the utility of the metric (de Bildt et al., 2011), it has not yet been widely studied.
A strength of the DISCO is that it provides an overall profile of skills and abilities, challenges and disabilities and areas of needed remediation. Therefore, for clinical purposes, it allows simultaneous information-gathering for diagnosis and treatment planning. The 3Di offers a dimensional measurement of symptoms, assessment of comorbid symptoms, a less time-consuming training protocol than comparable measures and quickly and easily accessible results. However, the base of empirical research on the 3Di is extremely small, and predictive validity estimates have been generated largely in comparison to typically developing children, a circumstance which inflates predictive validity scores and does not reflect the populations in which ASD diagnostic decision-making usually takes place. In light of the strengths of this measure, it will be worthwhile to explore the utility of the 3Di in larger, more diverse ASD populations, as well as its ability to differentiate between children with ASD and those with clearly defined and reported cognitive, language and behavioural disorders.
For most autism diagnostic measures, training is time-consuming and can be both expensive and difficult to procure. One avenue to achieving research reliability on the ADI-R and ADOS includes achieving three consecutive reliable administrations with a research-reliable examiner (at 90% item exact agreement on both the protocol and algorithm items for the ADI-R, and 80% for the ADOS). A second means of achieving reliability is to attend training workshops, score video-recorded administrations and submit recordings of one's own administrations that are scored by a reliable examiner for acceptable reliability. Training on the DISCO involves a 3-day introductory workshop, an additional 2-day workshop and subsequently meeting quality standards for accreditation. 3Di training is completed through DVDs and internet-based training modules, although in-person training courses are also available.
Another drawback of commonly used parent interviews is the length of administration. For some samples and purposes, the SCQ may be a viable substitute for the ADI-R, in particular when used in combination with the ADOS (see Chandler et al., 2007; Corsello et al., 2007). In addition, a shortened telephone screener, the Autism Screening Interview, has been created based in part on the ADI-R and currently is undergoing validation (S. Bishop and C. Lord, personal communication).
In general, an ASD evaluation should include, at minimum, a caregiver-based developmental history, a direct observation of the referred individual using a semistructured observational measure and measures of cognitive, language and adaptive skill (see also Gotham, Bishop & Lord, 2011, for a more detailed discussion of autism diagnostic assessment). Despite the strong predictive validity of some of the assessment tools described above, an individual's diagnosis of ASD should never depend on the diagnostic classification of a single measure or combination of measures. In addition, professionals must be realistic about the limitations of diagnostic instruments: ultimately these measures cannot “solve” a difficult diagnostic decision, and they may not be universally necessary (e.g., for clear-cut cases in which accurate diagnosis is the sole aim). Experienced clinical judgment is essential for accurate diagnosis, regardless of how carefully a clinician weighs the benefits and disadvantages of commonly used instruments. Nevertheless, choosing the best performing instruments for a particular clinical need can provide invaluable standardisation and structure to support clinical judgement and to aid in treatment planning and recommendations.
Over the past two decades both the research and clinical fields have benefited from the development of ASD screening and diagnostic instruments. Both provide valuable sources of information about a child or young person, which can help clinicians make more informed judgements about onward referral and diagnosis. However, they do not ‘do the job’ for the clinical team in that no instrument score equates to a diagnosis. In addition, screening and diagnostic instruments help clinicians least in the cases where they are most in want of direction as their accuracy will always be lower for marginal cases. Depending on service-configuration, timely and expensive ASD-specific diagnostic instruments will not always be feasible or appropriate, but in our experience, clinical teams benefit from practitioners being trained in these methods. Whilst further research hopefully will refine our knowledge of both screening and diagnostic instruments and where they can prove useful, to date many of them remain inadequately studied and their instrument parameters, whilst promising, are insufficient to recommend them as universal clinical tools.
This review article was invited by CAMH Editors for publication in the Measurement Issues section of the journal.
The work of TC on screening is supported by the COST Action BM1004. Research at the Centre for Research in Autism and Education (CRAE) is supported by the Clothworkers' Foundation and Pears Foundation. The work of KG on diagnostic instruments has been supported by United States National Institute of Health grants R01HD065277 (PI: Somer Bishop), R01 MH066469 (PI: Catherine Lord) and an Autism Speaks Predoctoral Training Fellowship.
The authors have declared that they have no competing or potential conflicts of interest arising from the publication of this article. KG will be an author on the as-yet-unpublished Autism Diagnostic Observation Schedule, 2nd edition (ADOS-2): royalties for this will be donated to charity