What are we optimizing for in autism screening? Examination of algorithmic changes in the M‐CHAT

The present study objectives were to examine the performance of the new M‐CHAT‐R algorithm to the original M‐CHAT algorithm. The main purpose was to examine if the algorithmic changes increase identification of children later diagnosed with ASD, and to examine if there is a trade‐off when changing algorithms. We included 54,463 screened cases from the Norwegian Mother and Child Cohort Study. Children were screened using the 23 items of the M‐CHAT at 18 months. Further, the performance of the M‐CHAT‐R algorithm was compared to the M‐CHAT algorithm on the 23‐items. In total, 337 individuals were later diagnosed with ASD. Using M‐CHAT‐R algorithm decreased the number of correctly identified ASD children by 12 compared to M‐CHAT, with no children with ASD screening negative on the M‐CHAT criteria subsequently screening positive utilizing the M‐CHAT‐R algorithm. A nonparametric McNemar's test determined a statistically significant difference in identifying ASD utilizing the M‐CHAT‐R algorithm. The present study examined the application of 20‐item MCHAT‐R scoring criterion to the 23‐item MCHAT. We found that this resulted in decreased sensitivity and increased specificity for identifying children with ASD, which is a trade‐off that needs further investigation in terms of cost‐effectiveness. However, further research is needed to optimize screening for ASD in the early developmental period to increase identification of false negatives.


INTRODUCTION
Early identification of children on a developmental path to autism spectrum disorder (ASD) is vital for providing early, tailored intervention. However, early identification [Correction added on 1st December 2021, after first online publication: The third affiliation has been updated] Synnve Schjølberg and Frederick Shic shared first authorship. is challenging due to the heterogenous nature of ASD in terms of symptom patterns and the onset time of symptom patterns (Chawarska et al., 2007;Ozonoff et al., 2010;Zwaigenbaum et al., 2015). While for some children, symptoms are evident during infancy and early in development, for others, symptoms are difficult to detect until social expectations exceed social abilities (Ozonoff et al., 2015). Thus, screening instruments for children in the early developmental period might not pick up children that have more subtle ASD symptom expression, but rather those with more severe disabilities regardless of ASD diagnosis (Øien et al., 2018;Stenberg et al., 2021).
One of the first systematic and widely used early developmental screening instruments for ASD was the Checklist for Autism in Toddlers (CHAT; Baird et al., 2000;Baron-Cohen et al., 1992. The Modified Checklist for Autism in Toddlers (M-CHAT) was subsequently derived from the CHAT by colleagues in 2001 (Robins et al., 2001), broadening the symptom list to capture a larger proportion of the children with ASD. Since then, the M-CHAT and its derivative instruments have become some of the most widely used early screening instruments for ASD in young children, contributing to the early identification of children with ASD across the globe (Stewart & Lee, 2017). However, recent studies have proposed that the M-CHAT, like the CHAT, struggles with a high number of false negatives and false positives, showing clear and grave nonoptimal performance (Baird et al., 2011;Carbone et al., 2020;Guthrie et al., 2019;Øien et al., 2018Stenberg et al., 2014Stenberg et al., , 2021. The high number of false positives using the M-CHAT and its derivatives (Guthrie et al., 2019;Øien et al., 2018;Stenberg et al., 2014Stenberg et al., , 2021 add to the discussion regarding the utility and cost-effectiveness of universal screening (Baird et al., 2011;Guthrie et al., 2019;Hickey et al., 2021;McPheeters et al., 2016;Øien et al., 2018;Siu et al., 2016;Stenberg et al., 2014Stenberg et al., , 2021Surén et al., 2019;Yuen, Penner, et al., 2018). A high rate of false positives may lead to unnecessary anxiety for some parents. However, one benefit of positive results when screening for ASD is the potential for identifying other disabilities and difficulties that also require specialized health services. Research is not clear on how adapting new criteria and/or new cutoff scores improve both the sensitivity and specificity of screening instruments or if there is a trade-off between rates of false positives and false negatives. As it has been debated that current lack of evidence for universal screening for ASD obtaining this knowledge is crucial for understanding how attempts at optimization may impact and influence the "true costs" of autism screening.
A 20-item revision of the M-CHAT that primarily focused on reducing the rate of false positives was published in 2014 (Robins et al., 2014): the Modified Checklist for Autism in Toddlers Revised (M-CHAT-R). The revision removed three items that were poor predictors of ASD and retained 20 items (with rewording and new exemplification for 12 of the 20 retained items and reordering of items). To help resolve potential ambiguity in item interpretation in the original M-CHAT, descriptive examples of each question were added in the M-CHAT-R. In addition, risk score calculation algorithms were modified in the revision. In addition, during the transition from the M-CHAT to the M-CHAT-R, the standard follow-up (Robins et al., 2001) was more rigorously operationalized as part of the standard operating procedure for screening administration (yielding the M-CHAT-R/F, i.e., the M-CHAT-R questionnaire with follow-up interview). The purpose of the follow-up interview, in both the case of the original M-CHAT and the M-CHAT-R was to provide additional diagnostic accuracy when children scored in an "intermediate range" of risk on the questionnaire portion. In this work, we do not consider the follow-up interview, which is often irregularly administrated in practice (Wallis et al., 2020).
However, more research is needed to assess the impact and tradeoffs of methodological optimizations in ASD screening in the general population and in children of different ages to answer the question, "what are we optimizing for?" This includes understanding factors related to different aspects of assessment: false positives with exploring symptom overlap to other neurodevelopmental disorders and false negatives with identifying broader symptom patterns than those currently considered as core ASD symptoms. A limitation of much of prior research on M-CHAT-related screening instruments, such as the original descriptive validation paper on the M-CHAT-R/F (Robins et al., 2014), is that they do not conduct prospective follow-up of all children, and as such tend to focus only on false positives while neglecting false negatives.
There are currently no studies that have simultaneously administered both the M-CHAT and the M-CHAT-R instruments with or without follow-up. For this reason, direct comparisons between the two measures are currently impossible. However, it is still possible to examine changes in screening performance due to algorithmic changes made in the transition from the M-CHAT to the M-CHAT-R.
This study aimed to evaluate the potential optimization of the original M-CHAT's efficacy in identifying ASD using the original recommended M-CHAT cut-off criteria as compared to a 20-item M-CHAT (M-CHAT 20 ) that was created from the original 23-item M-CHAT so as to replicate as closely as possible those changes incorporated into the M-CHAT-R. The 20 items of the M-CHAT 20 were the same as those retained in the M-CHAT-R, and the cut-off criteria applied for ASD risk-status were the same as those recommended by the M-CHAT-R. Specifically, this study examines trade-offs in rates of false positives and false negatives between the original M-CHAT and the M-CHAT 20 .

Participants
The present study utilizes data collected in the Norwegian Mother, Father, and Child Cohort Study (MoBa; Magnus et al., 2016). The Autism Birth Cohort (ABC) Study is a sub-study in the MoBa which aims to identify all ASD cases within MoBa (Skjaerven et al., 2006;Stoltenberg et al., 2010;Surén et al., 2019). MoBa is a national prospective general population pregnancy cohort that includes 114,552 children born between 1999 and 2009 (Magnus et al., 2016). Parents who agreed to participate in MoBa and the ABC study signed an informed consent form in each study. The study was approved by the Regional Committee for Medical and Health Research Ethics South East. MoBa data version 9 was used. In the present study, the child's status as ASD or non-ASD was determined based by the discharge diagnosis listed in the National Patient Registry (NPR) or by diagnostic conclusion in the ABC study (at approximately 42 months). Diagnoses from the NPR are obtained prospectively, and are provided by specialized health services in Norway in clinics that conduct ASD-specific assessments utilizing gold standard instruments such as the ADOS and the ADI-R, together with other instruments including measures of cognitive and adaptive ability. The youngest children that participated in the MoBa study turn 12 years of age in 2021.

Study sample
This study uses data collected prospectively in the MoBa study and the ABC study. The primary focus is on the early developmental period of children whose parents received and returned the 18-month questionnaire, relating ASD-relevant characteristics to a later diagnosis of ASD from the NPR linkage or by ABC discharge diagnosis.
The complete M-CHAT was included as one section (translated and back-translated, and items listed in the correct order) in the MoBa 18-month questionnaire from March 2005 through January 2011. Children whose parents returned the 18-month questionnaire and completed all 23 items from the M-CHAT (Robins et al., 2001) were included in the final sample (N = 54,436). Of the final sample, 332 children were later identified with an ASD diagnosis through the NPR or in the ABC clinic (mean age 42 months).

Original M-CHAT (2001) scoring
The original M-CHAT (Robins et al., 2001; for clarity, referred to here as the M-CHAT 23 ) is a 23-item, yes-no parent completed checklist developed for children 16-30 months. It was designed for completion by the childcare providers in the waiting room of well-baby clinics. A positive screening status depends on failing either (1) two or more of the six-critical discriminative items (i.e., the Crit6 criterion) and/or (2) three or more of the 23 items (Tot23 criterion). When the M-CHAT 23 is used as a screening measure, it is recommended to do a follow-up phone interview of screen positives to reduce false positives. These follow-up phone interviews were not conducted in the MoBa study due to its prohibitive cost.

M-CHAT-R (2014) scoring and M-CHAT 20 adaptation
With the introduction of the M-CHAT-R by Diana Robins and colleagues, and subsequent validation of the instrument (Robins et al., 2014), 20 out of 23 items from the M-CHAT 23 constituted the revised version, as the revision found three items to perform below par.
The present study explores the optimization of a screening checklist by testing the efficacy of excluding the three least predictive items from the M-CHAT 23 (Robins et al., 2001) and by changing cut-off criteria in identifying children at risk for ASD. For clarity, we call this adapted measure the M-CHAT 20 . It is important to note, for clarification, that the sample in this study was only administered the M-CHAT 23 at 18-months of age, and not the M-CHAT-R/F. Only the cut-offs and algorithm of the M-CHAT-R/F were applied to the original M-CHAT 23 , similar to procedures employed by Guthrie et al. (2019), to generate M-CHAT 20 scores.
For this 20-item M-CHAT 20 , we used the cut-off criteria developed for the M-CHAT-R (Robins et al., 2014) as the cut-off criteria for screen positives were changed in the revision: a total score of item failures across the 20-items (Tot20) of 0-2 is regarded as low-risk (no actions necessary), a score of 3-7 is considered as medium-risk (needs further follow-up to ascertain more information on the "at-risk" responses), and a score of 8-20 is regarded as high-risk (skip follow-up sequence and directly refer the child for a developmental and diagnostic assessment to determine if the child has ASD). In the present study, a score of 3 and above is regarded as "atrisk." It is important to note that children who failed two or fewer items would not have received a follow-up on either algorithm-even though some of these children would go on to receive a diagnosis of ASD. Also important to note is that, in the present study, neither children screening medium-risk nor at-risk received a follow-up interview as was implemented in the "F" portion of the M-CHAT-R/F.

Statistical analyses
To examine if the M-CHAT 20 reduced false positives and false negatives compared to the original M-CHAT 23 , 2 Â 2 crosstab tables for each outcome group (ASD or non-ASD) comparing M-CHAT 20 versus M-CHAT 23 criteria screening results were assembled (Table 1). These tables were used to (1) calculate sensitivity (SE), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV; Table 2), and (2) identify significant differences in identification between criteria using McNemar nonparametric tests.

RESULTS
In total 54,463 individuals included responded on the questionnaire at 18-months of age and returned to the Norwegian Institute at a mean age of 19-months of age (M = 19.02, SD = 1.21), 337 individuals were diagnosed with ASD later in childhood.
Value (NPV) with 95% confidence intervals for M-CHAT 23 and M-CHAT 20 algorithm and algorithm components.

Performance of the M-CHAT 20 cut-off criteria
For children with an eventual outcome of ASD, 2 Â 2 tables revealed that the M-CHAT 20 criterion (Tot20 cutoff) led to 93 (

DISCUSSION
The present study evaluated the impact of scoring algorithm changes from the M-CHAT 23 to the M-CHAT-R based on application of M-CHAT 23 and M-CHAT-R scoring criteria to a large sample of individuals with long-term developmental follow-up for ASD. Findings indicated that moving to M-CHAT-R scoring criteria (i.e., the M-CHAT 20 version) decreased false positives by 2.4% (1291/54126 children with no ASD diagnosis) at the cost of 3.6% increased false negatives (12/337 children with ASD). This tradeoff was in line with those findings observed from the validation study of the M-CHAT-R as compared to the original M-CHAT 23 (Robins et al., 2014). These relatively small changes should be considered in the context of the performance of the M-CHAT 23 , with or without M-CHAT-R scoring algorithms applied. As seen in Guthrie et al., 2019, Stenberg et al., 2014, and Øien et al., 2018, high numbers of false negatives and false positives were present, with most children with ASD (at least 68.8%, 232/337) screening negative and relatively few children without ASD (at most 7.5%, 4048/54126) screening positive. While screening positive in non-ASD children may have triggered undue alarm in families, for false negatives, these children with ultimate diagnoses of ASD would have not received any follow-up based on M-CHAT 20 criterion for children scoring between 0 and 2. This is in line with previous studies that have reported that most children with a later diagnosis of ASD who were screened at 18-months of age in prospective general population cohorts are not identified at 18-months (Guthrie et al., 2019;Øien et al., 2018;Stenberg et al., 2014;. Some of the M-CHAT-R's enhanced efficacy in identifying children at risk can be achieved by selecting the most efficient items from a checklist and deleting those with poor performance. To reiterate, our findings suggest that the M-CHAT 20 increases the false-negative rate while reducing the false positive rate, that is, improving specificity, but reducing sensitivity. However, neither the M-CHAT 23 nor M-CHAT 20 performs adequately in identifying children with a prospective diagnosis of ASD (high number of false negatives), as revealed in previous studies (Guthrie et al., 2019;Øien et al., 2018;Stenberg et al., 2014;Yuen, Penner, et al., 2018). The original CHAT showed similar difficulties in identifying ASD in a general population (Baird et al., 2001). It is important to note that the suboptimal performance and systematic identification of the more severe children (Stenberg et al., 2021) are not exclusive to the M-CHAT(-R) at 18 and 24 months, but seem present utilizing other screening instruments such as the social communication questionnaire (SCQ) at 36 months (Surén et al., 2019).
As shown in Stenberg et al. (2021), many children deemed as "at-risk" for ASD at 18-months were later diagnosed with other developmental disabilities, indicating that a change in criterion might reduce the identification of other developmental disabilities (Stenberg et al., 2021). Thus, a trade-off of increasing the specificity and decreasing the sensitivity might ultimately lead to fewer children being identified who go on to develop ASD as well as missing out on children with other developmental disabilities with valid needs of early identification. Reducing the sensitivity might definitely increase the age of diagnosis and access to early intervention. However, the authors want to acknowledge that there are advantages of using screening instruments in primary care to familiarize themselves with symptom patterns, and the instruments serve a purpose in that it identifies some children at 18 months of age. It might increase the knowledge in pediatric and well-visit clinics on early signs and symptoms of ASD. It is also established that these instruments work well when there is a parental concern. Screening instruments can help specify the difficulties that children have at a given time in their developmental course. In particular, efforts to identify more false negatives should be of most pressing concern in the research field on early identification. In this context, it is crucial to systematically assess behavioral differences in children at well-baby clinics using different developmental instruments, and, additionally, to use caution when interpreting both positive and a negative screens, because of the high number of false negatives. Due to the fact that symptoms might not be evident at 18 months, we might ask ourselves if we are asking the wrong questions or using inappropriate measures. In addition to developmental surveillance, sensitivity to parental concern, and using sound clinical judgment, it might be necessary to revisit constructs of ASD at different timepoints to improve early identification of the disorder. Thinking about screening instruments as "one measure to rule them all" may be utopian as various instruments serve different roles in identifying children with ASD.
When considering updates to screening measures, it may be critical to ask what is being changed and how does that change the weight of the diagnostic process. As recent studies have found (Guthrie et al., 2019;Øien et al., 2018;Stenberg et al., 2014Stenberg et al., , 2021Sturner, Howard, Bergmann, Morrel, et al., 2017;Sturner, Howard, Bergmann, Stewart, & Afarian, 2017) screening for ASD is not as straightforward as would be implied by the original instrument publications and associated validation studies. In particular, replication of results from validations studies in longitudinal and prospective studies are necessary to understand mechanisms of symptom patterns and later outcomes for both false positives and false negatives. Reduction of false positives are important if the aim of the screening process is to identify only cases of ASD, however it might be debated that detecting other developmental disabilities are of equal importance.

CONCLUSION
The results suggest that the performance of the original 23-item M-CHAT (labeled as the M-CHAT 23 in this work), after removing three items and using M-CHAT-R scoring criteria (labeled as the M-CHAT 20 in this work), has less sensitivity than the original M-CHAT 23 using all items with its original scoring method. Similar to expectations generated by original M-CHAT-R/F validation studies, however, we also found increased specificity for identifying children with ASD when using M-CHAT-R scoring on the M-CHAT 23 . Still, a more extensive investigation into different pathways to diagnosis is needed to tailor more dynamic instruments to identify sets of markers for children who concurrent screening instruments miss at 18 months of age. One option is to compare two algorithms in epidemiological-type samples to study the trade-offs, while changing the number of items, and to use complementary analyses, such as moving cutoff points. The main advantage of doing this on epidemiological-type samples is the possibility to study the trade-offs and to optimize the performance.
To conclude; it is important that clinicians exhibit caution when interpreting the status of a screening, both in terms of a positive or negative result. In terms of interpreting a positive result, caution is of great importance as the PPV is universally suboptimal. In terms of a negative result, caution needs to be exhibited as more than 2/3 of children with a later diagnosis of ASD will not be identified by any extant or prior criteria or cut-off-not even being flagged as moderate-enough risk to receive a follow-up interview. These limitations may result from multiple factors. The M-CHAT R/F algorithm may improve the false positive issue; however, the false negative issue is still to be tackled, as it persists utilizing the new algorithm. This could be a result of symptoms not being evident or prototypical at 18 months of age, and thus it might be that these individuals would not meet the criteria for an ASD diagnosis at this age utilizing gold standard instruments either. Indicating that they might not have ASD at an immediate evaluation but meet the criteria for ASD later in the developmental period. As highlighted in Øien et al., 2018 (utilizing the MoBa), children screening negative while receiving a diagnosis later had atypicalities in development that did not compare to true negatives even if they had similar screening status at 18 months of age, which might indicate subtler developmental issues. This highlights the need for developmental surveillance, as it may also be the case that we are asking the wrong questions or performing the wrong tests. Thus, indicating that children should be followed up in terms of development at different timepoints regardless of screening status early in the developmental period. There is a clear need for continued improvement in this domainbut it is similarly essential to consider what is actually being changed and how those changes impact the weight of the diagnostic process.

Limitations
As the MoBa did not include the M-CHAT-R item wording and item sequence, there is, of course, a limitation. More specifically, it is not possible to know if the rewording or resequencing of the items and the additional examples that are added to each item in the M-CHAT-R would affect the results in a positive direction. As noted, very few, if any, studies have abilities to conduct such analyses on the same set of children with ASD. The items still preserve the same phenomenology, even without the exemplification, and it seems like most children with a future diagnosis of ASD would still score below the cutoff for follow-up or "at-risk" status on the revised version compared to the original version of the instrument. Furthermore, we did not conduct the follow-up of individuals screening positive, and thus the reduction of false positives are solely based on the algorithmic change of the M-CHAT 23 . We neither had full access to the MoBa study nor outcomes associated with developmental disabilities other than ASD, so providing information on how many of the false positives that went on to receive other diagnoses with the current dataset are not possible. However, we have previously reported information from the Autism Birth Cohort (ABC) study (Stenberg et al., 2021), and we have added the information from the sub-study ABC (N = 1033) to the Appendix of this article to highlight the large number of false positives that received other diagnoses at assessment at 42 months. This provides additional clarity on outcomes likely associated with the false positive group. Future studies aims at utilizing the NPR to show how many of the MoBa children that received other diagnoses however this data was not available to the authors at this time.

ACKNOWLEDGMENTS
The authors would like to acknowledge and thank all participating families and their children. A PPE ND I X A : AUTISM BIRTH COHORT STUDY-OVERVIEW OF DIAGNOSTIC OUTCOME VERSUS SCREENING STATUS