Measurement bias in caregiver‐report of early childhood behavior problems across demographic factors in an ECHO‐wide diverse sample

Abstract Background Research and clinical practice rely heavily on caregiver‐report measures, such as the Child Behavior Checklist 1.5–5 (CBCL/1.5‐5), to gather information about early childhood behavior problems and to screen for child psychopathology. While studies have shown that demographic variables influence caregiver ratings of behavior problems, the extent to which the CBCL/1.5‐5 functions equivalently at the item level across diverse samples is unknown. Methods Item‐level data of CBCL/1.5‐5 from a large sample of young children (N = 9087) were drawn from 26 cohorts in the Environmental influences on Child Health Outcomes program. Factor analyses and the alignment method were applied to examine measurement invariance (MI) and differential item functioning (DIF) across child (age, sex, bilingual status, and neurodevelopmental disorders), and caregiver (sex, education level, household income level, depression, and language version administered) characteristics. Child race was examined in sensitivity analyses. Results Items with the most impactful DIF across child and caregiver groupings were identified for Internalizing, Externalizing, and Total Problems. The robust item sets, excluding the high DIF items, showed good reliability and high correlation with the original Internalizing and Total Problems scales, with lower reliability for Externalizing. Language version of CBCL administration, education level and sex of the caregiver respondent showed the most significant impact on MI, followed by child age. Sensitivity analyses revealed that child race has a unique impact on DIF over and above socioeconomic status. Conclusions The CBCL/1.5‐5, a caregiver‐report measure of early childhood behavior problems, showed bias across demographic groups. Robust item sets with less DIF can measure Internalizing and Total Problems equally as well as the full item sets, with slightly lower reliability for Externalizing, and can be crosswalked to the metric of the full item set, enabling calculation of normed T scores based on more robust item sets.


INTRODUCTION
Behavior problems, including noncompliance, emotional distress and outbursts, and disruptive behavior, are common during the toddler and preschool years (Wakschlag et al., 2007).While many such behavior problems are developmentally normative, some young children exhibit clinically significant behavior problems that disrupt child and family functioning (Keenan & Wakschlag, 2002).Population-based studies have reported prevalence estimates of elevated behavioral problems ranging from 7% to about 20% during the preschool years (Bayer et al., 2012;Briggs-gowan et al., 2001;Egger & Angold, 2006;Stülb et al., 2019).Importantly, clinically significant early behavior problems may be predictive of later psychopathology and adverse outcomes in adolescence and adulthood (Campbell, 1995;Eisenberg et al., 2009;Mathiesen & Sanson, 2000;Prior et al., 1992), underscoring the importance of early identification and intervention.

Key points
� Caregiver characteristics, especially language version, education level and sex of the caregiver respondent, greatly impacted the differential functioning of many CBCL/1.5-5items.
� Child age also influenced the measurement of child behavior problems on the CBCL/1.5-5.
� SES variables (i.e., caregiver education and household income levels) cannot fully account for measurement bias related to child race.Reliable and valid measures are central to assessment and treatment.For preschoolers, the Child Behavior Checklist for Ages 1.5-5 (CBCL/1.5-5) is one of the most commonly used measures for screening for behavior problems in both clinical and research settings (Ivanova et al., 2010;Medeiros et al., 2017), as it shows good reliability and validity in studies globally (Ivanova et al., 2010;Konold et al., 2003;Rescorla et al., 2011).In their initial validation, Achenbach and Rescorla (2001) described a factor structure of seven syndrome scales, which were then grouped under two second-order factors of Internalizing and Externalizing Problems.Subsequent confirmatory factor analyses have generally found this structure holds in samples with different compositions of nationality, culture, sex, and diagnosis (Ivanova et al., 2010;Konold et al., 2003;Koot et al., 1997;Medeiros et al., 2017;Tan et al., 2007).Accumulating research evidence supports the use of the two broad domains of Internalizing and Externalizing as transdiagnostic constructs for profiling clinically significant behavior problems in young children (Achenbach et al., 2016;Krueger & Markon, 2011).
Despite basic psychometric validations of the CBCL/1.5-5 (e.g., reliability, validity, structure), questions remain about whether it measures behavior problems equivalently across samples with different demographic compositions (i.e., does the same score reflect similar levels of problems regardless of child or caregiver demographic characteristics?).This warrants further investigation, as previous research has shown that measured/observed levels of child behavior problems can be affected by child factors, including age, sex, bilingual status, and developmental level (Carneiro et al., 2016;Chen, 2010;Sun et al., 2021;Wakschlag et al., 2017), as well as by caregiver characteristics, including informant's primary language, sex (father vs. mother), socioeconomic status (SES), and mental health (Davis & Qi, 2022;Flouri et al., 2017;Müller et al., 2011;Schroeder et al., 2010).Given that observed raw score differences across demographic groups could result from measurement bias and/or true group differences, it is necessary to ensure measurement equivalence or account for measurement bias if they exist before concluding that measured differences across groups reflect actual differences in levels of behavior problems (Vandenberg & Lance, 2000).For example, developmentally, younger children might use whining (Item 97) to communicate more often regardless of their levels of internalizing problems, in which case, high scores on this item might not be as reflective of the true levels of internalizing problems in young children as in older children.
Analyses of measurement invariance (MI) and differential item functioning (DIF) allow examination of measurement bias across groups to facilitate the estimation of actual group differences after accounting for measurement bias.Surprisingly, despite its widespread use, little MI/DIF work has investigated whether the CBCL/ 1.5-5 functions equivalently across groups that differ by child and caregiver characteristics.We only identified three studies that investigated MI/DIF of the CBCL/1.5-5(Dovgan et al., 2019;Gross et al., 2006;Rescorla et al., 2019).Rescorla and colleagues reported that the CBCL/1.5-5Autism Spectrum Subscale functioned equivalently across ages 18 months, 3 years, and 5 years, demonstrating its potential utility for tracking longitudinal changes across the preschool years.In a study of children with autism spectrum disorder (ASD), Dovgan et al. (2019) found that CBCL/ 1.5-5 syndrome subscales, emotional reactivity, anxious/depressed, and somatic complaints were non-invariant between those with and without intellectual disability (ID), raising concerns about the application of these syndrome scales in those with ASD and ID.
When examining MI of Internalizing and Externalizing scales across groups defined by parent race/ethnicity, family income, and language version (English vs. Spanish), Gross et al. (2006) found differential functioning of items on both the Internalizing and Externalizing scales.However, their findings are limited by the selective set of grouping variables tested.In sum, studies of CBCL/ 1.5-5 support partial invariance of the scale across certain subgroups and further investigation is warranted to inform the use of CBCL/1.5-5 in diverse samples.
To date, no study has systematically assessed the MI/DIF of CBCL/1.5-5across a broad range of child and caregiver-level characteristics, likely due to the lack of a large enough sample with sufficient diversity across multiple relevant characteristics.This type of MI/DIF analysis is necessary to ensure that the measure can be reliably and validly employed to measure behavior problems in diverse groups of young children.Data from the National Institutes of Health (NIH) Environmental influences on Child Health Outcomes (ECHO) program provide a unique opportunity for a comprehensive psychometric examination of CBCL/ 1.5-5 across multiple child and caregiver characteristics.Therefore, the current study leveraged these data to examine the configural invariance and item-level MI/DIF of CBCL/1.5-5across a wide array of child and caregiver (i.e., respondent who completed the CBCL/1.5-5)characteristics that are available in the ECHO dataset and have been reported to influence caregiverreport of child behavior problems.Moreover, we also explore MI/DIF across child race groups in the context of SES variables as sensitivity analyses.

Participants
Data for the current study were drawn from the NIH ECHO program.
For more information on ECHO, see Blaisdell et al. (2021).The aggregated ECHO dataset was queried to identify cases from 26 out of 69 cohorts that met the inclusion criteria: (a) CBCL/1.5-5 was administered to caregivers of children aged 18-71 months; (b) CBCL/ 1.5-5 item-level data were available; (c) data were available on at least one of the child or caregiver-characteristic variables of interest.
When multiple timepoints of CBCL/1.5-5data were available for a single child, the first administration was taken, yielding a final sample size of 9087 CBCL/1.5-5administrations.See Table 1 for distributions of child and caregiver characteristics.For each MI/DIF testing, only individuals with data on the assessed variable(s) were included in the corresponding analyses (i.e., individuals missing data on any of the specific variables were excluded for the specific set of analyses), allowing for the largest inclusion sample possible for each analysis (see Table 1 for proportions of missing data by child and caregiver characteristics).
MEASUREMENT BIAS IN CBCL/1.5-5Measures CBCL/1.5-5(Achenbach & Rescorla, 2001) requires caregivers to rate their child's behaviors "now or within the past 2 months" on items describing behavior problems on a three-point scale: 0 "Not

Grouping variables for measurement invariance/ differential item functioning testing
We tested MI/DIF across univariate groupings defined by single child and caregiver characteristics separately, and across multivariate groupings defined by two child and caregiver variables in  combination.Throughout, we use the term grouping to refer to subgroups defined by various levels of a given characteristic or combination of characteristics.

Univariate groupings
Child characteristics of interest included biological sex (male vs. female), age at the administration of CBCL/1.5-5 (18-27, 27-36, 36-45, 45-54, and 54-72 months, with children at the group boundaries assigned to the older group), bilingual/multilingual status (yes vs. no), and caregiver-reported diagnosis of a neurodevelopmental disorder (NDD) (none vs. any).For child bilingual and NDD variables, we included data from the same timepoint (i.e., within 6 months of CBCL/1.5-5administration date) when available and otherwise used historic data from the closest previous timepoint to keep more cases for analysis.Bilingual status was determined using a harmonized indicator of bilingual/multilingual exposures, incorporating direct reports of bilingual exposure and language of administration for child tests within ECHO.We used the following NDD diagnoses reported on the ECHO medical history form: ASD, intellectual/developmental disorders, attention deficit/hyperactivity disorder, learning disability, and speech disorder.If the child was reported to have one or more of these diagnoses, they were classified as having NDD.
Caregiver characteristics included respondent sex (male vs. female), language version of CBCL administration (English vs. Spanish; the only two available in the ECHO data), annual household income levels (<$30,000; $30,000-$49,999; $50,000-$74,999; $75,000-$99,999; ≥$100,000), caregiver education level, and caregiver depression.Given the low number of male caregivers completing CBCL/1.5-5,caregiver educational level and depression status were only examined for female caregivers (mostly mothers).Education level was harmonized to derive four categories: less than high school; high school diploma or equivalent; some college; Bachelor's degree and above.Caregiver depression was indexed using the PROMIS® Depression T score (mean of 50 and standard deviation of 10 normed in reference to U.S. adults) as a common metric to which multiple instruments have been linked (Blackwell et al., 2021;Choi et al., 2014;Kaat et al., 2017).For the current analyses, caregiver depression was dichotomized to form groups according to the recommended threshold T score = 60 to distinguish cases with elevated depression levels.For caregiver depression status, only data from the same timepoint of CBCL/1.5-5 were used; for other caregiver demographic variables, historic data from the closest previous timepoint was used when the variable was not available from the same timepoint as CBCL/1.5-5.

Multivariate groupings
To understand the differential impact of variables on MI/DIF, we examined multivariate groupings defined by combinations of child or caregiver characteristics.Based on data availability and known confounding impacts on child behavior problems, we assessed MI/DIF across groupings defined by: (a) child sex and age; (b) child sex and NDD diagnosis (May et al., 2019); and (c) caregiver education and income levels (Braveman et al., 2005).To yield sufficient sample sizes, caregiver education was collapsed into two categories (less than Bachelor's degree, Bachelor's degree or higher) and income was collapsed into three categories (<$50,000, $50,000-$74,999, $75,000 or more).
Sensitivity Analyses.We conceptualized race as a social-cultural variable that is confounded with SES variables (Cheng, Goodman, & The Committee on Pediatric Research, 2015) and thus, conducted sensitivity analyses to evaluate the unique impact of child race/ ethnicity on MI/DIF, as previous studies have shown the impact of race on the measurement of behavior problems and psychopathology in children (Gross et al., 2006;Vaughn-Coaxum et al., 2016)

Statistical analyses
Separate analyses were conducted for each of the univariate and multivariate groupings for each latent construct of Internalizing, Externalizing, and Total Problems.All item response theory (IRT) models were estimated using the graded response model (Samejima, 1997) using the mirt package (Chalmers, 2012)  ) from IRT to commonly used criteria (Yuan & Chan, 2005).Standardized root mean squared residual below 0.08 was used as the criterion for adequate fit (Hu & Bentler, 1999), while RMSEA is reported as a secondary index of fit.For Total Problems, the unidimensional and bifactor models were compared with respect to Akaike information criterion (AIC) and sample size-adjusted Bayesian Information Criterion (SABIC), where a difference greater than 10 indicated a superior fit for the model with lower deviance (Raftery, 1995).
Alignment method for measurement invariance/ differential item functioning estimation

Statistical testing of measurement invariance/ differential item functioning
We tested DIF for statistical significance using ANOVA, treating parameter estimates and standard errors analogously to sample statistics (mean and standard error, respectively) in conventional ANOVA, to screen for significance to examine the impact of DIF.For each grouping, an item was determined to have statistically significant DIF if any Bonferroni-corrected p value was less than 0.05, where the correction was applied for all item parameters.For multivariate groupings, two-way ANOVA was used, allowing DIF by each variable and the interaction between variables to be partitioned.To evaluate the differential impact of variables on MI/DIF, η 2 statistics were calculated for the main and interaction effect of DIF on each item, allowing comparisons of impact on the magnitude of DIF among the three sources.See Appendix S1 for additional details.

Assessing the impact of measurement invariance/ differential item functioning
To steer away from reliance on significance testing, we used the un- Linking of robust and full item sets For each domain, we used equipercentile equating (Kolen & Brennan, 2004), implemented in the equate package in R (Albano, 2016) to derive crosswalk tables which can be used to link total scores from the robust item sets to the total score metric of the original full item set.Then, the linked total scores can be used to derive the corresponding T scores using conversion tables available from the CBCL scoring manual.Correlations and mean differences between the scores from the full (i.e., originally published scale) and the linked scores from the robust item sets were computed to assess the reliability and bias, respectively, of the resulting linkages.Lastly, we conducted group comparisons of T scores based on the robust item sets across levels of each child and caregiver characteristic to examine whether substantive group differences remain after removing items with impactful measurement bias.

Configural invariance of CBCL/1.5-5
Both AIC and SABIC indicated that the bifactor model fit better than the unidimensional model across all subsamples for Total Problems.
We, therefore, used the bifactor model parameters for DIF testing.
Configural invariance was tested across all groupings of interest.
Figure 1 shows the distribution of SRMSR and RMSEA values across groupings for each latent construct, estimated using the unidimen-

Significance and meaningfulness of item-level differential item functioning
Given the large number of results generated and the limited manuscript space, below we focus on the main takeaways of our findings, summarizing alignment results with respect to specific assessments of DIF in the CBCL 1.5-5.Detailed results of item parameter estimates, significance, effect sizes (η 2 ) of DIF, and UIDS statistics across all groupings and constructs can be found in the OSF repository.
Across the child and caregiver groupings we tested, we observed significant DIF in most of the items across groupings.Only a small set of items showed no DIF across child characteristics tested (see Table 2), while all items showed DIF across caregiver characteristics tested.Because this number of items was so small and not representative of the breadth of the full CBCL 1.5/5, we instead sought to identify which items had the most DIF and, when removed, would yield a robust item set with scores comparable across diverse samples.
To better understand the meaningfulness of DIF in the CBCL 1.5/ 5, Figure 2 presents items and groupings in which DIF was found to be significant and with UIDS greater than 0.  (03,09,15,16,20,22,29,33,36,37,54,64,76,81,96,97) showing UIDS >0.1 on at least three out of nine univariate groupings tested.Taken together, the impact of DIF was T A B L E 3 CBCL/1.5-5robust item sets with less differential item functioning for internalizing, externalizing, and total problems.
higher for Externalizing than Internalizing and was the highest for Total Problems.Distributions of median UIDS of all items can be found in Figure S1.(n item = 14), and Total Problems (n item = 83).Reliabilities for the full and the robust items were all near or above 0.9 and in nearly identical ranges for Internalizing and Total Problems (see Figure 3 with the solid and dotted lines closely overlapping).
To understand the impact of different groupings, Figure 4 4) showed main effects of the groupings were associated with larger magnitudes of DIF than the interactions.
Notably, child age and NDD status were associated with larger DIF than child sex or their interactions with child sex.One outlier item (Item 76, speech problem) showed the largest magnitudes of DIF between those with and without NDD.Household income level showed larger magnitudes of DIF effect sizes than caregiver education level, especially for Externalizing Problems.For differential test functioning, respondent sex showed the largest STDS on the measurement of Total Problems, followed by caregiver education levels on Total Problems (see Figure 5).
Sensitivity analyses with child race groupings showed impactful DIF across race categories on multiple items on each latent construct, with impactful DIF aggregated on similar items as identified above.
While child race, caregiver education, and income all showed unique associations with DIF, child race was associated with larger magnitudes of DIF than the SES variables after adjusting for each other and the interaction, indicating that measurement bias across racial groups cannot be fully accounted for by SES variables (See Table S2).Similarly, the multivariate analyses with language version revealed that language version was associated with more DIF than child bilingual status but with less DIF than caregiver education level (See Table S3 for effect sizes, Figure S2 for Model Fit, and Table S6 for items without significant DIF and Figure S5 for items with impactful DIF).

T scores based on the robust item sets and tests of group differences
Correlations between linked scores (i.e., linking summed scores for the robust item sets to the metric of the full CBCL) and the observed summed scores of the full item sets, were 0.963 for Internalizing, 0.965 for Externalizing, and 0.989 for Total Problems.The raw mean differences between linked and observed scores, an indicator of linking bias, were −0.004, 0.008, and 0.007 for Total Problems, Internalizing, and Externalizing, respectively, indicating very minimal bias (See Table S4).
Group means and standard deviations of T scores based on robust item sets, and ANOVA tests of mean differences in each grouping, are presented in Table 5 (group comparisons of the full item set are shown in Table S5).We observed the largest group differences in T scores based on robust item sets across Caregiver Education levels (η 2 ≥ 0.023) on all three latent constructs, followed by income (η 2 ≥ 0.020) and Caregiver Depression (η 2 ≥ 0.018), with the smallest differences in Internalizing and the largest differences in Total Problems.
For child characteristics, Child NDD status (η 2 ≥ 0.021) showed small to medium effects, with the smallest differences in Externalizing.

DISCUSSION
The CBCL/1.5-5 has been widely applied in clinical practice and research, but little information is available on its measurement equivalence across subgroups defined by various child and caregiver characteristics.We conducted the largest study of MI/DIF of CBCL/ 1.5-5 (N = 9087) to date and identified multiple sources of measurement bias across child and caregiver characteristics.Our findings directly inform the use of CBCL/1.5-5 and have implications for the measurement of childhood behavior problems more broadly.
Results of the factor analyses confirmed the unidimensional structure of Internalizing and Externalizing broad domains and the bifactor structure of Total Problems across child and caregiver groupings, demonstrating that both symptom clusters and the general psychopathology ("p") factor account for distinct sources of variance in ratings of child psychopathology.These findings are consistent with prior conceptualizations of the CBCL/1.5-5(Achenbach & Rescorla, 2001) and previous studies in different samples (Achenbach et al., 2016), providing further evidence for the utility of broad-domain constructs.
Item-level analyses identified multiple sources of DIF, raising concerns about the ubiquitous application of CBCL/1.5-5 in diverse samples without adjusting for measurement bias.Caregiver demographic variables were associated with a larger magnitude of measurement bias than child characteristics across all three constructs as indicated by UIDS and STDS, with the greatest DIF/Differential Test Functioning (DTF) arising from language version, caregiver education, and respondent sex.Among them, language version showed the largest impact of MI/DIF across all constructs.It is possible that both the information loss due to translation and differences in cultural expectations/interpretations of child behavior problems contribute to the noninvariance of the English and Spanish versions of the CBCL/1.5-5.For example, as discussed by Gross et al., 2006, Item 97 whining is translated to queja in the Spanish version, which means complain in English and may have different connotations for parents.This underscores the need to consider validation efforts separately from simple translation of measurement tools in order to ensure equivalence across language translations.
We further observed an impact of respondent sex and caregiver education on DIF.It is possible that parents' ethnotheories of desirable and maladaptive behaviors in children (Olson et al., 2019), Nevertheless, we found small-to-moderate differences in Internalizing, Externalizing, and Total Problems across caregiver depression groups using T-scores derived from robust item sets, indicating that observed differences across caregiver depression groups were not attributable to item bias (Chi & Hinshaw, 2002;Gartstein et al., 2009).
Group comparison of T scores using robust item sets showed that differences between groups of language version and respondent sex were close to zero, while the original scores of Internalizing differed significantly across language version and respondent sex (see Table S5).These findings further underscore the need to assess  and account for measurement bias resulting from informant characteristics when interpreting results from caregiver-report measures.
For example, father-mother discrepancies in perceived levels of child behavior problems have been reported by many previous studies but were not significant in our sample after accounting for measurement bias.This is likely because the observed discrepancies are at least partially due to differences in how caregivers report specific items.
Thus, while DIF is not always problematic and, in fact, often informative for measure development and application by showing groupspecific patterns of item responding, accounting for DIF across these factors is necessary for teasing apart measurement bias from true differences in the construct of interests (e.g., behavior problems) between groups.
Child age showed a larger impact on DIF than child sex, indicating the importance of developmental considerations when measuring psychopathology even within the relatively narrow age range of early childhood (18-71 months) (Wakschlag et al., 2010).These findings suggest that some child behavior problems might not be informative in caregiver-report of psychopathology for children of different ages.
Therefore, normative ranges of child behaviors should be clarified by research, and care should be taken when administering identical item sets across age groups.Moreover, our sensitivity analyses showed that the DIF associated with child race cannot be fully accounted for by SES variables (caregiver education and income), indicating a need for further investigation of potential measurement bias associated with race (e.g., race-related cultural differences, trauma experiences).
In terms of measuring Internalizing problems, items with the most measurement bias were from the Anxious/Depressed and Emotionally Reactive subscales.Studies on anxiety and depression in young children have repeatedly emphasized the challenge of generalizing adult diagnostic criteria to young children (Carter et al., 2004;Luby et al., 2002;Tandon et al., 2009), and relying heavily on caregivers to  stubborn).Taken together, many items with impactful DIF lack clear descriptions of observable behaviors, which might be more susceptible to bias driven by child and caregiver characteristics.Thus, measurement of early behavior problems may be enhanced by carefully operationalizing specific behaviors relevant to constructs under investigation to reduce the risk of bias (Merrell, 2001).
Implications for using the CBCL/1.5-5 Given the large amount of DIF identified in our analyses of CBCL/1.5 items, it is important to consider the effects of measurement bias when using the CBCL/1.5-5.Practically, DIF could be handled in two ways: (a) by applying group-specific parameters for score calculation, or (b) by removing items with impactful DIF.Given the technical requirements to apply group-specific algorithms, we recommend CBCL/1.5-5consumers follow the below guideline to administer the robust item sets with reduced DIF: (1) administer robust item sets listed in Table 3; (2) calculate summed scores for the desired domain; (3) use the crosswalk table on OSF repository for this study to convert robust item set sum scores to approximate full item set sum scores; (4) convert sum scores to T scores using CBCL score conversion tables.Researchers and clinicians can then apply the recommended cut-offs based on the CBCL manual: 60-63 as borderline range, and above 63 as the clinical range.Changes in construct validity notwithstanding, the resulting T scores are expected to be comparable to the T score from the original CBCL but with fewer biases across groups that differ on the child and caregiver characteristics detailed above.

Limitations and strengths
We only highlighted the main takeaways of our study and did not present all results in full detail in the manuscript.Decisions on how to winnow these results were made with attention to presentability while maintaining statistical rigor and transparency.First, although different decisions, such as the threshold of "impactful" DIF, could have been made, we believe the core conclusions would remain the same.We share item-level parameter estimates and standard errors via OSF Repository for interested readers to conduct their own queries.Second, we had chosen to focus on the broad domains of Internalizing, Externalizing, and Total Problems.Thus, findings should not be generalized to the syndrome scales or researcher-defined item subsets (MI/DIF analyses for syndrome scales are included in Appendix S2, given the emphasis here on broadband domains).Additionally, the current analysis could not distinguish between uniform DIF (i.e., DIF affecting all items in the same way) and differences in latent mean and variance.Future work including multiple raters can help clarify this distinction.Finally, the current analysis did not examine item discrimination and severity based on the criteria of clinical concerns due to the lack of such information, so robust item sets are not optimized for identifying clinical-range problems.

�
Robust item sets with less DIF can reliably capture Internalizing, Externalizing, and Total Problems with less measurement bias Demographic information of the analytical sample of caregiver-child dyads in Environmental influences on Child Health Outcomes (ECHO) (N = 9087).
in R 4.1.0(R Core Team, 2022).Configural invariance testing Establishing configural invariance (similar number of factors and loading pattern) is the first step in MI/DIF examination.For Internalizing and Externalizing, we estimated a unidimensional model in each subsample.For Total Problems, we estimated both a unidimensional model and a bifactor model (i.e., Internalizing and Externalizing items loaded on two separate factors and all items loaded on the general factor representing Total Problems, with all factors mutually orthogonal).Model fit was evaluated using the standardized root mean squared residual (SRMSR) and root mean squared error of approximation (RMSEA), given their demonstrated performance advantage in IRT models (Maydeu-Olivares & Joe, 2014) and the general incomparability of other estimable fit measures (e.g., comparative fit index [CFI], Tucker-Lewis index [TLI] signed item difference in the sample (UIDS;Meade, 2010) to quantify the impact of DIF on CBCL items and construct scores(i.e., Internalizing, Externalizing, and Total Problems).Unsigned Item Difference in the Sample is calculated by comparing expected item scores using model parameters estimated from different groups, and the magnitude of UIDS reflects the impact of differences in item parameter estimates on scores and can be interpreted on the scale of raw item scores.Using a threshold of UIDS >0.1, representing one-tenth of a point of score difference on the item, we identified items with the most impactful DIF for each construct, operationalized as those with UIDS >0.1 across more than one univariate grouping for Internalizing and Externalizing, and more than two for Total Problems.Thus, removing these items yielded a more measurement-invariant (i.e., robust) item set for each construct.Moreover, to assess the aggregated impact of item-level DIF, Signed Test Difference in the Sample (STDS) was calculated to represent differential test functioning.See Appendix S1 for additional details on UIDS and STDS calculation.The effect of removing these items on reliability was assessed by plotting IRT reliability of measurement, calculated as one minus the reciprocal of test information, as a function of the latent trait, comparing values from the full item set to the robust item set.

F
Items with Unsigned Item Difference in the Sample (UIDS) > 0.1 in Univariate Groupings across Latent Constructs.Ext, Externalizing domain; Ext þ Tot, Externlizing and Total Problems; Int, Internalizing domain; Int þ Tot, Internalizing and Total Problems; Tot, Total Problems; UIDS, Unsigned Item Difference in the Sample; RespSex, Respondent Sex; NDD, Neurodevelopmental Disorder.Items with no significant differential item functioning (DIF) or no UIDS >0.1 are excluded from the figure.Items are grouped by domains and sorted in decreasing order of the number of groupings with UIDS >0.1 of items; groupings are sorted by the number of items with UIDS >0.1 within the grouping.The numbers on the two right-hand columns show the number of groupings with UIDS >0.1 for the specific items within domains (Internalizing or Externalizing, and Total Problems).

-
presents the distribution of UIDS by univariate groupings and domains for items with significant DIF.Notably, the impactful DIF across items concentrated on language version, caregiver education level, caregiver sex, and child age.The largest sources of DIF for caregiver F I G U R E 3 Item response theory (IRT) Reliability for Full and Robust Item Sets.Full, full CBCL item set; Robust, item set with high-DIF items removed.F I G U R E 4 Distribution of Unsigned Item Difference in the Sample (UIDS) by Grouping and Latent Constructs.NDD, Any Neurodevelopment Disorders; CareDep, Caregiver Depression with a cutoff of T-score ≥ 60 for clinical range; RespSex, Respondent Sex.MEASUREMENT BIAS IN CBCL/1.5-5Median and maximum effect sizes (eta-squared) of main and interaction effects on differential item functioning (DIF).
their expectations about child development, influence how parents perceive and rate their child's behavior.Hence, in line with recommendations to consider how caregiver informants' background might influence the assessment of child psychopathology (De Los Reyes & Kazdin, 2005), these findings provide empirical evidence of the potential impact of caregiver characteristics on the measurement of specific child behavior problems.Notably, we did not observe much impactful DIF related to caregiver depression, despite the depression-distortion hypothesis suggesting that depressed caregivers perceive more problems in their children (De Los Reyes & Kazdin, 2005).Since only a small proportion of caregivers met the clinical cutoff of T-score ≥60 (3.4%), our analyses might have lacked the power to detect possible bias across caregiver depression status.
Group comparisons of T-scores of robust item set across child and caregiver characteristics across domains.
infer the internal states (e.g., feelings) of young children.As for Externalizing, almost half of the items showed impactful DIF (10 out of 24), with three out of five items on the Attention Problems scale showing high DIF.Removing a large amount of high-DIF items to construct the robust item set might result in changes in the construct validity of the Externalizing scale.Externalizing items with the most DIF are adjectives that may imply character judgment (e.g., defiant,