Computerized adaptive testing and short form development for child and adolescent oral health patient‐reported outcomes measurement

Abstract Objectives To develop computerized adaptive testing (CAT) and short forms of self‐report oral health measures that are predictive of both the children's oral health status index (COHSI) and the children's oral health referral recommendation (COHRR) scales, for children and adolescents, ages 8–17. Material and methods Using final item calibration parameters (discrimination and difficulty parameters) from the item response theory analysis, we performed post hoc CAT simulation. Items most frequently administered in the simulation were incorporated for possible inclusion in final oral health assessment toolkits, to select the best performing eight items for COHSI and COHRR. Results Two previously identified unidimensional sets of self‐report items consisting of 19 items for the COHSI and 22 items for the COHRR were administered through CAT resulting in eight‐item short forms for both the COHSI and COHRR. Correlations between the simulated CAT scores and the full item bank representing the latent trait are r = .94 for COHSI and r = .96 for COHRR, respectively, which demonstrated high reliability of the CAT and short form. Conclusions Using established rigorous measurement development standards, the CAT and corresponding eight‐item short form items for COHSI and COHRR were developed to assess the oral health status of children and adolescents, ages 8–17. These measures demonstrated good psychometric properties and can have clinical utility in oral health screening and evaluation and clinical referral recommendations.


| INTRODUCTION
The importance of maintaining oral health status has been noted in Healthy People 2020 (Office of Disease Prevention and Health Promotion, 2011). Due to poor diets and oral hygiene, many children and adolescents are especially vulnerable to dental caries and oral health problems. In order to address these issues, patient-reported measures can be an essential part of patient-centered care (Forrest et al., 2014;Perlin, Kolodner, & Roswell, 2004;Snyder, Jensen, Segal, & Wu, 2013). Our previous research developed self-report items to assess oral health status for children and adolescents. (Liu et al., 2016;Maida et al., 2015). Based on an item bank and various statistical approaches, short forms have been recently developed for children and adolescents aged 8 to 17, using the framework and methodology of the Patient Reported Outcomes Measurement Information System (PROMIS ® ) Marcus et al., 2018;Wang et al., 2018).
Computerized adaptive testing (CAT), based on item parameters estimated from item response theory (IRT), further enables more accurate estimation of the underlying concepts being measured while minimizing response burden (Cella et al., 2007). One advantage of CAT is that items are selected from a database (item bank) based on the survey respondent's responses, using a preset computerized algorithm, which is derived from item information functions (Weiss & Kingsbury, 1984). As a result, each assessment is individualized to each respondent, based on the symptom level of the patient at the time of answering the survey. In addition, it is possible for CAT algorithms to allow the same respondent to respond to different items over time, depending on developmental change of symptom, while still maintaining comparability of scores at different times for the patient.
Compared with the short form, a higher level of measurement precision could be achieved using few items (Lai et al., 2011). This paper presents results of CAT simulation and derives short forms for two existing oral health measures for children and adolescents, ages 8-17, which are predictive of both the children's oral health status index (COHSI) and the children's oral health referral recommendation (COHRR) scale . Comparisons between the performance of the CAT and generated short form compared with the full-length scales are also provided.

| Procedures
The study procedures have been reported elsewhere  and are briefly summarized as below. During field testing from August 2015 to October 2016, all children had dental examinations onsite at dental clinics in Los Angeles County. Data were obtained from dental examinations, and survey questions were answered by children (age 8-17) and their parents or guardians during field testing. Clinical examination results were used to obtain the COHSI, which estimates children's overall oral health status (missing teeth, decay, and filled) and occlusal status (Koch, Gershen, & Marcus, 1985). Samejima's graded response model has been used to estimate item parameters (discrimination and difficulty) for the COHSI scale with 19 items and the COHRR scale with 22 items .

| Short-form item selection
Eighty-eight items were included in the questionnaire related to oral health status to create the 19-item COHSI and 22-item COHRR scales. Liu et al. selected short forms based on item parameters for the 19-item COHSI and 22-item COHRR full-length scales and incorporated inputs from content experts . The items with higher discrimination and with a wider range of difficulties were selected. In this paper, post hoc CAT simulation was used to select the most frequently used items for possible inclusion in the short forms.
Intraclass correlations between estimated scores of short form and full-length scales were used to assess the extent to which the short forms capture the information in the full-length scales and compare measurement reliability (information).

| Computerized adaptive testing simulations
Because the expected information varies by the distribution of the data (S. W Choi, Reise, Pilkonis, Hays, & Cella, 2010;Seung W. Choi, 2009), two normal distributions were evaluated. One is the standard normal distribution N(0,1), and the other is the normal distribution with a mean of 0.0 and standard deviation of 1.5, N(0,1.5). Items were ranked based on four criteria: the percentage of time selected in CAT simulations, discrimination parameters for each item, the expected information under N(0, 1), and the expected information under N(0, 1.5).
Simulations involve a series of steps (Yu et al., 2012): First, latent trait score θ's for COHSI and COHRR items were estimated using maximum likelihood estimation. Then, θ score for each respondent was adaptively estimated based on item responses from the calibration sample. And then, the CAT θ estimates were compared with the long-form θ estimates, as a function of the number of items administered in the CAT. Finally, Adaptive test lengths were determined in that they can result in greatest similarity between the CAT θ estimates and the long-form θ estimates, with a minimum number of CAT items.
We used the computer program Firestar (version 1.2.2) for CAT simulation (Seung W. Choi, 2009). The best eight items across the four criteria above were selected based on the literature on the optimal length of short forms (Reise & Henson, 2000). The initial first item administered in CAT was decided upon based on maximum information obtained at the mean value of the population distribution of the latent oral health scale (θ). Items were selected based on the Maximum Posterior Weighted Information, which has been shown to perform best among item selection methods (Seung W. Choi, 2009). In our CAT simulation, Firestar generated "virtual" respondents with predefined oral health scores, equally distributed on the latent oral health measurement continuum, from worst to best oral health (Lai et al., 2011). All of these "virtual" respondents first completed the item with the largest expected information for the previous distribution; the initial oral health score was estimated; the item with the largest pre-estimated oral health information function was selected as the next item; and then the oral health score was re-estimated based on the respondents' current item response (Lai et al., 2011). This estimation iteration continued until the stopping rule was reached: either standard error of measurement is <0.3 or the number of items is >8.
We used the default PROMIS CAT settings with ≥4 items. Finally, the simulated oral health scores obtained from CAT were compared with scores based on completion of all oral health items.

| Sample characteristics
The study recruited 334 individuals from 12 dental clinics in Greater Los Angeles (Table 1; Liu et al., 2018). The sample included 48% females, had a mean age of 12 years (SD = 3); 42% of the sample was Hispanic or Latino, 21% were White, 13% were Asian, and 8% were African American. The overall mean COHSI was 89 (SD = 9); 52% were referred to continue routine care, 16% needed to see a dentist at their earliest convenience, 25% needed to see a dentist within the next 2 weeks, and 7% children needed care immediately.

| Item response theory parameter estimates and scoring
IRT models have been fit to the COHSI and COHRR items and the parameter estimates (difficulty and discrimination) have been obtained . The set of items in the long form serves as the foundation for the development of short form with fixed format and CAT.
The IRT θ score measures the latent trait where higher θ score indicates better COHSI and COHRR. θ scores across the items ranged from −2.5 to 2.1 with a mean of 0.0021 ± 1.8 and median of −0.034

| Computerized adaptive testing Simulations
We used CAT simulations of all items for COHSI and COHRR item banks to estimate θ scores for each respondent. Then, we compute the correlation for each score from CAT with the final calibration scores based on full-length COHSI and COHRR scales and plotted the correlations as a function of number of items administered (Figures 1 and 2). The eight-item CAT for COHSI and COHRR provided a score correlation of.94 and.96, with the full-length COHSI and COHRR scale, respectively. These high correlations show that CAT can produce comparable score estimates with a limited number of items.
T A B L E 1 Characteristics of the children and adolescents in the field test (reprint of Table 1 of paper "Short form development for oral health patient-reported outcome evaluation in children and adolescents )" with permission)

M (SD) or No. (%)
Children's oral health status index 89 ( show that some items provide more information about study participants and therefore more valuable than others. For example, for COHSI, Items 4, 5, and 9 provide so rich information that they are always used, regardless of the simulated participant's oral health level (Reise & Henson, 2000). On the other hand, Items 10 and 14-19 provide little information, as a result, even if the simulated participant's oral health level is almost equal to that item's difficulty threshold, they are never administered. Such items are good choices for being excluded from the short form.
We ranked all COHSI and COHRR items according to these evalu-  Tables 2 and 3 show the ranking results for COHSI and COHRR item banks.
The items that were selected for non-CAT-based short form  are bolded in Table 2 for COHSI and Table 3 for COHRR.
These items were selected based on the higher slope, the wider range of threshold parameters, representation of oral health related domains, and opinions from content experts. There are eight items for COHSI short form and seven items for COHRR short form.
For the COHSI item bank, the top eight items according to the CAT simulation results (i.e., the last three columns of Ranks based on discrimination parameter (how well the item discriminates between respondents' with low or high symptom levels).
c Ranks based on number of times that each item was being selected in CAT simulations and discrimination parameter.
d Ranks based on expected information that each item has under the normal distribution with a mean of 0 and standard deviation of 1 and discrimination parameter.
e Ranks based on expected information that each item has under the distribution with a mean of 0 and standard deviation of 1.5 and discrimination parameter.
T A B L E 3 Short form item selection order of patient-reported outcomes measurement information system oral health referral item bank Ranks based on discrimination parameter (how well the item discriminates between respondents' with low or high symptom levels).
c Ranks based on number of times that each item was being selected in CAT simulations and discrimination parameter.
d Ranks based on expected information that each item has under the normal distribution with a mean of 0 and standard deviation of 1 and discrimination parameter.
e Ranks based on expected information that each item has under the distribution with a mean of 0 and standard deviation of 1.5 and discrimination parameter F I G U R E 5 Test information curve for children's oral health status index (long form vs. CAT vs. short form). The test information of 3.3 derived from item response theory on the left-side y axis is roughly equivalent to the reliability of .70 derived from classical test theory on the right-side y axis. Therefore, the curves above the horizontal line (test information of 3.3 to reliability of .70) indicate the section on the theta scale has reliability of.70 or above. CAT, computerized adaptive testing F I G U R E 6 Test information curve for children's oral health referral recommendation (long form vs. CAT vs. short form). The test information of 3.3 derived from item response theory on the left-side y axis is roughly equivalent to the reliability of .70 derived from classical test theory on the right-side y axis. Therefore, the curves above the horizontal line (test information of 3.3 to reliability of .70) indicate the section on the theta scale has reliability of.70 or above. CAT, computerized adaptive testing Using Figures 5 and 6, the range of reliable scores (i.e., scores with an expected reliability ≥.70 or test information ≥3.3) for COHSI are as below: the full bank provided reliable scores in range (−3.0, 1.2); the eight-item CAT assessments provided reliable scores in range (−3.0, 0); the non-CAT-based short form  provides reliable scores in range (−3.0, −0.8), which is somewhat constricted than eight-item CAT. For COHRR, the full bank provided reliable scores in range (−3.0, 0.6); the eight-item CAT assessments provided reliable scores in range (−3.0, 0.1); the non-CAT-based short form  provides reliable scores in range (−3.0, −0.5), which is somewhat constricted than eight-item CAT. These results demonstrate that eight-item COHSI CAT and eight-item COHRR CAT are generally more precise than the corresponding short form.  Several key limitations should be noted. First, for IRT analysis, a sample size 500 or more was recommended for accurately estimating the latent variable and stable parameter calibration of the items (Reeve & Fayers, 2005). The sample size in this study is relatively small.

| DISCUSSION
A larger sample size could provide a more stable estimate for IRT parameters and provide more precise estimates of item parameters on the high end of the oral health measurement continuum. Second, both versions of the measures lack precision for children with average or greater levels of oral health. Thus, the scales may be useful to screen for oral health problems but they may lack sensitivity as outcome measures for children with improving oral health. Third, the cohort has generally good oral health and is predominantly Hispanic, as a result, the generalizability of findings to the general population may be limited and will therefore require additional questions. In addition, CAT simulations were performed on the data set that was used for IRT calibration, which may also lead to limited generalizability. In our CAT simulation, we used Firestar to generate "virtual" respondents with predefined oral health scores, equally distributed on the latent oral health measurement continuum, which attenuated this problem. We planned to test on a prospective sample in the future. Fourth, validity of CAT and the eight-item short form need to be investigated further because they do not inherently possess the same psychometric characteristics as the long form (Smith, McCarthy, & Anderson, 2000).

| CONCLUSIONS
Using established rigorous measurement development standards, the CAT and corresponding eight-item short form for oral health measures and referrals were developed for children and adolescents, ages 8-17.
These measures demonstrated good psychometric properties and can have clinical utility in oral health screening and evaluation and clinical referral recommendations. This study enhanced current ongoing efforts to create short but efficient oral health assessment toolkits.
Further validation of these IRT-based CAT and short form measures in an independent sample of children in clinical populations is essential for them to play a pivotal role in dental clinical decision making.

FUNDING INFORMATION
This research was supported by a National Institute of Dental and Craniofacial Research grant to the University of California, Los Angeles [U01DE022648].