Measuring mentalizing: A comparison of scoring methods for the hinting task

Abstract Objective The Social Cognition Psychometric Evaluation (SCOPE) study supported the utility and practicality of the Hinting task as a measure of social cognition/mentalizing in clinical trials, specifically with the SCOPE authors' stringent scoring system. However, it remains unclear whether the SCOPE scoring system is necessary for the task to be judged as psychometrically sound. Method Independent raters rescored data from the three phases of SCOPE using the Hinting task's original scoring criteria. Psychometric properties of the task when scored with the original criteria versus more stringent SCOPE criteria were compared in a large sample of individuals with chronic schizophrenia (n = 397) and matched controls (n = 300) as well as a smaller sample of individuals with early psychosis (n = 38) and controls (n = 39). Results In both samples, SCOPE criteria resulted in lowered average scores and reduced ceiling effects. Further, revised scoring resulted in strengthened relationships between the hinting task and outcome measures in the chronic sample, and better differentiated early psychosis patients from controls. Conversely, test‐retest reliability and internal consistency estimates were not improved using revised scoring and remained suboptimal, particularly for healthy controls. Conclusion Overall, SCOPE scoring criteria improved some psychometric properties and clinical utility, suggesting that these criteria should be considered for implementation.

The Hinting Task (Corcoran, Mercer, & Frith, 1995) is one of the most widely used assessments for measuring mentalizing abilities in patients with schizophrenia, and has been administered to individuals diagnosed with autism spectrum disorders (Morrison et al., 2019), Parkinson's Disease (Kosutzka et al., 2019), and moderate-to-severe traumatic brain injury (Tousignant et al., 2018). Although this task was designed to measure deficits in clinical populations via 10 vignettes assessing an individual's ability to infer intent from indirect speech, it has been criticized for its poor psychometric properties (Davidson, Lesser, Parente, & Fiszdon, 2018;Mallawaarachchi, Cotton, Anderson, Killackey, & Allott, 2019). Specifically, this task has demonstrated ceiling effects in both patients with schizophrenia spectrum disorders (Lindgren et al., 2018;Marjoram et al., 2006;Roberts & Penn, 2009;Versimissen et al., 2008) and healthy controls (Corcoran & Frith, 2003;Corcoran & Frith, 2005), indicating that this measure may underestimate or inaccurately reflect true mentalizing abilities in clinical and nonclinical samples. In one psychosocial treatment study, Roberts and Penn (2009) found that over half of participants (57%) scored at normative levels on the task at baseline (a score of 17 or above out of 20), potentially limiting the ability to observe improvement in subsequent measurements.
Despite these limitations, the Hinting Task was selected for consideration in the Social Cognition Psychometric Evaluation (SCOPE) study, which sought to identify the best available measures of social cognition for use in clinical trials of schizophrenia spectrum illnesses (Pinkham et al., 2014;Pinkham, Penn, Green, & Harvey, 2016). Results from the final validation phase of SCOPE supported the utility and practicality of the Hinting task in clinical research (Pinkham, Harvey, & Penn, 2018) stating that, in contrast to the aforementioned criticisms of the task, this measure demonstrated limited floor and ceiling effects in patients and healthy controls (less than 7% of the total sample). Notably, the authors of SCOPE developed and utilized a novel, more stringent scoring system for the Hinting task throughout SCOPE, which may have resulted in improved psychometric properties over the original scoring method (Pinkham et al., 2016). Findings from SCOPE suggest that the Hinting task demonstrated small practice effects (patients: initial phase d z = .19; final validation study d z .15; healthy controls: initial phase d z = .31; final validation study d z .18) and adequate test-retest reliability in patients (initial phase r = .639; final validation study r = .695), with slightly lower test-retest reliability in healthy controls (initial phase r = 424; final validation study r = .509) (Pinkham et al., 2016;Pinkham et al., 2018). Further, the Hinting task was identified as a significant predictor of real-world outcomes, including functional capacity, social competence, social functioning, and community-living skills (Pinkham et al., 2016;Pinkham et al., 2018).
Building upon literature that suggests that length of illness may impact social cognitive ability, Ludwig et al. (2017) investigated whether the utility and practicality of social cognitive tasks utilized in the primary SCOPE study extended to younger individuals with first episode psychosis (FEP). Within their sample of individuals with FEP (M age = 23.45), the authors noted that the Hinting task and revised scoring system showed good test-retest reliability (r = .74) and limited practice effects (d z = .41-.64) in FEP (Ludwig et al., 2017). Further, only two patients (<6% of the sample) showed floor/ceiling effects.
Consistent with the results of SCOPE, the Hinting task demonstrated sound psychometric properties and was shown to be a significant predictor of real-world outcomes for individuals early in the course of illness when the more stringent scoring method developed by the SCOPE authors was utilized.
The reported results from the SCOPE study highlight that the Hinting task is appropriate for use in patients with psychosis, regardless of stage of illness when utilizing a more stringent scoring system. However, it is unclear how these psychometric properties compare to the original scoring criteria and whether the more stringent criteria are necessary and warrant widespread adoption. The current study therefore used all available SCOPE data to compare the psychometric properties of the Hinting task when scored with the SCOPE system to those obtained with the original scoring criteria in both chronic and FEP. We hypothesized that the revised SCOPE scoring criteria would result in overall improved psychometric properties of the Hinting Task, specifically higher estimates of test-retest reliability and internal consistency. By reducing the overall number of participants scoring at ceiling, we also anticipated that the SCOPE scoring system would introduce more variability within the sample, and result in stronger associations with functional outcome measures.

| Participants
Collapsing across the three SCOPE phases resulted in a sample of 790 unique participants. The sample was then divided into either patients with chronic schizophrenia spectrum diagnoses and matched healthy controls, or early psychosis patients and their matched healthy controls. Sixteen participants, 12 patients and four healthy controls, were omitted from the chronic analyses due to being extreme outliners (−3 SD) with either the original or SCOPE scoring systems. This resulted in final sample sizes of 697 participants in the "chronic" analyses (397 patients with schizophrenia spectrum diagnoses and 300 healthy controls) and 77 participants in the "early psychosis" analyses (38 patients with schizophrenia spectrum diagnoses and 39 healthy controls). Demographic information for both samples is provided in Table 1, and results will be discussed for each of these groups individually below.

| Hinting task and scoring criteria
During administration of the Hinting task (Corcoran et al., 1995), the rater reads aloud a short vignette describing an interaction between two characters. Each of the 10 passages end with one character dropping a hint, and participants are asked to indicate what the character truly meant. If the participant is inaccurate in their assessment of the character's intent, the rater provides a second hint, allowing the participant to receive partial credit. In the SCOPE administration of the task, participants could ask for the vignette or additional hints to be read again as needed; however, no additional queries were administered to elicit more detailed responses. Individual items are scored from 0 to 2, with a "2" indicating perfect understanding of the intention of the character in the scene, a "1" indicating partial credit, and a "0" indicating failure to infer intention. Performance is indexed as a total score that can range from 0 to 20. behind her. Patsy says to John, 'Gosh! These suitcases are a nuisance,'" the participant is asked to infer the intent of Patsy's statement. In the original scoring method, any response indicating that Patsy would like help with her suitcases earns full credit. The SCOPE criteria, however, emphasize that in each of the vignettes, a direct request is being made of the second character in the scenario. In the example above, a correct answer must identify not only that Patsy needs help with her suitcases but also that she wants John, specifically, to help carry them. Correct answers are required to include the intention of the person as well as a request of the other person in the scenario. Furthermore, the SCOPE authors identified key phrases or words that assist in reducing variability in rater scoring. These revised scoring criteria are provided in Appendix S1***.

| Functional outcome measures
Data for functional outcomes from the SCOPE study were utilized to assess the relationship between Hinting task scoring methods and both informant report and performance-based functioning. Social

| Neurocognition and symptom assessment
As part of the SCOPE protocol, premorbid IQ was estimated at the participant's initial visit using The Wide Range Achievement Test (WRAT-3) reading subscale (Wilkinson, 1993

| Procedures
This study utilized data from the three phases of the SCOPE study, in which participants completed a comprehensive social cognitive battery at two time points, approximately 2-4 weeks apart. Information regarding participant recruitment and administration procedures for the SCOPE study have previously been reported (Pinkham et al., 2016;Pinkham et al., 2018). Hinting task responses were recorded verbatim and scored in real time using the stricter, revised SCOPE criteria. Given the availability of these verbatim responses, independent raters that were not previously trained on the SCOPE scoring criteria were able to review all participant responses and rescore them, assessing whether provided responses were sufficient for full or partial credit based upon the original criteria. Three independent raters were trained to good reliability, ICC (1,3) = 0.840, based upon the original scoring guidelines provided by Corcoran et al. (1995).

| Statistical analyses
Analyses followed the statistical plan used in the original SCOPE study, and psychometric properties for the chronic and early psychosis groups were analyzed separately. For the chronic sample, distributions of the hinting task scores were first assessed for both the original and SCOPE scoring criteria. Outliers, defined as −3 SD from the mean for either scoring system, were excluded from analyses (n = 12 patients, four controls). Test-retest reliability was computed using Pearson's r correlation coefficients whereas internal consistency was evaluated via Cronbach's alpha. Practice effects (paired-samples t-tests with Cohen's d z ) and floor/ceiling effects (number of participants scoring at 0 or scoring 100%) were assessed to determine utility as a repeated measure. Additionally, although we define ceiling effects as perfect performance (scoring 20 out of 20), it is also important to consider whether the task allows room for improvement in clinical trials (Murthy, Xu, Zhong, & Harvey, 2019). We therefore also report the number of participants achieving near-perfect scores at initial testing (≥ 17 out of 20) using each scoring method. Finally, to examine relationship to functional outcomes, Pearson's r correlations were calculated with the Hinting task score from the participant's initial visit.
Partial correlations between Hinting scores and outcomes while controlling for MCCB neurocognitive performance were also calculated.
Independent sample t-tests with Cohen's d examined group differences between patients and controls, and paired t test were used to examine mean differences between the original scoring criteria and SCOPE scoring. Fischer's z was used to compare test-retest reliability indices between scoring criteria, as well as the relationships between scoring criteria and functional outcomes. Feldt tests (Feldt, Woodruff, & Salih, 1987) were performed to compare estimates of internal consistency between scoring criteria. As the current study collapses samples across the three phases of the SCOPE study to form the chronic sample, current results differ slightly from previously published reports. Appendix S1 provides more direct comparisons of individual samples from each of the previously published SCOPE psychometric papers with rescored results.
For the early psychosis sample, analyses were similar; however, as the early psychosis sample was considerably smaller, analyses were run both with and without outliers. Two participants were identified as outliers (one patient and one healthy control); however, their removal from the data did not significantly impact the pattern of results. Therefore, results below are reported with outliers included, although analyses excluding these three participants can be found in the Appendix S1.

| Early psychosis sample
Similar to the chronic sample, test-retest reliability as assessed by both the original and SCOPE scoring criteria was within benchmark standards for the patient sample, but below standards for healthy controls. Scoring criteria did not significantly impact estimates of testretest reliability for either group within this sample.

| Chronic sample
Internal consistency was calculated for both initial and follow-up task administrations for both groups (Table 2). Across the sample, these scores did not reach recommended levels for internal consistency (α = .80; Nunally, 1967), and scoring criteria did not significantly impact estimates of internal consistency for either group at any time point.  Table 3, statistical comparisons can be found in Table 4). The total number of participants scoring at ceiling were greatly reduced when using the SCOPE scoring criteria, reducing the total number from 58 participants (20% of sample) to 15 (5%) in the control group at the initial visit and from 80 (28%) to 22 (8%) at retest. For the patient sample, the overall number of participants receiving a perfect score was reduced from 30 participants (8% of sample) to 3 (0.8%) at the initial visit and from 55 (15%) to 6 (1.6%) at follow-up. In considering near-perfect performance and the benefits of capturing potential improvements resulting from treatment, SCOPE scoring

| Early psychosis sample
For FEP patients, performance on the task significantly improved with repeated testing, noting small practice effects with the original criteria and moderate effects with the SCOPE scoring criteria. Healthy controls, however, did not demonstrate any significant practice effect for either the original or SCOPE scoring criteria. When directly comparing scoring methods for the early psychosis sample, mean scores at each time point were significantly reduced for both patients and healthy controls when utilizing the SCOPE scoring criteria (Table 4). Similar to the chronic sample, SCOPE scoring criteria greatly reduced the number of participants scoring at ceiling, reducing the total number for healthy controls at the initial visit from 9 (25% of the sample) to 4(11%), and from 10 (29%) to 3 (8%) at follow-up. SCOPE criteria reduced the total number of patients scoring at ceiling from 5 (16% of sample) to 1 (3%) at the initial visit, and from 11 (31%) to 2 (6%) at follow-up. The number of individuals scoring near-perfect also decreased with SCOPE scoring, from 29 to 19 for patients and from 34 to 30 for healthy controls.
3.4 | Relationship to functional outcomes

| Chronic sample
Correlations between measures of functioning and both original and SCOPE scoring of the hinting task can be found in Table 5. For the chronic patient population, small but significant correlations were observed between performance on the Hinting task and performance measures of functional capacity (UPSA), social competence (SSPA), and the informant rated measure of daily functioning (SLOF Informant ).

SCOPE scoring significantly increased the relationship between
hinting and functional capacity.
We also observed significant correlations between neurocognitive ability and functional outcomes. Thus, in order to assess the unique relationship between social cognitive ability and functional outcome measures, we calculated partial correlations between hinting scoring and functional outcome measures, including all neurocognitive subscales as covariates. The small but significant relationship between hinting performance and functional capacity was retained after introducing covariates, though scoring system no longer significantly impacted this relationship. The relationship between hinting performance and social competence was no longer significant after controlling for neurocognitive ability. We observed a significant impact of the scoring system on the small, but significant, relationship between hinting performance and informant ratings of daily functioning after controlling for neurocognitive ability with the SCOPE criteria revealing a stronger relationship. No significant correlations were found between the Hinting task scores or neurocognitive subscales and the self-reported measure of daily functioning (SLOF Self-report ) in the chronic sample. Correlations between neurocognitive subscales and Hinting scores are presented in the Appendix S1.

| Early psychosis sample
Performance on the Hinting task was significantly related to functional capacity (UPSA) and social competence (SSPA) when using both the SCOPE and original scoring criteria, though these were small effects. Unlike the chronic sample, Hinting task scores were not significantly related to informant rated daily functioning (SLOF Informant ).
Scoring criteria did not have any significant impact on these analyses, suggesting comparable relationships to functioning across scoring methods for this sample. We did not observe significant relationships between functional outcome measures and neurocognitive ability, with one exception of letter-number span (a measure of working memory) on functional capacity. Nevertheless, when adding neurocognitive ability as a covariate in correlational analyses, the partial correlations between hinting performance and functional outcome were no longer statistically significant for this small sample.
3.5 | Group differences 3.5.1 | Chronic sample Direct comparisons between patients and healthy controls are presented in Table 6. The groups significantly differed on task performance using the original scoring criteria at visits 1 and 2, with patients scoring lower than healthy controls at both time points. The updated SCOPE scoring retained these group differences with very comparable effect sizes.

| Early psychosis sample
Similar to the chronic sample, patients and healthy controls significantly differed on task performance when utilizing the original scoring criteria at visit 1 yet failed to meet significance at visit 2. Utilizing the updated SCOPE scoring system resulted in larger effect sizes at the initial visit and significantly differentiated between patient and control samples at retest.

| DISCUSSION
The current study aimed to examine the psychometric properties of the Hinting task when using stricter scoring criteria developed as part of the SCOPE study, compared to the same data scored with the original scoring system. Chronic patients and early psychosis patients were analyzed separately to determine if the psychometric properties differed according to stage of illness. Although our overall results are mixed, the more stringent SCOPE criteria addressed key concerns regarding the Hinting task, namely reducing ceiling effects and better differentiating between patients and controls in an early psychosis sample; however, the updated criteria failed to significantly improve other psychometric properties. Further, we observed unique improvements in the relationship between performance and outcome measures when implementing a stricter scoring system, which may have added clinical utility when using this task.
Overall, SCOPE scoring criteria significantly lowered the mean scores for all examined groups, and the number of participants scoring at ceiling for the task was greatly reduced when implementing the SCOPE scoring across all samples. Importantly, the SCOPE criteria still differentiated patient and control groups for the chronic sample in a manner that was highly similar to the original scoring system, and significantly differentiated between early psychosis patients and matched controls with larger effect sizes. As the presence of ceiling effects has been raised as one of the primary criticisms against the Hinting task, particularly in relation to its suitability for use in clinical trials, these findings suggest that the revised SCOPE criteria may partially remedy this problem. Relatedly, applying the more stringent scoring system did not appear to disproportionately impact patient scores relative to those of healthy controls.
Results were not uniformly positive however, and within the chronic sample, SCOPE scoring resulted in mixed impacts on psychometric properties of the task. Specifically, SCOPE scoring resulted in test-retest reliability within the acceptable range despite a statistically significant decrease relative to the original scoring for the chronic patient sample. For healthy controls, scoring criteria did not impact test-retest reliability, and both scoring systems resulted in estimates below the desired range. Estimates of internal consistency did not meet recommended levels for either scoring criteria, and no significant improvement was seen using one scoring system over the other. Additionally, we only observed small practice effects on performance from initial to follow-up visits, using either scoring criteria.
Importantly, correlations between Hinting total scores at the initial visit and functional capacity were significantly improved using SCOPE scoring, and when controlling for neurocognitive ability, we observed a similar increase in the relationship between informant rated daily functioning and hinting performance. It is unsurprising that neurocognitive ability may partially account for the significant increase in the relationship between SCOPE scoring and functional capacity, as SCOPE scoring may tap into problem solving aspects of financial and communication skills that are assessed in the UPSA.
However, the uniquely significant relationship between hinting performance and informant reports of daily functioning may indicate that SCOPE scoring is measuring unique aspects of social ability/understanding that is related to interpersonal interactions above and beyond just neurocognitive ability.
The results for our early psychosis sample similarly indicate that the SCOPE criteria provided some unique benefits with minimal costs to the psychometric properties of the task. The most notable observation within this sample was that SCOPE criteria better differentiated between patients and controls, with rather large effects at the initial visit and smaller, albeit significant effects, at follow-up. As with the chronic sample, both scoring methods resulted in test-retest reliability within the acceptable range for the patient sample, whereas estimates for healthy controls were below cut-offs. We did not observe any significant impact of scoring system on test-retest reliability across groups. Although Cronbach's alpha was below desired levels for all groups and time points, we did observe a significant improvement in internal consistency using SCOPE scoring criteria for healthy controls at the initial visit. As noted above, this may be due to the limited T A B L E 6 Group differences on Hinting task variability in item responses within this sample, and ways to improve this aspect of the task are discussed below. Additionally, we observed an increase in practice effects for the early psychosis patient sample, with moderate effects when using the SCOPE criteria compared to small effects observed in the original criteria. Notably, we did not see any significant increased relationship between hinting performance and outcome measures when using either scoring criteria.
The different pattern of results in our early psychosis sample indicates that age or stage of illness may impact some of the psychometric properties of the SCOPE scoring. As noted in Ludwig et al., (2017), it is plausible that patients early in the course of illness may either retain levels of premorbid functioning or exhibit reduced deficits in ToM, impacting some of our results (i.e., increased practice effects, or nonsignificant associations between occupational skills and performance on the Hinting task). The current results may also be confounded by the fact that our early psychosis sample was significantly smaller than the chronic sample, as well as younger, more educated, and scoring higher on a measure of premorbid IQ. Taken together, these findings suggest that that task's ability to detect improved performance may be restricted for some individuals, and that researchers may need to assess potential costs and benefits to utilizing the Hinting Task along with the SCOPE scoring system in patient samples with attenuated social cognitive deficits and healthy samples.
It is also important to note that even though SCOPE scoring reduced the total number of those with near-perfect scores, these results highlight a potential inherent limitation of the task. In the chronic sample, 24% of patients and 47% of controls scored in the near-perfect range, and over 50% of the early psychosis patients and controls scored in the near-perfect range even with the more restrictive SCOPE scoring criteria. Although we believe that the SCOPE scoring system addresses key concerns with using the Hinting Task in its current form, there are several avenues for researchers to further improve the psychometric properties of the Hinting Task, especially for use in healthy and more normative patient samples. Specifically, researchers could create and test alternate vignettes to either add to the current task and increase total number of items, or to replace items that perform poorly to increase internal consistency and construct validity. This work would also benefit from the creation of an alternate form of the current task for use in clinical trials. Researchers could also expand the rating scale beyond the 0-2 rating per item to provide more nuances in subject response; however, this would greatly impact the ease with which the task can currently be administered. Finally, other suggested improvements would be to standardize the task through electronic/digital methods to reduce the interrater variability; however, this could be challenging given the task's current open-ended prompt structure. Although SCOPE scoring does not address all the challenges associated with the Hinting task, we believe it provides incremental benefits that warrant adoption until more substantial task improvements can be validated and peer reviewed.
In summary, the present study demonstrated that the SCOPE scoring criteria improves key psychometric properties of the Hinting task and increased relationships with outcome measures when administered to persons diagnosed with a schizophrenia spectrum disorder. Stricter scoring, as demonstrated by lowered group means and reductions in near-perfect scores on the task, allows for more variability not only within patient samples but also within healthy samples. As such, employing the more stringent scoring criteria from the SCOPE study may lead to more accurate assessment of ToM deficits and reduce potential statistical violations when directly comparing performance between clinical and non-clinical samples. Limiting ceiling effects increases the utility of the Hinting task in clinical and research settings. Further, stronger relationships with outcome measures indicate a more precise measurement of functionally important aspects of ToM, thus arguably resulting in a stronger tool for clinical research. It is important to note that test-retest reliability and internal consistency decreased in some samples with use of the SCOPE scoring system, and as such, should be taken into consideration when using the stricter scoring system. Despite these limitations, this study clarifies and emphasizes that the SCOPE study endorsement of the Hinting task as acceptable for use in clinical trials carries the caveat that the more stringent scoring should be used. As such, we strongly recommend a wider adoption of the revised scoring system when using the Hinting task.