Equivalent short forms of the Situational Feature Recognition Test 2: Psychometric properties and analysis of interform equivalence and test–retest reliability

Abstract Objective To obtain two equivalent short forms of the “Situational Feature Recognition Test, Version 2,” a social perception test, and their psychometric properties. Methods Patients with schizophrenia (n = 101) were assessed at two different times. Statistical analyses were performed as follows: (1) Cronbach's alpha was used to assess reliability; (2) Spearman correlations, Wilcoxon signed‐rank test, and a 2 (form) × 2 (time) repeated measures multivariate analysis of variance were used to analyse interform equivalence; (3) Sensitivity to change was studied by a 2 (group) × 2 (time) repeated measures multivariate analysis of variance; (4) Spearman correlations were employed to assess test–retest reliability, convergent and discriminant validity, and relationship with functionality and symptoms. Results The short forms showed good internal consistency at both times. Significant and moderate correlation between forms was found along with no statistically significant form x time interaction. Hits and false positives of both forms were moderately correlated at both times. Group x time interaction was significant especially for hits when assessing sensitivity to change. Both forms were significantly correlated with other social cognition domains and with functionality. Conclusions Results of this study support the use of short forms of the Situational Feature Recognition Test, Version 2 especially in clinical trials and longitudinal studies among patients with schizophrenia.

According to an exhaustive meta-analysis (Savla et al., 2013), SC research usually presents methodological limitations due to the assessment tools used to measure a remarkably complex construct. This is of special importance in clinical trials and longitudinal studies (Grant, Lawrence, Preti, Wykes, & Cella, 2017;Pinkham, 2014;Pinkham, Penn, Green, & Harvey, 2016). The psychometric properties of many SP and other SC measures have not been fully investigated to date Pinkham et al., 2016). Moreover, many of the assessment tools have not been tested for their use as repeated measures and some of their properties, such as test-retest reliability or possible learning effects, have not been studied at all. These limitations compromise their use, particularly in clinical trials and longitudinal studies (Grant et al., 2017;Green et al., 2008). In addition, the relationship between some SC measures and patients' functionality is not usually assessed, even though it is well known that performance on SC and especially SP might provide important data for the study of the patient's functional outcomes (Fett et al., 2011).
SP refers to the ability to decode and interpret social cues in others by integrating information about the social context and knowledge in order to make a judgement about others' behaviours (Pinkham, 2014). This ability is very necessary when stimulus interpretation is ambivalent or confusing based on the stimulus itself (e.g., tears can be interpreted as signs of sadness or joy depending on the context but can be rarely interpreted correctly if they are solely based on the stimulus itself). SP is highly related to "situational schemata" a term proposed by Corrigan and Green (1993), to refer to the interpersonal information acquired from the situation per se that guides interpersonal responses to a specific stimulus (Corrigan & Green, 1993). As far as the authors are aware, to date, only three of the SP measures commonly used to assess patients with schizophrenia have equivalent forms: The Social Attribution Test-Multiple Choice (SAT-MC; Bell, Fiszdon, Greig, & Wexler, 2010), The Awareness of Social Inferences Test (McDonald, Flanagan, Rollins, & Kinch, 2003), and The Trustworthiness/Approachability Task (Adolphs, Tranel, & Damasio, 1998). Equivalent forms of a test are tests created to measure the same construct, which are as similar as possible in terms of the distribution of item difficulty and item content, with high intercorrelation between them (Kelley, 1942). Equivalent forms have an important role to play, especially in clinical trials and longitudinal studies, in order to avoid learning effects without the need to change the measure used at different times of assessment. None of the three SP assessment tools initially proposed by the Social Cognition Psychometric Evaluation (SCOPE) study for identifying and improving the existing SC measures in schizophrenia have an alternative equivalent form, which illustrates the paucity of equivalent forms among SP tests (Pinkham et al., 2016). A new SP measure with equivalent forms was later included in the SCOPE final validation study: the SAT-MC (Bell et al., 2010). However, this instrument showed poorer psychometric properties when compared with those used for assessing the rest of SC domains (Pinkham, Harvey, & Penn, 2018), leaving SP domain without a recommended task to assess it. The lack of SP tests with reliable equivalent forms can be also noted by observing the measures included in other recent reviews and meta-analysis studies (Grant et al., 2017;Savla et al., 2013). From the nine different SP measures included in a review of SC clinical trials on schizophrenia, no equivalent validated forms were available (Grant et al., 2017). Similarly, in the meta-analysis mentioned above (Savla et al., 2013), that examined the deficits of all SC domains in schizophrenia, only one measure presented an equivalent form (The Trustworthiness/Approachability Task; Adolphs et al., 1998), among the more than 10 SP measures that were included.
An additional challenge for SP assessment among patients with schizophrenia is the time needed to administer the measure. In general, current SP assessment tasks may take from between 20 to 35 min to be performed, as with the cases of the Half Profile of Nonverbal Sensitivity (PONS, Ambady, Hallahan, & Rosenthal, 1995), the Interpersonal Perception Task-15 (Costanzo & Archer, 1989), and the Relationships Across Domains task (Sergi et al., 2009). Taking into account the overall examination time that an exhaustive neuropsychological assessment usually involves, this time might be excessive for participants, resulting in their performance being compromised.
All the limitations listed above are found not only in English instruments but also in tools in other languages. As previously noted elsewhere (Gómez-Gastiasoro, Peña, Zubiaurre-Elorza, Ibarretxe-Bilbao & Ojeda, 2018), most SP measures lack Spanish adaptations and validations.
One of the SP tests commonly used among patients with schizophrenia that has also shown good psychometric properties is the Situational Feature Recognition Test 2 (SFRT-2; Corrigan & Green, 1993;Corrigan, Silverman, Stephenson, Nugent-Hirschbeck, & Buican, 1996). This assessment tool presents nine social situations along with a list of related and unrelated actions (actions that are usually performed in a given situation) and a list of related and unrelated goals (goals that people usually try to accomplish in a given situation) for each situation (Corrigan et al., 1996;Corrigan & Green, 1993; see Figures 1 and 2 for a sample item). For each situation, the participant is asked to mark all the actions and goals that they think are related to a given situation. This assessment tool has been adapted to Spanish and validated and has obtained good psychometric properties (Gómez-Gastiasoro, Peña, Zubiaurre-Elorza, Ibarretxe-Bilbao & Ojeda, 2018). However, as many of the most common used SP measures, the full version of the SFRT-2 has no equivalent forms and takes about 15-20 min to be completed, depending on the cognitive status of the participant.
The main objective of the present study was to develop two equivalent short forms of the original SFRT-2 test in a sample of native Spanish-speaking patients with schizophrenia. In addition, we intended to assess the psychometric properties of the short forms from a classical test theory perspective, in terms of (1) internal consistency, (2) interform equivalence, (3) sensitivity to change, (4) testretest reliability, (5) convergent and discriminant validity in relation to other SC measures and neurocognition scores respectively, and (6) convergent validity in relation to functional impairment and symptom severity. We hypothesized that (1) (Corrigan et al., 1996;Corrigan & Green, 1993). These were conducted at two different times, 3 months apart. Neuropsychological

| Procedure
Patients were involved in a project, which originally assessed the efficacy of the REHACOP cognitive rehabilitation program for psychosis (Peña et al., 2016 (Peña et al., 2016). All the patients were first assessed at the beginning of the rehabilitation program and then again after 3 months of treatment. No incentives were provided to patients (either at the beginning or at the end of the clinical trial), so neither baseline nor posttreatment performance was influenced by incentives. The same full form of the SFRT-2 (Corrigan et al., 1996) was applied at both times. The SFRT-2 was never administered in their short forms, and performance on this measure was not a criterion for patients' inclusion in the clinical trial.
In order to develop the two equivalent short forms of the SFRT-2, eight situations were selected from the original nine, to obtain an even number of situations. They were separated into two different forms, each of them consisting of four situations. The selection was performed taking into account the patients' degree of familiarity with the situations, in an attempt to include two familiar and two unfamiliar situations in each of the abbreviated forms. The familiarity of the situations was assessed by means of a scale in which patients had to indicate their degree of familiarity with the situation (1 = totally familiar; 2 = very familiar; 3 = familiar; 4 = neutral; 5 = unfamiliar; 6 = very unfamiliar; 7 = totally unfamiliar). All the situations were classified as "familiar" except "building an igloo" and "performing surgery," which were rated as "totally unfamiliar" and "performing an ultrasound," which was rated as "neutral." The "playing Monopoly" situation was discarded after a preliminary reliability analysis was performed with the same sample used for this study, as it was found that the internal consistency of the short forms decreased when this situation was used.
Each of the situations maintained the 12 options for both actions and goals previously presented in the Spanish adaptation and validation of the SFRT-2 (Gómez-Gastiasoro, Peña, Zubiaurre-Elorza, Ibarretxe-Bilbao & Ojeda, 2018). The situations included in the first form were "building an igloo," "reading in a library," "driving a car," and "performing an ultrasound" ( x familiarity = 4.5), whereas the second form included "taking a test," "celebrating first communion," "having a haircut," and "performing surgery" ( x familiarity = 4.0).
Each of the eight situations was linked to a list of actions and goals. As in the Spanish adaptation, each list of actions and goals contained five possible hits and seven possible false positives (see the example item given in Figures 1 and 2). Performance was indexed as the total scores obtained in action hits (ranging from 0 to 20), action false positives (ranging from 0 to 28), goal hits (ranging from 0 to 20), and goal false positives (ranging from 0 to 28). Administration time for each short form was estimated at being 5 min based on the assessment of 6 independent participants who completed the short forms independently. The full forms (1 and 2) are shown in Appendices 1 and 2, respectively.

| Other cognitive and SC variables
The neuropsychological battery also included a premorbid IQ measure (Word Accentuation Test; Del Ser, González-Montalvo, Martínez-Espinosa, Delgado-Villapalos, & Bermejo, 1997), designed specifically for Spanish speakers. In this test, participants are asked to read aloud some uncommon words written without the accent mark, stressing the correct syllable (Del Ser et al., 1997). Raw scores were converted into estimated full scale IQ based on Gomar et al. (2011). Other neurocognitive measures were also included, such as The Hopkins Verbal Learning Test-Revised (HVLT-R; Benedict, Schretlen, Groninger, & Brandt, 1998), for verbal learning and memory; the Trail Making Test, Parts A and B, for processing speed and cognitive flexibility (Reitan & Wolfson, 1985) (Happé, 1994), the Mayer-Salovey-Caruso Emotional Intelligence Test for EP (Mayer, Salovey, & Caruso, 2002), and the self-serving bias index of the Attributional Style Questionnaire for AS (Peterson et al., 1982;Sanjuan & Magallares, 2006).

| Statistical analyses
The Kolmogorov-Smirnov test was used to assess the distribution of the variables. A number of statistical analyses were performed depending on whether the variables were normally or non-normally distributed. Hit and false positives composite scores were obtained in order to calculate the interform equivalence, test-retest reliability, and relationship with functionality. Some statistical analyses were performed for the first and second testing times. The analyses performed at the first testing time (Time 1) were carried out on 101 patients (patients who received cognitive rehabilitation and patients in the control group). However, the analyses performed at the second testing time (Time 2) were carried out only on the 47 patients in the control group, in order to avoid the effects of cognitive rehabilitation on the scores obtained. As the SFRT-2 was never administered in short form, all the analyses performed were post-hoc manipulations of data collected from the full form. All the analyses were conducted using SPSS v.23 (SPSS Inc., Chicago, IL, USA).

| Reliability
The reliability of the short forms of the SFRT-2 was examined by assessing the internal consistency of the total action hits, goal hits, action false positives, and goal false positives separately for Time 1 (n = 101) and Time 2 (n = 47), and for both forms by means of Cronbach's alpha.

| Interform equivalence
Interform equivalence between both short forms of the SFRT-2 was assessed by analysing the relationship between both short forms' hits and false positives at Time 1 (n = 101) by means of Spearman correlations. As in other studies that assess cognitive alternative forms, reliability coefficients upwards of .60-.70 were stablished as being confident of clinical usefulness and robustness (Geffen, Butterworth, & Geffen, Butterworth, & Geffen, 1994). In addition, Wilcoxon signedrank tests were used in order to assess the differences between hits and false positives in both actions and objectives at Time 1 (n = 101) and Time 2 (n = 47) separately. In this case, interform equivalence was stablished based on the nonsignificant differences between forms and the effect sizes obtained. Finally, a repeated measures analysis (n = 47) was performed by means of a 2 (form) × 2 (time) repeated measures multivariate analysis of variance (MANOVA) in order to assess the interaction between form and time factors for action hits and false positives, and goal hits and false positives independently. In this case, time was included as a within-subjects factor (with two levels: Time 1 and Time 2) and the dichotomous variable of form (Form 1 vs. Form 2) was included as an intersubject factor. SFRT-2 scores (action hits, action false positives, goal hits, and goal false positives) were included as variables to study. Time by form interaction was studied in order to assess the differences. For this analysis, equivalence between forms was driven by the nonsignificant differences obtained in the form x time interaction and the effect sizes obtained.

| Test-retest reliability
Spearman correlation analyses were performed in order to assess the relationship between hits, and false positives at Time 1, and hits, and false positives at Time 2 for both short forms separately (n = 47). Reliability coefficients upwards of .60-.70 were stablished as being confident of clinical usefulness and robustness (Geffen et al., 1994).

| Convergent and discriminant validity
In order to assess convergent and discriminant validity, two composite scores were calculated. The first one (α = 0.80) included neurocognition measures such as the HVLT-R learning and long-term trials,

| Relationship with functional and symptom severity variables
Spearman correlation analyses were also used in order to assess the relationship between both short forms' hits, and false positives and functionality. Correlation analyses were performed including Forms' 1 and 2 hits, and false positives and UPSA and GAF total scores at Time 1 (n = 101).

| Clinical and SP characteristics
Clinical characteristics and data about performance on the SFRT-2 are shown in Table 1. Data were divided by sample for Time 1 (n = 101) and sample for Time 2 (n = 47). As expected, there were more men than women in the groups in both cases.  Table 2).  Table 4).

| Sensitivity to change
For Form 1, the 2 × 2 repeated measures MANOVA showed significant effects for the group by time interaction for goal hits (p = 0.030) and a trend to significance for action hits (p = 0.053). No significant interaction was shown for false positives, either in actions or in goals. Similar results were found for Form 2, with significant group x time interaction for action hits (p = 0.004) and a trend to significance in goal hits (p = 0.083). Again, no significant interaction was found for false positives in either of the lists (actions and goals; Table 5).   Note. Cronbach's alpha coefficients for forms 1 and 2 at Times 1 and 2. Form 1 includes: "building an igloo," "reading in a library," "driving a car," and "performing an ultrasound." Form 2 includes: "taking a test" "celebrating the first communion" "getting a haircut," and "performing surgery." Abbreviation: SFRT-2, Situational Feature Recognition Test 2. also between Form 2 at Time 1 and Form 2 at Time 2. Correlations coefficients ranged from .62 to .73 (see Table 6).

| Convergent and discriminant validity
Both forms' indices showed a significant correlation with the SC composite score with the exception of goal false positives. Correlation indices were low to medium ranging from .20 to .45. SFRT-2 short forms were also correlated with the neurocognition composite score, but to a lesser extent. In this case, neither action nor goal false positives showed a significant correlation with the composite scores. Correlation indices were low for all the measures, ranging from .15 to .37 (seeTable 7).

| Relationship between short forms and functionality and symptom severity variables
Results showed that both hits and false positives of Forms 1 and 2 correlated with the UPSA total score, with coefficients ranging from 0.33 to 0.41, whereas GAF scores only correlated with the false positives for Form 1 (see Table 8).

| DISCUSSION
This study presents two equivalent short forms of the SFRT-2 for SP assessment in patients with schizophrenia. The two short forms showed good internal consistency both at Time 1 and Time 2. Both forms' indices were related to each other, and no differences were found between forms when considering time effects, whereas patients performed better on Form 2 than on Form 1 when time was not considered, questioning interform equivalence. Both forms showed good test-retest reliability and sensitivity to change, especially for hits scores. In addition, hits and false positives for both short forms of the SFRT-2 proved to be related to functional outcome and other SC measures.
Internal consistency indices ranged from acceptable to excellent, similarly to those obtained in the SC measures selected by the SCOPE study (Pinkham et al., 2016), such as the Bell Lysaker Emotion Recognition Task (Bryson, Bell, & Lysaker, 1997), the Penn Emotion Recognition Text (ER-40; Kohler et al., 2003), the Reading the Mind in the Eyes Test (Eyes; Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, 2001), the Hinting task (Corcoran, Mercer, & Frith, 1995), the Relationships Across Domains test (Sergi et al., 2009), the Trustworthiness Task (Trust; Adolphs et al., 1998), and the Awareness of Social Inferences Test (McDonald et al., 2003). These indices were also in line with those presented in the original version of the test (from α = .75 to α = .84; Corrigan et al., 1996). These internal consistency indices suggest that both SFRT-2 short forms are reliable SP measures to assess patients with schizophrenia.
The need for standardized and validated equivalent forms of social cognitive measures, especially SP measures, has been highlighted by specialists in studies such as the SCOPE (Pinkham et al., 2018(Pinkham et al., , 2016. Specifically, the present short forms of the SFRT-2 showed interform equivalence in terms of interrelationship reliability coefficients between forms and also equivalence and stability when considering assessment time effects on form equivalence. In addition, the reliability indices obtained when correlating both forms showed to be high enough to be confident of clinical usefulness and robustness according to the equivalence assessment of other cognition measures' alternative forms (Geffen et al., 1994). In contrast, when considering However, performance mean scores did not differ in more than one

Test-retest reliability (Spearman ρ)
Variable n = 47  score from one form to another, which might be interpreted as nonclinically significant. Nevertheless, these results might point to nonequivalence between forms when assessing performance differences in each of the forms. It is difficult to compare these results to those obtained by other SC measures since, to our knowledge, only three SP assessment tools currently present alternative forms. One of those measures, the SAT-MC, also showed differences between forms when patients' performance in both alternative forms was compared in the last SCOPE study (Pinkham et al., 2018) but not in the original manuscript which presented the alternative forms (Johannesen, Fiszdon, Weinstein, Ciosek, & Bell, 2018). However, correlation between SAT-MC forms was not assessed in the SCOPE study (Pinkham et al., 2018), although it was studied in the original manuscript, with good outcomes (Johannesen et al., 2018). The apparent variability between the three different analyses used to assess interform equivalence is also hard to compare with other studies, since the three analyses are very rarely reported jointly for the same assessment tool. However, similar interform equivalence has been obtained by other test forms assessing verbal memory, such as the HVLT-R (Benedict et al., 1998), a well-recognized neurocognition measure that has been recommended for neuropsychological assessment in clinical trials of patients with schizophrenia by the MATRICS initiative (Nuechterlein et al., 2008).
In addition, hit scores of both forms showed changes after the intervention, but results were far from significant for false positives.
The lack of significance for false positive responsiveness could indicate that these indices presented some kind of ceiling effect, preventing them from being sensitive to changes after a cognitive intervention. However, changes did not follow the expected pattern when assessing sensitivity to change (stability in the control group and improved scores at Time 2 in the experimental group). In this case, changes were given due to stability on the SFRT-2 scores in the experimental group and a decrease of these scores in the control group.
Therefore, this would not be reflecting sensitivity to change as it is commonly understood. Nevertheless, as far as authors are aware, there is lack of data about the pattern of longitudinal changes on the SFRT-2 in patients with schizophrenia when no intervention is implemented. Therefore, it is not clear if the change observed in the control group is the typical pattern or not. As far as the utility of short forms as repeated measures is concerned, test-retest reliability indices (rho indices ranging from .62 to .73) were also similar to those obtained by the SCOPE study (Pinkham et al., 2018(Pinkham et al., , 2016 Whereas ToM and EP seemed to be highly related to SP (Grant et al., 2017), AS has shown to have a weaker relationship with this domain (Bell et al., 2010;Mancuso, Horan, Kern, & Green, 2011 (Mancuso et al., 2011;Sergi et al., 2007).
In addition, scores obtained in the short forms of the SFRT-2 were shown to be related to functional and symptom severity measures, especially with functional competence scores measured by the UPSA test. The idea that SC would to some extent be related to, or even explain, some variance in functional outcome, has been well demonstrated (for a review, see Fett et al., 2011). In fact, relationship to functional outcome is one of the most important characteristics to be taken into account when choosing SC measures to be used in clinical trials (Pinkham, 2014;Pinkham et al., 2016). Regarding the short forms of the SFRT-2, hits and false positives were found to be related to functional outcome. Relationship coefficients were moderate and similar to those obtained when assessing the relationship between SC measures selected by the SCOPE study and UPSA total scores (Pinkham et al., 2018(Pinkham et al., , 2016. These results suggest that the short forms of the SFRT-2 might be useful when trying to predict patients' functional outcome and symptom severity. It is also noteworthy that, by using either of the two short forms of the SFRT-2, the test administration time was reduced from 15 min (in the original version) to 5 min. The SCOPE study stated that, SC instruments presenting administration times under 10 min are described as being practical and tolerable for participants. Unlike most of the existing SP measures, the short forms of the SFRT-2 provided reliable SP scores in 5 min. Despite the wide variety of SC tests available, administration time is still a challenge for SC assessment. As described by the SCOPE study, in some cases some SC measures administration times range from 20 to even 35 min, depending on the task (Pinkham et al., 2016). This can reduce the usefulness of the task, as well as making the assessment more unpleasant for the patient.
Taking this into account, the short forms of the SFRT-2 might represent one of the most practical available measures of SP, which could be especially useful in clinical trials or when employing a large neuropsychological battery.
Despite the good psychometric characteristics of the short forms of the SFRT-2, some limitations of the present study merit further discussion. First, although the sample was large for analyses, performed at Time 1, sample size was reduced by half for all tests carried out at Time 2, due to the involvement of some of the patients in a rehabilitation program. Therefore, it would be appropriate to repeat the test- In conclusion, the short forms of the SFRT-2 seem to be reliable and practical SP measures for assessing this SC domain in patients with schizophrenia. Their psychometric properties, and especially the good test-retest data obtained and the sensitivity to change shown by some of its indices, suggest that they are suitable to be included in clinical trials in order to assess SP performance and changes over time.
This would contribute to gaining a better understanding of the effectiveness of cognitive interventions and longitudinal studies regarding SP performance in patients with schizophrenia.

CONFLICT OF INTEREST STATEMENT
The authors report no conflicts of interest.

AUTHORSHIP
Ainara Gómez-Gastiasoro, Javier Peña, and Leire Zubiaurre-Elorza have made substantial contributions to the conception and design, acquisition of data, and analysis and interpretation of data. Ainara Gómez-Gastiasoro, Javier Peña, Leire Zubiaurre-Elorza, Naroa Ibarretxe-Bilbao, and Natalia Ojeda have been involved in drafting the manuscript and revising it critically for substantial intellectual content.
All authors have given final approval of the version to be published.