Stability, reliability, and validity of the THINC‐it screening tool for cognitive impairment in depression: A psychometric exploration in healthy volunteers

Abstract Objectives There is a need for a brief, reliable, valid, and sensitive assessment tool for screening cognitive deficits in patients with Major Depressive Disorders. This paper examines the psychometric characteristics of THINC‐it, a cognitive assessment tool composed of four objective measures of cognition and a self‐rated assessment, in subjects without mental disorders. Methods N = 100 healthy controls with no current or past history of depression were tested on four sequential assessments to examine temporal stability, reliability, and convergent validity of the THINC‐it tests. We examined temporal reliability across 1 week and stability via three consecutive assessments. Consistency of assessment by the study rater (intrarater reliability) was calculated using the data from the second and third of these consecutive assessments. Results Test–retest reliability correlations varied between Pearson's r = 0.75 and 0.8. Intrarater reliability between 0.7 and 0.93. Stability for the primary measure for each test yielded within‐subject standard deviation values between 5.9 and 11.23 for accuracy measures and 0.735 and 17.3 seconds for latency measures. Convergent validity for three tasks was in the acceptable range, but low for the Symbol Check task. Conclusions Analysis shows high levels of reliability and stability. Levels of convergent validity were modest but acceptable in the case of all but one test.


Funding information
H. Lundbeck A/S tool composed of four objective measures of cognition and a self-rated assessment, in subjects without mental disorders.
Methods: N = 100 healthy controls with no current or past history of depression were tested on four sequential assessments to examine temporal stability, reliability, and convergent validity of the THINC-it tests. We examined temporal reliability across 1 week and stability via three consecutive assessments. Consistency of assessment by the study rater (intrarater reliability) was calculated using the data from the second and third of these consecutive assessments.
Results: Test-retest reliability correlations varied between Pearson's r = 0.75 and 0.8. Intrarater reliability between 0.7 and 0.93. Stability for the primary measure for each test yielded within-subject standard deviation values between 5.9 and 11.23 for accuracy measures and 0.735 and 17.3 seconds for latency measures. Convergent validity for three tasks was in the acceptable range, but low for the Symbol Check task.
Conclusions: Analysis shows high levels of reliability and stability. Levels of convergent validity were modest but acceptable in the case of all but one test.  Rock, Roiser, Riedel, & Blackwell, 2014). These cognitive difficulties have been shown to be present at the first episode of depression (Lee, Hermens, Porter, & Redoblado-Hodge, 2012), and a significant number of patients continue to experience difficulties between depressive episodes (Conradi, Ormel, & de Jonge, 2011;Roca et al., 2015). Clinical research has yielded various candidate measures for assessing, evaluating, and detecting cognitive difficulties in patients with MDD (see Harrison, Lam, Baune, & McIntyre, 2016). This literature has been helpful for identifying the cognitive domains in which depression associated impairment is observed, as well as the magnitude of these effects. This research has also been instrumental to inform the selection of cognitive assessments appropriate for screening purposes for cognitive difficulties in patients with MDD.
Routine care screening for cognitive deficits in patients with depression remains a rare phenomenon, unless other nonmood diagnoses are contemplated (McAllister-Williams et al., 2017). This is especially the case in older patients, in whom dementia might be suspected. When cognitive performance is assessed, it is typically with brief, portmanteau tests such as the Mini-Mental States Examination and the Montreal Cognitive Assessment (Folstein, Folstein, & McHugh, 1975;Nasreddine et al., 2005). Both measures have a useful role to play as brief, bedside tests of global cognitive function, but are very unlikely to detect the types, forms, and the severity of cognitive dysfunction seen in MDD. Standardized tests for measuring cognitive difficulties in patients with MDD would offer health care professionals a further option for assessment. The demands of contemporary patient care require that screening measures and tests of change in the patients' status should be tested using reliable, sensitive, and valid measures (Harrison, 2016). Other computerized cognitive measures have been employed to assess the cognitive performance of patients with MDD. For example, the Cambridge Neuropsychological Test Automated Battery system has been employed in several studies, the results of which were recently reported in a meta-analysis (Rock et al., 2014). Additionally, both the CogState system (Harrison & Maruff, 2008) and the assessment from central nervous system vital signs (Gualtieri & Johnson, 2006) have been employed to investigate cognitive function in depression. These systems contain computerized paradigms designed to index key cognitive areas. However, a challenge to employing one of these proprietary testing platforms is the task duration. Potential clinical users informed us that a brief assessment was required. Candidate tests of, for example, executive functions, can be rather lengthy. For example, the Groton Maze Learning Test requires at least 7 minutes (https://www.cogstate.com/cognitivetests/groton-maze-learning/). Cambridge Neuropsychological Test Automated Battery measures, such as the "One Touch Stockings of Cambridge" Test, require at least 10 minutes to administer (http:// www.cambridgecognition.com/cantab/cognitive-tests/executive-function/one-touch-stockings-of-cambridge-ots/). A further issue mitigating against the use of these systems was our need to provide tests for free.
Recently, we have developed and field-tested a novel screening tool for health care professionals called THINC-it that assesses key domains of function known to be compromised in patients with MDD. THINC-it is a digital, gamified cognitive assessment tool developed by the THINC Task Force (http://thinc.progress.im), composed of experts in psychology, psychiatry, primary care, psychometrics, neuroscience, and scale development. The THINC-it tool is accessed via computers/tablets and is composed of well-known cognitive paradigms. The selected paradigms were chosen on the basis of their prior use with patients with MDD and their brevity. A further selection principal was to employ paradigms that are acknowledged to index performance in the key cognitive areas of working memory, attention, and executive function. The "One-Back" paradigm (Kirchner, 1958) was selected as a measure of working memory ("Symbol Check"); Choice Reaction Time (Donders, 1969) as the measure of attention ("Spotter"), and Part B of the Trail Making Test (Strauss, Sherman, & Spreen, 2006) as a measure of executive function ("Trails"). In addition to these paradigms, it was also decided to include a computerized variant ("Codebreaker") of the Digit Symbol Substitution Test (DSST) paradigm (Lezak, 1995). Competent DSST performance is considered dependent on the functional integrity of various cognitive skills, including working memory, attention, and executive function (Harrison, Lophaven, & Olsen, 2016). In addition to these four objective measures of cognition, THINC-it also includes the Perceived Deficits Questionnaire-Depression, 5-item (PDQ-5-D) as a subjective measure of cognitive function. This measure asks the patient to rate his/her performance regarding attention/concentration, planning/organization, and retrospective and prospective memory (Lam et al., 2013;Lovera et al., 2006). Each of the THINC-it assessments have been employed in stud-  gov/ct2/show/NCT02508493. Briefly, the first study visit was planned to allow for correlations between THINC-it and comparison tests to determine levels of concurrent validity. Calculations of within-subject standard deviation (WSD) across all four THINC-it assessments at Visit 1 allowed for estimates of test stability to be made. Correlations between Visit 1 Assessments 2 and 3 were designed to determine the levels of intrarater reliability. Study Visit 2 was incorporated to evaluate estimates of temporal reliability which were determined by correlating performance.

| Study participants
The N = 100 subjects were recruited via social media. They responded to media announcements seeking healthy controls who wished to participate in a study evaluating cognitive function. The media website was www.kijiji.com. All subjects were carefully screened and examined for current and past mental disorder by the Mini International Neuropsychiatric Interview for Diagnostic and Statistical Manual of Mental Disorders (Sheehan et al., 2006). Healthy subjects were included if they had no current nor a past mental disorder, no first-degree relative with an established diagnosis of a lifetime mood or psychiatric disorder. Exclusion criteria were (a) unstable medical disorder(s) (b) any medication that, in the opinion of the investigator, may have affected cognitive function (e.g., corticosteroids and beta-blockers); and (c) consumption of alcohol within 8 hr prior to the THINC-it tool administration. Participants received financial compensation for their participation of 50.00 CAD per visit.

| Assessments
The assessment instruments featured the iPad version of the THINC-it tool (i.e., Spotter, Symbol Check, Codebreaker, Trails, and PDQ-5-D); the Identification Task and One-Back Memory (OBK) task from the CogState battery, as well as the pencil-and-paper versions of the DSST, Trail Making Test Part B (TMT-B) and PDQ-5-D. The National Adult Reading Test-Revised was also included as an estimate premorbid IQ (Nelson, 1982). For each THINC-it test, a primary measure was selected for analysis. We selected the measures analogous to those most typically chosen for the CogState and paper-and-pencil measures. These measures are reported in Table 1. THINC-it takes approximately 15 min to administer with instructions commensurate with minimal education for administration (i.e., Grade 6).

| Procedures
The order of administration between the THINC-it tool and the comparison tests (the two CogState tasks and pencil-and-paper test versions) was alternated between study participants to account for potential order effects. The sequence of the THINC-it tool component scales remained identical for all study participants and was administered in the following order: Spotter, Symbol Check, Codebreaker, Trails, and PDQ-5-D. The comparison tests were administered in the following order: Identification Task, OBK, DSST, TMT-B, and PDQ-5-D. All participants completed the full set of cognitive assessments (i.e., THINC-it tool, CogState, and pencil-and-paper tests) on four occasions: three times during the first visit and once during the second visit (see Figure 1 for details of study flow).

| RESULTS
The demographics and clinical characteristics of our study participants are presented in Table 2. The mean age of the 58 females and 42 male subjects was 40 (SD = 14.4). The group was relatively well educated and exhibited a National Adult Reading Test-Revised IQ estimate equal to a full-scale IQ of 111.87 (SD = 6.63). Data for n = 8 healthy controls were excluded for those that did not complete the THINC-it assessment in its entirety. The primary reasons were the subject's inability or unwillingness to complete the tasks.
Median and mean values for all THINC-it and comparison tasks were found to be similar and so we report only means and confidence intervals (CI) based on standard errors of the mean. As the data met the requirements for parametric analysis, we calculated Pearson's "r" correlations in all cases. These values are reported for each THINC-it test measure and as a composite of all four tests, by Visit and Assessment, in Table 3, together with the same statistics for all four paperand-pencil versions of the selected paradigms, as well as for a composite score of all four THINC-it tasks.
Temporal, or "test-retest," reliability was calculated by correlating Visit 1 Assessment 1 scores with Visit 2 scores, which were separated for 1 week. This analysis yielded Pearson's "r" correlations for the four objective measures varying between 0.74 and 0.81 (all significant at <0.001) and a value of 0.72 for the PDQ-D-5 (see Table 4). These correlations were higher when comparisons were made between Visit 1   Assessment 3 and Visit 2. This is likely due to greater performance stability, suggesting some repeated exposure effects. We report also the temporal reliability correlation between Visit 1 Assessments 2 and 3.
Temporal reliability correlations for a THINC-it composite score are also reported in Table 4.    level of internal consistency (Kline, 2000, p. 28), which was observed for the self-report PDQ-5-D questionnaire. The observed Cronbach's alpha score of 0.76 is also below the level at which tests might be a "bloated specific," which can occur when the test items tend to measure essentially the same construct.
Although the primary ambition of developing THINC-it is to provide a cognition screening instrument, we have also sought to imbue the system with test characteristics that would facilitate the evaluation of cognitive change in group studies, and potentially also in individuals. Such an approach has been advocated for some time (Harrison & Maruff, 2008), and emphasis has been placed on the need for the use of reliable measures (Harrison, 2016). The use of this methodology relies on the calculation of a Reliable Change Index (RCI). A variety of methods have been proposed for determining RCI values, and most methodologies include reliance on measures of temporal reliability (Jacobson & Truax, 1991 One possible reason for this is the difference in task demands. The typical One-Back Task requires a binary "yes" or "no" decision and response depending on whether the current stimulus is the same as that presented on the previous trial. Symbol Check requires the study participant to respond by touching the previous symbol, a choice of five possibilities. This typically requires the study participants to rapidly switch their attention between the stimulus sequence and the response options. In contrast, the traditional binary decision version does not typically require visual attention to the possible responses.
It seems likely that Symbol Check taxes attentional and executive resources to a greater degree than traditional versions of the One-Back paradigm. The relative lack of validity suggests that this task is not a robust proxy measure of the standard One-Back Task, in our study exemplified by the CogState version.
A third element of test reliability in our study was our investigation of intrarater reliability. This analysis yielded reliability scores for all four tests exceeding the usual minimum acceptable level of 0.7.
Intrarater reliability for the THINC-it tests varied between a score of 0.7 for the Trails test and a high of 0.93 for the "Spotter" task.
A possible limitation of the study is that the volunteer cohort was well educated (mean YoE = 16.3, SD = 2.7), with a mean estimated fullscale IQ of almost 12-points above the population mean. Performance on cognition tests is influenced by both these factors. For example, Mitrushina et al. (2005, p. 653) suggest a minus 6.45 second subtraction from standard norm values per extra year of education for every year over 14 years for the TMT-B. In our study, the mean TMT-B score on first assessment was 78 seconds. The meta-analysis of TMT-B performance by age provided by Mitrushina et al. (2005) varies from a mean score of 54 seconds for 16-to 29-year-old study participants, and a score of 105 at the top of our age range. The 40-to 44-year-old cohort is the closest to the mean age of our sample (mean age = 40, SD = 14.38), and for this age group, the reported mean score is 65 seconds. This is substantially faster than our group, who at first assessment scored a mean of 78 seconds. However, it must be recalled that in our study Part A of theTMT was not administered. It seems likely that completion of Part A has a facilitative effect on Part B completion, and this may account for the observed difference. A further issue to be taken into consideration is that study participants were recruited using social media. While social media platforms are commonly utilized by all sections of the general public, it might be that respondents are among those individuals who are most comfortable with using digital technology.
In summary, this validation study of THINC-it has shown the selected measures to be temporally reliable, to exhibit expected levels of convergent validity, and high levels of intrarater reliability and test stability. These observations support the use of THINC-it as a brief cognitive testing system with the potential to be employed as a robust measure of cognitive change.

DECLARATION OF INTEREST STATEMENT
In addition to receipt of funding from Lundbeck for the present study,