Interrater agreement in dementia diagnosis: A systematic review and meta‐analysis

Dementia remains a clinical diagnosis with a degree of subjective assessment and potential for interrater disagreement. We described interrater agreement of clinical dementia diagnosis for various diagnostic criteria.

A variety of validated diagnostic criteria are available. Many of these have been superseded by newer diagnostic criteria, but othersestablished by different organisations-are in concurrent use. There is typically some uncertainty in the diagnosis of dementia and its subtypes, which is reflected in the fact that many diagnostic criteria for dementia subtypes allow 'probable' and 'possible' diagnoses depending on how closely the patient's symptoms conform to the archetypal case.
These diagnostic criteria are often used and referred to as reference standard (or gold standard) tests. In particular, they often serve as reference standards in studies of the diagnostic test accuracy (DTA) of neuropsychological assessments 4 and of biomarkers. 3 Although diagnostic criteria seek to operationalize dementia assessment, there remains an element of subjectivity. Thus, a potential source of imperfection in these diagnostic criteria is suboptimal interrater agreement. This is defined as the degree to which two or more raters make the same diagnosis under similar assessment conditions. 5 Imperfect interrater agreement is an important potential source of error in the diagnostic criteria, and could cause several issues. For example, it could lead to inaccurate diagnosis which will then lead to inappropriate treatment in clinical practice. It could also cause the estimates from DTA studies to be over-or under-estimated. DTA studies are carried out to determine the accuracy and appropriate threshold of cognitive assessments and biomarkers, so the implications of biased estimates could be significant. There could also be bias for randomised controlled trials (RCTs).
For instance, prevention studies will use dementia incidence as an outcome; imperfect interrater agreement will lead to misclassification, which means that the study could require a larger sample size to show an effect. Another issue arising from imperfect agreement is bias in the estimates of inter-criteria studies. These are studies which assess the agreement between different diagnostic criteria, to determine whether they identify the same groups of patients.
Imperfect interrater agreement could cause the results of these studies to be questionable. Finally, imperfect interrater agreement could also lead to bias in the estimates of prevalence rates of clinical dementia.
A systematic review and meta-analysis has not been previously conducted on this topic. The aim of this systematic review and metaanalysis is to determine the interrater agreement for dementia diagnostic criteria.

| METHODS
Where appropriate, we followed the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA 6 ) best practice guidelines for design, conduct and reporting of the systematic review.

| Search strategy and inclusion criteria
We searched MEDLINE (Ovid), EMBASE (Ovid), PsycINFO (EBSCO) and CINAHL (EBSCO) for relevant studies until April 2020. We created a search syntax based on concepts of dementia and interrater agreement and used a validated search filter for dementia. The full search strategy is provided in Table S1.

-
All aspects of searching, data extraction and analyses were performed by two raters (EC and EV) working independently. At each stage results were compared, and consensus reached, with recourse to a third reviewer if required. Titles and abstracts generated from the search were screened to determine relevancy. Full-text articles were screened to determine eligibility. The inclusion/exclusion criteria were: (1) Studies having measured and quantified interrater agreement for at least one clinical dementia criteria (note that for this paper, this includes studies which have a short (≤2 weeks) time interval between measurements, sometimes called test-retest reliability 7 studies, as long as they used different raters), (2) the dementia subtype(s) studied were clearly stated, (3) the diagnostic criteria used were clearly stated, (4) the study was published in English Language. Conference proceedings, clinical guidelines, dissertations, as well as letters and commentaries were excluded.
Reference lists of included studies were hand-searched for additional eligible studies. Study authors were contacted to obtain the full-text when it was not available.

| Diagnostic criteria for dementia
For all-cause dementia, the diagnostic criteria considered for inclusion were the iterations of the International Classification of Diseases (e.g., ICD-10 8 ), the iterations of the DSM (e.g., DSM-III-R, 9 DSM-IV, 10

| Study quality assessment
Study quality was assessed using criteria based on the Guidelines for Reporting Reliability and Agreement Studies (GRRAS 5 ). There were 10 items, where each study could be awarded a maximum of one point for each of the items. If a study scored 4 or less, study quality was deemed as low, between 5 and 7 moderate, and 8-10 high. The 10 items assessed were dementia reference standard criteria, dementia subtype, assessor population, sample size calculation, sampling methods, blinding between raters (rater-rater blinding), whether raters were aware of patients' previous diagnoses (rater-patient blinding), assessment timing, and whether agreement was described with a measure of uncertainty. If enough data were available, sensitivity analyses limited to studies classified as high quality was planned.

| Data extraction and statistical analyses
Data extraction was independently undertaken by two authors (EC and EV) according to The Cochrane Handbook guidelines, 22 findings were reported according to PRISMA 6 guidance, and any disagreements were settled by consensus between authors. For study characteristics, we extracted data for the number of subjects used for the assessment of interrater agreement, number and descriptions of raters, country the study was conducted in, whether the study was conducted in a single site, dementia type(s) evaluated, diagnostic criteria evaluated, estimated dementia prevalences, study settings, age information, gender, education information, ethnicity, whether any of the raters' diagnoses were made face-to-face with the patient, information on the severity of dementia, sampling information.
Ethnicity was defined as 'white,' 'non-white' or 'mixed,' where studies reporting a population consisting of at least 80% of the same ethnicity were classified as 'white' or 'non-white.' Those studies that reported a population consisting of less than 80% 'white' or 'nonwhite' were classified as 'mixed.' If ethnicity was not reported for a study, it was defined based on the predominant ethnicity of the countrie(s) where the study was conducted.
Where reported, information on severity of cognitive decline was also recorded. Studies were labelled as 'harder-to-classify' if they reported over two-third of patients belonged to categories which indicate they were more difficult to classify (i.e., possible, mild dementia or MCI). Conversely, studies were labelled as 'easier-to-classify' if over two-thirds of patients were in categories which indicate that they were more straightforward to classify, such as probable dementias, severe dementia and healthy controls. Dementia severity was defined as 'mixed' if less than two-thirds of patients fitted into either of the aforementioned categories. We used reported study information for this where available. If this was not available, we used the mean classifications from the raters in the study.
Agreement was primarily described using a measure of agreement called the kappa-statistic, which is used to measure inter-rater agreement and takes into account the possibility of the agreement occurring by chance. For comparisons which were ordinal (e.g., probable vs. possible vs. no dementia) using 2 raters, we used weighted kappa-statistics with linear weights to account for the ordered structure, if possible. In addition, for studies which supplied sufficient data, we calculated Gwet's AC1 23 and linearly-weighted AC1 (i.e., Gwet's AC2 24 ) statistics using the irrCAC package 25 in R and compared them to the kappa-statistics. For kappa-statistics and AC1/2, we make reference to the classifications according to Altman, 26  Estimates were tabulated by diagnostic criteria, study and comparisons being made. Where at least four estimates were available, CERULLO ET AL. kappa-statistics were pooled by dementia type, diagnostic criteria and comparison, by fitting random-effects meta-analytic models in R using metafor 27 and metaviz 28 packages, using the method described by Sun et al. 29 For these pooled estimates we presented both confidence intervals (CIs) and prediction intervals (PIs) 30,31 ; the latter better reflects the variation across different settings in the presence of heterogeneity, and what the expected estimate is for a future study. 30 If possible, we also repeated these meta-analyses using AC1/2 coefficients using the same method. Any estimates which were not pooled were summarised in a narrative. If studies reported more than one assessment (e.g., before and after a standardisation meeting), then we used the first assessment. If studies did not report confidence intervals (CIs), where possible, they were calculated in R with packages fmsb, 32 irr 33 and boot 34 using the standard error, observed agreement or contingency tables. The proportion of variability that was due to between-study heterogeneity rather than within-study variability was assessed using the I-squared statistic 22 A 95%CI was given, rather than only a point estimate, since the I-squared statistic is known to be biased unless a large number of studies is available. 35 For the main meta-analysis, we pooled kappa-statistics for the presence or absence of dementia, by dementia type and diagnostic criteria. We also conducted a second meta-analysis looking at agreement for comparisons which take the uncertainty of diagnosis into account. To determine agreement for a given dementia type, we also conducted exploratory meta-analysis for each dementia type regardless of classification system used. If a sufficient number of studies were available, we planned to perform subgroup analyses for any pooled estimates obtained by dementia severity category, ethnicity and study settings, as well as meta-regression models for any pooled estimates obtained to investigate sources of betweenstudy heterogeneity. If possible, publication bias was assessed using funnel plots for asymmetry using the Beggs and Eggers tests 22 Sensitivity analysis was also planned, to explore the effect of each individual study on overall pooled estimates.

| Identified studies
A total of 7577 titles were screened ( Figure 1). Sixty-nine full texts were assessed for eligibility. Twenty-five of these studies did not assess interrater agreement and 13 did, however this was not evaluated for clinical dementia. Six studies assessed interrater agreement for clinical dementia, but not for one of the specified diagnostic criteria. For two studies, the diagnostic criteria used was not clear, and for one study (Kukull 1990

| Study characteristics
The studies included in the systematic review investigated different dementia subtypes and used various diagnostic criteria for dementia (Table 1 and Table S2). They had a wide range of estimated dementia prevalences (0.03-1.00). Thirteen studies reported sufficient information to estimate dementia severity. Of these, 5 were easier to classify, 7 were of mixed, and one more difficult to classify. Eight studies reported information on education. One reported that 84% of participants had education level 'greater than high school,' 1 'at least high school,' 6 reported mean or median education levels of at least 10 years, and 1 reported a mean education of only 4.9 years. Nineteen studies were conducted in white ethnicity populations and three mixed. Thirteen studies were conducted in secondary care settings, five in community settings and four using both secondary and community settings. Fourteen studies reported sampling information; three used random sampling, three stratified, six convenience, and two consecutive. Nine studies were conducted using a single site, and 16 studies had more than two raters.

| Study quality
All studies which underwent quality assessment using the GRRAS obtained a rating between 2 and 9. 11 studies were rated as a low study quality, 10 moderate and 1 high. Assessment of quality is described in Table S3.

| Exploratory analyses
For the exploratory meta-analyses, some studies reported multiple kappa-statistics for a given dementia type as they assessed more than one diagnostic criteria. Hence, we conducted separate analyses using the lowest ( Figure S1) and highest ( Figure S2

| Sensitivity analyses
For the first main analysis using dichotomous classifications ( Figure 2), the point estimates I-squared statistic suggests that the fraction of the between-study variance due to heterogeneity may be substantial (I-squared = 76) however the uncertainty is large (95%     3 and 4). The large degree of uncertainty in the I^2-statistic is not surprising, given the small number of studies. 35 We did not perform meta-regression or subgroup analyses to explore sources of heterogeneity for the first meta-analysis, or any publication bias tests, due to the small number of studies. 22 We performed leave-oneout analyses to investigate the influence of each study on the pooled estimates, and these did not change appreciably for either the main analyses ( Figures S3, S4 and S5) or the exploratory analyses ( Figure S6).

| Narrative synthesis
In this section, we discuss some results (see Table 2

| Strengths and limitations
The main strength of this study is the fact that it is the first systematic-review and meta-analysis assessing interrater agreement in clinical dementia diagnosis. We used a comprehensive search strategy and assessed the most recent evidence on interrater agreement for dementia diagnosis, and computed multiple agreement coefficients where this was possible. This study has some limitations.
Namely, we had to exclude studies not published in English language due to resource limitations. The estimates of agreement can vary from one study to another due to differences in study settings, rater and subject characteristics. Due to a limited number of studies that assessed agreement for any given criteria and dementia type, we did not have enough information to conduct subgroup analysis or metaregression to investigate sources of heterogeneity, or to assess publication bias. We were also not able to compute AC1/2 statistics for all studies. Lastly, there was an insufficient number of studies to obtain pooled estimates for some criteria and dementia types.

| Comparison with other studies
An evidence-based review conducted in 2001 60 included some of the studies that were part of our systematic review. This was not a CERULLO ET AL.
systematic review and they did not attempt meta-analysis.

| Implications for research and clinical practice
The fact that we found a relatively good level of agreement for the pooled measures is probably not surprising, considering that dementia has less room for subjectivity than other conditions that may be susceptible to more interrater disagreement. Since these tests are often assumed to be perfect in research, these findings should be thought of in the context of a perfect test. For example, even small levels of misclassification have potentially important implications for studies where the outcome is incident dementia. For instance, for RCTs, larger sample sizes may be needed to compensate as a result. One study 34 investigated the impact of imperfect interrater agreement on trial power for stroke studies, and found that just 5% misclassification of both trial arms could result in a 20% increase in sample size required. This also has implications for cohort studies investigating the conversion rates from MCI to dementia. 65 The diagnostic criteria evaluated in this review are often used as reference tests and assumed to be 100% accurate in DTA studies in order to apply standard methods 66

| CONCLUSION
We found evidence to suggest that the DSM-III-R for dementia has 'moderate-good' agreement, NINCDS-ADRDA for AD 'good' for presence or absence of dementia and 'moderate-good' when differentiating probable and possible cases, and 'good-very good' using the ICD-10 for VaD, according to Altman 26 classifications. Evidence was more limited for other criteria and dementia subtypes, and only one study was rated as high-quality, which suggested that the DSM-5 has 'good' agreement for presence or absence of dementia. We also found some evidence from a smaller study suggesting that the NIA-AA has 'good' agreement for distinguishing dementia, MCI and normal cognition. Few estimates obtained were at the upper-end of the 'very good' range (>0.90). Future research should verify these 1144findings, further investigate interrater agreement for clinical dementia (particularly for DLB and FTD, and for newer diagnostic criteria) using multiple measures of agreement, investigate the effect that relatively high (but imperfect) interrater agreement has on studies which assume clinical diagnostic criteria are perfect, and employ methods to adjust for imperfection.