The bumpy road to achieve reliability of clinical profile characteristics in psychosis and related disorders

Abstract Objectives Profile characteristics are factors that are relevant for diagnosis, prognosis or treatment. The present study aims to develop a set of clinically relevant profile characteristics. Moreover, our goal is to determine the inter‐rater reliability (IRR) of the selected profile characteristics. Methods Potential profile characteristics were determined by literature review. Assessment of IRR was done by comparing scores on profile characteristics determined by two researchers. We conducted three subsequent studies: (1) assessment of pre‐training IRR, (2) IRR following implementation of an instruction manual, (3) IRR after optimizing scoring methods. IRR was measured with the Intraclass Correlation Coefficient (ICC). Results IRR scores of profile characteristic Illegal activities were high across the three studies (ICC ≥ 0.75). Following training procedures in study 2 and 3, reliability estimates remained low to moderate (ICC < 0.75) for the profile characteristics Support of relatives, Aggression recent and lifetime, substance use and insight recent. IRR scores of the other eight profile characteristics varied from low, moderate to high across studies. Conclusion IRR scores of profile characteristics were highly variable, and mostly inadequate in all three studies. Consequently, further research should focus on specification of severity scores of profile characteristics, optimizing scoring methods and re‐evaluation of IRR.

analysis (Borsboom & Cramer, 2013), the trans-diagnostic approach (Nelson, McGorry, Wichers, Wigman, & Hartmann, 2017) or clinical profiling (Wigman et al., 2013). These novel methods aim to disentangle the diverse presentation and outcome parameters of the disease into specific categories, which are relatively easy to recognize and may form a target for intervention (Kahn & Keefe, 2013).
Eventually, the ultimate goal is to develop diagnostic tools that can improve prediction of outcome, individualized treatment and prevention of illness progression.
Clinical profiling have been proposed to create a comprehensive set of clinical relevant prognostic factors (Wigman et al., 2013). Profile characteristics may represent current as well as life time patient characteristics, relevant for diagnosis and prognosis.
By profiling we may personalize the diagnosis by targeting the large variation in symptomatology, comorbidity and social participation.
Clinical profiling of psychiatric illness may offer various benefits at the level of scientific research and clinical practice, in comparison to other methods that aim to address heterogeneity. First, a set of clinically relevant factors could support clinicians to ensure quick recognition of specific problem areas and specify their treatmentplan according to these profile characteristics. In fact, clinical profiling surpasses the current diagnostic categories and can be utilized in the trans-diagnostic approach. Second, profile characteristics can be easily applied by clinicians at individual patient-level, in comparison to the network approach which is predominantly applied on group-level symptoms in research (Bos & Wanders, 2016). To perform network analysis on individual patient level longitudinal time series of experienced sampling are needed (Yang et al., 2018). Third, profiling in scientific research could provide a more comprehensive and detailed description of patient characteristics, and thereby contribute to the selection and comparison of study populations.
Last, for several of the proposed profile characteristics specific therapeutic interventions are available, such as interventions to enhance social interactions, health education or medication adherence training (Abdel Aziz et al., 2016;Dodell-Feder, Tully, & Hooker, 2015).
Until now a comprehensive overview of profile characteristics for psychosis and related disorders is missing. In addition, no studies have yet addressed the inter-rater reliability (IRR) of assessing profile characteristics. Assessment of IRR is an important first step in the psychometric evaluation, considering that sufficient reliability of profile characteristics is a pre-requisite for any further validity research (Kobak, Kane, Thase, & Nierenberg, 2007). Additionally, poor reliability leads to high variability of data and reduced power to find relevant differences (Perkins, Wyatt, & Bartko, 2000). An important tool to achieve sufficient reliability is training of raters, and this training should include information about profile characteristics, specifications for ratings and feedback on performance by experienced clinicians. Typically, training of raters leads to an increase in reliability estimates (Kobak, Engelhardt, Williams, & Lipsitz, 2004).
Consequently, in the present study we first aim to develop an overview of clinically relevant and feasible profile characteristics, and subsequently determine the IRR of these profile characteristics in patients presenting with recent onset psychotic symptoms.
Third, we set out to evaluate whether training procedures for raters improve IRR. We hypothesize that IRR scores of profile characteristics without training procedures will be insufficient, and training of raters will substantially improve reliability estimates.

| Sample characteristics
In Table 1 the demographical and clinical characteristics of the separate studies are shown. In total we included 139 patients, of whom we included 99 patients in study 1, 20 patients in study 2 and another 20 patients in study 3. We included a smaller sample in study 2 and 3 as we only aimed to evaluate the effect of training procedures on IRR. The majority of patients were diagnosed with a schizophrenia spectrum disorder and used antipsychotic medication. No significant differences in age, gender, global assessment of functioning (GAF), use of antipsychotic medication or diagnosis were found between the samples that participated in the three studies.

| Design
The design of the current study was cross-sectional and it contained three sub-studies. We included outpatients from the diagnostic center of the Early Psychosis Unit at the department of psychiatry, at the University Medical Center Amsterdam, location Academic Medical Center. Three independent clinicians (authors AT and TJB, IL in the acknowledgments) performed inter-rater reliability assessments across the sub-studies.

| Procedure
The main reason for referral to our diagnostic center was recent onset of psychotic symptoms. The diagnostic interview was conducted by a psychiatry resident or psychologist in training for healthcare psychologist, supervised by a clinical psychologist and psychiatrist. The psychiatrist and clinical psychologist were present during the first 30 min of the interview. After the diagnostic interview a comprehensive formal report was written by the psychiatry resident or psychologist in training, supervised by the psychiatrist or clinical psychologist. This report was written in a predefined format, and included the current psychiatric condition, illness history, information from relatives and other relevant information for diagnosis and treatment. It was a comprehensive document of approximately four to six pages. This formal report based on the diagnostic interview was used for IRR purposes.

| Profile characteristics
We selected clinically relevant profile characteristics based on an extensive review performed by the committee that developed the clinical guidelines for the diagnosis and treatment of psychotic disorders in the Netherlands (Trimbos, 2019). To substantiate the latter

| Inter-rater reliability
The general procedure of IRR assessment was as follows: two clinicians independently determined severity scores of profile characteristics by thoroughly evaluating the formal report written after the diagnostic assessment. The clinicians did not see the patient and only had access to the formal report. IRR was assessed by comparing severity scores of profile characteristics determined by both clinicians. We chose this procedure because it is in line with clinical practice in which clinicians frequently base decisions on written information of other colleagues. Moreover it is in accordance with previous studies concerning reliability of medical diagnosis based on electronic record review (Chung, Chiang, Chou, Chu, & Chang, 2010;Kang et al., 2013;Varmdal et al., 2015). Estimation of IRR by multiple raters who merely observe video fragments may reduce information variance, leading to artificially inflated reliability estimates (Kobak et al., 2004). To evaluate the influence of training procedures on the IRR of profile characteristics, we conducted three subsequent studies: Pre-training IRR was determined by comparing severity scores of profile characteristics assessed by two independent researchers (A.T and I.L., mentioned in the acknowledgments). Both researchers had access to the description of the profile characteristics as shown in Table 2, there was no other training or instruction beforehand. Study 2 (n ¼ 20) Following findings of insufficient pre-training IRR in study 1, an instruction manual was developed in which specific criteria for each profile characteristic severity score were described with practical examples and instructions. Two researchers (A.T. and T.B.) thoroughly discussed the instruction manual and independently determined profile characteristics scores based on the formal report.

Study 3 (n ¼ 20)
Based on the findings of insufficient IRR of some profile characteristics after introducing an instruction manual and training we adjusted the instruction manual by extending criteria of severity scores. Furthermore, the designated sections in the formal report from which information should be retrieved to determine profile characteristics scores, were described. Profile characteristic specific terms were introduced to explore the report with the search-function, for example "school," "work" or "physical ill." With these adjustments we aimed to ensure that both raters retrieve the same information from the report and base their profile characteristic scores on the same data. Subsequently, two researchers (A.T. and T.B.) independently assessed profile characteristic scores based on a new set of formal reports.

| Inter-rater reliability
In

| DISCUSSION
In the present study we created comprehensive set of prognostic factors that were considered clinically relevant and feasible in regular clinical practice. We investigated the IRR of these profile characteristics in patients with recent onset psychotic disorders. In addition, we explored whether training procedures and scoring instructions improved reliability estimates of profile characteristics.
The selected profile characteristics represent important aspects of symptomatology, comorbidity and psycho-social functioning relevant for prognosis and treatment in psychosis and related disorders.
Our findings showed that the majority of IRR scores of profile characteristics were insufficient, and that training procedures provided some improvement in IRR of some profile characteristics.
However, after training, some profile characteristics actually showed reduced IRR.
We found no comparable studies concerning the IRR of profile characteristics, therefore we are not able to compare our results with previous studies. On the other hand, psychiatric classifications such as schizoaffective disorder or attention deficit disorder also show highly variable reliability estimates (Grabemann et al., 2017;Santelmann, Franklin, Busshoff, & Baethge, 2016). Even beyond the scope of psychiatry, the diagnosis of allergic laryngitis or classification systems for scoring tibia plateau fractures demonstrated insufficient or highly diverse IRR scores (Millar, Arnold, Thewlis, Fraysse, & Solomon, 2018;Stachler & Dworkin-Valenti, 2017). We therefore conclude that IRR of profile characteristics, psychiatric classifications and at least certain diagnoses in general medicine are highly variable.
When examining our results in more detail, pre-training IRR of profile characteristics in study 1 was generally inadequate, except for the profile characteristics Involuntary Care and Illegal Activities.
These results are in line with various research concerning insufficient pre-training IRR of observational instruments (Berendsen et al., 2019;Muller & Dragicevic, 2003;Muller et al., 1998). Utilizing profile characteristics or other psychiatric instruments without instruction and training for raters should therefore be avoided.
Following implementation of the instruction manual four profile characteristics reached sufficient reliability, while IRR scores of the majority of profile characteristics remained poor. After the manual was complemented by optimizing scoring methods, and adding more detailed specifications and instructions for the raters, several profile characteristics achieved sufficient IRR scores, whereas the observed IRR score of eight profile characteristics remained insufficient.
Previous studies concerning training procedures followed by reliability assessment generally show improvement of IRR. For instance: a commonly applied instrument such as the Positive and Negative Syndrome Scale demonstrated a substantial increase in IRR after training (Kobak et al., 2004;Muller et al., 1998). On the other hand, there is conflicting evidence whether intensive training procedures for the Hamilton Depression Rating Scale (HDRS) actually lead to improvement of the IRR (Demitrack, Faries, Herrera, DeBrota, & Potter, 1998;Muller & Dragicevic, 2003). Various reasons have been stated for the non-improving reliability estimates of the HDRS after training, for instance rater selection or shortcomings in itemcontent.
Despite great effort we were not able to achieve sufficient reliability scores for all profile characteristics, we hypothesize that this could be due to two main reasons. To start, although we aimed to set clear specifications of severity scores in the instruction manual, some BERENDSEN ET AL.  (Demitrack et al., 1998).
Another possible cause of the observed low IRR might be that in the diagnostic report information was missing too frequently to adequately determine profile characteristic scores. Raters might have been obliged to select options based on lacking information, this could have led to an increased information variance which is a major source of low reliability. Nevertheless, our finding of large disagreement among raters when obtaining information from medical records is disturbing, considering the substantial number of studies that are based on collecting information from medical records.
An important question that remains is whether profiling is the best approach to address heterogeneity in psychiatric illness. Profile characteristics are easily applicable for clinicians and not labor intensive. However, it approved difficult to achieve overall reliability in the assessment of the profile characteristics. From that perspective, introduction into clinical practice or scientific research is not warranted at the current time, since insufficient reliability affects study power, placebo response and possible clinical decision making (Berendsen et al., 2019;Kobak et al., 2010;Perkins et al., 2000).
Consequently, effort should be focused on improving reliability of the assessment of these prognostic factors.
Our study should be viewed in light of several limitations. First, our sample size in study 2 and 3 is underpowered to permit strong conclusions concerning improvement or deterioration of reliability scores. Therefore, we conducted no formal statistical comparisons between ICC scores of the different studies. We merely tested improvement of ICC scores in the established categories for interpretation of IRR scores as low, moderate to high (Koo & Li, 2016).
Second, we only tested IRR. Evaluating other forms of reliability, such as internal consistency or test-retest agreement is necessary to complement knowledge concerning the psychometric properties of profile characteristics. Third, we did not conduct a systematic search, therefore it is possible that not all studies with prognostic factors for patients with psychotic disorders and related illness were included.
We consider it a strength of our study that we created a comprehensive set of clinically relevant profile characteristics.
Another advantage of our study are the training procedures for raters with repeated reliability assessment.
We found that IRR scores of clinical profile characteristics were highly variable and mostly poor, even after training procedures.
Therefore, we strongly recommend to further develop definitions and trainings procedures of profile characteristics and evaluate IRR and other psychometric properties before clinical profiling can be introduced in clinical practice. Furthermore, we recommend that future research may explore the relationship between profile characteristics and underlying psychological mechanisms, for instance how the profile characteristic support system may be related to attachment disruptions (Fraley, 2019). Understanding these fundamental mechanisms may support the search for novel targets of therapeutic