Developing and testing inter‐rater reliability of a data collection tool for patient health records on end‐of‐life care of neurological patients in an acute hospital ward

Abstract Aim Develop and test a data collection tool—Neurological End‐Of‐Life Care Assessment Tool (NEOLCAT)—for extracting data from patient health records (PHRs) on end‐of‐life care of neurological patients in an acute hospital ward. Design Instrument development and inter‐rater reliability (IRR) assessment. Method NEOLCAT was constructed from patient care items obtained from clinical guidelines and literature on end‐of‐life care. Expert clinicians reviewed the items. Using percentage agreement and Fleiss' kappa we calculated IRR on 32 nominal items, out of 76 items. Results IRR of NEOLCAT showed 89% (range 83%–95%) overall categorical percentage agreement. The Fleiss' kappa categorical coefficient was 0.84 (range 0.71–0.91). There was fair or moderate agreement on six items, and moderate or almost perfect agreement on 26 items. Conclusion The NEOLCAT shows promising psychometric properties for studying clinical components of care of neurological patients at the end‐of‐life on an acute hospital ward but could be further developed in future studies.


| INTRODUC TI ON
Studies that have investigated end-of-life care of neurological patients highlight that it is a complex subject (Alonso et al., 2016;Hussain et al., 2018). To shed light on current practice concerning end-of-life care of neurological patients in an acute neurological ward, it was decided to retrospectively retrieve data from patient health records (PHRs). The aim of this paper is to describe the design and testing inter-rater reliability of a data collection tool to map end-of-life care for neurological patients in the last 7 days of their life. The last 7 days were selected since it could be expected that within that time frame indicators of impending death of neurological patients should have manifested and could be noted (Hui et al., , 2015. Designing the data collection tool is part of a larger study of diagnosing dying among neurological patients in an acute hospital ward.

| BACKG ROU N D
The final stage of palliative care or end-of-life care refers to healthcare provided in the last weeks and months of life. End-of-life care is given to patients when medical treatment to cure their disease has been brought to an end (Cohen-Mansfield et al., 2018). The ultimate goal is to relieve suffering, for instance by optimizing pain management and providing psychological and social support to assist patients and their next-of-kin with managing physical, emotional, social, and spiritual burden of the imminent death (Huskamp et al., 2012).
High quality end-of-life care requires communication between the patient, healthcare professionals involved in the care, as well as relatives in order to create a shared understanding about the person's values and treatment preferences (Fernando & Hughes, 2019;Gonella et al., 2020;Sinuff et al., 2015;Taffurelli et al., 2020). This communication ultimately leads to an individualized care plan that is consistent with the patient's values and needs, considering what treatments, including assessments and interventions, will or will not be used to manage the symptoms of the life-threatening disease (Sinuff et al., 2015). Such individualized care, based on highly tuned clinical judgement, using clinical indicators along with experience and clinical wisdom, is paramount for the right course of action in end-of-life care (Kennedy et al., 2014).
To shed light on current practice in relation to end-of-life care of neurological patients, a retrospective study is attractive. In this situation the main strength of this non-intrusive research design is that at this delicate point in the patients' and the relatives' lives they are not bothered by data collection (Kaji et al., 2014).
PHRs are widely used in retrospective studies more generally.
The records can be in both electronic and paper form (Gregory & Radovinsky, 2012). PHRs contain already documented information, which is intended for patient care, not research purposes.
However, there are multiple advantages of using PHRs to obtain research data. The most prominent one is that they can provide a large amount of data of clinical significance at relatively little cost, without taking up patients' time or disturbing patients in any way (Gregory & Radovinsky, 2012;Kaji et al., 2014). PHRs in electronic form (EHR) are more reliable than the paper form as they are stored permanently, so there is less chance of losing data. The shortcomings of retrospective studies of PHRs are that the data is already collected and unchangeable, so the researcher cannot influence what information could be documented; there may be incomplete or missing data; specific patient information may be lacking; and there may be difficulties in interpreting or verifying the information (Feder, 2017;Fortney & Steward, 2015;Jansen et al., 2005).
There is also variability in the quality of how healthcare professionals originally documented patient information in the PHRs (Cassidy et al., 2002;Gianinazzi et al., 2015;Gregory & Radovinsky, 2012;Kaji et al., 2014).
With regard to extracting the data, the data abstractors may read, interpret, code, and transcribe what is written in the patients' records differently from what was initially intended, resulting in low sensitivity and specificity of the data collected (Jansen et al., 2005;Kaji et al., 2014). The use of standardized data collection methods or guidelines to ensure consistent collection of data enhances the quality of the data and minimizes bias (Gregory & Radovinsky, 2012;Jansen et al., 2005;Kaji et al., 2014). Reporting on how inter-rater reliability (IRR) in data collection is achieved shows transparency and may avoid inconsistencies in extraction of the data both within and between the data abstractors (Alonso et al., 2016;Cox et al., 2011;Fortney & Steward, 2015;Kaji et al., 2014). Reports on IRR with percentage agreements and kappa calculations from studies where data has been collected from PHRs have shown the data collection tools to be reliable, with high sensitivity and specificity (Gianinazzi et al., 2015;Ntlholang et al., 2016;Yawn & Wollan, 2005).
Several studies of patients with various neurological diseases have been conducted to improve care and provide evidence for the need of person-centered end-of-life care, with some studies using retrospective PHRs (Alonso et al., 2016;Cheng et al., 2017;Hussain et al., 2018;Liu et al., 2017;Munroe et al., 2007;Ntlholang et al., 2016;Quadri et al., 2018;Wang et al., 2018;Williams et al., 2019). These studies provide limited information on the data collection tools that were used, their psychometric properties, and the training of data abstractors.
Systematically scrutinizing the literature generated a list of possible items, which was listed in an Excel file. Of notice is the emphasis that is placed on communication with relatives, and on spirituality.
Although these are important aspects of end-of-life care for neurological patients, the format for documenting them in EHR is yet to be developed which made it unlikely that they would be identified and extracted (Forde-Johnston et al., 2022;Sjöberg et al., 2021).
The spectrum of these items is therefore limited in the NEOLCAT.

| Expert advice
The second phase consisted of an iterative process of developing the item list further to its completion. This included conversations with professional experts in palliative care, neurology, and health informatics to gain consensus on items that would finally make up the NEOLCAT. With several formal and informal meetings with the experts, the item list was refined based on their comments and suggestions and finally approved. The NEOLCAT has both objective and subjective items. The wording of each item was carefully studied.

| Ensuring and reporting the inter-rater reliability of the NEOLCAT
The third phase entailed ensuring a consistent data collection procedure and testing the IRR of the NEOLCAT. A coding manual of definitions of items in the NEOLCAT and how to locate each item in the PHRs was developed. The coding manual was made as accurate as possible with guiding images presented for how and where the data extractors could find the exact information. An example of a definition in the coding manual is given in Figure 1.
The data abstractor team consisted of two students in medicine and psychology and one registered nurse. They underwent individualized training as data abstractors, which consisted of a detailed presentation of the study and instruction in using the coding manual and in extracting the data. Following this, the abstractors' team met formally twice with the primary researcher (GJ) for co-ordination.
Face-to-face contacts between the researcher and each of the abstractors continued with numerous meetings during the data collection period. During this training period inter-rater reliability was calculated three times and the mismatch that was found was explored and corrected. This allowed for discussion around ways of handling conflicting data and clearing up misunderstandings. This was key to facilitating and gaining a shared understanding of each of the items in the NEOLCAT. Examples of issues that needed to be resolved were misunderstandings about the dates of radiological examinations. The study is limited to the last 7 days of life, but the data abstractors, on some occasions, wrongly used information that was documented within a longer range than 7 days. This was corrected after meetings with the researcher. The medium time for extracting data for one patient's EHR was about 60 min.

| Testing inter-rater reliability
Inter-rater reliability is defined as the degree of agreement or consensus of raters, in this case, the consistency of extracting data with NEOLCAT by data abstractors (Landis & Koch, 1977;LeBreton & Senter, 2008;McHugh, 2012;Polit & Beck, 2017). It was established by regular communication and meetings between researchers and data abstractors to discuss and handle conflicting data from the patient records and evaluated by calculating percentage agreement and the Fleiss' kappa coefficient. The Fleiss' kappa coefficient was categorized into fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), and almost perfect agreement (0.81-1.00); (Landis & Koch, 1977). There should be a consistency between high percentage agreement and high Fleiss' kappa coefficient. Therefore, it is important to present both calculations (Feinstein & Cicchetti, 1990;To et al., 2008).
Percentage agreement was calculated without delay in the study period as soon as three abstractors had finished abstracting the same PHRs. All three data abstractors extracted data from eight randomly selected PHRs, subsequently calculating the percentage agreement and the Fleiss' kappa, considering agreement occurring by chance (Cassidy et al., 2002;Feinstein & Cicchetti, 1990;Landis & Koch, 1977;Polit & Beck, 2017;To et al., 2008;Zaiontz, 2019). A percentage agreement that was under 80% was given special consideration and action was taken to rectify data extraction on that item. Disagreement decreased substantially after issues about data extraction of the first three PHRs had been resolved. The data that was gained in the last round of rectifying the data extraction is used in this paper. F I G U R E 1 Definition of fatigue. An example from the coding manual.
With multiple data abstractors it is possible that IRR can be compromised (Gianinazzi et al., 2015). Gregory and Radovinsky (2012) highlighted that to obtain high IRR it is necessary to conduct continuous monitoring and periodic reviews, and to make detailed reports on limitations that were encountered through the data abstraction processes, and to explain how these limitations were addressed (Gregory & Radovinsky, 2012

| RE SULTS
We report on average percentage agreement and average Fleiss' kappa coefficient for the 32 items of the NEOLCAT belonging to six categories having one to 16 items each, see Table 2 (see supplement 1 for the total of categories and items). This is followed by calculations of congruence between the NEOLCAT items into fair, moderate, substantial, and almost perfect agreement.

| Percentage agreement
Percentage agreement on the 32 items of NEOLCAT varied from 71% to 100% with 89% overall percentage agreement, see Table 3. The percentage agreement of the six categories ranged between 83% and 95%. The category Health professionals' contribution (83%) was the lowest and Medically invasive and other treatments the highest (95%). The percentage agreement range was biggest in the category Clinical signs and symptoms (71%-100%).

| Fleiss' kappa coefficient
The Fleiss' kappa coefficient of the 32 items varied between 0.24 and 1.0 with overall coefficient 0.84, see Table 3.

| Congruence between percentage agreement and Fleiss' kappa coefficient
Congruence between percentage agreement and Fleiss' kappa coefficient is presented on all the 32 items.

| Moderate congruence
There was moderate percentage agreement for four items: Consciousness, dyspnea, blood tests at 48 h, and mobility in the 3-7 days before death. The percentage agreement was ranged from 71% to 83%. The kappa coefficient was reasonably consequent and ranged from 0.50 to 0.60.

| Substantial congruence
There was substantial agreement for 11 items: Progression of disease both at 3-7 days and 48 h before death, consciousness in 3-7 days, dyspnea and nausea in 3-7 days, vital signs in 3-7 days, urine tests, There was fair congruence between two items and moderate congruence between four. Incongruence could be expected on a few items and can be explained by lack of rigour in interpreting clinical information in free text and difficulties in locating items in the PHRs.
The category with the lowest percentage agreement had only one item and the data were to be extracted from free text. Items of the category Laboratory and other tests undertaken showed acceptable but lower percentage agreement and Fleiss' kappa coefficient than could be expected since required interpretation is minimal. This data was, however, to be found in free text rather than having a fixed location in PHR, which may explain the low values that were found.
Several of the items in the category, Clinical signs and symptoms (12 items), required interpretation of free text, yet did have acceptable values. It is concluded that overall, the IRR was acceptable.
Low percentage agreement and Fleiss' kappa coefficient would to a great extent be explained by lack of rigorousness of the guidelines in the coding manual as to how to interpret clinical information and where to locate data. The coding manual, therefore, needs revision on those items.
In this study, it was not possible to extract data from PHRs of all the patients simultaneously from the data warehouse. For that to be possible, the PHR needs to have a more rigorous structure and built into the coding system. Therefore, we manually extracted data for one patient at a time, which is both time-consuming and increases likelihood of mistakes. An important implication of this study is, therefore, the need for improvements in how the PHR is structured and coded.
The general assumption is that end-of-life care involves teamwork (Fernando & Hughes, 2019;Kennedy et al., 2014

| Strength and limitations
It is a considerable strength of the study to have substantiative literature for item selection and expert advice from different healthcare professionals to contribute to the design of the NEOLCAT. There were however only a couple of instruments on which to ground the NEOLCAT. Another strength is the rigorousness of the data extraction process, the training of the data abstractors and the use of a coding manual.
There are inbuilt limitations to data collection tools that are aimed for researching PHR, most importantly that the research data is unchangeable and that they are intended for patient care, not research purposes. Lack of structure and coding possibilities of the PHR is a limitation as well. Items of communication with patient and family, and on spirituality, in the NEOLCAT are incomplete. There should be a strong focus on these issues in future development of the NEOLCAT.

| CON CLUS IONS
The NEOLCAT data collection tool which was developed to

AUTH O R CO NTR I B UTI O N S
GJ, EH, and HJ were responsible for the study conception and design. RV and GBT supervised the statistical analysis. GJ, EH, VS, AT, RV, GBT, and HJ worked collectively on interpretation of data. GJ was responsible for drafting the manuscript and all the co-authors provided ideas for improvement.

ACK N O WLE D G E M ENTS
First and foremost, we thank the data abstractors Andrea Jona Eggertsdottir, Berglind Osk Olafsdottir and Mona Sif Hadaya. We also owe gratitude to Marianne E. Klinke for important insights into designing and developing the DCT. The manuscript has been proofread by Anchor English.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflicts of interest.

FU N D I N G I N FO R M ATI O N
This study was supported by the grants received by GJ from the following institutions: The Icelandic Nurses´ Association; The University of Iceland and Landspitali, The National University Hospital of Iceland. The funding sources were not involved in the data analysis and interpretation of results.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.

TA B L E 3
Inter-rater reliability.