Machine learning‐enabled multitrust audit of stroke comorbidities using natural language processing

With the increasing adoption of electronic records in the health system, machine learning‐enabled techniques offer the opportunity for greater computer‐assisted curation of these data for audit and research purposes. In this project, we evaluate the consistency of traditional curation methods used in routine clinical practice against a new machine learning‐enabled tool, MedCAT, for the extraction of the stroke comorbidities recorded within the UK's Sentinel Stroke National Audit Programme (SSNAP) initiative.


INTRODUC TI ON
Medical records are a rich source of information, continuously accessed by health care professionals to help care for their patients and community. The benefits of trawling through swathes of medical notes are clear, including understanding the individual in the acute setting; audit and service evaluation [1][2][3]; and identifying patterns embedded in a disease population for research [4][5][6]. With the increasing adoption of electronic records in the health system [7][8][9][10], using computers to analyse all these data has been a common objective [11][12][13]. However, accurate extraction of medical concepts from unstructured data, like free text, requires an understanding of the language used, something that is relatively simple for a human but extremely challenging for a computer.
Over the past decade advancements in a branch of machine learning, known as natural language processing (NLP), have enabled the translation of free text into a standardized, structured set of medical terms that can be subsequently analysed by a computer [14]. These tools have the potential to automate and support data collection; however, evaluation with real-world clinical data across medical specialties, such as stroke, has been limited [15]. CogStack is an open-source software ecosystem that incorporates both the structured and unstructured components of the electronic health record (EHR). The MedCAT Toolkit [16]  Conventional registry and national audits use standardized case report forms to provide periodic standardized submissions into centralized databases. The Sentinel Stroke National Audit Programme (SSNAP) [17] is a health care quality improvement programme collecting stroke patient data that represent >90% of all cases in England, Wales, and Northern Ireland. With 100,000 stroke cases per annum [17], this is a time-pressured, labour-intensive exercise conducted manually by a team of clinical coders and/or clinicians.
Although manual curation is the current gold standard, these pressures increase the risk of errors [18][19][20] and limit the timeliness of the data to some months after the event, negatively impacting on the utility of the collected data.
In this project, we evaluated the manually inputted SSNAP data from three different National Health Service (NHS) hospitals against a manually reviewed sample for the four stroke comorbidities routinely collected as part of the SSNAP initiative. We also trained MedCAT on a set of manually annotated stroke documents, to identify the same four comorbidities, and then applied the model to the inpatient stroke records of three different NHS hospitals, comparing the MedCAT performances against the corresponding manually inputted SSNAP data and the manually reviewed subsample.

MedCAT algorithm training and data
The base MedCAT algorithm was trained in an unsupervised manner on the entire KCH EHR data consisting of more than 18 million documents [16], and received additional training from 301 and 373 annotated documents in endocrinology and cardiology, respectively. For our study, further training on stroke-specific comorbidities was provided through 500 KCH-PRUH annotated documents obtained from 2015 to October 2020, stratified by patient, age, and gender, using the method described in Kraljevic et al. [16]. This only included free-text information documented by clinical staff, and excluded information from other systems like blood results, investigation reports, outpatient letters, and vital observations. MedCAT counted the number of instances a concept was mentioned (e.g., atrial fibrillation [AF]) and generated a total count for each patient episode. This only included references for the presence of the concept relating to a patient. Phrases such as "this patient does not have AF" or "a family history of AF" would not increase the count. Because MedCAT is mapped onto the SNOMED-CT library, counts for child concepts defined by the inbuilt "IS A" hierarchical relationship were merged to reflect the data collected in SSNAP. Table S3 shows the list of SNOMED-CT concepts used to emulate the SSNAP comorbidities.
The MedCAT concept count was converted to a binary state by applying a threshold, above which a patient would be diagnosed with the comorbidity for the specific admission episode. Two different document periods were examined, based on the recorded admission and discharge timestamps: (i) 12 h prior to admission to 12 h after discharge (admission period) and (ii) January 2015 to 12 h after discharge (2015-to-discharge).

SSNAP data
The SSNAP governing body has released protocols and guidelines for data curation, with each participating site responsible for its own curation of data [17]. Although the data collected for SSNAP have evolved with the changing face of stroke, the data definition for SSNAP remained constant during the period assessed in our study.
Using the local SSNAP data from each hospital, the comorbidities AF, hypertension, congestive cardiac failure (CCF), and diabetes were extracted. SSNAP collects both "atrial fibrillation" and "new atrial fibrillation," where the patient cannot be positive for both labels. For this project, we combined these two groups to represent whether the comorbidity of AF was present for this admission episode and used this to perform the subsequent analyses. Diabetes included both Type 1 and Type 2 diabetes mellitus.
In addition, the stroke type for the admission episode-acute ischaemic stroke (AIS) or primary intracerebral haemorrhage (PIH)was recorded.

Subsample reread (KCH and GSTT): Ground truth
To evaluate the performance of the two auditing methods, a mutual reference dataset was curated to represent the ground truth. Two subsamples of 100 patient episodes each were randomly selected from the KCH and GSTT datasets. A Wilcoxon rank-sum test was used to assess whether there were significant differences between the subsample reread and its parent dataset.
To of sensitivity, specificity, precision, negative predictive value, and F1 score were calculated. The level of agreement between the subsample reread and the auditing methods was measured using Cohen kappa, whereas the McNemar test was used to assess the difference in performance between the auditing methods. All statistical analyses were performed using MATLAB 2020b [22].

RE SULTS
The mean ages for the KCH-PRUH and GSTT datasets were 71.5 and 70.9 years, respectively. There was no significant difference between the subsample reread and the respective parent dataset for age, proportion of females, AIS, and comorbidities. The absolute values for each dataset are displayed in Table 1. There was a significantly higher proportion of females at the GSTT site compared with KCH-PRUH (p = 0.045). The prevalence of comorbidities between the KCH-PRUH and GSTT sites was not significantly different except for diabetes mellitus (p = 0.045).
The F1 score is the harmonic mean between the sensitivity and precision (positive predictive value). For the KCH-PRUH and GSTT datasets, compared against SSNAP, MedCAT was able to determine the type of stroke using documents from 2015-to-discharge with peak F1 scores of 0.92 and 0.95, respectively (Table S2).

Comparison A: MedCAT compared against SSNAP
Comparing MedCAT against SSNAP, the performance of MedCAT is similar between the two document inclusion periods, with the peak F1 scores obtained within a threshold value between two and eight counts (Table S1). MedCAT's performance as a function of MedCAT count threshold for the type of stroke, and the four comorbidities are provided in Figures S1 and S2. The corresponding area under the receiver operating characteristic curve plots for the four comorbidities are displayed in Figure S3.
The peak F1 scores for AF and diabetes were obtained using documents from the admission period only, whereas CCF was from all documents (2015-to-discharge), and there was no difference with hypertension (Table S1). The deterioration in F1 score for AF in the GSTT data compared with the KCH-PRUH data is primarily driven by a low level of precision (i.e., false positives) likely related to the number of acronyms for atrial fibrillation (e.g., "AF," "PAF," "AFib," "atrial fib") not encountered in the training sample.

Comparison B and C: MedCAT/SSNAP compared against the subsample reread (ground truth)
Interrater agreement for the two sites were high, with a Cohen

DISCUSS ION
The SSNAP is a national health care quality improvement programme supporting the delivery of evidence-based care for stroke.
It has helped shape stroke services in the UK by measuring process of care, with data collected under time pressure in a continuous and contemporaneous manner. A hub-and-spoke system exists in the UK, where a hyperacute stroke unit (HASU) provides hyperacute intervention and care to a large area containing multiple smaller stroke units (SUs) that manage longer term rehabilitation needs. With a high annual incidence of stroke, the task of data collection is shared, with each hospital required to submit data for every patient that passes through its unit. In this project, we have examined the four comorbidities recorded by SSNAP over a 15-month period and evaluated the consistency of current audit practices, as well as a new machine learning-enabled method, MedCAT, against a manually reviewed set of patient episodes.
To evaluate an auditing method, a ground truth needs to be specified, and will inevitably require human operator involvement.
Here, we have referred to our "subsample reread" as the ground truth and compared both the MedCAT and SSNAP methods against this dataset. Although this subsample reread is potentially vulnerable to the same risks of human error as in the SSNAP method, it was importantly not encumbered by the same time pressures that afflict SSNAP, focussed on comorbidities alone, and was performed completely by clinicians who would be better able to interpret the medical vernacular and extract the appropriate concepts. Moreover, the interrater consistency for both sites were strong, with a Cohen kappa greater than 0.80 for all comorbidities.
The KCH and PRUH sites operate both an HASU and an SU, with more than 1800 patient episodes annually between them. There was good consistency between SSNAP and our subsample reread for AF, hypertension, and diabetes. The F1 score for CCF was poor, primarily driven by low sensitivity. This is partly explained by the low prevalence of CCF within the subsample reread, with small numbers of detection errors incurring a large deterioration in performance.
Importantly, the prevalence of heart failure in a nonenriched popu- difference may explain the apparent deterioration in F1 score for AF, as the inpatient documentation is likely to have a greater emphasis on rehabilitation requirements rather than the aetiology of the stroke.
The low sensitivity of the SSNAP data compared with the subsample reread at KCH and GSTT would indicate sufficient information is present within the documents available at each site. The absence of time pressure, and requirement to extract only the comorbidities rather than all the SSNAP concepts, are likely to have contributed to greater sensitivity in the subsample reread, especially when a more comprehensive review of the records is required.
MedCAT uses NLP to extract concepts from the free text and maps them onto a standardized clinical vocabulary, SNOMED-CT.
Intuitively, a more accurate picture of patients' stroke risk profiles will be obtained from a review of their entire medical histories.
Clinical teams will review a patient's extensive history to identify potentially relevant stroke risk factors using sources from within and external to the hospital. This phenomenon is demonstrated in the performance of MedCAT, with higher F1 scores obtained when using all available documents (Table S2) However, this needs to be considered against the running costs of a physician's time, and of a potential dedicated full-time curator. Fourth, despite the enriched population, the prevalence of the comorbidities is low, except for hypertension. Consequently, the training of the MedCAT model to the stroke-specific concepts was biased toward those concepts with greater representation within the population. It is unsurprising to see both AF and CCF perform worse than hypertension and diabetes. This is highlighted with errors where MedCAT misattributes the abbreviation "AF" to "atrial fibrillation" rather than "artificial feed" despite the nutritional context of the entry. Nevertheless, this issue can be addressed with further focussed training or more sophisticated NLP models (e.g., transformer-based models).

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.