Natural language processing for the assessment of cardiovascular disease comorbidities: The cardio‐Canary comorbidity project

Abstract Objective: Accurate ascertainment of comorbidities is paramount in clinical research. While manual adjudication is labor‐intensive and expensive, the adoption of electronic health records enables computational analysis of free‐text documentation using natural language processing (NLP) tools. Hypothesis: We sought to develop highly accurate NLP modules to assess for the presence of five key cardiovascular comorbidities in a large electronic health record system. Methods: One‐thousand clinical notes were randomly selected from a cardiovascular registry at Mass General Brigham. Trained physicians manually adjudicated these notes for the following five diagnostic comorbidities: hypertension, dyslipidemia, diabetes, coronary artery disease, and stroke/transient ischemic attack. Using the open‐source Canary NLP system, five separate NLP modules were designed based on 800 “training‐set” notes and validated on 200 “test‐set” notes. Results: Across the five NLP modules, the sentence‐level and note‐level sensitivity, specificity, and positive predictive value was always greater than 85% and was most often greater than 90%. Accuracy tended to be highest for conditions with greater diagnostic clarity (e.g. diabetes and hypertension) and slightly lower for conditions whose greater diagnostic challenges (e.g. myocardial infarction and embolic stroke) may lead to less definitive documentation. Conclusion: We designed five open‐source and highly accurate NLP modules that can be used to assess for the presence of important cardiovascular comorbidities in free‐text health records. These modules have been placed in the public domain and can be used for clinical research, trial recruitment and population management at any institution as well as serve as the basis for further development of cardiovascular NLP tools.

institution as well as serve as the basis for further development of cardiovascular NLP tools.

K E Y W O R D S
cardiovascular comorbidities, natural language processing 1 | INTRODUCTION Extracting and accurately categorizing medical comorbidities is paramount in clinical research. 1 The traditional approach to identification of comorbidities using manual adjudication is labor-intensive and expensive. However, the ever-expanding adoption of electronic health data makes it possible to automate this process. While relying on structured data sources such as coded problem lists or billing codes can be an efficient way to capture medical comorbidities, structured data often has poor sensitivity which can introduce bias into analytic work. [2][3][4][5][6] Accordingly, innovative and efficient methods of analyzing free-text documentation are crucial to realizing the electronic health record's potential to advance medical research. 7,8 One common approach to analyzing free text information is to deploy natural language processing (NLP) systems. NLP can be implemented using machine learning 9 or human-designed "heuristic" technologies. 10,11 Machine learning technologies are increasingly able to model non-linear linguistic relationships and can be trained quickly on large annotated datasets. On the other hand, human-designed heuristic-based NLP tools are characterized by transparency (allowing for easier correction of errors) and do not require specialized highperforming hardware such as Graphics Processing Units. Additionally, human-designed NLP techniques can be developed using smaller annotated datasets as they incorporate their designers' knowledge of language as well as professional vernacular. 11,12 NLP has been implemented in numerous clinical applications [11][12][13][14][15][16][17][18][19] and continues to be developed across a host of critical domains to transform natural language into data ready for computational work.
Although there have been NLP systems developed to assess for the presence of cardiovascular comorbidities in narrative electronic health data, 20,21 their portability and implementation within other health-system databases face questions of validity. 22 Accordingly, we sought to develop and validate NLP modules for key cardiovascular comorbidities using the longitudinal electronic health records within the Mass General Brigham system, a large tertiary care medical system in Boston, MA. Our aim was to accurately assess for the presence of major cardiovascular comorbidities-as documented by clinicians in free-text form-in a system-wide longitudinal health care record.

| METHODS
We developed five distinct NLP modules to assess for the presence of the following cardiovascular comorbidities: (a) hypertension; (b) dyslipidemia (any subtype); (c) diabetes; (d) coronary artery disease (CAD); (e) non-hemorrhagic stroke and transient ischemic attack (TIA).
Each module was designed to assess for language that is diagnostic of these comorbidities on a phrase-by-phrase and sentence-by-sentence level. For instance, a sentence stating that, "Mrs. Smith has a history of hyperlipidemia" or one that stated, "Mrs. Smith has a history of elevated cholesterol" would be considered semantically equivalent and diagnostic of dyslipidemia. Similarly, a phrase stating, "History: uncontrolled hemoglobin A1c" would be considered diagnostic of diabetes.
The ability to develop algorithms that can extract phrase and sentence-level details to determine the presence of a diagnostic concept allow for the potential to build highly accurate NLP modules.
Ultimately, the goal was to design NLP algorithms that are able to recognize phraseology that clinicians use in regular practice to represent the diagnostic concepts of interest.
For conditions where non-binary classifications provide valuable information, we sought to develop algorithms that would be able to characterize multiple levels of clinically useful information in order to obtain granular diagnostic data. Accordingly, the modules for diabetes, CAD, and non-hemorrhagic stroke/TIA were designed to obtain the following secondary levels of information:    Agreement between the two adjudicators as measured by Cohen's Kappa in the test-set is given in Table 1 and ranged from 0.961 to 1.00. Any adjudication discrepancies in the test-set were resolved through a joint meeting between the two physicians to create a final validation test-set against which the NLP software output was then compared. The physician adjudicators were not involved in the development of the NLP modules and the designer of the NLP modules was blinded to the test-set adjudication.

| Adjudication
Each physician was instructed to extract diagnostic information through a sentence-by-sentence review of each clinical note. Accordingly, if there were multiple sentences in a given note that referenced a history of coronary artery disease, each was logged as a positive reference to that diagnostic concept. See Table 2 for the number of unique note-level and sentence-level positive references for each diagnostic concept in the test set. The secure, web-based software platform REDCap 23,24 (Research Electronic Data Capture) was used for data entry. Individual REDCap forms for each of the five diagnostic concepts were developed to facilitate information entry by the team of physician adjudicators. See the Appendix S1 for representative designs of the REDCap forms.
When multiple levels of diagnostic information were available within a given phrase or sentence, the adjudicators were instructed to input all available classification information through the use of "radio buttons" in the REDCap forms. For instance, if a sentence stated: "Mr. Smith has a history of CAD s/p MI in 2018 requiring 2 stents to his LAD," the adjudicators were instructed to check off the boxes for "CAD General," "MI," and "Revascularization" as shown in Figure 2.
This process allowed for the ability to obtain detailed information from sentence-level references and program the NLP algorithms to recognize complex and multi-layered diagnostic concepts.

| NLP development
NLP algorithms were created using the open-source Canary NLP platform. 17,19,[25][26][27][28] We elected to use the Canary NLP system for the following reasons: (a) it implements NLP algorithms transparently, facilitating error correction; (b) it is easily portable to other institutions and datasets; and (c) it was previously shown to achieve higher accuracy than other NLP methodologies. 28 F I G U R E 1 Note selection process. Schematic overview of the note selection process for manual adjudication of the five diagnostic concepts targeted for NLP development. Each of the training sets and test set contained 200 unique notes with the same proportion of outpatient and hospital discharge summaries. The NLP designer was blinded to the gold-standard test set adjudication For each distinct NLP module, a unique set of word classes were created. Word classes contain sets of semantically-related words that can be used to create phrase structures. A simplified set of word classes from the CAD module include: 1. >CAD -cad, coronary disease, coronary heart disease, ischemic heart disease.
In addition to defined word classes, Canary allows for the creation of an ">UNKNOWN" word class which accounts for sentences with undefined words. Phrase structures are then created from word classes to create meaningful units of information which can later be extracted as numbered outputs for analytic work. A phrase structure to capture a sentence such as, "The patient had 2 stents placed in his LAD in July 2018" is shown in Figure 3. This example sentence referencing the placement of stents in a coronary artery would then resolve to an F I G U R E 2 Example adjudication of multi-layered diagnostic sentence. Example sentence and associated REDCap form of how adjudicators were instructed to input all available classification information for multi-layered diagnostic information. In the sentence, "Mr. Smith has a history of CAD s/p MI in 2018 requiring 2 stents to his LAD," adjudicators would click "unspecified CAD," "myocardial infarction," and "revascularization" to capture all available data points output indicating that the patient had a coronary revascularization procedure. In the CAD module, for example, there were more than 40 distinct word classes, greater than 600 unique heuristic-based phrase structures, and 70 numbered output types.
Each module was designed with its own unique set of word classes and heuristic-based phrase structures to maximize diagnostic accuracy. For the 800 training set notes, a rigorous iterative process was performed whereby unique and often multilayered phrase structures were created to capture positive references to the diagnostic concepts of interest. When the creation of additional phrase structures improved sensitivity but caused a decrement to the specificity of the module, the specificity of the module was favored and such heuristics were not included in the final algorithms.
In addition to capturing positive references to the desired diagnostic concepts, the NLP system was designed to exclude negations and family history. As such, sentences describing a patient's family F I G U R E 3 Schematic of building NLP phrase structures. Schematic of building phrase structures to capture diagnostic concepts using defined word classes. This example sentence referencing the placement of stents in a coronary artery would then resolve to an output indicating that the patient had a coronary revascularization procedure

| RESULTS
For each NLP module, we calculated the following metrics on each unique sentence-level reference: sensitivity and positive predictive value (PPV). On the document level, we calculated the sensitivity, PPV, and specificity of each algorithm. In addition, we calculated the corrected sensitivity, corrected PPV, and corrected specificity for each module to account for true positive references that were identified by the NLP system but missed by the manual physician adjudication. For the three modules that contained multi-layered outputs, we further calculated the sensitivity, specificity, and PPV of each distinct subcategory.
The performance of each of the five modules is given in Table 3.
The NLP modules demonstrated robust performance for all the studied disease states, but was particularly accurate for the hypertension, dyslipidemia, and stroke modules with greater than 95% PPV for note-level performance. For the three modules that had additional subcategories (e.g., diabetes, CAD, and stroke), the performance of each subcategory is presented in Table 4. For two of the subcategoriestype I diabetes and ST-segment elevation myocardial infarctionthere were no references to these diagnostic concepts within the test set notes. Accordingly, we could not calculate the performance characteristics on these subcategories. Additionally, two subcategories, for example, references to a greater than 50% coronary stenosis and unstable anginahad 10 or fewer references and are reported separately in the Appendix S1.

| DISCUSSION
Through a meticulous development and validation process, we designed five highly accurate NLP modules that can be used to assess for the presence of important cardiovascular comorbidities in freetext electronic health records. When putting our metrics in the context of other methods of extracting such data-such as using ICD billing codes-it is clear that rigorous NLP modules have the potential to significantly improve the accuracy of coding cardiovascular comorbidity data. Across all five modules, we almost always achieved sensitivity, specificity, and PPV of greater than 90%. This compares to sensitivities as low as 35% for stroke, 6 61% for hypertension 2 and 57% for coronary artery disease 2 in previously published work on the accuracy of ICD coding for the ascertainment of cardiovascular risk factors.
Unlike administrative billing codes which are coded for episodically and intermittently, our NLP modules accurately extract data from individual sentences within free-text documentation. This allows for a significant increase in the sensitivity of extracting such data, especially for patients who have only a limited number of medical encounters. Additionally, because administrative billing codes were not designed for medical research purposes, they are subject to both miscoding and under-coding, realities which significantly impact their validity. Our NLP modules demonstrate the power of accurately extracting data from the rich narrative of freetext documentation that is the backbone of clinical electronic health data.
Another commonly used approach for computational analysis of text is statistical analysis, also known as machine learning. Machine learning methods can also attain high accuracy but typically result in "black box" models where reasons for categorization of a particular piece of text are not clear to an external observer. This leads to difficulties in adaptation of machine learning-based NLP tools between different institutions that may have distinct clinical vernacular and forces development of NLP tools from scratch at every organization and for every task, consuming scarce resources and impeding progress of the field. 29 With that in mind, in this study we pursued the approach of a more transparent, human-designed heuristic-based NLP technology that allows tracing of each step of text analysis as well as easy modification of NLP tools to correct errors or add new functionality. We have placed the NLP modules we have designed in the public domain. 30 We expect that their portability and transparency will allow them to serve as the foundation for a family of cardiovascular The accurate extraction of data from clinical records is critically important for prospective and retrospective clinical research, including for recruitment for clinical trials and for population-based studies. As demonstrated through our work, NLP has the potential to accurately identify disease states from the electronic medical record, enabling the robust description of baseline characteristics. Our five NLP modules-specifically built to identify individuals with cardiovascular disease comorbidities-is a highly accurate and open-source system that will allow researchers to better understand the baseline characteristics of the patients in their research cohorts.

ACKNOWLEDGMENTS
Adam N. Berman is supported by a T32 postdoctoral training grant from the National Heart, Lung, and Blood Institute (T32 HL094301).