Identifying signs and symptoms of AL amyloidosis in electronic health records using natural language processing, diagnosis codes, and manually abstracted registry data

incidence

To the Editor: AL amyloidosis, a plasma cell disorder caused by extracellular deposition of misfolded proteins, is a rare disease with an estimated incidence of 12 to 14 cases per million person-years. 1 Patients with early-stage disease have a relatively high survival rate, with nearly 80% of patients alive 5 years after diagnosis. 2 However, delayed diagnosis reduces the survival rate substantially; patients diagnosed at stage IIIB have 5-year overall survival of about 10%. 3 Early diagnosis of AL amyloidosis is often delayed due to the non-specificity of early symptoms and rarity of the condition. 4 In a retrospective study of 1500 patients with newly diagnosed AL amyloidosis, the median time from sign or symptom onset to diagnosis was 2.7 years. 5 Electronic health records (EHRs) present an avenue for identifying patients with suspected AL amyloidosis based on their symptoms, which could enable earlier diagnosis and treatment. EHRs are especially useful for investigating rare diseases because of difficulties in recruiting sufficient patients for prospective clinical trials. EHRs contain a wealth of clinical insight across the structured tables (e.g., diagnosis codes, medications, and laboratory results) and unstructured free text (e.g., admission/discharge summaries, physician notes, and descriptions of conditions), which can be used collectively to identify early indications.  Table S1.
We considered 15 signs and symptoms: ascites, atrial fibrillation or flutter, autonomic neuropathy, carpal tunnel, congestive heart failure, dyspnea, edema, fatigue, lightheadedness, proteinuria, orthostatic hypotension, paresthesia, pericardial effusion, peripheral neuropathy, and pleural effusion. These were selected because they are characteristic of AL amyloidosis, present in the registry, and relatively common in patients (>3% prevalence). We considered signs and symptoms around the time of diagnosis since longitudinal data were not available for many patients, and we did not consider post-diagnosis signs and symptoms since they could be secondary to treatment.
Three data sources were used for identifying signs and symptoms: (1) a manually curated registry, (2) structured diagnosis codes, and  (Table S2). The notes, which we automatically curated with a neural network-based NLP algorithm, were the third data source. 6 The algorithm classifies a sign/symptom synonym and its surrounding text fragment with one of the following labels: "Yes"-confirmed, "No"-ruledout, "Maybe"-suspected, or "Other"-alternate context (e.g., family history of sign or symptom; Figure S1). 6 This data source is referred to as "augmented curation". Lists of synonyms for each sign and symptom (Table S3) were curated with input from the hematologist (A. Dispenzieri) to ensure alignment with categories in the registry. Synonyms classified with a "Yes" sentiment were counted as a record, while other classifications were not. ICD codes and notes timestamped 1 year before to 90 days after AL amyloidosis diagnosis and prior to initiation of treatment were considered. For a patient to be counted as having a sign/symptom according to a given data source, the patient needed at least one record of the sign/symptom in that data source.
The number of cases identified from each data source and the overlap across data sources are reflected in Euler diagrams ( Figure 1).
Congestive heart failure (38.5%) and pleural effusion (32.3%) had the highest levels of concordance across all data sources ( Figure 1 and Table S4). Lightheadedness (3.8%), atrial fibrillation/flutter (5.3%), and paresthesia (8.6%) had the lowest concordance across the data sources. There was relatively high concordance between augmented curation and the registry for the most prevalent signs and symptoms:   (Table S4).
We evaluated the accuracy of each data source for proteinuria by deriving a "gold standard" patient set based on laboratory data. For this, we considered all patients with at least one laboratory measurement for urine protein occurring 1 year before to 90 days after the AL amyloidosis diagnosis date and prior to initiation of treatment, followed by a clinical note within 0 to 15 days, which was 974 patients (of 1223 in the study population). Individuals who had at least one measurement ≥0.5 grams of urine protein/24 hours during the study period were counted as positive for laboratory test-derived proteinuria, which was 423 patients. Using this "gold standard" patient set, we computed specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV) metrics for patient sets identified by each data source.
Augmented curation and registry yielded similar results in terms of spec- and PPV (85.3%), but substantially lower sensitivity (32.2%) and NPV (48.1%), affirming that ICD codes miss many true positive cases.
For each of the signs and symptoms, we further investigated a random sample of 10 cases (150 cases in total) identified by augmented curation alone. We manually reviewed all notes containing a mention of the sign/symptom in the observation window, and then assigned one of the following labels for each patient-symptom pair: "Present, attributed to AL amyloidosis", "Present, attributed to another condition/treatment", "Present, no attribution", or "Not present". Of the 150 cases reviewed, the symptom was confirmed to be present in 141, while 9 were false positives ( Figure S2). Of the 141 cases, the symptom was not attributed to any condition in 90, attributed to another condition or treatment in 25, and attributed to AL amyloidosis in 26. Symptoms most commonly attributed to conditions other than AL amyloidosis included peripheral neuropathy (5), proteinuria (4), congestive heart failure (3), and dyspnea (3). Peripheral neuropathy was attributed to diabetes, trauma/overuse from long-distance running, and vincristine. Proteinuria was attributed to chronic kidney disease, glomerulonephritis, and diuretics. Congestive heart failure was attributed to heart attack, mild hypertension, hyperlipidemia, and as an adverse event of pomalidomide for treating multiple myeloma. Dyspnea was attributed to cerebrovascular disease, depression, fatigue, and promethazine and fentanyl for treating abdominal pain.
Overall, augmented curation was highly accurate in identifying conditions experienced by patients, but these conditions were often not explicitly linked to AL amyloidosis in the clinical notes. Alternatively, the registry dataset was manually curated to include signs and symptoms attributed to AL amyloidosis exclusively. To develop screening algorithms that facilitate earlier diagnosis, symptoms identified via augmented curation may be more useful, since they capture symptoms noted at their earliest manifestation and do not rely upon a suspected diagnosis of AL amyloidosis from the clinician.
These findings demonstrate that an NLP-based approach is valu-

DATA AVAILABILITY STATEMENT
The data sharing policy of Janssen Pharmaceutical Companies of Johnson & Johnson is available at https://www.janssen.com/clinicaltrials/transparency. These data were made available by Mayo Clinic for the current study and are not publicly available due to the inclusion of protected health information (PHI). To request data from this F I G U R E 1 Euler diagrams showing the intersections between the three data sources for the 15 signs and symptoms. The number of patients with the sign or symptom according to the data source(s) is given in each section. Diagrams are sorted by the total prevalence of the sign or symptom across the three data sources, "n", and the area for each Euler diagram is proportional to n. Corresponding percentages are shown in Table S4. ICD, International Classification of Diseases.