CodeMapper: semiautomatic coding of case definitions. A contribution from the ADVANCE project

Abstract Background Assessment of drug and vaccine effects by combining information from different healthcare databases in the European Union requires extensive efforts in the harmonization of codes as different vocabularies are being used across countries. In this paper, we present a web application called CodeMapper, which assists in the mapping of case definitions to codes from different vocabularies, while keeping a transparent record of the complete mapping process. Methods CodeMapper builds upon coding vocabularies contained in the Metathesaurus of the Unified Medical Language System. The mapping approach consists of three phases. First, medical concepts are automatically identified in a free‐text case definition. Second, the user revises the set of medical concepts by adding or removing concepts, or expanding them to related concepts that are more general or more specific. Finally, the selected concepts are projected to codes from the targeted coding vocabularies. We evaluated the application by comparing codes that were automatically generated from case definitions by applying CodeMapper's concept identification and successive concept expansion, with reference codes that were manually created in a previous epidemiological study. Results Automated concept identification alone had a sensitivity of 0.246 and positive predictive value (PPV) of 0.420 for reproducing the reference codes. Three successive steps of concept expansion increased sensitivity to 0.953 and PPV to 0.616. Conclusions Automatic concept identification in the case definition alone was insufficient to reproduce the reference codes, but CodeMapper's operations for concept expansion provide an effective, efficient, and transparent way for reproducing the reference codes.

requires several steps to achieve consistency between databases. A case definition that describes the event in the study protocol is translated into an operational definition, which is then mapped for each vocabulary into a set of codes that represents the event. The code sets are combined into queries for case identification and harmonized between databases by comparison with benchmarks from the literature and by feedback from the database custodians.
The creation of code sets for each vocabulary from the textual case definitions has been largely a manual process. Given the number and complexity of the targeted vocabularies, the mapping and harmonization process can pose an important bottleneck to the rapid implementation of collaborative epidemiological studies. 9,10 Furthermore, the rationale for including or excluding individual codes is not consistently documented, which hampers the possible reuse of code sets and queries in subsequent studies.
A previous attempt to accelerate the creation of code sets from multiple vocabularies was made in the EU-ADR project. [10][11][12] Medical concepts like diseases, symptoms, laboratory procedures, or tests were automatically identified in a case definition using the MetaMap program. 13 Code sets representing the concepts in the targeted vocabularies were then generated using the Unified Medical Language System (UMLS), 14 a biomedical terminology system that integrates many vocabularies including coding vocabularies commonly used in EHR databases. Whereas the identification of concepts and their projection to codes was automated, the overall workflow was not integrated or recorded to facilitate the later reuse of the mapping. The approach was also applied in other European projects like GRIP (http://www.grip-network.org), VAESCO (http://www.vaesco.net), and EMIF (http://www.emif.eu). Similar collaborative studies in the Asian and Pacific region deal with less heterogeneous medical vocabularies (Mini-sentinel, 15 PRISM, 16 VSD, 17 and AsPEN 18 ). Instead of adapting the event identification algorithm to the different databases, databases can also be mapped to a standardized coding system. A single event identification algorithm can then be used in different databases. This approach has been pursued in OMOP 19 and OHDSI. 20 We present a web application called CodeMapper, which has been developed in the Accelerated Development of Vaccine Benefit-Risk Collaboration in Europe (ADVANCE) project (http://www.advancevaccines.eu). It is based on the EU-ADR approach and assists in mapping case definitions to code sets from different vocabularies while keeping a record of the complete mapping process. We evaluate the application by comparing code sets that were automatically generated by CodeMapper with reference code sets that were manually created in a previous epidemiological study.
First, medical concepts are automatically identified in a free-text case definition. The user can then revise the set of medical concepts by adding or removing concepts or by expanding a concept to more general or more specific concepts. For example, the concept Coughing can be expanded to more general concepts such as Respiratory disorders and Abnormal breathing. Expanding it to concepts that are more specific results in subtypes of coughing such as Paroxysmal cough and Evening cough. Finally, each concept is represented by (possibly several) codes in the targeted vocabularies, and the projection of the concepts to codes forms the result of the mapping process. In this section, we will describe the mapping approach, the CodeMapper application, and an evaluation of the approach.

| Mapping approach
CodeMapper builds upon information from the Metathesaurus of the UMLS. The Metathesaurus is a compendium of many medical vocabularies, which have been integrated by assigning equivalent codes and terms from different source vocabularies to the same concepts. Each concept in the UMLS is identified by a concept unique identifier (CUI). For example, the concept Coughing (CUI: C0010200) is among others associated with the codes 786.2 (ICD-9 CM), R05 (ICD-10), and XC07I (CTv3).
The Metathesaurus contains more than 1 million concepts connected to codes from 201 vocabularies. Each concept is assigned to 1 or more of 127 semantic types, which define broad conceptual categories like Disease or syndrome, Finding, or Substance. To provide even broader structure, semantic types are combined into 15 semantic groups. 21 We

| Application
The CodeMapper application is implemented as a web application and freely available for noncommercial use (https://euadr.erasmusmc.nl/ CodeMapper). CodeMapper has three screens. On the first screen, the user enters a clinical case definition of an event as free text. Medical concepts are automatically identified in the text and highlighted inline. By default, only concepts that belong to the semantic group of Disorders are preselected for further processing in the application, but the user can select and deselect any identified concept depending on their relevance for the described event.
The second screen displays the mapping as a table with one row for each medical concept, and one column for each targeted vocabulary The third screen shows a list of all operations that have been performed, for later traceability of the mapping process. When the user saves the mapping, he or she is asked to provide a summary of the modifications, which is incorporated into the mapping history. After saving, the mapping and history lists are available to other users of the application. Comments can be attached to concepts to capture the discussion about the mapping. Concepts can be categorized by tags. Finally, the user can download the mapping as a spreadsheet file, for example, to incorporate the codes into extraction queries.
The spreadsheet file comprises the original free-text case definition, the concepts of the mapping, the codes for the targeted vocabulary, and the full history of the mapping process.

| Evaluation
We evaluated the effectiveness of CodeMapper's approach for creating realistic code sets, by comparing code sets that were generated with CodeMapper with manually created reference code sets. We used the case definitions and reference code sets from the FP-7 funded SAFE-GUARD project (http://www.safeguard-diabetes.org), 27  SAFEGUARD studied nine events: acute pancreatitis, bladder cancer, hemorrhagic stroke, heart failure, ischemic stroke, acute myocardial infarction, pancreatic cancer, sudden cardiac death, and ventricular arrhythmia. One event (sudden cardiac death) was excluded from the evaluation because of several missing code sets, and another (heart failure) because the case definition contained only a short symptomatic description of the event, unrelated to the codes representing the event. The events were mapped for nine EHR databases with four Overall, the reference code sets contained 420 codes ( Table 1).
The size of the reference code sets varied widely between vocabularies: on average, the code sets for Read-2 contained 48.3 codes, whereas the code sets for ICPC-2 contained 1.1 codes. This discrepancy is firstly due to the differences of granularity of the vocabularies (Read-2 has 77290 codes in the Metathesaurus, ICPC-2 only 1397). Second, the queries to the IPCI database (to which the ICPC-2 code sets are targeted) were supported by keyword searches Different code sets were generated fully automatically by CodeMapper for the events of the reference project based on the same case definitions. The baseline code sets resulted from the concepts identified in the case definition ( Figure 3). We then simulated the actions of an "informed user" who seeks to improve the sensitivity of the mapping. We assumed that this user would expand the concepts and, from all possible concepts that are more general or more specific, would only retain those that are relevant to the event. Based on the reference set we were able to automatically simulate the "informed user's" actions. The resultant set of concepts defined a new code set, which always contained all codes from the preceding code set. We simulated four of these expansion steps on successive concept sets.
For each target vocabulary and event, the generated code set was compared with the reference code set. We determined the number of true-positive codes (TP), false-positive codes (FP), and false-negative codes (FN), and computed sensitivity (TP / (TP + FN)) and positive predictive value (PPV) (TP / (TP + FP)). We report for each vocabulary the sensitivity and PPV averaged over all events in the reference set.

| Error analysis
We then carried out an automatic error analysis of the false-positive and false-negative codes after the third expansion step (Figure 4).

| Baseline
The baseline mapping created by CodeMapper had an average sensitivity of 0.246 for reproducing the reference code sets ( Table 2).The average PPV of the baseline mapping was 0.420. Without filtering by the semantic group of Disorders, the number of concepts would increase from 46 to 77 without affecting the sensitivity of the generated code sets.

| Concept expansion
The sensitivity of the baseline mapping greatly improved in the first expansion step, to 0.818. Sensitivity further increased in the second (0.940) and third (0.953) expansion steps. All ICPC-2 codes were produced after the first expansion step and all ICD-10 codes were produced after the second step. The sensitivity increased incrementally for Read-2 and ICD-9 CM. The PPV improved after one expansion step

| Error analysis
False-positive codes were generated in all vocabularies after the third expansion step (Table 3)

| DISCUSSION
In this article, we presented the CodeMapper web application that assists in the mapping of textual case definitions to code sets from multiple vocabularies, which is often a bottleneck in the implementation of epidemiological multi-database studies. We showed the effectiveness of CodeMapper's approach by simulating an informed usage of the application.     When exclusion criteria are indicated in the case definition, CodeMapper's approach can be applied to map them to codes, but they must then manually be marked for exclusion to inform the data extraction process. Automatic negation extraction 32

ETHICS STATEMENT
The authors state that no ethical approval was needed.