Combining open-source natural language processing tools to parse clinical practice guidelines



Natural language processing (NLP) has been used to process text pertaining to patient records and narratives. However, most of the methods used were developed for specific systems, so new research is necessary to assess whether such methods can be easily retargeted for new applications and goals, with the same performance. In this paper, open-source tools are reused as building blocks on which a new system is built. The aim of our work is to evaluate the applicability of the current NLP technology to a new domain: automatic knowledge acquisition of diagnostic and therapeutic procedures from clinical practice guideline free-text documents. In order to do this, two publicly available syntactic parsers, several terminology resources and a tool oriented to identify semantic predications were tailored to increase the performance of each tool individually. We apply this new approach to 171 sentences selected by the experts from a clinical guideline, and compare the results with those of the tools applied with no tailoring. The results of this paper show that with some adaptation, open-source NLP tools can be retargeted for new tasks, providing an accuracy that is equivalent to the methods designed for specific tasks.

1. Introduction

Clinical practice guidelines (CPGs) form a substantial and important source of knowledge about evidence-based diagnostic and therapeutic recommendations. It has been extensively recognized that accessing this knowledge at the point of care would improve the quality of healthcare and reduce unnecessary costs (Grimshaw & Russel, 1994; Rosser et al., 2001). In order to implement CPGs as electronic applications, it is essential to transform the CPG document text into an electronic format. Thus, knowledge engineers, with the assistance of clinical experts, translate medical text into some formal language specifically developed to represent CPGs (de Clercq et al., 2004; Isern & Moreno, 2008). However, manual knowledge acquisition from GPCs is still complex and labour-intensive, despite the unquestionable assistance provided by some computer tools, such as AsbruView (Kosara et al., 2001), AREZZO (InferMed, 2007), Tallis (Steele & Fox, 2002) or Protege (Gennari et al., 2003) (among others). Thus, there seems to be a need for tools for automating, at least in part, knowledge extraction from CPGs free-text documents.

Natural language processing (NLP) is an active line of research in healthcare, playing a decisive role in assisting and enhancing the process of medical care, especially regarding clinical information stored as narrative text in electronic health records, physician notes or biomedical literature (Chapman & Cohen, 2009). Healthcare entails not only patient information but also clinical knowledge about evidence-based recommendations on diagnostic and therapeutic procedures. Therefore, the current NLP technology can be crucial to bring CPG-based evidence at the point of care. However, expanding the reach of current biomedical NLP technology is not trivial, as up to now, many of the developed systems are application-driven (Demner-Fushman et al., 2009). Therefore, new research is necessary to assess whether such methods can be easily retargeted for new applications and goals, with the same performance.

In this paper, open-source tools are reused as building blocks on which a new system is built. The aim of our work is to evaluate the applicability of the current NLP technology to a new domain: automatic knowledge acquisition of diagnostic and therapeutic procedures from CPG free-text documents. The main idea is to enrich these documents with an ontology, with the aim of making them computer-interpretable. However, the reliability of the current NLP technology and knowledge engineering (KE) techniques is not guaranteed. First, text preprocessing tools can cause errors when they are used in new domains and these can be propagated upward to create more errors at the next levels (Demner-Fushman et al., 2009). Second, named entity recognition (NER) entails to find medical entities in the text and to correctly interpret their meaning. Terminologies are the main knowledge resources used to assist NER in clinical domains, as they are usually large scale and publicly available resources. Even so, NER has to deal with the problem of ambiguity (Friedman et al., 2004; Denecke, 2008; Gschwandtner et al., 2010). Third, in contrast to the favourable results in NER, the advance in relation extraction research has been less productive, at least in providing technology. Many works carried out to date involve this task as part of a full clinical information extraction system (Rindflesch & Fiszman, 2003; Denecke, 2008), and only a few publicly available open-source tools have been proposed, such as Sem-Rep (Rindflesch & Bodenreider, 2006).

In order to overcome these drawbacks, several tools were adapted to parse CPG documents with the aim of identifying firstly diagnosis and therapy entities and then meaningful relationships among these entities. Our approach applies a sequential combination of several methods classically used in NLP, to gradually map sentences to an ontology. We have applied this new approach to 171 sentences describing diagnostic and therapeutic procedures. These sentences were selected by a group of experts taking part in the research project HYGIA.1 This project has promoted the acquisition, formalization and adaptation of knowledge from CPGs in order to describe care pathways. The results of this paper show the applicability of integrating several open sources to parse GPC documents.

2. Research background

NLP systems can be developed using KE, machine learning (ML) or hybrid methods (Krauthammer & Nenadic, 2004). ML systems have become more widespread since results are attained earlier (Jin et al., 2006; Roberts et al., 2009). However, automatic learning is particularly difficult if not much clinical data are available for training and, more often than not, researchers do not have any such resources available. Nowadays, KE approaches are not so successful due to the manual work they require and the difficulty to be reused in other applications. Even so, mainly for the recognition of complicated structures, these approaches offer good results. This is the case of many research works recognizing entities and extracting relationships from patient reports. Many of them apply KE approaches, using syntactic parsers with domain-specific grammar rules, lexicons and ontologies. The RECIT system (Rassinoux et al., 1994) combines syntactic and semantic information to extract conceptual graphs expressing the meaning of components in free text sentences. MedLEE (Friedman et al., 2004) is a rule-based system successfully used to process reports from different domains (e.g. radiology, pathology, electrocardiography). SeReMeD (Denecke, 2008) is a method used to generate knowledge representations from chest X-ray reports, supported by the Unified Medical Language System (UMLS). To date, we can also find very relevant works on information extraction from CPG texts. Kaiser et al. (2007) have modelled treatment processes by automatically identifying relevant guideline text parts using information extraction methods. Serban et al. (2007) simplify knowledge acquisition by automatically extracting control knowledge (such as clinical action decomposition or sequencing) using linguistic patterns, and an ontology that was specifically constructed for this system. Recently, MapFace (Gschwandtner et al., 2010), an interactive editor, has been proposed in order to simplify the enrichment of CPGs documents with UMLS concepts.

There are four main steps in NLP (Demner-Fushman et  al., 2009): text pre-processing, NER, context extraction and relationship extraction. Currently, we can find available text pre-processing tools providing sentence detectors, tokenizers, part-of-speech (POS) taggers, Treebank chunkers or Treebank parsers. Examples of these tools are OpenNLP,2 which also takes part in integrated toolkits of NLP software, such as GATE (Cunningham et al., 2002), or Stanford,3 which has been integrated into other NLP browsers and interfaces. With regard to NER in medical domains, MetaMap (Aron-son & Lang, 2010) is the most used tool nowadays, although other UMLS resources have been applied (Huang et al., 2006; Taboada et al., 2009). The UMLS (Lindberg et al., 1993) provides the terminology resource for Meta-Map. It consists of several knowledge sources providing terminological information. The largest knowledge source is the Metathesaurus, which contains information about medical concepts, synonyms and the relationships between them. Semantic types (STs) are a set of basic semantic categories used to classify the concepts in the Metathesaurus. Examples of STs are diagnostic procedure or therapeutic preventive procedure. STs are related to each other by hierarchical (‘is-a’) and non-hierarchical (‘diagnoses’, ‘treats’, etc.) relationships, making up the so-called semantic network (SN). On the other hand, understanding the context, from which an entity is extracted, is essential to transform text into an electronic format. Context recognition can involve the identification of a clinical condition, negations or patient health antecedents. One of the most popular algorithms proposed in the literature for context recognition is NegEx (Chapman et al., 2001), an algorithm used to identify negation in sentences. NegEx has been integrated into the latest releases of MetaMap. Finally, SemRep (Rind-flesch & Bodenreider, 2006), a tool oriented to extract knowledge in the form of semantic predications, is also available to researchers.

3. Case study

In order to evaluate the current NLP technologies in automated knowledge acquisition, we used a public CPG for chronic heart failure (CHF), several open-source tools, and we applied performance evaluation criteria usual in NLP research.

3.1. The text

We executed the case study using 171 sentences from a free-text document: the guideline for the diagnosis and treatment of CHF published by the European Society of Cardiology (Swedberg, 2005). The sentences were selected by a group of experts and they were used in this work for verifying the effectiveness of our approach. Initially, the clinically relevant contents of the CPG were underlined by the group of experts and then, the underlined sentences were manually annotated with the diagnosis context that they refer to: CHF, CHF with Preserved Left Ventricular Ejection Fraction (PLVEF) or CHF in elderly patients.

3.2. The open-source NLP tools

The open-source NLP tools used in this case study were the following: the OpenNLP and the Stanford parsers, the SemRep and the UMLS NormalizeString service. All these tools were selected by the following criteria: availability, high quality tested by other researchers and reported in the literature and use experience by our research group.

3.3. Performance evaluation criteria

The main difficulty when evaluating NLP over clinical documents is the almost complete absence of gold standards to the domain. Hence, we validated the automated approach with a manual gold standard, including sentences annotated by a clinically trained annotator. On the other hand, evaluation metrics are usually interpreted in terms of precision and recall. Precision is a measurement of the accuracy of the units that are suggested as entities or relationships, and it is typically measured as the ratio of true positives (right suggested units) over all the suggested units. Recall designates the proportion to which entities or relationships are recognized and it is generally measured as the ratio of true positives over all the units that should be recognized. The overall performance is usually calculated by the well-known F-measure, a harmonic mean of precision and recall.

4. Methods

Our approach extracts entities and relationships from sentences in natural language. By entity, we mean some clinical concept or event: disease, treatment, symptom and sign, etc. and by relationship, some semantic predication on a clinical diagnostic or therapeutic procedure. These predications are composed of two arguments, that is, two entities asserted in the sentence. In addition, the generation of computer-executable knowledge requires understanding the context from which entities are extracted. For example, processing diagnostic and therapeutic procedures will require not only recognizing the procedure to be applied, which is given by the entities and their relationships, but also determining the clinical condition under which the procedure is recommended. In Figure 1, we can see an example of sentence from the guideline of the case study, and the informally expressed knowledge that is needed to be recognized. In this particular situation, the sentence involves a therapeutic procedure for CHF, as an alternative to another therapeutic procedure, but only in the context of patients with a particular sign or pathologic function.

Figure 1.

A sample sentence and descriptive knowledge to be recognized.

In order to extract relationships understanding the patient context, our approach produces a shallow syntactic analysis, a UMLS-based NER and a conversion of syntactic structures into semantic relationships. Figure 2 shows the main building blocks of our approach. First, the tool SemRep4 is used to extract information from the set of sentences, in the form of semantic predications. SemRep produces two types of results: a UMLS-based NER using the MetaMap tool (Aronson & Lang, 2010) and a predication recognition using the SN relationships. In Figure 3, we can see the results of SemRep for the sentence in Figure 1. Then, our approach automatically checks these results and, in the absence of the expected entities or predications, it applies the steps classically used in NLP: text pre-processing, NER, context extraction and relationship extraction. Figure 4 shows the application of these four main steps to the sample sentence.

Figure 2.

Building blocks of our natural language processing (NLP) approach.

Figure 3.

The results of SemRep for the sentence in Figure 1.

Figure 4.

An example of the steps classically used in natural language processing.

4.1. Recognizing entities and extracting predications using SemRep

Each sentence is mapped to a set of UMLS concepts, and a set of semantic predications between two of the recognized concepts, using SemRep. Each semantic predication is some type of relationship in the SN, such as diagnoses, treats, causes or location of. As we are only interested in extracting diagnostic and therapeutic procedures, our approach is restricted to some UMLS STs (diagnostic, laboratory, therapeutic or preventive procedures; sign or symptoms, findings, diseases or syndromes and pathologic functions; pharmacologic substances), two types of SN predications (treats and diagnoses) and a type of predication not included in the SN (has adverse effects). An example of the original output of SemRep for the sentence in Figure 1 is given in Figure 3. Seven UMLS entities and three relationships are automatically recognized by SemRep. Our approach rules out those unexpected results as, for instance, the three entities and the three relations highlighted in bold type, as they are not included in the required semantic or predication types. In the absence of the expected entities and predications, our approach applies the steps described in the following sections.

4.2. Text pre-processing

Two open-source parsers (OpenNLP and Stanford) were used to provide a shallow syntactic analysis. The POS tags used by the Stanford and OpenNLP parsers are from the Penn Treebank tag set.5 Once we obtain the parser trees using the two parsers, our method traverses these trees, looking for tuples NP–VP and PP–NP– VP (see text pre-processing in Figure 4). These are called the expected tuples. NP–VP tuples involve a noun phrase (NP), generally representing the subject of the sentence, and a verb phrase (VB), normally representing the predicate of the sentence, whereas tuples PP–NP–VP are tuples NP–VP preceded by a pre-positional phrase (PP). These tuples are interesting to extract binary semantic relationships between medical entities. The method compares the results of the two parsers and it rules out those tuples differing from the expected tuples, improving, in this way, the performance of each parser individually. For the sample sentence, the both parsers provide correct analysis, but for the sentence ‘Breathlessness, ankle swelling, and fatigue are the characteristic symptoms and signs of chronic heart failure’, the OpenNLP produces an analysis differing from the expected tuples, whereas the Stanford parser produces an expected analysis. Hence, in this last situation, the method chooses the Stanford analysis and it rules out the OpenNLP analysis.

4.3. NER

Most of the syntactic patterns of the medical entities in the guideline fit an NP. Within a sentence, there can be two types of NPs (Huang et al., 2006):maximal NPs, that is, NPs of maxmum length and including other phrases, such as NP, PPs or relative clauses; base NPs, that is, minimal NPs without including other NPs descendants. As there is no optimal level of NPs to be mapped to the UMLS, our approach attempts to match maximal NPs (those that were not mapped with SemRep) using the UMLS NormalizeString service. In the case of no mapping for a maximal NP, this is divided into its internal NPs or constituents (i.e. names, adjectives, etc.) and a request is sent to the UMLS database in order to map them to some UMLS concept. In the sample example, only the NP angiotensin receptor blockers is mapped to the pharmacologic substance angiotensin ii receptor antagonist (Figure 4), as some NPs were successfully mapped by SemRep. For the rest, this step could find no mapping.

4.4. Patient context extraction

The patient context, under which the diagnostic or the therapeutic procedure is recommended, is automatically identified by means of a set of regular expressions that are triggered by types of NPs or PPs. Some of the regular expressions are: in patients with *, in patients who*, inPathologic Function〉 [with | from | and | or 〈Pathologic Function], and in* [patients tolerant | intolerant] to 〈Pharmacologic Substance〉*, where * may be a finding, a pathologic function or a pharmacologic substance.

Once the patient context is recognized in the sentence, the method identifies its central entities and the links among them, which can be AND or OR links. Following with the sample sentence (Figure 4), the patient context is represented by means of the two concepts Coughing and Angioedema, linked by the connective OR.

4.5. Relationship extraction

The relationship extraction phase is carried out in several steps. First, the method identifies the leading concepts in the sentence from the set of both recognized and context entities. Entities corresponding to very generic terms, such as symptom, sign, finding, procedure, therapy, patient, etc. are considered non-leading concepts, whereas the rest of the entities are considered leading concepts. In the sample sentence, all entities (plus the diagnosis context CHF) are classified as leading concepts. Then, a set of predefined rules identifies verbs, names and non-leading concepts as relationships in the SN, and entities in the sentences as arguments in the relationships treats, diagnoses or has adverse effects. The relationships are only possible if the STs of the candidate entities match the STs of the arguments in the relationship. In Figure 4, the relationship treats is triggered by the verb uses and the candidate entities angiotensin ii receptor antagonist (pharmacologic substance) and CHF (disease or syndrome). Later, attributes are identified and added to the relation-ship. Possible attributes to the relationship treats between a pharmacologic substance and a disease or syndrome are other pharmacologic substances that can be administered alternatively or in combination. In the sample sentence, angiotensin ii receptor antagonist is an alternative on angiotensin-converting enzyme inhibitors in CHF. Finally, the patient context is added to the relationship.

5. Results

An evaluation of the method performance was made in order to determine the method accuracy. For each sentence of the case study, one or more entities and relationships were generated, using our approach, and compared to the entities and relationships that were manually marked. For the evaluation, the following criteria were considered. A relationship is interpreted as correct if it (1) contains the two entities described in the sentence as arguments, and a predication, and (2) both the entities and the predication are the same as the manually created ones. A patient context is considered correct if the entities and the resulting expression are the same as the manually created ones.

5.1. Text pre-processing

Of 171 sentences, the results of the identification of NPs, using each parser independently and after an automated checking, achieved 81% precision, 92% recall and 86% F1-measure for the OpenNLP parser, and 68% precision, 96% recall and 80% F1-measure for the Stanford parser. Using a combined approach of the two parsers, our method achieved 86% precision, 96% recall and 91% F1-measure.

5.1.1. Interpretation

Evaluating the discrepancies between the two parsers, we have identified different situations. Firstly, the Stanford tool wrongly parsed sentences with a coordinated phrase inside a PP, such as patients with PLVEF or diastolic dysfunction in CHF. These types of phrases often appear on the sentences, being the main reason for a low precision. Second, some medical names ending in -y were incorrectly parsed as adverbs (such as ventriculectomy or aneurysmectomy) and then, decreasing the precision. However, in long and complicated sentences with subordinated phrases, the performance of the Stanford parser is higher than the OpenNLP parser.

5.2. NER

Of 171 sentences, the extraction of entities achieved a precision of around 83% and a recall of 91%, using SemRep. In total, 87% entities were correctly and completely mapped to the corresponding Metathesaurus concepts (F-measure), using SemRep. These results were substantially improved when we applied the step NER to the NPs not mapped with SemRep: a precision of 98% and a recall of 95%. In Figure 5, we show the precision, recall and F-measure of the method SemRep, and the combination of SemRep and our NER for each of the eight STs of our domain. The precision of SemRep for extracting laboratory procedures, findings and pathologic functions was low (around 35%, 70% and 75%, respectively), whereas the recall was only low for therapeutic procedures (around 60%).

Figure 5.

Results of applying the general steps of our approach.

5.2.1. Interpretation

Evaluating the discrepancies between the automated and the manually recognized entities, we have detected two main situations that decrease the precision of SemRep. Note that SemRep uses MetaMap to recognize concepts in the Metathesaurus and MetaMap provides the best candidate entities to cover the text. On some occasions, MetaMap provides more than one candidate covering the same part of the text. For example, for the NP s-creatinine in a sentence, MetaMap provided two best candidates: creatinine (biologically active substance, organic chemical) andcreatinine (creatinine finding). In situations like this, SemRep chooses one (maybe at random). We have detected that this operation was made repeatedly on entities corresponding to the ST laboratory procedures. This was a source of a low precision. An improved alternative would be to have the possibility of configuring this operation when several best candidates are provided by MetaMap. In this way, SemRep would take account of the particular heuristics of each application. On other occasions, entities recognized by SemRep do not agree with those provided by MetaMap. The cause can be that SemRep (via MetaMap) is accessing to a different UMLS release than if MetaMap is run independently. In the sample sentence of this paper, SemRep recognized the NP angiotensin receptor blockers as an angiotensin receptor (a receptor or a protein), whereas running MetaMap independently, the NP is recognized as angiotensin ii receptor antagonist (substance). Again, an improved alternative would be to have the possibility of configuring this operation with SemRep.

5.3. Relationship extraction

Of 171 sentences, SemRep extracted 240 relationships, achieving a precision of around 80%. In total, SemRep extracted 16 different types of SN relationships, mainly treats (48%), process of (24%) and coexist with (8%) relationships. As we were only interested in extracting two types of SN predications (treats and diagnoses), initially, our method only took into account the predication treats provided by SemRep (48% of total ones). Then, our method correctly ruled out those predications whose arguments were not leading concepts. In total, 42% of the predicationstreats provided by SemRep were discarded and only 86 relationships treat were extracted. An example of ruled-out predication is: angiotensin ii receptor antagonist TREATS patients. These results were substantially improved when we applied the second relation extraction to the sentences, which recognized 137 additional relationships. In total, the approach extracted 223 relationships: diagnoses (69), treats (151) andhas adverse effects (3), achieving a precision of 82% and a recall of 78%.

5.3.1. Interpretation

Although the precision of SemRep is high, the number of predications relevant for our application is low: 39% of the total extracted predications. However, SemRep is a general tool oriented to recognize a wide range of relationships between all the UMLS concepts identified in the sentence, whereas our method only generates three types of predications between leading concepts and, more importantly, it takes into account the patient context. The latter is carried out by breaking down the sentence into two pieces: the patient context and the predication part, and only the last one is used to provide the relationship. We think that if SemRep is tailored in the same way as our method, the recall would substantially increase. On the other hand, comparative syntactic structures resulted in a wrong identification of leading concepts, and so predications. An example of incorrectly represented sentence is anti-arrhythmic drugs other than beta-blockers. Complex syntactic sentences, including co-references and relative clauses, brought on a missing or a wrong representation of relationships. Another source of errors in both methods is the inappropriate representation of negative sentences.

5.4. Patient context extraction

Of 171 sentences, only 72 sentences include a patient context (42%). Our method achieved 81% precision, 69% recall and 75% F1-measure for context extraction.

5.4.1. Interpretation

We have detected several situations that decrease the performance of our method. Low recall is mainly due to (1) the lack of some recognized entities and (2) the wrong identification of the context by means of the predefined regular expressions. In order to resolve this drawback, in the future, we plan to explore the possibility of also using some predications extracted by SemRep, such as process of or coexist with. Regarding precision, complex syntactic expressions led to wrong or incomplete representations of the context.

6. Limitations and future work

A limiting factor to the breadth of our approach is that it can only be applied to extract descriptive knowledge on diagnostic and therapeutic procedures. This is an important part of knowledge in GPCs, but another important content is the set of actions to be made in a particular situation. For example, management actions in CHF are assess severity of symptoms, determine aetiology of CHF, etc. The extraction and representation of these types of actions is particularly difficult with the available resources, as many medical actions are not included in these resources or they are classified into semantic categories not useful for our purposes, leading their automated acquisition to many errors. In addition, this type of knowledge often includes control structures, such as action decomposition or sequencing, which is necessary to extract at the same time. Future investigation will explore other alternatives to extract this type of knowledge, such as the use of linguistic patterns (Serban et al., 2007; Peleg & Tu, 2009).

One intention of SemRep is to extract a wide range of predications between the UMLS concepts identified by MetaMap in the text. SemRep achieves a high precision, but when it is applied to a specific domain, such as the study case of this paper, the coverage of extracted predications considered relevant decreases substantially. The issue of suitably tailoring SemRep with less implementation effort is important and it has not been completely solved as yet. In this paper, we propose a method for tailoring, but it still uses excessive resources. As for future work, we aim to consider other ways of tailoring SemRep that take into account the advantages of the method described here: the detection of leading concepts and the extraction of patient context before extracting predications.

On the other hand, the success of the electronic CPGs does not only depend on the quality of the knowledge used to represent and execute diagnostic and therapeutic recommendations but also on the degree of interoperability with the electronic patient record (EHR). One intention of both terminologies, such as SNOMED CT, and archetypes is to facilitate sharing of EHRs. Future investigation will also analyse terminologies and archetypes as mediators to integrate electronic GPC and EHRs.

Finally, evaluating our method on different clinical texts can help us to determine the scope of the method application. The adaptability of our method will be also tested in the future, in order to extract phenotypic information from clinical notes.

7. Conclusions

This paper confirms the advantage of adapting several open-source tools to extract meaningful therapeutic and diagnostic predications from CPG documents. Our proposal offers a systematic approach to combine different NLP tools and it serves as an example and promising method to automatically encode diagnosis and treatment recommendations with mappings to standardized terminology. In particular, this information is relevant for medical decision support, promoting the guideline content to be delivered automatically at the point of care.


This work has been funded by the Spanish Ministerio de Educación y Ciencia, through the national research projects HYGIA (TIN2006-15453-C04-02) and TermiMed (TIN2009-14159-C05-05).


  • Maria Taboada is associate professor of computer science and artificial intelligence at University of Santiago de Compostela. Her current research interests include knowledge engineering, ontology and terminology mapping, and archetype modelling in medicine.

  • Maria Meizoso is research assistant at University of Santiago de Compostela. Her current research interests include knowledge extraction and natural language processing of medical guidelines.

  • Diego Martinez is associate professor of applied physics at the University of Santiago de Compostela. His current research interests include knowledge, ontology engineering and qualitative simulation.

  • David Riano is associate professor of computer science at the Universitat Rovira i Virgili and head of the research group on artificial intelligence and medical informatics. His current research interests include healthcare knowledge management, data mining and machine learning.

  • Albert Alonso is senior researcher at Hospital Clinic (Information Systems Directorate). His current research interests include new models of integrated care services targeting chronic conditions and health technology assessment and evaluation methods in eHealth.


  1. 1

  2. 2

  3. 3

  4. 4

  5. 5