Papers written by researchers in medical and life sciences are a valuable source of information even for non-experts looking for knowledge related to rare diseases, but only if those non-experts can read English. If researchers create descriptors of their papers in the form of description logics (DL) ABoxes (assertion components) according to a DL ontology, then by using currently available software, computers can reason over the ABoxes to infer semantic consequences of the assertions in the descriptor. One open issue is how best to render information contained in the ABox for a particular user based on that user's knowledge requirements and background knowledge, including language preference. Natural language generation (NLG) is a method for rendering computer-interpretable statements, content models, in human-readable form, natural language text. ABoxes could be used as content models for NLG with particularly rich semantics. In particular, ABoxes could be used to generate expressions of expert knowledge in languages different from the original language that are more accurate and more tailor-fit to the user's cognitive state than existing methods for translating scientific papers. A method for generating natural language expressions from ABoxes in English and Japanese is presented and compared with a state-of-the-art expert translation software package.
Recently, a large amount of information on expert knowledge in medical and life sciences is freely available over the web through services such as PubMed, PLS (Public Library of Science) and BioMedCenter. This information is accessed not only by researchers and medical practioners, but also increasingly by laypersons, such as friends and family members of patients suffering from rare diseases that are difficult to treat, who seek to discover new knowledge related to potential cures for the disease in question. However, because currently the contents of these information services are almost entirely in English, their usefulness is essentially restricted to native English-speakers. In a country where the mother-tongue is not English, unless the information content is translated into the native language, non-experts will not likely be able to profit from the shared information. Furthermore, the problem of discovering expert knowledge, for example knowledge related to treatment of a rare disease, is compounded by recent increases both in complexity of medical treatment and the volume of information available that might be related to the treatment (Beasley 2000, Shenk 1997). One reason is that conventional media for sharing scientific knowledge such as papers and conference presentations, although well-tested and reliable, remain static, specific, and non-interactive (Natarajan et al.2005, Gerstein et al.2007).
There are essentially two ways to provide translations of natural language text (figure 1). One is to do it by hand, and the other is to use computer programs for automatic translation. Translating documents by hand is time consuming and expensive. On the other hand, while software for automatically translating one language to another has come a long way, automatic translations still suffer from prolific errors in meaning, so unfortunately automatic translation still cannot attain a level of semantic accuracy required to meet the needs of laypersons searching for medical knowledge.
Automated Discourse Generation or Natural Language Generation (NLG) is a branch of Natural Language Processing (NLP) concerned with creating natural language artifacts (e.g. texts and speech) automatically (Bateman 2002). Basically, NLG works in the opposite direction of conventional NLP. While conventional NLP, such as information retrieval and information extraction, seeks to extract meaning from natural language in a computer-interpretable way, NLG seeks to generate natural language representations of computer-interpretable content, representations that read naturally and are easy for humans to understand. Several researchers in the field of NLG have attempted to apply NLG techniques to realize a more intelligent role of computers in translating natural language text from one language to another (Power et al.2003, Paris et al.1995). Applications of NLG to language translation can be understood as replacing the conventional direct translation process from one language to another with an indirect process whereby the source document is initially converted into a computer-interpretable description written in some formalized or “controlled” language, e.g. an interlingua. That computer-interpretable descriptor is then used to build a natural language document in the target language (Paris et al.1995). Consequentially, NLG offers a third approach to converting documents from one natural language to another (figure 1).
NLG is based on using a computer-interpretable content model to generate human-readable text. Many different content models have been proposed for different types of NLG applications, ranging from standard tree structures to more flexible DAG representations (Nicolov et al.1996, Reiter 1995). A particularly attractive type of content model is one based on some form of logic, such as a description logic. Description logics (DL) are a family of logics based on first order predicate logic that are defined by the types of operators available for describing concepts and relations in terms of various primitives. In particular, the SHION “flavor” of DL has attracted attention because it is implemented in the OWL-DL web ontology language, which has been proposed by the World Wide Web consortium to describe ontologies on the Semantic Web (World Wide Web Consortium 2004). A DL includes two main components. The first is a terminological component, called a TBox, which provides a controlled vocabulary mapping to concepts and relationship types in the domain described. The second is an assertion component, called an ABox, which is populated with instances of TBox classes and relationships asserted between pairs of instances.
Bateman proposed the use of ontology engineering techniques and DL to create an Upper Model for “organizing domain knowledge appropriately for linguistic realization” (Bateman 1990). Several researchers have investigated the effectiveness of using different forms of DL as content models for NLG (Danlos & El Ghali 2002, Dymetman 2002, Bateman et al.2005, Kruijff-Korbayova & Kruijff 1999). For example, Gabsdil and colleagues proposed a content model based on DL as an engine for text computer adventure games (Gabsdil et al.2001).
A web-based platform, called EKOSS for Expert Knowledge Ontology-based Semantic Search, has been developed to support computer-aided sharing, discovery and integration of expert knowledge (Kraines et al.2006). Through EKOSS, people interested in sharing expert knowledge can create computer-interpretable descriptors of that knowledge in the form of DL ABoxes. In addition to various knowledge sharing and mining applications, these ABoxes could be used as content models for NLG.
In this paper, we report a method for using the ABoxes created in the EKOSS system to generate natural language text in the two main languages that are supported by the EKOSS system: English and Japanese. Section 2 gives details of the tools and components that form the basis of our NLG approach. Section 3 presents results of a comparative analysis of the NLG text with text created by a state-of-the-art expert translation software package. Section 4 gives a discussion of the results, and Section 5 suggests possible applications of the approach of generating natural language text from DL ABoxes and ontologies.
2.1 EKOSS: A Semantic Web system for authoring ABoxes
Semantic Web technologies could be used to enable people interested in sharing expert knowledge to create computer-interpretable descriptors of that knowledge (Berners-Lee & Hendler 2001, Wang et al.2005). Expert knowledge often needs to be described not simply with a “bag of words”, but by also giving the specific relationships between the entities represented by those words (Sheth et al.2003, Saric et al.2006). In order to describe knowledge with explicit representation of typed relationships that describe the way in which one entity modifies or interacts with another, some form of representation language that can represent typed relationships between concepts, such as ontologies grounded in some DL, is required (Di Noia et al.2007). EKOSS provides a collaborative web environment that lets users create the descriptors, called semantic statements, using DL ontologies represented in OWL-DL as formalized knowledge representation languages. The administrator of an EKOSS system can load as many ontologies as desired. A user of the EKOSS system can then choose from any of the available ontologies to create a semantic statement using web interfaces provided by the system. EKOSS uses the TBox of the DL ontology to provide the language for expressing knowledge in the targeted domain. The actual shared knowledge is then described by the creators of that knowledge with ABoxes that are populated with instances of TBox classes to represent entities in the scope of the shared knowledge together with relationships between the instances that are stipulated in the shared knowledge. EKOSS utilizes those semantic statements to provide a number of services, such as semantic searching and knowledge mining (Kraines et al.2006).
2.2 A DL ontology for life sciences
Many ontologies have appeared in the domain of life sciences and medicine. Most of the ontologies are expressed as directed acyclic graphs in formats such as OBO, with little or no information on the types and logical properties of the relationships. Therefore, those ontologies are not suitable for the approach we have proposed. There are several major ontologies in life sciences and medicine that are founded in a description logic, and most of those have representations in OWL-DL. In particular, we initially tried using the GALEN ontology (Rector & Rogers 2000). However, given the focus on Medline papers, we decided to apply the basic knowledge model in the GALEN ontology to a subset of the MeSH (“Medical Subject Headings”) controlled vocabulary (US National Library of Medicine 2007), a subset that corresponded to the corpus of text that we were targeting (work in progress). The result is a DL ontology with an upper level structure roughly corresponding to the GALEN ontology that contains about 1500 classes corresponding to MeSH terms and about 100 properties that are based on the UMLS semantic network (McCray 2003). The preexisting taxonomic structure of the MeSH terms was analyzed and reclassified in terms of the relationships described by the properties from the UMLS semantic network, mainly using existential restrictions (OWL-DL someValuesFrom restrictions). The top level structure of the ontology is shown in figure 2.
2.3 Using Semantic Statement ABoxes for NLG
The approach of sharing knowledge through human authored ABoxes is based on the following premises:
1.Computer-interpretable descriptors of scientific papers can be more accurate and semantically rich if they are created by the original authors, possibly as a part of the publication process.
2.Researchers have a strong incentive to create such descriptors, which we call “semantic statements”, because computers are increasingly relied upon to provide “match-making” services between knowledge providers and knowledge users.
Using existing technologies, a computer platform for sharing knowledge encapsulated in scientific papers using computer-interpretable semantic statements created by the authors could be constructed that is easy enough to use and that provides sufficient benefit to authors to motivate them to use the system.
Much skepticism has been expressed regarding the feasibility of getting researchers to create the kind of descriptors that we describe here (e.g. Hahn et al.2007). However, we believe that the problem is not that researchers are not able to create such descriptors, but rather that they are unwilling, mainly because the benefits of doing so are not clear. The work presented here aims to provide a clear form of “instant gratification” from creating semantic statements. Adopting a hybrid approach that integrates natural language processing techniques to assist human authors in creating semantic statements might be an effective way to ease the burden on the researcher (Craven 2000, Rzhetsky et al.2008).
Consequentially, the NLG approach we propose for translating natural language text actually involves a human component, but upstream of the content model creation process instead of downstream (figure 1). This offers several advantages. First, because the human author of the knowledge resource takes responsibility for creating the computer-interpretable descriptor of the resource, that descriptor is essentially guaranteed to be accurate. Second, the human author does not need any knowledge of the target language, only an understanding of the ontology used to create the descriptor. Third, because the computer is provided with a descriptor of the shared knowledge resource that it can “understand” or reason against, it can make accurate representations of the meaning in any natural language for which an appropriate mapping is provided, representations that in principle should not have any errors in meaning. In fact, representations can be generated in other kinds of “language” – for example, a computer could generate a description of a research paper by a molecular biologist on the effect of a particular protein in a disease related metabolic pathway that is expressed in a language understandable to a public health-care worker with only basic knowledge of biology. Moreover, because the component for rendering the natural language representations in the target language is computerized, we can set parameters for tuning the generation algorithm, for example to balance between verboseness and simplicity, or accuracy and readability (Bateman et al.2005).
Using the EKOSS system, 392 semantic statements have been created mainly by graduate students in life sciences as DL ABoxes based on the life sciences ontology presented in section 2.2 that describe the semantic contents of the abstracts of papers selected from Medline. The students were able to create one statement, with an average of 19 instances and 22 relationships, in about 3 hours following a short tutorial session. This provides evidence that researchers can indeed create computer-interpretable semantic descriptors of their papers if they choose to do so. The semantic statements act as computer-readable semantic representations of the human-readable scientific papers. For example, in other work we have applied automatic semantic graph matching algorithms to those semantic statements to generate network graphs showing the semantic similarity of the research topics of the different papers (work to be published).
As discussed above, the semantic statement ABoxes could also be used as content models for NLG in order to generate human readable representations of the knowledge codified in the ABoxes in any language for which there is a mapping provided from DL representation to natural language representation. Because those natural language representations are based on content models created by the original authors, they should be more accurate than translations generated by computer software and even by human translators who may not have the same level of understanding of the represented knowledge as the original author (Paris et al.1995, Warner 2007, Gil-Leiva & Alonso-Arroyo 2007, Seringhaus & Gerstein 2008). Furthermore, although we do not address them here, application of standard components of natural language translation software packages for rendering the finished text should make it possible to assure that the text generated by the NLG is highly readable. As an additional merit, the natural language representations of the DL ABoxes could provide a useful feedback mechanism for helping the creator of a semantic statement to confirm that the intended meaning is accurately reflected in the ABox (Power et al.1998).
We have used a template approach for sentence planning and surface realization of natural language text representing the content of the ABoxes constructed based on DL ontologies. The classes and properties of the ontology are all given short descriptive phrases in each of the supported languages (English and Japanese here). We represent each instance in the ABox with the descriptive phrase for the class of the instance in the target language together with a simple translation of the instance label if the user provided one. In English, this is rendered as “a classname instancelabel” if an instance label is provided, or just “a classname” if no label is given. Subsequent renderings of the instance are given in the form “the instancelabel” or “the classname”. For each instance in the ABox that is the subject or domain of a property relationship with another instance, a sentence is generated with that instance as the subject and each property leading from that instance to another instance as a predicate phrase. Properties are rendered with the descriptive phrase in the target language that includes anchors for the subject and object of the property relationship. We also include some simple techniques to handle plurality and sequences of identical relationships. The techniques for rendering the Japanese natural language text are similar, with appropriate reversals of sentence order to handle the differences of grammar between English and Japanese.
In order to illustrate the effectiveness of the NLG approach described here, we present an example using a semantic statement that was authored based on the abstract of the paper “Spectral differentiation of blue opsins between phylogenetically close but ecologically distant goldfish and zebrafish” (Chinen et al.2005). To avoid introduction of bias, this paper was chosen randomly from the corpus of semantic statements that we created using the life sciences ontology. A graph view of the semantic statement is shown in figure 3. In figures 4 to 8, we show the original abstract, sentence by sentence. For each sentence of the abstract, we also show both the Japanese and English text that are generated using NLG based on the semantic statement ABox together with an automatic translation to Japanese created by the medical translation software “MedTranser2008” (Cross Language 2008). We have indicated the places in the Japanese text created by “MedTranser” where the translation failed to capture the original semantics correctly in red, and we give a direct translation of the “MedTranser” output back to English to illustrate where “MedTranser” fails to translate the semantics of the original abstract correctly. For the Japanese text created by NLG, the Japanese versions of class names and property names are drawn from the labels given in the life sciences ontology, and the labels, shown in blue, are translated using a basic freeware translation tool (http://translate.google.com/translate_t#).
Overall, as assessed by our Japanese colleagues, “MedTranser” creates an excellent translation that is easy to read and essentially correct grammatically. However, we identified several important mistranslations that could lead to serious misunderstandings by readers of the translated text.
In the “MedTranser” translation of the first sentence shown in figure 4, the main problem is that the position of the word “ryouhou”, Japanese for “both”, is incorrect. Also, the scientific term “family” has been translated to the general word for a household family: “kazoku”. Although the translation generated using the NLG approach is not quite as natural as the “MedTranser” translation and misses the part about the differing surroundings, the meaning of the text that is generated is accurate.
The translation by “MedTranser” of the sentence in figure 5 is almost completely incorrect. As in the previous sentence, although the text generated based on the semantic statement is incomplete and somewhat convoluted, the rendered text is accurate and precise, containing no errors in meaning.
In the sentence shown in figure 6, the problem in the “MedTranser” translation is that the position of the word “pi-ku” (peak) is incorrect. The text rendered by the NLG not only handles this problem, it also clarifies a number of other ambiguous expressions in the original sentence.
In the sentence shown in figure 7, the technical term “residue 94” is incorrectly translated to “nokori94”, which means the remaining 94. Because the NLG approach draws on the expert vocabulary provided by the life sciences ontology, the term “residue 94” is given the correct meaning of “residue located at the 94th position of the amino acid chain”, thereby addressing an important ambiguity in the original text as well.
In the sentence shown in figure 8, the problem with the “MedTranser” translation is that the translation given for the word “reconstructed”, “saiken shita” whose direct translation is “rebuilt”, does not fit accurately with the sentence context. Also, the term “likelihood-based” has been incorrectly translated using the common Japanese word for possibility: “kanousei”. As in the previous sentence, the NLG approach can draw on the ontology to give the correct term for “likelihood-based”. Although the NLG approach is not able to provide a direct translation of “reconstructed” either (due to the limited scope of the ontology), it represents the semantic essence – the relationship between “ancestral SWS2 pigment”, “likelihood-based Bayesian statistics” and “site-directed mutagenesis” – clearly and accurately.
4. Discussion of results
The analysis of the translation results in the previous section identified several errors in semantics occurring even with the state-of-the-art medical translation software “MedTranser2008”. The NLG approach based on EKOSS semantic statements in the form of DL ABoxes can give semantically accurate translations in each of those cases because the computer NLG algorithm is provided with a clear and accurate semantic representation of the content to be rendered. Although this may appear a foregone conclusion, even when using a computer-interpretable interlingua, it is an open question of whether such a controlled language could be expressive enough to convey the semantics of the main ideas expressed in the original text (Power et al.2003). Any computer-interpretable knowledge representation language is necessarily a simplification of human language, and even in the single case examined here, the life sciences ontology could not represent several expressions in the original text. On the other hand, due to the logical rigor of the ontology, the author of the semantic statement was forced to clarify the ambiguities in the original text, resulting in final natural language texts created through the NLG approach that were more clear and precise than even the original text upon which they were based. This is a result of the active human involvement in the translation process, specifically in the disambiguation step involved in creating the computer-interpretable representation from the original natural language text (Paris et al.1995).
The natural language output that we have given is not as readable as the output from the translation software. However, we believe that this is simply an issue of the sophistication of the surface realization and morphology steps of the text generation process. Our work has focused on the issues of strategic selection of content – “what to say” – rather than tactical expression of content – “how to say it” (Bateman 2002). Reiter noted that an effective approach to automatic text generation is to use NLG techniques for content determination and sentence planning, and to use non-NLG techniques such as templates for low-level syntactic realization (Reiter 1995). It should be possible to apply the surface realization techniques used in automatic translation software packages, such as “MedTranser”, to produce a more polished and readable output from a semantic statement ABox.
5. Conclusions and applications
A long-term goal of the EKOSS project is to integrate the step of creating computer-interpretable semantic descriptors into the overall process of submitting, reviewing, and publishing scientific papers (Kraines et al.2006, Seringhaus & Gerstein 2008, Berners-Lee & Hendler 2001). By doing this, it will be possible to provide knowledge sharing and discovery services based on inference of specific semantic relationships in the same way that many journals today use author-selected keywords attached to published papers to aid in topic searches. Furthermore, if computer-interpretable descriptors can be accumulated in this way, it should be possible to apply the ideas demonstrated in this paper for generating accurate translations of the semantic content of the original paper using different languages and even different worldviews to repositories of scientific publications the scale of Medline.
Additionally, the research described here could play an important role in assisting authors of scientific papers to create computer-interpretable descriptors more easily and accurately – a critical factor in achieving this long-term goal (Kraines et al.2006). As the human author creates the computer-interpretable descriptor, e.g. the EKOSS semantic statement ABox, the NLG can automatically generate a natural language expression of the ABox in real time (Power et al.1998). This feedback helps the author check that the ABox does in fact reflect the intended meaning accurately. The NLG techniques described here could be integrated with the overall EKOSS system in order to realize a more user-friendly platform for creating and sharing computer- interpretable descriptors of expert knowledge in the life sciences and other fields of scientific research, from which a wide range of computational services could be provided.
We thank Toshihisa Takagi, Daisuke Hoshiyama, Takaki Makino, and Haruo Mizutani for suggestions and comments regarding the work presented in this paper, particularly the quality of the Japanese language text produced by automatic translation and by natural language generation from the semantic statement. Hideo Ogimura created the semantic statement used in this paper as part of a corpus of statements based on abstracts of papers from Medline. We are grateful to the President's Office of the University of Tokyo for providing funding support.