Standard Article

Ontologies for natural language processing

Part 4. Bioinformatics

4.7. Structuring and Integrating Data

Specialist Review

  1. Yves A. Lussier

Published Online: 15 APR 2005

DOI: 10.1002/047001153X.g408212

Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics

Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics

How to Cite

Lussier, Y. A. 2005. Ontologies for natural language processing. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. 4:4.7:84.

Author Information

  1. Columbia University, New York, NY, USA

Publication History

  1. Published Online: 15 APR 2005


Rapid technological improvements of biomedical computational semantics and natural language processing (NLP) are leading to a profound transformation in the reuse of knowledge found in scientific journal articles or semistructured biomedical databases. Indeed, both contain textual entries, which can be transformed and annotated with formal language representation, description logic and ontologies (such as Gene Ontology), and can be stored in ontology-anchored databases. Recent advances in the standardization of electronic document interchange formats (e.g., XML, RDF, OWL) also contribute to the reuse of information and knowledge. While scientific biomedical literature remains the pinnacle of knowledge in terms of breadth and depth, as compared to derivative databases (e.g., Pubmed), its role is challenged by alternate receptacles of original knowledge, such as biomedical databases containing semistructured textual entries and highly computable data (e.g., Genbank). As a result of the development of digital libraries and semistructured textual databases, automated tools are increasingly researched to create data structures in declarative knowledge, a highly computable data structure using logic. Though current generation of literary knowledge trumps declarative knowledge in quantity, its noticeable value is compromised at a retrieval cost, owing to literary knowledge being buried in an overwhelming growth of scientific articles. Similarly, data entry of biomedical databases derived from scientific journals, generally accomplished via manual annotation, is a rate-limiting and costly process. In contrast, automated data structures derived from NLP portend instantaneous and comprehensive linguistic knowledge across boundless scientific articles and research communities. Increasing efforts have been invested to translate linguistic data structures generated by NLP into ontology-anchored declarative data sets to obtain otherwise unattainable large-scale or cross-disciplinary inferences. Additionally, NLP based on Harris' Sublanguage Theory challenges the very nature by which we conceive and maintain biomedical ontologies. This article focuses on the challenges brought on by the convergence of biomedical ontologies, originally developed to structure databases, and linguistic data structures produced by NLP operating on unstructured or semistructured corpora. Theories upon which the convergence of NLP, biomedical informatics, and ontologies are being conducted will first be addressed, followed by a description of the properties of ontologies that make them suitable for NLP, and a succinct analysis of their readiness for use by NLP systems.


  • ontologies;
  • natural language processing;
  • controlled terminologies;
  • coding;
  • indexing;
  • computational semantics;
  • semantic networks;
  • computational terminologies;
  • biomedical databases;
  • knowledge bases