Universal abstracting



Abstracts are brief summaries of the content of a work, and they have long been used to improve international accessibility and/or dissemination, e.g., in the form of English abstracts for articles published in non-English languages.

Universal abstracts are similar in that they summarize the meaning of a work, but the indexer creates them in a special lingua franca that makes them available in any language, not just, say, English.

Universal abstracting is performed by an indexer using a piece of software that guides him or her in creating a language-independent summary of the abstracted work. The abstract is written in a stylized form of the indexer's own language; internally, a knowledge representation that combines multilingual controlled vocabularies with a universal grammar based on the Montague Semantics for natural language is created. This form enables high-quality automatic translation – the internal representation is universal and localizes to any language or mode of communication.

State of the art

The feasibility of a universal knowledge representation that localizes to a large number of languages was shown by the WebALT project (Strotmann & Seppälä, 2005; Strotmann, Ng'ang'a, & Caprotti, 2005; Ng'ang'a, 2006) for the case of small pieces of mathematics education contents. While the language fragment was kept purposely small for the project demonstrator, it handled several languages, including Finnish and several Germanic and Romance languages, and complex sentence structures in all these languages. The purpose of the project's multi-lingual component was to overcome a problem that has so far been intractable using traditional machine translation methods: the translated contents had to be guaranteed to express exactly the intended meaning of the original, and to express it correctly.

Theoretical background

The fundamental idea behind universal abstracting is to take multilingual thesauri and add a universal grammar component to compose the individual concepts contained in the thesaurus into grammatical text.

The universal grammar first and foremost needs to support a compositional formal semantics. This enables us to systematically construct phrases, sentences, and texts that both express the intended meaning and do so syntactically correctly, and to do so in any language or mode of expression.

The formal (or Montague) semantics branch of linguistics has long utilized the lambda calculus as a formal knowledge representation that mirrors in a compositional fashion both the syntax of a language fragment and its meaning. The Lambek calculus has served as a mathematical framework in which the compositional rules mapping between syntactic expressions and formal representation are expressed (van Benthem & ter Meulen, 1997).

Linguists traditionally derive lambda expressions as representations of its meaning from a text fragment; in WebALT, we took the converse approach, providing the indexer with a tool to create something like a lambda expression as a knowledge representation that later gets mapped, compositionally, concept by concept and structure by structure, to grammatical text that expresses its meaning in a natural language – any natural language. As an example, the sentence Calculate the derivative of f at 1, where f is the inverse of the polynomial x2 + 1. is created in the form of an (OpenMath) lambda term (lambda [f]. (diff(f))(1)) (inverse(lambda [x]. plus(power(x,2),1))) (Strotmann & al., 2005), and translated automatically into the above English sentence or a sentence in any other language that the translation of terms (e.g., ‘derivative’ for diff) and the language generation module are available for.

Universal abstracting in practice

There are not many scientists in the world able to compose text fragments in this formal semantic form from scratch. A key contribution of the WebALT project was, therefore, an editor (WebALT Project, 2006) that supports the creation of this form of text by quite ordinary people. Using it, an indexer creates a universal abstract in a stylized form of his or her own language, restricted to words and phrases taken from a multilingual thesaurus as parts that are composed into complete sentences and short texts. The editor acts as a phrase or word completion mechanism while enforcing the use of controlled vocabulary.

While the input language for this editor is stylized to keep things feasible, the text generation mechanism that translates the resulting formal abstract into any number of languages will produce fully grammatical text which expresses the intended meaning in any language that has available (a) a translation of the specific controlled vocabulary and (b) a general natural language generation engine.

The WebALT project used the open-source GF software (for grammatical framework), both to define the stylized input language for the editor and for the natural language generation component. GF (Ranta, 2004) “knows” the basic syntactic structures of a number of languages; during 2009, efforts are underway to complete support for the set of all 25 official European Union languages (Ranta, 2009). This is made possible by its ability to allow structurally related languages to share significant amounts of code.

As a formal semantics representation, the WebALT project used OpenMath (Abbott & al., 1998), a markup language that its members had helped develop as a formal knowledge representation for mathematics. OpenMath as used in that project was a variant of the lambda calculus, structurally, and defined its own language-independent and extensible ontology for basic mathematical concepts. This formal semantic language was extended with a small set of linguistic markers in order to create a fully functional knowledge representation with support for fundamental linguistic features. The project also extended the language-independent ontology defined by OpenMath into a multilingual thesaurus for mathematical concepts.

Enabling universal abstracting

In the LIS community, multilingual thesauri are a well-researched topic, and in some areas such as medical informatics they are under intense and already well-advanced development (e.g., Marko & al., 2006). For universal abstracting to become feasible, it is therefore first and foremost the universal grammar component that needs to be added, although the multilingual editing and natural language generation components are also still under development.

In this poster, we therefore propose investigating the possibility of introducing universal abstracting into international library practice. We expect that all components of this innovative approach to universal accessibility of knowledge and information will grow through this integration, in particular the theory and practice of multilingual thesauri.