A Shallow Text Processing Core Engine


Günter Neumann DFKI, Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germanyneumann@dfki.de


In this article we present SMES–SPPC, a high–performance system for intelligent extraction of structured data from free text documents. SMES–SPPC consists of a set of domain–adaptive shallow core components that are realized by means of cascaded weighted finite–state machines and generic dynamic tries. The system has been fully implemented for German; it includes morphological and on–line compound analysis, efficient POS–filtering, high–performance named–entity recognition and chunk parsing based on a novel divide–and–conquer strategy. The whole approach proved to be very useful for processing free word order languages such as German. SMES–SPPC has a good performance (more than 6000 words per second on standard PC environments) and achieves high linguistic coverage, especially for the divide–and–conquer parsing strategy, where we obtained an f–measure of 87.14% on unseen data.