SEARCH

SEARCH BY CITATION

Background: Biosystematics and Biosystematics Publishing

  1. Top of page
  2. Background: Biosystematics and Biosystematics Publishing
  3. Semantic-based Quality Insurance System for Biosystematics Publishing
  4. Conclusion and Future Work
  5. Acknowledgements
  6. References

Biosystematics literature contains accounts of living organisms and their fossil records. Keeping the recorded information accurate and complete is essential for specimen identification and subsequent research. Like all other publishers, publishers of biosystematics rely on expert human editors to maintain quality standards. However, due to the complexity and special requirements of biosystematics descriptions, it is well-acknowledged by the biosystematics community that these editors need additional help to ensure each description meets the essential requirements. Two major and unique requirements of good biosystematics descriptions are stemmed from the hierarchical structure of taxonomy and the essential use of descriptions for specimen identification. The hierarchical relationships are formed on the shared characters of different taxa. A taxon is a name designating an (or a group of) organism(s) and assigned a taxonomic rank (e.g. family, genus, etc.) which can be placed at a particular level in the taxonomy reflecting evolutionary relationships. A description of a taxon at a certain rank (e.g. family) should include characters shared by all its lower taxa (e.g. genus, species, variety etc.). Similar to superclass/subclass definitions in object-oriented programming, properties that define a superclass (i.e. a higher rank taxon) are not to be repeated in its subclasses (i.e. lower taxa). This requirement ensures no information is unnecessarily repeated to avoid confusion. The second requirement, commonly referred as “parallelism”, ensures necessary information is not left out. To support specimen identification, characters described for taxa of the same rank must be parallel to form a base of contrast. Perfect parallelism is crucially important for creating identification keys, but is difficult to enforce manually as it commonly involves a large number of descriptions (e.g. all species of a genus) and a large number of characters.

In summer 2008, we collaborated with and were funded by the Flora of North America Project, a non-profit biosystematics publisher, to investigate the feasibility of a semantic-based quality insurance (SQI) system for biosystematics publishing. The publishers of Flora of China and Treaties on Invertebrate Paleontology also expressed an interest in such a system.

Semantic-based Quality Insurance System for Biosystematics Publishing

  1. Top of page
  2. Background: Biosystematics and Biosystematics Publishing
  3. Semantic-based Quality Insurance System for Biosystematics Publishing
  4. Conclusion and Future Work
  5. Acknowledgements
  6. References

The design goal of the SQI system is to use the unsupervised machine learning based semantic annotation technique we developed over the last three years (Cui, 2008a, 2008b) to 1) auto-check for parallelism in description manuscripts of the same ranks, 2) perform de-duplication in description manuscripts of different ranks, 3) ensure recommended terminologies (e.g. terms defined in the FNA Glossary (Kiger Porter, 2001) ) are used in manuscripts and discover and propose new candidate terms to be included in the glossary, and 4) semantically mark up unstructured manuscripts into XML format which can later be converted to SDD (SDD, 2008) format for automated identification key generation.

All these features rely on the unsupervised annotation technique which identifies and learns relevant concepts, including organ/structure names (e.g. leaves), their modifiers (e.g. cauline leaves), and values of characters (e.g. margin entire) directly from the raw plain-text of description collections, which in this case are extracted from published Flora volumes. The learned concepts are saved in a MySQL database and used to annotate individual description manuscripts, which are then checked for redundancy, parallelism, and terminology consistency(see Figure 2 and 3 for examples of annotated segments).

However, a description is only one of many sections in a taxon record. To extract description sections and to fully annotate taxon records, we developed a parser that takes a volume of text and segments it into individual taxon records and then parses each of the records into sections such as nomenclature, description, habitat, distribution, discussion etc (Figure 1). All but the description section is future parsed to explicitly annotate information such as scientific name, authority, common names, citations, global (continents, countries) and local (states, provinces) distribution etc. The parsing was fairly straightforward by using the styling clues (font, size, etc) in case of Flora of North America (tested on Vol. 5 and Vol. 19, the volumes were selected by the FNA Project), and is likely to be as straightforward in Flora of China and Treatise on Invertebrate Paleontology as well, as they all used MS Word to style the text close to the final stage of the editorial process. By saving a.doc file in .docx format and unzipping it, we then get the text with style information encoded in XML format, which can then be parsed with ease. Since there may be discrepancies in the application of styles in long text, the parser allows the user to configure the styles and verifies the consistency of their application. The user must correct identified inconsistencies before moving forward to the next parsing phases. The parser treats description sections in a different way because often in biosystematics publishing there is little or no style applied in these sections. The parser collects all description sections from the entire volume and calls on the unsupervised learning algorithms to annotate these sections with semantic tags.

thumbnail image

Figure 1. Flora of North America Semantic Parser

Download figure to PowerPoint

To support quality checking functions, we further developed the unsupervised annotation algorithms to annotate characters/values (e.g. coloration = “dark red_brown”) and relate characters with relevant organs. The association between characters and their values were achieved by the learning algorithm with help from the FNA Glossary. While further improvements to the algorithms are needed, the preliminary result is rather promising. Figure 2 and 3 show some examples of annotated segments of descriptions. Annotations are in tags surrounding the textual descriptions.

thumbnail image

Figure 2. Identify Unnecessary Redundant Information

Download figure to PowerPoint

thumbnail image

Figure 3. Identify Missing Information

Download figure to PowerPoint

With all of the above in place, we adapted the Longest Common Subsequence algorithm (Cormen, Leiserson, Rivest, Stein, 2001) to identify redundant characters across ranks and missing characters in the same ranks. Here the semantic tags (a tag consists of a tag name and several attribute/value pairs) of a description form a sequence. While the algorithm is looking for the longest common subsequence of two sequences, it also identifies organs and characters that are part of the longest common subsequence (i.e. parallel/redundant information) and that are not part of the subsequence (i.e. unparallel information). Figure 2 shows an example of a redundant character/value pair (i.e. plane_shape, shown in blue) described in a species description, and then unnecessarily repeated in a variety (a lower rank than species) description. Figure 3 shows an example of the lack of parallelism between two species descriptions of a same genus: when describing seeds, one species description states their architecture while the other misses that but includes their shape instead.

Conclusion and Future Work

  1. Top of page
  2. Background: Biosystematics and Biosystematics Publishing
  3. Semantic-based Quality Insurance System for Biosystematics Publishing
  4. Conclusion and Future Work
  5. Acknowledgements
  6. References

Semantic annotation of biodiversity literature has been shown to improve information retrieval performance and user satisfaction levels (Tang Heidorn, 2007). While further research is needed, our collaborative research with biosystematics publishers suggest this technique also promises improved efficiency and effectiveness in other stages of the information cycle, such as creation and publication We will continue to improve character annotation algorithms and implement the user interface of the SQI system, in addition to system- and user-centered performance evaluations. Eventually we hope to adapt the SQI system to perform quality checking of biosystematics information on the World Wide Web.

References

  1. Top of page
  2. Background: Biosystematics and Biosystematics Publishing
  3. Semantic-based Quality Insurance System for Biosystematics Publishing
  4. Conclusion and Future Work
  5. Acknowledgements
  6. References
  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C. (2001). Introduction to Algorithms. MIT Press and McGraw-Hill.
  • Cui, H. (2008a). Approaches to Semantic Mark up for Natural Heritage Literature. Proceedings of the iConference 2008.
  • Cui, H. (2008b). Unsupervised Semantic Markup of Literature for Biodiversity Digital Libraries. Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital libraries, (pp. 2528).
  • Kiger, R. W., Porter, D. M. (2001). Categorical Glossary for the Flora of North America Project. Retrieved January 3, 2009, from http://huntbot.andrew.cmu.edu/HIBD/Departments/DB-INTRO/IntroFNA.shtml
  • SDD. (2008). SDD Wiki. Retrieved July 10, 2008, from TDWG Wiki: http://wiki.tdwg.org/SDD
  • Tang, X., Heidorn, P. B. (2007). Using Automatically Extracted Information in Species Page Retrieval. Retrieved July 10, 2008, from Proceedings of TDWG 2007: http://www.tdwg.org/proceedings/article/view/195/0