Towards a cross-disciplinary notion of data level in data curation
Originally focused on scientific data, the data curation community is now taking up the curation problems of humanities data as well. Sharing concepts and terminology across these domains should be valuable for both the practice of data curation, and the education of data curation professionals. To convene a discussion of the possibilities we outline an exercise mapping NASA's well-known four “data levels” to practices and concepts in traditional textual criticism.
Data curation is concerned with the fundamental principles and best practices for managing the entire lifecycle of data: creation, management, exploitation, enhancement, and preservation (Rusbridge, 2007). The GSLIS Data Curation Education Program (DCEP) is a specialization within the ALA-accredited master of science degree at the Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign; its development was supported by a grant from the Institute of Museum and Library Science (IMLS). Although DCEP's original focus was on scientific data, in 2008 GSLIS was awarded another IMLS grant: to extend the DCEP program to include the humanities data.
The new digital context for humanistic inquiry challenges the methods and techniques of traditional curatorial objectives such as authenticity, provenance, authoritative reference, and annotation. It also challenges our understanding of those objectives themselves, pointing to an urgent need for better-theorized curatorial practices. We recognize, however, that the humanities disciplines have already evolved sophisticated curation practices and theories, and in the last 50 years these have been refined and extended to support computation and digital storage. It is a fundamental principle the DCEP humanities program that this rich tradition must inform any development of a humanities data curation curriculum.
The Value of Shared Concepts
The emerging data curation community now conceptualizes itself as including humanities and social science data as well as scientific data (Crane et al., 2007). This immediately presents a challenge: to what extent can shared frameworks of common concepts do justice to best practices, current and future, in such widely disparate areas?
The advantages of even a partially shared framework of concepts and terminology are substantial. In the future, data curation services may often be provided by a single organizational unit that serves multiple disciplines, uses a single set of repository tools, and maintains a single set of policies (Swan & Brown, 2008). Unnecessary diversity will create a burden for design, implementation, documentation, and training. In addition, concealing similarities impedes recognition of opportunities for simplification and efficiency, and the re-use of successful strategies in different domains.
Similarly, shared concepts and terminologies will benefit the professional education of data curators. This is in part because data curators will find positions in organizations such as those just described, and in part because a coherent integrated curriculum has pedagogical advantages. But perhaps most important, the more the data curation curriculum reflects commonalities that exist across superficially disparate data practices the more robust and effective the theory and techniques we teach in our curriculum are likely to be (Cragin et al., 2007).
Data Levels in Earth Science and Textual Criticism
One concept that is a particularly interesting candidate for cross-disciplinary comparison is “data level”, the categorization of data with respect to the extent to which it is “raw” or “processed”. [The notion of data level is closely related to the key data curation concepts of provenance and lineage (Bose & Frew, 2005; Freire et al., 2008)]
As a first exercise in conceptual alignment we propose comparing the widely used NASA data level categories for remote sensing data (NASA, 1986) with traditional notions of scholarly transcription and editing found in the theory textual criticism or textual philology. Because the NASA data levels are not well-characterized conceptually with respect to their epistemic or methodological significance, but primarily specified operationally, in terms of what specific data features can be found at what level, we defer conceptual analysis of data level and focus initially on what key features of levels of transcription and editing seem to correspond, intuitively, to the features characteristic of the different NASA data levels. What follows is however intended to only to motivate the exercise, not carry it out.
Comparison of NASA Data Levels with Levels of Textual Transcription and Editing
NASA Level 0: “…unprocessed instrument data at full resolutions.”
TextCrit Level 0: Unprocessed text images.
In digital textural editing these might be images of textual documents (codex or manuscript pages for instance) recorded in raster formats (bitmaps) and prior to any additional processing, such as format conversion or compression. Because they are at the original scanning resolution they may contain details that can be lost when smaller more manageable versions are generated from them. At the same time, because they have not been enhanced in any way, features apparent only after processing may not be evident when rendered with standard viewers. Such files are usually inappropriate for digital research as they are large, often in inappropriate formats, and do not make textual content computationally available at the character level, nevertheless they are the evidentiary foundation for further scholarship.
NASA Level 1A: “…unprocessed instrument data at full resolution, time referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters (i.e., platform ephemeris) computed and appended but not applied to the Level 0 data.”
TextCrit Level 1A: Unprocessed text images, annotated with the identification of the hardware and software used, any configuration or calibration information, the time and place of the scanning, and organization or persons conducting the imaging, and a [non-descriptive] identification of the object imaged.
Here information about the scanning process is included, as well as a specification of what was scanned. This will include “objective” administrative and technical metadata elements for raster images (NARA, 2004). [The identification of the object scanned should not contain any unnecessary inferred information about the object and should be understood only as identifying, not describing, the object; the combination of an agency and a registrar's accession number would be an example of a relatively non-interpretative identification, while the characterization of the document as written by a particular author would involve inference and description inappropriate at Level 1.]
NASA Level 2: “Derived environmental variables (e.g., ocean wave height. soil moisture, ice concentration) at the same resolution and location as the Level 1 source data.”
TextCrit Level 2: Derived representation of text content and structure, mapped to locations in the Level 1 source data.
These might be transcriptions of Level 1 text images with accompanying markup indicating textual components that may be considered relatively non-interpretative. Many difficult decisions arise at this point: What counts as a data feature to be recorded? Are ligatured characters to be identified as such? Are purely typographic variations (e.g., “v”s for “u”s) distinguished? Are (apparently) non-meaningful design ornaments to be recorded? The literatures of textual criticism, bibliography, documentary editing and diplomatics provide extensive discussions of possible transcription policies. It is likely that several sub-levels will be needed to map scholarly textual practices corresponding to NASA Level 2
Another issue is, obviously, what counts as interpretation. Printed glyphs and characters might be considered identifiable with (usually) relatively little interpretation. Text components that can be identified with relatively little interpretation might be chapter and paragraph boundaries, extracts, stanzas, and other things for which there is reasonably non-controversial visual evidence — while on the other hand the distinction of place names and person names, or the identification of referenced persons, places, and things are examples of annotations that are probably too inferential for Level 2
A plausible data product at this level might be an XML/TEI document (TEI, 2008) intended for broad use by diverse scholars, but without annotation as to relationships to other texts, other physical instantiations of this text, or other raster files carrying this text, or the identification of mentioned persons, places, times, etc. Such XML/TEI documents not only seem a close to match to NASA Level 2 in nature, but often seem to play the same role in some humanities disciplines (in history and literary studies for instance) as NASA Level 2 data plays in earth science: they are processed enough to be useful, but include little information that might be considered interpretation and inference by scholars using them.
NASA Level 3: “Variables mapped on uniform spacetime grid scales, usually with some completeness and consistency properties (e.g., missing points interpolated, complete regions mosaicked together from multiple orbits).”
TextCrit Level 3: Representation of textual content and structure mapped on to (perhaps multiple) carriers with described structure (e.g, physical bibliography), and expansion of abbreviations, interpolation of missing text.
Locating the digital text in a larger bibliographic space of physical objects and editions corresponds well to mapping data to a “uniform spacetime grid”. This allows, for instance, several raster images and corresponding TEI documents to be mapped to the same book, or the same edition, perhaps “mosaicking” several XML/TEI documents together. The standard text critical practice of supplying missing text where there are “lacunae” caused by illegibility, physical damage, etc. corresponds well to the interpolation of missing points. [The question arises though as to whether these two interventions are really at the same epistemic level of processing and inference.]
NASA Level 4: “Model output or results from analyses of lower-level data (i.e., variables that were not measured by the instruments but instead are derived from these measurements).”
TextCrit Level 4: Textual history including scribal transmission, seriation, intended but unrealized texts, etc. Possibly also person, name, and data disambiguation.
This level would contain the results of other aspects of traditional textual criticism, such as the correction (emendation) of textual errors made by the original authors, scribes, or compositors, the interpolation of missing manuscripts, the determination of order of composition (seriation), or the establishment of an intended text that was never realized due to compositor errors. None of these things are “directly measured”, but they are derived from what is measured.
The preceding is intended only to initiate discussion. A full treatment will require, among other things, (i) a precise conceptual analysis of the intended epistemic or methodological significance of the operationally characterized NASA data levels; (ii) taxonomies of both conceptual and operational features; and (iii) an assessment of how well the proposed operational definitions correspond to genuine epistemic or methodological boundaries. There is much to puzzle over.
A question of particular interest for the design of a data curation curriculum is the possibility that there are fundamental differences between cultural and scientific data that will bear on characterization of data level. For instance, what is data and what is theory varies from discipline to discipline in the humanities: a scholarly edition is data for a literary critic or cultural historian, but theory for a textual philologist. Another possible difference is the frequent role of human judgment and intuition in moving from one data level to another — even when algorithmic techniques are employed (in computational seriation or stemmatics for instance) human judgment remains salient. And finally, there is the intentionality, the aboutness, of humanities data: an inscribed text not only has a causal history, but semantic features as well: it has meaning and may be about some person, place or thing. But whether any of these are real differences between science and humanities data remains to be seen. The exercise has been set — let the discussion begin.
We have benefited from discussions with Carole L. Palmer and John MacMullen as well as from discussions at the e-Research Roundtable, Center for Informatics Research in Science and Scholarship, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign.