With the increasing amount of data generated in geoscience research, it becomes critical to describe data sets in meaningful ways. A large number of described data sets are described using XML metadata, which has proved a useful means of expressing data characteristics. An ontological representation is another way of representing data sets with the benefit of providing rich semantics, convenient linkage to other data sets, and good interoperability with other data. This study represents geoscience data sets as an ontology based on an existing metadata description and on the nature of the data set. It takes the case of Vortex2 data, a regional weather forecast data set collected in Summer 2010, to showcase how forecast data can be represented in ontology by using the existing metadata information. It supplies another type of representation of the data set with added semantics and potential functionalities compared to the previous metadata representation.
Scientific activities generate a multitude of data sets, which are not only useful to the data creators but are also of significant value to other researchers now and into the future. The description of a scientific data set thus becomes critical in the effective sharing and reuse of it. In geosciences (e.g., earth science, atmospheric science, oceanography), data sets have been described by different languages, such as flat file description, exceptionally long file names, and XML metadata. These representations have different representational powers and have played important roles in the context of their use. In recent times, more and more resources, including data, are being represented by ontologies on the semantic web, using languages such as OWL and RDF. Ontologies have the advantage of describing resources at the semantic level and thus are a good candidate for data set description. This study demonstrates an effort to represent geoscience data sets using an ontology as such practice is not common while it does come with several benefits.
First, ontology representation focuses on semantics and offers rich semantics for machine understanding. The rich semantics in the description, e.g. concepts and relations, can enhance data access, share and reuse of data sets. Second, ontology representation facilitates interaction with other resources in the semantic web. For example, a data set can be easily connected to another data set, a person, an organization, or a collection in the linked data space. Third, the ontology languages such as OWL and RDF ensure the interoperability between descriptions of different data sets, thus making their interoperation possible.
Since many geoscience data sets are already represented by metadata, in this study we do not start from scratch but rather work on developing an ontology representation for data sets based on existing metadata. In order to develop a generalizable methodology for converting metadata to ontology representation, we choose to use a specific data set, a weather forecast data set with metadata description, as a test bed to demonstrate the process. Our methodology and application may shed light on similar practice elsewhere.
VORTEX2 DATA AND METADATA
The Vortex2 project was an NSF funded 6-week field campaign held Spring 2010 to gather data about severe storms across the middle part of the United States. The LEAD II project at Indiana University contributed to the field campaign by generating 5 short-term regional forecasts each morning of the field campaign (http://newsinfo.iu.edu/news/page/normal/14369.html).
These short term forecasts, 10-25 hours in duration and executed over a region about the size of a US state, were used to understand the weather patterns of the day for a location. The results of the forecast model was converted into images and movies that were then instantly made available tat a web site for viewing on a cell phone by field researchers. On a particular forecast day, the LEAD II system would retrieve the latest observational weather to use in the forecast, making the forecasts more accurate. Over the course of 6 weeks the LEAD II team generated 175 weather forecasts and over 9000 graphical products (images and movies.) This data was carefully curated using the FGDC (Federal Geographic Data Committee) metadata schema. Each forecast, its derived products, and metadata were bundled into a single forecast bundle.
A weather forecast generates numerous types of data, of 11 types in total, including WRF (Weather Research and Forecast) model output itself, and images and movies of specific attributes (e.g., vorticity, radar, precipitation data) that capture different readings of a severe storm and are of great importance for forecast analysis. The FGDC metadata schema is specifically designed for geospatial data, composed of 7 geospatial aspects overarching 334 elements.
The whole effort resulted in 175 forecast bundles. Under each forecast bundle, there are images and model output data. We call the data from one forecast “a forecast set” and the specific image/output data within a forecast set “a forecast subset” here. For example, for a forecast run at 14:00:00Z on day 2010/05/16, we have general forecast set and forecast subsets for WRF model output, precipitation images, and so on.
Each forecast set has an XML metadata description defined by FGDC schema. The metadata employs 3 geospatial aspects and 59 elements from FGDC. The forecast sets and metadata files are in one-to-one match, so we have 175 metadata files in total as basis for conversion to ontology representation. Figure 1 shows an example of a forecast set, the 11 types of data it contains, and the XML metadata record describing the forecast set.
There have been prior studies presenting approaches to converting metadata schema to ontologies. These approaches are generally generic approaches dealing with the problem of given an arbitrary metadata document (e.g. XML file) how it can be automatically transformed to an ontology representation (e.g. OWL or RDF file). For example, issues that are often addressed in the conversion are how XML components, such as XML sequences, elements/attributes, identifiers, should be represented in RDF language (Battle, 2006; Martens et al., 2011). The generic approaches usually convert metadata records on syntactic level, in the sense that they consider mapping between XML and RDF structural components.
Ferdinand et al.'s (2004) generic approach includes two types of mapping: mapping from XML to RDF (e.g. making an XML elements/attribute as RDF properties) and mapping from XML schema to OWL (e.g. making XML schema complexType as an OWL class). Battle (2006) presents the Gloze toolkit Gloze, which maps XML to RDF and RDF back to XML on the basis of XML schema. The argument is that there is no need to create a mapping language, but instead uses XML schema as the language to describe important mapping information when translating between XML and RDF formats. XML Elements/attributes are mapped to data/object properties in RDF, and XML sequences are addressed by a solution based on RDF sequences. Van Deursen et al. (2008) propose another generic approach, which employs a mapping file to build connection between XML schema and OWL ontology and further helps transform XML records to RDF instances. Martens et al. (2011) introduce a procedure to convert an XML-based metadata schema to an OWL ontology using the example of DIG35, a metadata standard of describing still images. They automate the converting process by using a tool that maps XML to RDF based on an XML document.
These related efforts focus on the syntactic part of conversion, whereas our study also tackles the semantic portion. We think the conversion not only involves syntactic mapping between XML and RDF, but also explicit semantic representation of objects contained in XML records. Therefore the conversion is also context dependent, namely, the conversion approach relies on the context of the described entities. Conversions of geosciences metadata to ontologies can make use of these generic approaches, and in the meanwhile we also need to consider conversion issues specific to the context of the particular data set.
We propose an approach to conversion that employs mapping on two layers: concept layer and instance layer. The concept layer deals with concepts (classes in ontology) relevant to the data set, and the instance layer addresses instances of concepts occurring in the metadata files. The abovementioned generic approach sheds light on concept layer mapping, for example, mapping from XML elements/attributes to OWL properties. In addition, context of forecast data is analyzed to provide semantically meaningful classes and properties to the ontology. The instance layer mapping relies on the concept layer structure and extracts useful metadata snippets and values as instances in the resulted ontology. We present our conversion methodology in steps as below.
Step 1. Identify major concepts/classes in context
The concepts of the forecast set are context dependent, and therefore we looked into the Vortex2 context and the XML metadata records to find major concepts. The identified concepts serve as classes in the ontology representation. Since Vortex2 data set is the central entity in our study, we
identify the concept “data set” as a critical concept for the ontology. Correspondingly, it is made as “Dataset” class in our OWL ontology. As Vortex2 data set and its subsets (e.g. precipitation data, vorticity data) all belong to the “Dataset” class conceptually, they are made its subclasses.
Step 2. Identifying other useful classes
Other important classes are identified from the metadata schema and the Vortex2 context as well. These classes mainly further facilitate the representation of the data sets. Overall, the metadata describes Vortex2 data from three aspects using the FGDC vocabulary: identification information, entity attribute information, and metadata information. These three aspects are three major elements in the XML metadata and embed several levels of sub-elements. We consider these as important objects and make them as classes in the ontology. The three classes are used to describe Vortex2 data in the ontology, for example, we have such a statement in OWL: Vortex2Data isDescribedBy some IdentificationInformation (meaning the Vortex2 data set is described by identification information).
We follow one rule loosely to identify classes from the XML metadata: if an XML element contains sub-elements instead of data values, then call it a complex object and make it as a class in the ontology, such as the three classes mentioned above. Do not make it as an ontology property because it is described by its sub-elements and it is suitable to make it a class to facilitate further description of itself in this situation. If an XML element only contains data values, then it is called a simple object and is made a data property in the ontology. As Figure 2 shows, in the XML snippet, <SpatialDomain> and <BoundingCoordinates> are complex objects and are thus made OWL classes, whereas <WestBoundingCoordinate> and its three siblings are mapped as OWL data properties because they are simple objects.
We then refine the ontology based on our understanding of the context. For example, an element is <Originator> in the XML schema and is a sub-element of <CitationInformation>. This is a simple object because it contains only data values about the names of the creators of the data set. Following the above rule, we would make “originator” a data property and the data values as literals in the ontology. However, since the <Originator> data value actually lists names of people who are originators, it makes more sense to have a statement in OWL saying that Vortex2Data hasOriginator only Person (meaning Vortex2 data set has originator, which belongs to the Person class) instead of simply treating the data values as literals in the OWL language. Accordingly <Originator> is not made a data property in the ontology but as an object property instead because it is followed by a class (Person) in the statement now. The following figure shows the mapping of this example.
The description of ontology classes is constructed by learning metadata records and the context as well. We first follow a rule loosely to describe classes: for classes that are originally a complex object of the XML schema, describe them by their sub-elements in the XML schema. For example, the Vortex2Data has three sub-elements which are also mapped as OWL classes, including identification information and two others. In this case, we write the OWL statement as Vortex2Data isDescribedBy some IdentificationInformation (meaning Vortex2 data set is described by identification information), and the OWL statements for the other two classes are similar.
As above, we analyze the context to better describe the OWL classes. For example, Vortex2Data and its sub data sets are all subclasses of the Dataset class, moreover, we learn that conceptually Vortex2Data also contains sub data sets. This is obvious to a human but this statement has not been explicitly represented in the ontology. We define the relation between Vortex2Data and its sub data sets as whole-part relationship. Properties are attached to the Vortex2Data class to represent the part-whole relationship: e.g. Vortex2Data contains some CapeData (meaning Vortex2 data set contains Cape data as subset), where contains is the object property indicating the whole-part relationship.
Step 4. Using external classes and properties
We make maximum use of existing external classes and properties to avoid reinventing the wheel and allow interoperability with other data sets. If a class or property in the ontology is present in an external ontology, preferably a frequently used one, we employ the class/property of that ontology instead of creating our own. We look into the DC (Dublin Core), VIVO, SWEET (Semantic Web for Earth and Environmental Terminology), FOAF (Friend Of A Friend) ontologies to find potential useful classes and properties. For example, the CitationInformation class has a property of “has publication date of,” meaning CitationInformation has publication date information of the data set, and we find that DC terms have an “issued” property about publication date, and therefore we use this DC property to describe the CitationInformation class: CitationInformation issued only date Time.
Step 5. Extracting instances from metadata files
The first four steps deals with conversion on the concept layer, and this last step addresses the instance layer. Information about instances is extracted from metadata files to add instances to the abstract ontology. The OWL ontology is about the skeleton of the data sets and describes their characteristics, whereas the instances are resource pointers to the real objects, such as particular data sets and particular person. We represent the instances using RDF, which describes resources in the manner of triples.
There are 175 weather forecasts and derived visual products and thus we have 175 Vortex2Data instances. Each Vortex2Data instance contains 11 sub data set instances, whose information can also be extracted from the XML files. Besides the data set instances, instances of other classes are also extracted from the XML file automatically and are organized in the ontology according to their class relationship.
The first four steps of the methodology are manual work, in which ontology classes and properties are derived and written in the ontology. We use the Protégé toolkit to construct the OWL ontology, part of which is presented in Figure 4. The last step, instance extraction, is done automatically by integrating the Jena toolkit, a toolkit for handling RDF, OWL, etc., because instance information is acquired from metadata files and we can extract and add their information automatically when we know the structure of the ontology.
CONCLUSION AND FUTURE WORK
In this study, we present our methodology for converting an XML metadata description for a short-term regional weather forecast and related data sets to an ontological representation. The methodology addresses conversion on both concept and instance layers and considers both syntactic and semantic mapping between metadata schema and ontology. The methodology can be generalized to other geoscience data sets as well.
For future work, LEAD II used FGDC to represent a suite of related products derived from a single forecast. FGDC is not strong in its ability to relate products to one another. Our ontology will be extended to relate the data products within a forecast bundle. Too, information and semantics can be lost and misinterpreted during the conversion from XML to ontology, and thus evaluation is an important way of examining the conversion quality. Secondly, we will more deeply explore temporal and spatial representation of geoscience data sets in the future, since they are intrinsic characteristics of geoscience data. The Vortex2 data contains temporal information such as forecast start date, model start time, forecast duration, and has spatial information such as bounding coordinates for describing geographical locations, and representation of these information needs to be further enhanced to provide better data description and access to end users.