The semantics of metadata attributes is sometimes simply a matter of stipulation; the schema designers agree on what semantics they want their attributes to have and those decisions determine attribute semantics. However, metadata elements are usually defined in natural language sentences that make use of familiar common words that are themselves neither formally defined nor plausibly primitive. In these cases it is difficult to know what was intended by schema designers even assuming they had clear and settled intentions. More importantly the semantic formalization of metadata schemas often happens after the schema is completed and already in use. In that case determining the semantics of metadata attributes as they are actually used is the issue, and not the intentions of the schema designers. For these reasons and others testing conjectured metadata rules against actual metadata is important. To explore how this testing might be accomplished we built an RDF repository containing descriptions from the IMLS Digital Collections and Content (DCC) project (Wickett et al., 2009).
Since CIMR rules are developed in first order logic and refer to facts implied by metadata, we wanted to create a environment for testing that used a knowledge representation language that could support straightforward translation from rules into queries. RDF was therefore a natural choice for data representation as it is explicitly based on a fragment of first order predicate logic, encoding information as subject-predicate-object “triples”, which are expressively equivalent to the two place predicates we use in CIMR rules. The associated query language SPARQL provides a mechanism for investigating metadata assignments, and related semantic web ontology and rule languages (OWL and SWRL) can support additional modeling and inferencing. Finally, because we anticipate these rules being used in the emerging semantic web and linked data environment we wanted to do our testing in a similar architecture.
Collections from the IMLS DCC Collection Registry were chosen on the basis of the availability of item descriptions and the appearance of certain properties in those item descriptions. We were interested in examining patterns of values for type, format, temporal, and geographic elements and selected collections with item descriptions that used those attributes. For details on the processing of the OAI-PMH XML records into RDF, see Urban et al. (2010).
Problems for Rule Testing
Our metadata repository was built to support empirical testing of candidate rules, allowing us to search real world metadata for prima facie confirmation or refutation of our conjectures. However, testing these rules in an RDF repository is not as straightforward as it might seem.
The rules in the CIMR framework are universally quantified conditionals, evaluated as true if and only if the consequent is true for every case where the antecedent is true. Testing a rule by refutation therefore involves searching for an apparent counterexample, a case where (taking repository statements at face value) the antecedent of the rule is true and the consequent is false. Such a counterexample would then be evidence that the rule is false, although alternative explanations, such as errors made by metadata cataloguers or problems in subsequent processing, must also be considered. If no apparent counterexample is found that might be taken, in the right circumstances, as providing some confirmation of the rule.
Evaluating alternative explanations of apparent counterexamples, and deciding in what circumstances the absence of counterexamples counts as confirmation are familiar challenges in evaluating conditional claims of any kind. But there is a more distinctive and interesting problem in using logic-based queries to search for counterexamples to conditional rules in an RDF repository.
In the formal semantics of first order logic the truth value of a logical formula is relative to an interpretation, which is a series of statements assigning predicates to individual things, giving enough information to determine a truth value for the formula. Given an interpretation, a universally quantified conditional is evaluated by searching the list of statements for a counterexample – a statement or group of statements where the antecedent of the conditional is true and the consequent is false. If there is no counterexample, then the conditional is true in the interpretation, if there is a counterexample then the conditional is false in the interpretation.
Like a logical interpretation, our RDF testbed is also a series of statements assigning predicates to individuals, and so might be thought to promise a simple approach to rule evaluation. However the parallel fails in an interesting way.
When interpretations are defined they typically make assignments by listing all true atomic statements; the atomic statements not listed are considered false in that interpretation. This assumption — whatever is not asserted as true is assumed false — is not appropriate for a repository whose statements are directly derived from metadata records. Metadata records are created in circumstances where there is typically no expectation that statements not asserted as true are assumed false; the policy of “mandatory if applicable” being the exception, not the rule. Rather metadata repositories must be understood as making the “open world assumption,” where the absence of a statement does not license inferring the negation of that statement. Metadata is not an exception here; knowledge representation projects, including the semantic web and linked data, also make the open world assumption.
Refutation of our rules in the testbed would still be possible, even under the open world assumption, if the repository contained explicit negations of statements implied by a rule, and in fact specifications of logical interpretations do sometimes indicate which atomic statements are false. However RDF cannot express denials of that sort as it does not have logical negation.
Finally, all of our rules imply only positive atomic statements; there are no negations of atomic statements in rule consequents. This means that it will not be possible for the rule, rather than the repository, to supply the negation needed for a counterexample.
These three things taken together, the open world assumption, the lack of negation in RDF, and rules that imply only positive atomic statements, mean that it is not possible, without additional axioms, to find statements in our RDF repository which are counterexamples to our rules.
Direct refutation is only possible with additional axioms, ones allowing inferences from the presence of some properties to the absence of others. Such axioms are a natural part of formal ontologies and can be expressed in semantic web languages such as OWL and SWRL. However these ontologies have not been developed for DCMI metadata. Our empirical testing is therefore currently focused on exploration and confirmation, rather than refutation, searching for rules that are a best match to patterns observed in the data. Refutation based on additional metadata semantics is planned for future work.