Working primarily in the semantic Web environment, we harvested 25 indexing languages that met three criteria: (1) were freely available, (2) were considered valuable by a body of designers/users (were always used and most often maintained by and given imprimatur of a respected organization), and (3) were used to index or curate documents. These languages spanned what we commonly understand to be thesauri, ontologies, and folksonomies. The domain of the indexing language was of secondary concern for this research question. We accounted for that in our analysis, though it did not shape our sampling procedure. The 25 indexing languages that met these criteria are listed online (Good & Tennis, 2008).
Here, when discussing indexing languages, we refer only to the set of terms that compose them - the tags in folksonomies, and the concept labels in thesauri and ontologies. Each of the terms in these indexing languages was subjected to a normalization process meant to help consistently delineate the boundaries of compound words. We carried out four procedures of normalization: (1) all non-word characters (e.g.;, _, -) were mapped to spaces using a regular expression (\\W), so the term “automatic-ontology_evaluation” would become “automatic ontology evaluation,” (2) case-delineated compound words were mapped to space separated words (e.g., “camelCase” becomes “camel case”), (3) all words were made all lower case, and (4) any redundant terms were removed.
For the primary analysis, a variety of different measurements were recorded for each term set. These metrics, summarized in Table 1., included indicators of the size of the term sets, the lengths of the terms, and the apparent levels of modularity within the sets. Measures of modularity expose the structure of the term set based on the proportions of multi-word terms and the degrees of sub-term re-use. These measures include two main categories, Observed Linguistic Precoordination (OLP) and Compositionality.
OLP indicates whether a term appears to be a union of multiple terms based on syntactic separators. For example, the MeSH term 'Fibroblast Growth Factor” would be observed to be a linguistic precoordination of the terms 'Fibroblast', 'Growth', and 'Factor' based on the presence of spaces between the terms. We categorize terms as uniterms (one term), duplets (combinations of two terms), triplets (combinations of three terms) or quadruplets or higher (combinations of four or more terms). Using these categorizations, we also record the 'flexibility' of a term set as the fraction of sub-terms (the terms that are used to compose duplets, triplets, and quadplus terms) that also appear as uniterms.
The OLP measurements were adapted from characteristics of indexing languages introduced by Van Slype, who, in the process of comparing thesauri to the ISO Standard, identified a number of simple measures for gauging the extent of a thesaurus (Bureau-Marcel-Van-Dijk, 1976). His measures provide numbers that give the basic extent of these indexing languages. These were proposed as benchmarks for standards revision. They outlined the anatomy of the sample of thesauri in English, French, and German. Our intent in using a subset of these metrics here is to provide a means to generate such an anatomical description of any indexing language.
The OLP measures were extended with related measures of 'compositionality' (Ogren, Cohen, Acquaah-Mensah, Eberlein, & Hunter, 2004). Compositionality measures include a) the number of terms that contain another complete term as a proper substring, b) the number of terms that are contained by another term as a proper substring, c) the number of different complements used in these compositions, and d) the number of different compositions created with each contained term. A complement is a subterm that is not itself an independent member of the set of terms. For example, the term set containing the two terms “macrophage” and “derived from macrophage” contains one complement - “derived from”. A composition is a combination of one term from the term set with another set of terms (forming the suffix and/or the prefix to this term) to form another term in the set. For example, in the Academic Computing Machinery (ACM) subject listing, the term “software program verification” contains three subterms that are also independent terms (“software”, “program”, and “verification”). According to our definition, this term would be counted as three compositions - “software”+suffix, prefix+“program”+suffix, prefix+“verification”. As another example, the term “denotational semantics” would only result in one composition because “semantics” is an independent term while “denotational” is not (and thus is a complement as defined above).
Modularity, though not indicative of conceptual structure or meaning, is indicative of the factors that go into the semantics of an indexing language, and shape its use. Here we are guided by Soergel's rubric from concept description and semantic factoring. He tells us “we may note that often conceptual structure is reflected in linguistic structure; often multi-word terms do designate a compound concept, and the single terms designate or very nearly designate the semantic factors. Example: Steel pipes = steel:pipes [demonstrating the factoring]. This fact can be used in thesaurus building,” p. 75 (Soergel, 1974). The combinations of terms or the factoring out of semantics is theoretically important for another reason. It shapes the result of indexing, what we call indexes here.
Together, these measurements combine to begin to form a descriptive picture of the anatomy of the many diverse term sets used for indexing. Table 1 lists and provides brief definitions for all of the term set measurements taken.
Table 1. Parameters of Indexing Languages
|Number distinct terms||The number of syntactically unique terms in the set.|
|Term Length||The length of the terms in the set. We report the mean, minimum, maximum, median, standard deviation, skewness, and coefficient of variation for the term lengths in a term set.|
|OLP uniterms, duplets, triplets, quadplus||We report both the total number and the fraction of each of these categories in the whole term set.|
|OLP flexibility||The fraction of OLP sub-terms (the independent terms that are used to compose precoordinated terms) that also appear as uniterms|
|OLP number subterms per term||The number of subterms per term is zero for a uniterm (“gene”), two for a duplet (“gene ontology”), three for a triplet (“cell biology class”), and so on. We report the mean, max, minimum, and median number of subterms per term in a term set.|
|contains another||The terms that contain another term from the same set. Both the total and the proportion of terms that contain another are reported|
|contained by another||The terms that are contained by another term from the same set. Both the total and the proportion of terms that are contained by another are reported|
|complements||A complement is a subterm that is not itself an independent member of the set of terms. The total number of distinct complements is reported|
|compositions||A composition is a combination of one term from the term set with another set of terms (forming the suffix and/or the prefix to this term) to form another term in the set. The total number of compositions is reported.|
To enable visualizations of the relationships between data types of varying dimensions, the non-ratio data, such as the size of the term sets, was log-transformed and then mapped to a 0-1 scale by dividing each value by the largest number in the set. The variables were then plotted on radar graphs to provide a clear, visual representation of the distinct shapes of these indexing languages.
In addition, we applied cluster analysis to the normalized data using seven variables that were, upon earlier inspection, highly variable in the sample. Those variables were: % of uniterms, % of duplets, flexibility, % contained by another, standard deviation of term length, skewness of term length, and number of complements. We used Ward'si method of cluster analysis using SPSS. This allowed us to create a dendrogram illustrating the clusters of normalized indexing languages.