Query log analysis can provide valuable information for improving information retrieval performance. This paper reports findings from a query log mining project, in which query terms falling in the very long tail of low to zero similarity (with the controlled vocabulary) scores were analyzed by using similarity algorithms. The query log data was collected from the Gateway to Educational Materials (GEM). The limited number of terms in the GEM controlled vocabulary was a major source for the long tail of low or zero similarity scores for the query terms. To mitigate this limitation, we employed a strategy that involved using the general-purpose (domain-independent) ontology WordNet and community-created Wikipedia as the bridge to establish semantic relatedness between GEM controlled vocabulary (as well as new concept classes identified by human experts) and user query terms. The two sources, WordNet and Wikipedia, were complementary in mapping different types of query terms. A combination of both sources achieved a modest rate of mapping accuracy. The paper discussed the implications of the findings for automatic semantic analysis and vocabulary development and validation.
Query logs record what users entered in the system when they search for information. The information captured can be useful for studying user search behavior when combining with session data. Numerous publications have reported analysis results from perspectives of search behavior, interactions, cognitive processes, algorithm enhancements, and clustering, which have been reviewed by Jansen and Spink (2006). The fact that query terms represent information needs of users provides a valuable source not only for studying user search behavior, but also for enriching or validating existing controlled vocabulary with new terms and topics. This paper focuses on analyzing the relationships between user query terms and controlled vocabulary, which includes three steps: 1) Preprocessing query log data from the Gateway to Educational Materials (GEM), 2) Processing query terms using WordNet and Wikipedia, and 3) Using different similarity algorithms to automatically mine the semantic relationships between query terms and controlled vocabulary/new knowledge classes. The following sections will first review relevant research, and then describe the methodology and experiment result. Implications of research findings are discussed in the conclusion section.
2. Related Research
Log analysis has many versions depending on the type of log data used and purpose of analysis. Access log is a source commonly used for evaluating access to and usage of Web pages and user behavior. By mining sequences and status of access to Web pages, researchers can discover the navigation patterns and generalize concept hierarchies (Spiliopoulou, 2000). Transaction logs originally refer to the history of actions executed by database management systems, but this term has also been used for query logs that are generated in information search systems (Jones et al., 2000; Jansen & Spink, 2006). Another term “query logs” also appears frequently in literature and is used interchangeably with “user logs” and the two mentioned above. Query logs keep information on what query terms a user entered, under what query options a query was executed, and how many results were returned to the user. Query log data may be associated with user's authentication information if login is required, and associated with web server log data to obtain access sequence information about queries. Although the data in these logs vary in format and detail, they are common in that all of them record users' interactions with the system and keep information about what actions users took to access and use the system.
Query log analysis has been used as a strategy to improve the effectiveness of information retrieval because the log data reflects user search trends and predicts human interactions with searching engine in the future (Shokouhi, 2006). Researchers have studied frequencies of log data components such as the number of terms per query and frequency distribution of query terms (Jansen et al., 2000; Spink et al., 2001), or correlations among query terms and pairs of query terms (Silverstein et al., 1999). The statistical description of query log data have been used to study the features of user search behavior on commercial web search engines (Silverstein et al., 1999; Jansen et al., 1998). The findings from these studies show that typical search behavior today includes shorter queries than those used in traditional information retrieval systems and impatient result browsing (i.e., users rarely go more than the first page when checking search result).
Query clustering and classification is often used in information retrieval to categorize queries in order to return relevant information to searchers. Kim and Seo (2005) developed an FAQ system using query log clustering. Their experiment showed that both precision and recall outperformed the traditional system, which demonstrated that query log clustering was a useful bridge between queries and FAQs (Kim & Seo, 2005). Query classification is another application of query log analysis in information retrieval. It is usually done by applying either supervised learning (Beitzel, et al. 2005) or unsupervised learning (Wen, et al. 2002; Beeferman, 2000) techniques. The supervised method refers to machine learning techniques such as Space Vector Machine and decision tree that are used together with human tagging as training data. In contrast, unsupervised learning focuses on computing similarity scores between queries without human judgment. The most frequently used algorithms including the K-means and expectation-maximization (EM) algorithm.
Automatic semantic similarity analysis is essentially calculating similarity scores between the target and source. Semantic similarity can be viewed as a special subset of semantic relatedness (Warin, 2004). Semantic relatedness refers to how two concepts are related, using any kind of relation, while semantic similarity only considers the IS-A relation (hypernymy/hyponymy). There are three different approaches to measuring similarity:
- 1.Lexical-based approach (Leacock & Chodorow, 1998; Wu & Palmer 1994): measures similarity based on path length and depth on certain lexical resources. The closer the two concept nodes are in taxonomy, the higher the degree of similarity is. WordNet is frequently used by this approach.
- 2.Corpus-based approach (Landauer & Dumais, 1997): measures similarity based on the probability or entropy calculated from corpus. The performance of this method is related to the corpus chosen. Some studies such as (Rapp, 2002) used syntagmatic or paradigmatic relationships to calculate the semantic similarity, which is also corpus-driven.
- 3.Hybrid approach (Lin, 1998; Resnik, 1995; Jiang & Conrath, 1997): uses both lexical and corpus resources to calculate the similarity scores. Although query logs have been used to study user behavior and served as the source for clustering and classification for information retrieval, most of them came from commercial Internet search engines and the query terms included a wide range of subjects in which controlled vocabulary was rarely involved. While users of Internet search engines are generally not concerned about controlled vocabulary, the usefulness and effectiveness controlled vocabulary in information retrieval has been proven in specialized search systems such as the Unified Medical Language System (UMLS) (Aronson et al., 1997). Most digital libraries built for educational purposes offer a search option for using controlled vocabulary. However, keyword or full-text search has been found to be dominant (Qin & Prado, 2005). While many factors could have contributed to this search behavior, the lack of semantic relations between the controlled vocabulary and keywords puts users at a disadvantageous position because general users are not trained to think about the exact relationships, e.g., synonymy and polysemy, between the terms they are using for search and the controlled vocabulary (Dumais et al., 1990). The project reported in this paper used query log data from an educational digital library to mine semantic relationships between the user query terms and the controlled vocabulary. We used the revised hybrid approach for calculating similarity scores that involved general (domain-independent) ontological resources to process query terms and match the result with the controlled vocabulary. The performance of matching was evaluated by using community-created (Wikipedia) and expert-created (WordNet) resources.
Search engines in educational digital libraries often use controlled vocabulary as one of the search options. Unfortunately, the lack of robust vocabularies and the heavy reliance on keywords in user searches caused a large number of zero hits or too low / too high number of hits for many searches (Qin and Prado, 2005). The research reported in this paper is an experiment for testing performances of three approaches used in mining user query terms for enriching the controlled vocabulary.
Domain-specific vocabularies may be built by using classical query classification, such as the method used by Beitzel et al. (2005). One problem with this method is that, because user queries often contain only one or two words, they contain too few features for machine learning algorithms to generate a reasonable recall rate. Another approach is to use general-purpose vocabulary sources to facilitate the construction of domain-specific vocabularies. WordNet and Wikipedia are the two examples of generalpurpose vocabulary sources. We used the second approach in which WordNet and Wikipedia served as a “connecting bridge” to facilitate the knowledge capturing process. In doing this, we need to address two different but related questions:
- 1.Will the general-purpose ontology or knowledge repository be able to provide convincing coverage for the educational knowledge domain?
- 2.Will general ontological entity relationships be able to produce high-quality matches between query terms and controlled vocabulary in the educational domain?
GEM as an educational digital library provides access to Internet-based lesson plans, instructional units, and other educational materials. In fall 2003, by courtesy of the Gateway to Educational Materials (GEM), we collected query logs from a four-month (February, March, April, and August) period that generated 411,898 queries. We then wrote SQL programs to parse the queries in order to obtain a master list of query terms. The master list contained 1,044,043 query terms, including grade numbers and/or terms, topical keywords, book/movie titles, names for persons and organizations, geographical names, chronological expressions, and many other categories. We calculated the similarity scores between query terms and those in ERIC Thesaurus by using the built-in fuzzy grouping function in Microsoft SQL Server 2005. Description analysis of the query terms was reported in (Qin & Prado 2005).
We used two sources for processing the query terms. One is the WordNet, the most frequently used structured lexical ontology developed by Princeton University (Miller 1995). Unlike conventional thesauri, WordNet is indexed by senses and contains various kinds of semantic relations such as synonymy, antonymy, hyponymy, and meronymy. The relatedness between word senses are critically important for identifying the similarities (Richardson et al., 1994), which will be elaborated in detail in later sections.
Another source used for processing the query components is Wikipedia (www.wikipedia.org), a usergenerated knowledge repository. Each article in Wikipedia represents a concept (mainly name entities and domain-specific terms) and is indexed by several subject categories. The fact that all of its content is created and maintained by users from all over the world makes it a great resource for our purpose. WordNet is suitable for processing linguistic relationships for single words only. Wikipedia complements to the limitations in WordNet through offering up-to-date information in terms of current events, news, and concept changes. Compared to other encyclopedias (e.g., Encyclopedia Britannica, http://www.britannica.com), which are strictly controlled by a professional editorial team, Wikipedia has better coverage and comparable quality (Giles, 2005).
4. Similarity based on Edit Distance
A direct method used for query and controlled vocabulary (CV) match is string (fuzzy) match, namely calculating the edit distance between a query (including stop words) and controlled vocabulary, which is based on the number of edit operations, such as delete, insert and substitution of sub-strings. The similar approaches have been proposed by Gusfield (1997).
In our experiment, we employed Dice Coefficient, a popular similarity algorithm, to describe the similarity between query and CV strings. The Dice measure is defined as twice of the number of the common sub-strings shared by two strings divided by the total number of substrings in both tested entities. The Dice coefficient result of one (1) indicates that the strings are identical, whereas a zero (0) equals to orthogonal strings. The formula can be expressed as following:
Where SN (Q ∩ CV) denotes the number of common substrings shared by query terms and controlled vocabulary, while SN(Q) and SN(CV) the number of sub-strings within the query term and controlled vocabulary respectively. We used bi-gram sub-strings as the test unit.
The above method is simple yet with a distinctive advantage: it takes into account the string order that differs from classical vector space model (Salton et al., 1975). For example, an existing CV term “team teaching” is different from “teaching team” even though both contain the same tokens. In other words, if Dice Coefficient has a high score (e.g. Dice Coefficient > 0.85), it implies that a high probability existed between the two strings that they are similar in both linguistic form and semantics, especially when strings are long (based on our experiment and human judgment).
|Similarity Score||Query terms||Similarity Score||Query terms|
The Dice Coefficient, however, can be limited when there is a very long tail of low scores. The small number of controlled terms used in our experiment (364 terms) was dwarfed by the large number of unique query terms (87295). The Dice Coefficient score distribution in Figure 1 shows that, among the 87295 unique query terms, the scores less than 0.5 consisted of more than 90% of the total. The perfect and near perfect matches between query components and controlled vocabulary (Dice Coefficient scores > 0.85) counted for only less than 4% (2.54% + 0.14% + 0.51%). Compared to the long-tail percentage (90.56 %, Dice Coefficient < 0.50), the 4% “good matches” can be considered somewhat trivial.
One problem in the Dice Coefficient method is that a high similarity score does not necessarily mean a good match. For example, the similarity score between “observation” and “reservation” is 0.81; however, there is almost no semantic relationship between the two words, although only few cases were identified in this scenario. To identify semantic relationships (similarity) between query terms and controlled vocabulary, we introduced two sources to process the long-tailed query terms in the dataset.
5. The WordNet Approach
As a large lexical level ontological database, WordNet is frequently used by various Natural Language Processing (NLP), Information Retrieval (IR), and Text Mining systems. WordNet can provide accurate definition of word senses and diverse relationships between words in English.
|Evaluation query terms||659||100.00|
|Found directly from WordNet||103||15.63|
|Found after stemming||72||10.93|
|All query terms found||175||26.56|
|Unable to find from WordNet||484||73.44|
Although WordNet is a powerful tool with universal coverage, it has limitations as a domain-independent tool. For instance, Hearst (1998) mentioned that WordNet's coverage of proper nouns in a specific domain is rather sparse, but the proper nouns are often important in application tasks. WordNet as a lexical resource can help us test the performance of query and controlled vocabulary match in the specific domain (educational digital library) and address two questions:
- 1.To what extent does WordNet cover query terms in the educational domain?
- 2.How well does WordNet perform in identifying the relatedness between query and controlled vocabulary strings in the education domain?
In this experiment, we randomly selected 659 queries from the log data for evaluation purpose and manually mapped controlled vocabularies to WordNet words (as CV classes), such as “Art”, “Sport”, and “History.”.
5.1 Coverage Test
To address the first question, we conducted a search from the WordNet database for the whole query log dataset, including proper nouns. During this process, stemming algorithm (such as Porter stemmer) is necessary to process the query strings. We were able to locate most single token noun queries in WordNet (except some numbers, not very often used terminologies, or class labels), but failed to identify proper nouns, i.e., “1001 Arabian Nights” and “12th amendment”, except a few (e.g. “Adolf Hitler”) that was defined in WordNet. The majority of unsuccessful matches included queries for person names, titles of books, software, movies, or geographical names. The fact that WordNet is a linguistic tool rather than an encyclopedia decides that it cannot always provide complete and accurate word sense information. For example, “Hamlet” has three explanations in WordNet:
- 1.Hamlet, crossroads – (a community of people smaller than a village)
- 2.Hamlet – (the hero of William Shakespeare's tragedy who hoped to avenge the murder of his father)
- 3.Village, hamlet – (a settlement smaller than a town)
There is no mention in the above statements that “Hamlet” is also a “film.” A similar example is “Vista,” which WordNet defines as “view, aspect, prospect, scene, vista, panorama” but none is associated with “software” or “operation system.” The lack of proper nouns and up-to-date definition in WordNet make it challenging to address the second question mentioned above. The data in Figure 2 shows that among the 659 query terms, we were able to find from WordNet exact matches for 103 single-word query terms, and additional 72 after using Porter stemmer transformation. The test demonstrated that WordNet could cover only a quarter of query terms (175, 26.56%). This result is consistent with the testing result for the large collection of 87295 query terms.
5.2 Relationships between Query Terms and CV
To identify the relatedness between query terms and CV, we calculated the semantic relationships between query terms and CV by analyzing the paths provided by WordNet. The semantic similarity between words were represented by the hierarchically structured lexical information in WordNet, namely IS-A or hypernymy / hyponymy relations (Warin, 2004). Most taxonomy-based semantic similarity algorithms are evaluated in terms of the distance between the words or phrases (nodes on the tree structure) on the taxonomy: the shorter the path is from one node to another, the more similar they are. In the case of multiple paths, the shortest path would be the strongest relationship. In the WordNet taxonomy environment, it has been noted that the semantic “distance” covered by individual taxonomic links is variable, due to the fact that certain sub-taxonomies are much denser than others do. This intrinsic problem of the WordNet system can be minimized by adopting the maximum depth (from the lowest node to the top) in the taxonomy in which both words co-occur to normalize the similarity score.
The Lin algorithm (Lin 1998) identifies the degree of relatedness between query terms and CV based on the probability model and information content. The basic mechanism of this algorithm is first to find the Lowest Common Subsumer (LCS) of query terms and controlled vocabulary through WordNet, e.g., the LCS for both “Apple” and “Pear” is “Fruit,” and then to calculate the probabilities (computed from corpus frequencies). Equation  is the formula (Lin, 1998) for computing the similarity between query terms and controlled vocabulary:
Where SimLin (Q, CV) represents the similarity score between query terms and CV, – log p(LCS (Q, CV)) the negative log of probability for the LCS between query terms and CV, and (- log p(Q)) + (- log p(CV)) the negative log of probability for query term and CV respectively. Figure 1 is an example to demonstrate how the LCS is calculated. In WordNet database, feline is the hypernym for query term cat and canine for query term dog. Carnivore is the immediate hypernym for both feline and canine, hence the LCS for cat and dog. This IS-A relationship network can be identified by using WordNet database, which can be written as:
From similarity scores calculated for all query terms against the candidate controlled vocabulary classes (provided by human expert), we found that the good matches with a high precision rate were those query terms identified from WordNet. Figure 3 presents the experiment result at three similarity levels. The percentages of correct matches suggest that semantic relationships mapped between query terms and CV classes are reliable as long as the position of target query term can be identified in the WordNet tree. When similarity scores are greater than 0.3, the precision rate of the correct match increased to 98.46%, a near perfect match. Having carefully examined the incorrect matches, we found that some popular word senses were absent from WordNet as mentioned earlier in this paper. The incorrect match example “Adobe” in Figure 3 (similarity score = 0.295) was a result of word sense presence/absence from WordNet: the available definition for adobe is “a kind of clay or brick” which led to “construction,” while the word sense for adobe as a proper name is absent.
Overall, the result (Figure 3) from semantic path based match was encouraging and showed a better precision rate than that of the WordNet coverage test outcome (Figure 2). This demonstrates that even though we can only find a portion of educational query terms, the semantic matches provided by the semantic relationships in WordNet show a high level of accuracy.
6. The Combined Wikipedia & WordNet Approach
The result from last section demonstrates that, although WordNet was able to provide accurate matches for the query terms, there was a low rate of recall and coverage (only 25.56% of query terms matched with WordNet equivalents) and some important word senses were absent from WordNet. To increase the lexical and semantic coverage of query terms, additional sources would be needed to compensate for the limitations in WordNet. The fact that our data was collected from educational digital library query logs provides us with two advantages: first, they are concentrated on the academic/educational domain. Second, the query terms contain a large number of proper nouns. A user-created knowledge repository such as Wikipedia would satisfy the need for domain specific terms and proper names and provide closer and fuller matches for the query terms. Figure 4 compares some major characteristics between Wikipedia and WordNet.
6.1 Coverage of Query Terms by Wikipedia
Similar to the WordNet approach, we searched against Wikipedia 484 query terms that did not match any words in WordNet. Most query terms found to have matches in WordNet were also located in Wikipedia. We were able to locate additional 194 query terms in Wikipedia, most of which were proper nouns, e.g., “three billy goats gruff (Norwegian fairy tale). Furthermore, we partially matched 168 query terms in Wikipedia, e.g., “1950's rock and roll” matched “rock and roll” entry with “1950's” as the modifier.
|Maintenance||Not often updated||Updated frequently and regularly|
|Indexed unit||Word Senses||Categories / Semantic tags|
|Relationship||Accurate relationship||Inexact relationship|
|Coverage||Word / Common Phrase||Word / Proper Noun|
|Query terms found in WordNet||175||26.56|
|Wikipedia exact match||194||29.44|
|Wikipedia Partial Match||168||25.49|
|Unfound in either WordNet or Wikipedia||122||18.51|
The experiment (Figure 5) produced an encouraging result: 194 + 168 = 362 additional query terms were found in Wikipedia, which increased the coverage for the 659 query terms by 29.44% + 25.49% = 54.93%. The result proved our hypothesis that a general-purpose knowledge repository such as Wikipedia can cover most query terms in the educational domain.
6.2 Calculation of relationship between query and CV through WordNet + Wikipedia
Wikipedia employs a two-way category system (Voss, 2006), known as article-centric and category-centric. From the article-centric perspective, a user/creator is authorized to provide subject categories (similar to semantic tags) to Wikipedia pages. The subject categories can be considered as collaborative tagging (Zesch, 2007) to Wikipedia articles. From the category-centric perspective, each category may be assigned to one or more articles that are semantically related. Two examples below demonstrate this method:
Proper noun: 1001 Arabian Nights Wikipedia categories:
Motif of harmful sensation
The Book of One Thousand and One Nights
Proper noun: 1906 earthquake
History of the San Francisco Bay Area
1906 San Francisco earthquake
The analysis result showed that the query terms had 5.66 categories (semantic tags) on average and most categories are phrases instead of a single word. However, Wikipedia cannot present reliable relationships between categories the same way as WordNet does due to its uncontrolled nature. Even though Wikipedia has a high semantic coverage for our data, we cannot calculate the similarity scores simply based on its category hierarchy.
In order to improve the reliability of expected relationships, we developed a vote algorithm to match the query terms and CV classes with the help of WordNet. We used WordNet in the algorithm because of the high-quality matching in WordNet (see section 5.2). The vote algorithm includes following steps:
- 1)Search the first category/semantic tag from WordNet, recorded as WN-category
- 2)If the category phrase does not exist in WordNet, remove the modifier and re-search
- 3)Calculate the Similarity between (WN-category, CV-class) by using Lin algorithm
- 4)Select the highest score CV-class, and the category will “Vote” for that
- 5)Repeat the last four steps and let every category to “Vote” for its closest WN-category
- 6)The highest voted CV-class will be the result
For example, there were 10 categories in the query “1001 Arabian Nights” (listed above). Using the vote algorithm, each category would vote for its own CV class and the result would be “Literature” scores 5, “Music” scores 2, “Politics” scores 2, and “Software” scores 1. Therefore, the best choice of CV class is “Literature” for query “1001 Arabian Nights.”
Since WordNet method had already achieved a high precision, we were focusing only on the 362 cases that did not have a match in WordNet. The percentages of correct and incorrect matches calculated by using the vote algorithm are shown in Figure 6.
Compared with the result from using WordNet, this matching result was not as good as what WordNet achieved. It is still interesting, though, because Wikipedia was able to compensate for the weaknesses in WordNet. Figure 6 shows that Wikipedia achieved more than one-third correct matches among those query terms not found in WordNet.
|Query terms not found in WordNet but found in Wikipedia||362||100|
|1996 Olympic Games:||2000 Olympic Games:||2008 Olympic Games:|
|History of||2000 in Australia||Sports festivals|
|Atlanta||Olympic mascots||hosted in China|
|1996 Summer||2000 Summer||2008 Summer|
|1996 in sports||Sports festivals hosted in Australia||Future sporting events|
The varied category names for the same topic demonstrate that there are no rules or guidelines for tagging the Wikipedia articles. Any terms or phrase structures may be used by anyone to describe the same or different concepts. If we were to further increase the rate of high-quality semantic matching, we would need to employ NLP techniques to process the article content.
In this paper we described the experiment results from using three different methods to test the performance of matching query terms in an educational digital library against its controlled vocabulary. We have a number of observations from the study:
- ▸The educational digital library users tended to use domain-specific nouns and proper nouns with various kinds of modifiers in their queries, which made the data cleaner than those from general-purpose commercial search engines.
- ▸When the controlled vocabulary had only limited terms to cover the domain knowledge, the edit distance-based match was not efficient because it would produce a very long tail with low accuracy rate (more than 96% with Dice Coefficient < 0.80).
- ▸It is possible to use domain independent ontology (expert-generated) to act as the connecting bridge to match query terms against the CV. Our experiment with WordNet demonstrates that, even though there was a low coverage rate (26.56%) for the query terms in our dataset, the precision rate could be decently high (83.33% – 98.46% depending on the threshold).
- ▸User generated knowledge repositories such as Wikipedia could bridge the matching between query terms and CV terms. Our experiment with Wikipedia suggested a high coverage for educational user queries (81.49%) Available in Wikipedia. By using vote algorithm that was based on user-provided subject categories, we achieved a modest accurate match result (35.64%).
- ▸Studies of this kind need to consider issues both in query log mining and controlled vocabulary. The balance point between accuracy and coverage in this research slightly favored to the latter. In order to improve the matching rate, more studies with natural language processing and machine learning algorithms are needed in future.
This study provides evidence for supporting the use of general-purpose ontologies and user-generated knowledge repositories to identify semantic relationships between query terms and domain-specific CV in the absence of powerful controlled vocabularies, which is summarized in Figure 2.