Discourse analysis of online chat reference interviews for modeling online information-seeking dialogues

Authors


Abstract

Recent developments in Web technologies have enabled a rapid increase in computer-mediated information-seeking communications. This manuscript presents the dissertation research the author is undertaking, which will involve a discourse analysis of online chat reference dialogues. The goal of the study is to examine existing information seeking behavior models in the recent context of the computer-mediated communication (CMC), and to augment the theories by developing a dialogue model of information-seeking CMC. In doing so, the study will develop an annotation scheme by incorporating dialogue act analysis into the analysis of information behavior. The contribution of the study will be to inform the design of information retrieval systems that utilize the CMC as an information resource or incorporate interactive user interface.

1 Introduction

While fully automated Web search engines have become the primary tool for information searching by the ma jority of Web users, human-in-the-loop systems such as product review systems (e.g. Amazon), community-based QA systems (e.g. Yahoo! Answer), and online chat services with information experts (e.g. help desk and virtual reference services) have also been increasing in popularity. By facilitating computer-mediated communication (CMC), these systems provide opportunities for information-seeking communication in natural language texts, which enables interactions that are more complex and richer in the contents than exchanges of keywords and hyperlinks as is the case of the traditional ‘one-shot’ Web search engines. Furthermore, current data storage technologies enable these service providers to store entire exchanges of texts, which creates a potentially useful information resource itself and certainly valuable data to analyze human information-seeking behavior through CMC. Although various methods have been proposed and studied to improve the performance of the ‘one-shot’ information retrieval (IR) systems, less studies have been done to examine the processes of information-seeking interactions and to incorporate them into the design of IR systems. Therefore, this study aims to verify the assumptions behind the existing information-seeking behavior models in a more recent context, which utilizes computer-mediated communication (CMC), and to augment or refine the models by developing a dialogue model of information-seeking CMC. This re-examination of the models is valuable not because the models need yet another empirical validation, but because it will show if and/or how the CMC has changed the information-seeking behavior. Some researchers have suggested that the development of the Web information technologies may have changed the people attitude towards the information seeking (Radford and Connaway, 2007). In order to achieve these goals, the study will conduct a discourse analysis of online chat reference dialogues. By incorporating the linguistic analysis, the study aims to provide a more holistic view of information-seeking interactions than previous studies of information-seeking behavior.

The rest of this manuscript is structured as follows. Section 2 presents related studies: models of information-seeking behaviors, discourse analysis and information seeking, and interactive information retrieval. Section 3 describes the proposed study, method, research questions, data, and the current progress, which is a pilot study. Lastly, Section 4 presents current issues and concluding remarks.

2 Related Studies

2.1 Models of Information-Seeking Behaviors

The theoretical development in the study of information-seeking behaviors is best represented by the development of models of of information-seeking behaviors. A model is an abstract representation of concepts and relations among the concepts that collectively describe aspects of human information-seeking behavior. Models vary in their scope, range or granularity of representation but tend to focus on a specific aspect than theories do. Among the various models, the ASK hypothesis by Belkin et al. (1982) and the berrypicking model by Bates (1989) are two classic models that emphasized the importance of interactions in information-seeking processes. The ASK hypothesis states that information seekers are not able to specify their information need precisely at the very early stage of human information seeking, thus suggesting that IR systems need to employ strategies that elicit users' information needs interactively. The berrypicking model states that users' information needs are not static, but evolve throughout IR processes. Thus it suggests that IR systems need to be able to provide not only a piece of information, but also a series of information pieces, which may be obtained from different information resources and/or by different search strategies. Other models of information seeking suggested include: Saracevic's Stratified Interaction Model (Saracevic, 1996), Kuhlthau's Information Search Process Model (Kuhlthau, 1991), Wilson's Information Behavior Models (Wilson, 1999), and Ingwersen's Cognitive IR Model (Ingwersen, 1996).

2.2 Discourse Analysis and Information Seeking

Based upon the ASK hypothesis, Belkin and other researchers at City University London conducted a series of studies in 1980's with a goal of revealing the functions and structure of information seeking interactions between intermediaries and information seekers by using methods and techniques developed for discourse linguistic studies (Belkin et al., 1983; Brooks and Belkin, 1983; Daniels et al., 1985; Daniels, 1986; Brooks, 1986; Brooks et al., 1986). The studies suggested that analysis of interactions between a trained intermediary and an information seeker can contribute better designs of IR systems which could facilitate interactions between the system and its user. 2Based upon the ASK hypothesis, Belkin and other researchers at City University London conducted a series of studies in 1980's with a goal of revealing the functions and structure of information seeking interactions between intermediaries and information seekers by using methods and techniques developed for discourse linguistic studies (Belkin et al., 1983; Brooks and Belkin, 1983; Daniels et al., 1985; Daniels, 1986; Brooks, 1986; Brooks et al., 1986). The studies suggested that analysis of interactions between a trained intermediary and an information seeker can contribute better designs of IR systems which could facilitate interactions between the system and its user. The focus of the ma jority of IR research, however, has since shifted to system experiments, based on “single-shot” search and evaluation, following the Cranfield experiment discipline, developed by Cleverdon (Cleverdon, 1967). As a result, fewer studies have incorporated the notion of information-seeking interactions to IR research to this date. Relevance feedback (Rocchio, 1971) is a classic technique that utilizes iterative feedback from users, but as Bates (Bates, 1989) points out, the method assumes that users information needs never change throughout the search processes and thus misses the important function of interactions, i.e., understanding of the user's information need within the context. In the field of library and information science, analyses of information-seeking behaviors are often conducted for the purpose of evaluating library services. Such studies often employ a discourse analysis method, because it allows researchers to analyze the interactions with a minimum interference to the participants. Discourse analysis may also provides theoretical justifications for the phenomena of interest and dimensions of analysis. For example, Westbrook Westbrook (2007) analyzed a chat reference log in terms of use of formality indicators, based on politeness theory (Brown and Levinson, 1987), and suggested the greater control of formality nuances by librarians encouraged a more effective search for users.

2.3 Interactive Information Retrieval

Text Retrieval Conference (TREC) is an annual evaluation conference, organized by the US National Institute of Standards and Technology (NIST), and has been the primary venue for IR researchers in the US to conduct batch-style retrieval tests on large collections. In the early years, TREC had an ‘interactive’ track, where queries were formulated by a human searcher while interacting with the data through the system. However, the inherent conflict between the need to control variables as an experiment and the need to include human searchers for the interactive component created difficulties in designing the evaluation of systems (Beaulieu et al., 1996) and the track was eventually discontinued. In order to bring back the interactive components to TREC, the High Accuracy Retrieval of Documents (HARD) track was introduced in 2003 (Allan, 2003). It started with three components: rich metadata, passage retrieval, and clarification dialogues, but clarification dialogues became the sole focus of the track in 2005 (Allan, 2005). Here, a clarification dialogue was defined as a single iteration of an information request from a system to its user and an information provision as a response from the user to the system through a ‘clarification form’ using the standard Web interface e.g. HTML texts, pull-down menus, text boxes, etc. The question was whether clarification dialogues would improve the search performance. According to Allan (2005), some systems demonstrated “appreciable average gains from using clarification forms”. Using a trained intermediary for creating the clarification forms, Lin et al. (2006) show that human involvement significantly improves IR performance and claim that analysis of interactions between a trained intermediary and an information seeker can inform the design of interactive IR systems. The proposed study is motivated by this assumption: the analysis of information-seeking interactions will benefit the design of interactive IR systems. The study is reminiscent of the research by Belkin et al. (Belkin et al., 1983; Brooks and Belkin, 1983; Daniels et al., 1985; Daniels, 1986; Brooks, 1986; Brooks et al., 1986) and other studies in 1980s, such as (Saracevic et al., 1988), which share the assumption, but is unique in the following two aspects: 1) the study incorporates dialogue acts, analyzing the sociolinguistic functions of 3Text Retrieval Conference (TREC) is an annual evaluation conference, organized by the US National Institute of Standards and Technology (NIST), and has been the primary venue for IR researchers in the US to conduct batch-style retrieval tests on large collections. In the early years, TREC had an ‘interactive’ track, where queries were formulated by a human searcher while interacting with the data through the system. However, the inherent conflict between the need to control variables as an experiment and the need to include human searchers for the interactive component created difficulties in designing the evaluation of systems (Beaulieu et al., 1996) and the track was eventually discontinued. In order to bring back the interactive components to TREC, the High Accuracy Retrieval of Documents (HARD) track was introduced in 2003 (Allan, 2003). It started with three components: rich metadata, passage retrieval, and clarification dialogues, but clarification dialogues became the sole focus of the track in 2005 (Allan, 2005). Here, a clarification dialogue was defined as a single iteration of an information request from a system to its user and an information provision as a response from the user to the system through a ‘clarification form’ using the standard Web interface e.g. HTML texts, pull-down menus, text boxes, etc. The question was whether clarification dialogues would improve the search performance. According to Allan (2005), some systems demonstrated “appreciable average gains from using clarification forms”. Using a trained intermediary for creating the clarification forms, Lin et al. (2006) show that human involvement significantly improves IR performance and claim that analysis of interactions between a trained intermediary and an information seeker can inform the design of interactive IR systems. The proposed study is motivated by this assumption: the analysis of information-seeking interactions will benefit the design of interactive IR systems. The study is reminiscent of the research by Belkin et al. (Belkin et al., 1983; Brooks and Belkin, 1983; Daniels et al., 1985; Daniels, 1986; Brooks, 1986; Brooks et al., 1986) and other studies in 1980s, such as (Saracevic et al., 1988), which share the assumption, but is unique in the following two aspects: 1) the study incorporates dialogue acts, analyzing the sociolinguistic functions of communicative behaviors, which, the author believes, will provide new insights into the analysis of information-seeking interactions; and 2) the analysis will be of transcripts of online chat dialogues, a relatively new form of communication with growing importance.

3 Proposed Study

3.1 Method

The study will apply discourse analysis to online chat reference dialogues, using a dialogue act annotation scheme. The annotation scheme is currently being developed through a pilot study, focusing on identifying the following two aspects of the interactions: 1) functions which are related to achieving the information-seeking task (information-seeking functions) and 2) functions which are related to the maintenance and management of the dialogue (communicative functions). The idea of having these two aspects is theoretically motivated by dialogue act theory (Bunt, 1994), as described in the next paragraph. The annotated data will be compared with a discourse model that is presumed by the traditional information-seeking behavior models. A new discourse model of online chat reference dialogues will be developed based on the annotated data. Fig. 1 illustrates the overall research and deliverables at each stage. Brief descriptions of the dialogue act annotations and discourse modeling are following. The subsequent subsections will describe the research questions, the data, the annotation environment, and lastly the pilot study, which is the current work in progress.

Figure 1.

The Research Design of the Proposed Study

* The oval shapes represent deliverables at each stage. In the first literature review, discourse models (DMs) of existing information-seeking theories are identified. Then in the pilot study, an annotation scheme is developed using pilot data and existing annotating schemes. Lastly, the main study will produce annotated data and the discourse model of the online chat reference dialogues.

Dialogue Act Analysis

Dialogue acts classify and analyze functions of utterances in a dialogue. In this study, a dialogue is conceptually defined as a series of exchanges of linguistic utterances by two cognitive agents in a certain context. Like speech acts (Austin, 1975; Searle, 1969), the dialogue acts treat an utterance in a dialogue as a kind of action, but also incorporate theoretical developments in analyzing dialogues in terms of the maintenance or management of a dialogue, such as understanding of turn-taking mechanisms (Sacks et al., 1974) or specification of underlying expectations for adjacency pairs (Goffman, 1981) – a pair of utterances such as a question and an answer or a greeting and a response. The dialogue act analysis is used to create a structural representation of a dialogue for designing a spoken dialogue system (Jurafsky and Martin, 2008). Bunt (1994), particularly to the interests of this study, proposed the Dynamic Interpretation Theory, which states that dialogues are carried out by participants performing two kinds of tasks: 1) tasks to achieve the goal that motivated the dialogue and 2) tasks to maintain the dialogue itself in order to achieve goals that are associated to the context of the dialogue. In this study, these notions are translated to information-seeking functions and communicative functions, which were defined earlier, with respect to the context of dialogues in the data, namely, the online chat reference service.

Discourse Modeling

A discourse model is an individual's mental model of an on-going discourse. (Jurafsky and Martin, 2008). A discourse model can be represented as a graph-based structure such as a Hidden Markov Model (HMM) or a Finite State Automaton (FSA). The study hypothesizes that existing IR systems presume a certain discourse model of information-seeking interactions and thus will construct hypothetical models based on reviews of the literature. For example, relevance feedback mechanisms assume that an ideal information-seeking process involves minimum interactions between the user and the system, which should consist of the initial search request from a user, followed by iterations of a presentation of search results by the system and feedback from the user on the results. (Salton, 1971) Based on the annotations of the data, the author aims to develop a discourse model of online chat reference dialogues.

3.2 Research Questions

The following research questions reflect the research goals in light of the method described above:

Q1: What is the discourse model of online chat reference dialogues?

Q1.a: What are the roles of information-seeking functions in online chat reference dialogues and how do they occur in the dialogues?

Q1.b: What are the roles of communicative functions in online chat reference dialogues and how do they occur in the dialogues?

Q2: How do the existing models of information-seeking behavior fit in with the discourse model of online chat reference dialogues?

Q3: How does understanding the discourse model of online chat reference dialogues help research of IR?

Q3.a: How does it contribute to utilizing online dialogues as an information resource?

Q3.b: How does it contribute to facilitating information-seeking interactions?

Q1 will be answered by constructing an annotation scheme, applying the scheme to data, and analyzing the annotated data. The annotation scheme will define the dialogue acts that the study will look for and the analysis will be for finding structural patterns in the annotated data. Specifically, the author aims to construct graph representations of the discourse that illustrate how online chat reference dialogues proceed and what are the consequences (e.g. Was the user's information neeed satisfied?).

Q2 will be answered by comparing the annotated data and the discourse model that the existing information-seeking models presume. Any refinement, if needed, will be noted and reflected to the newly constructed discourse model.

Lastly, in answering Q3, the study will provide implications for the design of IR systems in two aspects: utilization of online dialogues as an information resource (Q3.a) and facilitation of information-seeking interactions (Q3.b). For Q3.a, the author will look for how dialogue acts, as additional inputs to an IR system, can contribute to improve the process of retrieving information from online dialogue archives. For example, an IR system may be enabled to identify dialogues that are likely to contain the information that the user is looking for, and thus improving the precision/recall of the system. For Q3.b, the author will look for the structural patterns in dialogue acts that lead to successful or unsuccessful reference sessions.

3.3 Data

The data, provided by the Online Computer Library Center (OCLC), are a log of virtual reference service dialogues, collected between December 2005 and December 2006.11 Currently 450 sessions, which consist of 8066 lines of messages, are formatted and stored in a MySQL database for analysis. Any information that may identify the participants of the reference sessions, e.g. e-mail addresses, names, phone numbers, etc, are replaced by a space holder with a general descriptor (e.g. [Patron's name], [Librarian's e-mail address]) prior to the release of the data by OCLC. Using online reference chat transcripts as data has several advantages over using other forms of textual data in terms of the validity of data or feasibility of analyzing data. First, the data are near-complete reproductions of actual interactions, i.e., texts available to researchers are exactly the same texts as the actual conversation participants exchanged, in contrast to transcribed texts of face-to-face reference sessions, where some non-linguistic and para-linguistic aspects of the interaction, such as tones, gestures, or eye movements, are lost. Second, the data are collected unobtrusively in a natural environment, i.e., from the users' points of view, there is no difference in the environment whether or not the dialogues are logged. Third, the interactions are always performed by two people, thus presumably involving less complications in conversational threads than community-based question-answering sites (e.g. Yahoo! Answer or WikiAnswers), where many people may answer to a question. And lastly, the responder to a question is always a professional reference librarian and thus the author assumes the conversations are purposefully structured for solving the information need of the user and involve less sidetracking.

3.4 Annotation Environment

The current annotation environment was developed using Java, with general Web/Database application development tools: MySQL, Hibernate, and Spring MVC. The tool allows researchers to annotate each utterance with multiple annotation schemes while modifying existing annotation schemes or adding a new one. Currently, the tool is used by a single coder (the author), but in the main study, the tool will be modified to support multiple coders.

3.4 Work in Progress: Pilot Study

The primary goal of the pilot study is to develop an annotation scheme that will be used in the main study, while exercising the same tasks in the main study and testing the annotation environment.

A new annotation scheme is being developed by examining four existing annotation schemes against a pilot dataset, a randomly selected subset of the data. The following four schemes have been used to cover broad aspects of dialogues while describing necessary details: 1) the functions of the general information provision mechanism, proposed by Belkin et al. Belkin et al. (1983); 2) the dialogue act annotation scheme, called DIT++, proposed by Bunt (2000); 3) the taxonomies of clarifying questions and information provisions used by Radford and Connaway to analyze digital reference dialogues Radford and Connaway (2005); and 4) the annotation scheme that Crouch and Lucia developed for the analysis of reference dialogues (Crouch and Lucia, 1980); The selection of the schemes is not meant to be definite and more may be added and some may be dropped as the study progresses. The first three schemes were chosen to represent three disciplines: information-seeking behaviors, dialogue acts, and digital reference, and the forth scheme was added later, for its broad coverage and detailed code book.

The new scheme is being developed by finding the strengths and shortcomings of the current schemes while they are applied to the data. The new scheme will be developed to meet the following goals: 1) to cover all the utterances in the data; 2) to explain relations among annotations and levels of analysis; and 3) to help identify the progress in the information-seeking process. For example, the existing schemes largely lack distinguishing functions of information provision by intermediaries such as explaining the next search plan or strategy, explaining the current work that the intermediary is performing, suggesting alternative resources, and so on. While the question negotiation is perceived in general as consisting of clarifying questions from intermediaries and information provision from information seekers it seems to be the case that the information-seeking dialogues in reality involve information requests and provisions from both of the participants. Thus, the new annotation scheme must be able to distinguish and describe such functions. The organization of the annotation labels will also be important as the scheme grows richer. The author plans to organize the scheme hierarchically with multiple levels, of which each level represents the level of analysis. The top level will classify functions in terms of the type (or dimension) of the function, i.e., information-seeking or communicative. The lower levels will distinguish functions progressively in detail.

4 Current Issues and Concluding Remarks

4.1 Current Issues

This subsection briefly discusses the current issues of the study.

Interactions between intermediary and information system:

The study will look at only the interactions between the intermediary and the information seeker, and not interactions between the intermediary and the information system. For example, the study will not look at changes of query strategies that the intermediary employs (e.g. keywords to be used). That means that annotators will not know of changes in the query strategies unless the intermediary specifically says so in the data. This may affect the theorizing about interactions.

Analytical method:

As described in 3.1, the annotated data will be analyzed in order to find “structural pat- terns” – graph-based structures that represent transitions of states in the information- seeking process. This may be done either manually or statistically (by constructing an HMM or FSA). While statistical methods are preferred because of the precision and objectivity, the amount of data may not be sufficient to construct a model that is converged enough to be interpretable, which questions the appropriateness of constructing graph-based representations as part of the method for the study.

4.2 Concluding Remarks

The study is still in the early stage of the pilot and the author is still looking for alternative approaches to some of the problems that started surfacing. By the time of the conference, the author aims to have finished the development of the annotation scheme and defended the proposal of the study. The poster will present the scheme and primary observations, as well as the overall description of the study, which is presented in this manuscript.

Footnotes

  1. 1

    The data was originally prepared for an on-going research pro ject by Radford and Connaway (2005) and became available to the author by courtesy of Dr. Radford, Dr. Connaway, and the OCLC.

Ancillary