Information search and retrieval in microblogs



Modern information retrieval (IR) has come to terms with numerous new media in efforts to help people find information in increasingly diverse settings. Among these new media are so-called microblogs. A microblog is a stream of text that is written by an author over time. It comprises many very brief updates that are presented to the microblog's readers in reverse-chronological order. Today, the service called Twitter is the most popular microblogging platform. Although microblogging is increasingly popular, methods for organizing and providing access to microblog data are still new. This review offers an introduction to the problems that face researchers and developers of IR systems in microblog settings. After an overview of microblogs and the behavior surrounding them, the review describes established problems in microblog retrieval, such as entity search and sentiment analysis, and modeling abstractions, such as authority and quality. The review also treats user-created metadata that often appear in microblogs. Because the problem of microblog search is so new, the review concludes with a discussion of particularly pressing research issues yet to be studied in the field.


The term microblog describes an increasingly common information medium. Usually comprising brief textual entries written on an ongoing basis, a typical microblog is written by a single person or entity and is read by anywhere from zero to hundreds of thousands of “followers.” Users of microblog services such as Twitter11 are likely to maintain a microblog of their own. Likewise, each user typically follows some number of other peoples' microblogs. As millions of users broadcast updates to their followers, microblogs appear attractive insofar as they promise access to timely information written by people we have chosen to pay attention to.

However, finding and managing information in growing masses of microblog data are not trivial tasks. This review discusses the most poignant problems faced by those of us who would like to help people use microblog data as they pursue their information needs. To contextualize this review's motivation, consider four questions that fall under the broad rubric of microblog information retrieval (IR).

  • 1.There are millions of microbloggers. Which of those millions of people write often enough on topics that interest me and with sufficient authority that I should follow them?
  • 2.I follow over 1,000 people's feeds because they all post interesting material at least occasionally. But following 1,000 feeds, each of which is updated several times daily, leads to information overload. How can I winnow my incoming microblog updates so that I see only the most interesting entries?
  • 3.I plan to buy a new computer and am deciding between two models. I know that people voice opinions on Twitter about the merits of each model. How can I find trustworthy opinions on the matter, and how can I synthesize these opinions to help me choose a computer to buy?
  • 4.A wildfire is burning in my neighborhood. Many of my neighbors are posting information about the situation such as evacuation zones, aid station statuses, weather conditions in specific locations to Twitter. What information relevant to my situation is available in this setting? Also, what information could I contribute that would be of use to others?

Several points are worth noting with respect to these questions. In question 1, the unit of retrieval is a person. Instead of textual documents, the person asking question 1 expects an IR system to suggest people who are worth following. Question 2 imagines a scenario in which the unit of filtering is the individual post. Thus, in microblog IR, the unit of retrieval is often either not obvious or in need of formalizing to permit tractable retrieval. Question 3 points to the role that subjectivity and opinion play in microblog IR. Because a great deal of microblog data expresses opinions, IR systems built on these data invite hopes of answering questions related to abstract matters such as consensus or dispute. Finally, in question 4 we see the role that time and place play in microblog IR. People post to microblogs at particular times in particular places. Rapidly updated data and knowledge of authors' locations suggest that microblog IR traffics in a notion of relevance that is often bounded—with fine granularity—by time and place.

The sections that follow elaborate on these ideas. Although this review focuses on IR, I must stress that microblog IR is in its infancy and thus my discussion ranges to cognate problems as well. Many of the problems that we face in microblog IR are part of established research areas. For instance, question 1 above poses problems very similar to expert finding, a well-studied IR problem. Question 3 involves modeling opinion, a problem that is widely treated in the area of sentiment analysis. More broadly, the popularity of microblogs must be understood in the larger context of increasingly pervasive social computing in general. In the remainder of this review, then, I will articulate the main challenges and opportunities facing those who study microblog IR. But doing so will require us to visit a broad palette of related literatures.

Overview of Microblogs

A microblog is usually maintained by an individual person or entity. It comprises periodic and brief posts (often called status updates) that collectively offer readers (usually called followers) timely information deemed interesting by the microblog's author. What constitutes interesting information is of course subjective. Many microblogs are highly personal and of interest only to their authors' close acquaintances.22 Other microblogs have a wider audience. For instance, thousands of people follow updates to celebrities' microblogs, affording these celebrities new venues for cultivating their fan base. Other microblogs have wide audiences because their content has broad appeal. Many people follow a small number of elite Twitter users because these people tend to post information that is prima facie of value.

At the time of this writing, several microblog services exist. For example, Facebook33 , Tumblr44 , Orkut55 , and FourSquare66 all offer microblog-type services. However, at this time the most visible microblogging platform is Twitter. Because of its widespread use among microbloggers, in the remainder of this review I will use examples, data, and vocabulary taken from the context of Twitter use.

Each user of Twitter maintains his or her own microblog. A user's microblog comprises individual posts, called status updates, each comprising a text string of no more than 140 characters. Twitter posts are also known as tweets. Although statuses are brief, they often point to external material by including a hyperlink to a document on the web, or to an image or other uploaded content. In Twitter jargon, a given user's status is called his or her timeline. For our purposes, this is identical to his or her microblog.

Twitter was launched in 2006 and saw rapid growth in membership when a critical mass of participants in that year's South by Southwest conference in Austin, Texas adopted the service (Douglas, 2007; Terdiman, 2007). Since that time, Twitter's popularity has grown at an increasing rate (Beaumont, 2010; Twitter, 2010). Writing in The Daily Telegraph, Beaumont reports that as of February 2010, users post approximately 50 million tweets per day.

People's motivations for microblogging are diverse (Jansen, Zhang, Sobel, & Chowdury, 2009; Java, Song, Finin, & Tseng, 2007; Krishnamurthy, Gill, & Arlitt, 2008; Zhao & Rosson, 2009), ranging from offering autobiographical updates to reporting on current news events (Diakopoulos & Shamma, 2010; Shamma, Kennedy, & Churchill, 2009) or crisis situations (Longueville, R. S. Smith, & Luraschi, 2009; Sakaki, Okazaki, & Matsuo, 2010; Vieweg, Hughes, Starbird, & Palen, 2010). People's use of Twitter to communicate during political unrest (e.g., the disputed 2009 election in Iran) brought increased attention to microblogs as a communication medium (Evgeny, 2009; Grossman, 2009). Twitter has played a role in important events, but the service also allows people to communicate among a relatively small social circle, and a sizeable part of Twitter's success owes a debt to this function.

Although the reasons for Twitter's popularity are complex, artifacts of people's interactions with Twitter are readily observable, thanks, in large part, to the API that allows developers to download data from Twitter programmatically. The availability of Twitter data has led a host of researchers to inquire after high-level questions regarding people's use of microblogs. At the broadest level, initial research addresses these questions:

  • Who are the users of microblog services?

  • What qualities characterize the social networks that these users maintain?

  • How does information flow through microblogs?

Two of the earliest studies of the demographics and network statistics of Twitter by Java et al. (2007) and Krishnamurthy et al. (2008). Both of these studies analyze samples (gathered in different ways) of tweets to characterize peoples' interactions with Twitter. Some findings here are not surprising. For example, Twitter data such as people's number of status updates and their number of friends and followers follow power law distributions. As in many informetric settings, a large proportion of data is created by a small number of highly engaged people. Java et al. note that the power law exponent for Twitter social graphs is approximately −2.4, a number that is very similar to the parameter for graphs on the web (pp. 58–59).

However, less predictable results also emerged from these studies. As of 2008, Twitter use was far-reaching geographically speaking. Although the United States contained the lion's share of Twitter activity, these studies found heavy use internationally as well. Of particular interest is the finding that many people's social networks contain geographically diverse entries, with connections often spanning several continents (Java et al., 2007). Additionally, Twitter networks tend to show strong reciprocity. According to Huberman, Robero, and Wu (2009), “90 percent of a user's friends reciprocate attention by being friends of the user as well.” That is, people who person x follows are likely also to be followers of person x.

Digression: A Microblog Corpus and a Vocabulary for Discussing IR

During this review, I will make reference to statistics calculated from a corpus of microblog data that I have collected and used in my own research. Twitter has made acquiring microblog data trivial by exposing their content via a streaming API.77 Twitter's Streaming API allows a developer to harvest a putatively random sample of tweets very quickly. However, this stream (“the garden hose” in Twitter jargon) delivers tweets out of context. If tweet a by author A is sampled, then it is unlikely that other tweets by A will find their way into the sample. Nor is it likely that tweets by other people, B, C, D, etc., who are members of author A's social network, will be represented in the sample. It is true that this context could be recovered by using other methods in the Twitter API, but on a large scale, such contextualization becomes an engineering burden, exacerbated by API rate limits and latency in API responses.

To build a more naturalistic test bed of microblog data, I decided to focus on a single (but broadly construed) community. The core of this collection comprises Twitter users who have interests in matters relating to IR and human-computer interaction.

My goal in collecting these data was twofold:

  • 1.I wanted to capture data that allows analysis of the social aspects of microblogging.
  • 2.I wanted the corpus to contain a large, heterogeneous topic base.

I aimed to build a collection where interactions between people and the diffusion of ideas within social circles could be observed. However, to allow for realistic analysis, the data set should not hew too narrow a line with respect to topical coverage.

I identified the core set of authors in this corpus by tracking tweets that were written in a 2-week period—1 week before and after the date of the announcement of paper acceptances and rejections for the 2010 ACM SIGIR conference (March 24, 2010). My hope was to use this event as an opportunity to identify people in the IR community who use Twitter. To accomplish this, I used the Twitter Streaming API, tracking all tweets using any of the words in Table 1.

Table 1. Words used to track tweets using Twitter's Streaming API during the initial data collection phase.

The words geneva, workshop, and retrieval intentionally admitted tweets by authors not involved in IR research. The last three search terms (iiix, cfp, and hcir) were included to capture discussion of related conferences. Tracking this activity identified 49 individual users.

To augment this set, I harvested each of the 49 users' friends and followers on July 17, 2010, for a total of 10,111 users. After removing accounts marked as “private,” the corpus contained 9,294 authors in the “community.”88

In this review, I report data that were obtained by tracking all tweets written by these 9,294 people beginning at midnight (CDT) on July 18, 2010. Though I continue to store this community's output, for the purposes of this review, I limit analysis to tweets written before 11:59 P.M. (CDT) on August 28, 2010. Summary statistics for this corpus appear in Table 1. To be clear, it bears stressing that the sample is not random.

The final 2 rows of Table 2 report counts for two types of user-generated metadata. Each of these data types is described in detail below. But for the sake of contextualization, a brief definition here is in order. Hashtags are simply character strings that begin with the # character. They are often topical in nature and serve to collocate tweets related to a particular subject. Mentions are strings preceded with the @ sign followed by a Twitter user's screen name. They mark a tweet as being directed to a particular user.

Table 2. Summary statistics for corpus used to motivate examples in this review.
Number of tweets2,314,460
Number of users   9,297
Median tweet length (characters)     88
Median number of tweets per user     90
Median number of friends per user    447
Median number of followers per user    888
Tweets containing at least one hashtag 363,153
Tweets containing at least one mention1,304,198

Given a corpus of microblog data, the IR literature gives us many approaches to conducting search and retrieval based on textual queries. To situate our discussion, I will rely primarily on the language modeling framework when discussing IR operations (Ponte & Croft, 1998; Zhai & Lafferty, 2004). I will focus on language modeling because of its simplicity, its pervasiveness in the IR literature, and its demonstrated effectiveness. Although a full treatment of language modeling IR is beyond the scope of this review, I will present a brief overview in the following paragraphs.

Language modeling IR relies on generative models to rank documents against queries. Under this approach, we assume that the terms in a document di were generated by a particular probability distribution Mi. Given a keyword query q we examine the quantity:

equation image(1)

where the proportionality is due to Bayes' rule and Pr(q|di) is taken to be Pr(q| Mi). The prior probability for a document Pr(di) is often taken to be uniform and is thus omitted. In this case, we simply rank documents by the likelihood that their underlying language models generated the words in q.

The likelihood of q given Mi is typically calculated under the assumption that terms are statistically independent. This assumption gives:

equation image(2)

To avoid numeric underflow, we typically use log probabilities, replacing the product in Equation 2 with a summation. In either case, in the language modeling approach, documents whose corresponding language models have a high probability of generating q are ranked higher than those with low probability.

The family and fitting methods for inducing Pr(w|Mi) is treated in the language modeling literature. But typically, Mi is a multinomial distribution over the p terms in the indexing vocabulary, where the distribution is characterized by the multinomial probability distribution function and Θi a p-dimensional probability vector, where pj gives the probability of observing the jth term on an observation from Mi. We assume that Θ i is estimated using maximum likelihood with Bayesian smoothing using Dirichlet priors and hyperparameter μ=1500. Interested readers may find more detail on estimation, smoothing, and document ranking using language models in Zhai and Lafferty (2004).

Two Types of Search in Microblog Systems

The term microblog IR easily leads to confusion. This is because people search for microblog information in at least two ways:

  • Broadcasting questions to their followers in hopes that people in their social network will answer them.

  • Conducting searches over preexisting microblog data in hopes of discovering relevant information that has already been written.

This distinction has been made before (Sullivan, 2009a, 2009b). Sullivan discusses “how people use Twitter itself … to ask for help directly, especially when in the past, they might have first turned to a search engine.” This is in contrast to “Twitter search … where people can explicitly do a search against past tweets to find information.” I will refer to the first of these modes of microblog search as asking and will call the second approach retrieving.

Of course, asking and retrieving aren't mutually exclusive. It is conceivable, for instance, that a system could aid a user in asking by helping him or her to retrieve a group of people who are likely to lend helpful answers. Alternatively, a person might ask his followers to recommend hashtags that collocate tweets related to his information need. Nonetheless, the difference between asking and retrieving in microblogs is worth our attention.

Asking for Information

The activity that I have called asking has been shown to be common in peoples' use of microblogs as well as other social media. Research in this vein has shown that many tweets involve asking or answering questions. Morris, Teevan, and Panovich (2010) report results from a survey in which half of respondents had used Facebook or Twitter to ask questions. Efron and Winget (2010) analyzed two corpora of tweets and found that in one corpus, 13% of tweets had a question, while the other corpus 16% of tweets contained questions. It bears noting, though, that DiMicco et al. (2008), using a stricter definition of questions, found only three questions in a sample of 200 tweets.

The information interaction entailed by asking in microblogs is similar to the operations of online Q&A sites such as Yahoo Answers99 as well as older systems such as e-mail forums. In these domains, the searcher posts a question, typically articulated in some detail, to an online space where other users can peruse questions and offer answers. Typically online Q&A sites connect askers and answerers who do not know each other and who may not be identifiable.

Research into online Q&A sites has shown that this approach to addressing information needs, though popular, suffers several drawbacks. Zhang, Ackerman, Adamic, and Nam (2007) report lengthy delays between when questions were asked and when they were answered. In the sample analyzed by Hsieh and Counts 20% of questions were never answered. Of course, the quality of answers in online Q&A sites varies, but Shah and Pomerantz (2010) found it difficult to predict which answers users would actually find helpful. Although Q&A sites are similar to microblog asking, the shortcomings outlined here suggest room for improvement.

Why should microblog asking offer different results than online Q&A services? Asking questions via microblogs seems like nothing more than an informal online Q&A activity. However, several points distinguish microblog asking from most established Q&A sites:

  • Questions are posed only to those people who have chosen to follow the asker.

  • The asker is identified not only by his or her screen name, but also by the history of tweets that he or she has posted, lending context to a given question.

  • Because of the compressed format of microblogs, questions and answers must be expressed succinctly.

Point 1 above locates microblog asking in the domain of so-called social search (Evans & Chi, 2008; Horowitz & Kamvar, 2010). In the sense suggested by Evans and Chi, social search entails pursuing an information need by consulting (via whatever medium) with people whom the searcher knows. Evans and Chi found that survey respondents often engaged socially before, during, and after the search process itself. They argue that social interaction is a key factor in the search experience.

Discussing a problem with a colleague down the hall and e-mailing a question to friends are both social search actions. However, social search may also be computationally mediated (Horowitz & Kamvar, 2010), as in the case of the service called Aardvark.1010 Here, a searcher poses a question to the vark system. The system predicts which other users of vark are likely to have expertise on the question's topic, routing the question to people who are likely to provide useful answers.

Asking questions via microblogs shares characteristics with the notion of social search. In the case of microblogs, a searcher's question reaches an audience limited to those people who have chosen to read his or her tweets. If I broadcast a question via Twitter, then I am asking a group of friends (loosely construed) for responses. Thus, the mode of microblog search that I call asking entails an avenue to pursue social search that complements venues such as face-to-face discussion, e-mailed questions, etc.

Retrieving Information

What I have called retrieving is similar to traditional, ad hoc IR. Interactions of this type are likely to involve a “query” that is posed against an index of microblog data. Based on some matching criteria, the system finds putatively relevant “documents” in the index. These results, organized by algorithmic and design principles, are then presented to the searcher.

I have put the terms query and documents in quotes above because, as our discussion will show later, what constitutes a query or a document in microblog retrieval systems is quite fluid. Additionally, the matching criteria and result presentation techniques may vary from system to system (perhaps one system returns a ranked list of tweets, while another presents a word cloud). However, the process outlined here captures a typical search process in the context of retrieving microblog data.

Key Problems in Microblog Search

In 2008, Hearst, Hurst, and Dumais (2008) asked, “What should blog search look like?” In 2011, we may ask the same question of microblog search. The fundamental challenges in microblog search have yet to be detailed, as noted by Golovchinsky and Efron, who write, “Little attention … has been paid to how people search Twitter, and to how they explore returned search result sets” (2010, p. 2). What information needs do people bring to microblogs and what forms might tractable queries take? What are useful units of retrieval in the microblog setting? What constitutes relevance in a microblog search session? Factors such as influence, authority, and timeliness probably weigh on what makes some microblog posts more useful than others. But how to account for these qualities is not obvious.

But in setting an agenda for microblog search, the IR community need not start from scratch. The IR literature contains findings, techniques, and vocabularies that serve as good starting points for research on microblogs. The following subsections outline how established problems in IR have made or could make their way into retrieval settings over microblogs.

Sentiment Analysis and Opinion Mining

A key challenge in contemporary text mining is so-called sentiment analysis (Pang & Lee, 2008). Sentiment analysis has informed IR research especially in the context of retrieval of blog data, as formalized in the TREC blog track (Macdonald, Ounis, & Soboroff, 2007; Ounis, Macdonald, de Rijke, Mishne, & Soboroff, 2006; Ounis, Macdonald, & Soboroff, 2008). Not surprisingly, findings from the blog retrieval community have ready interpretation in the context of microblogs.

Given a document d, the goal of sentiment analysis is twofold. First, sentiment analysis algorithms may classify d as either containing opinionated content or not (i.e., taking an objective tone). A goal of other sentiment analysis is to treat a different question: Assuming that document d does contain opinionated language with respect to topic T, does d evince a positive or negative opinion of T overall? This type of analysis has found its way into IR to support queries such as find documents that contain favorable or unfavorable discussion of Citizen Kane.

The majority of sentiment analysis operates by identifying opinion-expressing terms and estimating the importance of those terms in the texts to be analyzed. Common sources of opinion-expressing terms are the OpinionFinder lexicon, as described in Wilson, Wiebe, and Hoffmann (2005), and the General Inquirer lexicon, as described in Hatzivassiloglou and McKeown (1997). These resources are ontologies that enumerate words whose presence in a text indicates the expression of emotion or opinion. Precisely how the knowledge encoded in these lexicons is used by a sentiment analysis system varies. But a common approach to identifying the “semantic orientation” (i.e., positive or negative sentiment) of a text lies in a supervised machine learning setting, where we train a model on labeled data, making predictions on the basis of the target text's evidentiary features.

Like blog data, microblog posts often express opinions or emotion (Diakopoulos & Shamma, 2010; Jansen et al., 2009). Analyzing a corpus of Twitter data, Diakopoulos and Shamma found that “the tenor of the tweets was distinctly negative” (p. 1197). In the context of e-commerce, Jansen et al. write that “microblogs offer immediate sentiment and provide insight in affective reactions toward products … .” (p. 2170). Because tweets are often informal and expressive of opinions, the problems of sentiment detection and polarity identification have a clear role to play in microblog retrieval.

For instance, microblog data has shown promise as a gauge of political sentiment. Tumasjan, Sprenger, Sandner, and Welpe (2010) analyzed a sample of Twitter posts related to a German federal election. Constructing “sentiment profiles” for each candidate, Tumasjan et al. argue that “sentiment profiles of politicians and parties … plausibly reflect many nuances of the election campaign” (p. 183). Similar findings appear in the context of more abstract sociopolitical issues. In work such as Bollen, Pepe, and Mao (2009) and O'Connor, Ramnath Balasubramanyan, Routledge, and Smith (2010), researchers have found that phenomena such as consumer confidence at a future point in time can be predicted by analyzing Twitter text.

Although microblogs' utility in gauging or predicting public opinion is compelling, other research has shown that Twitter data provide leverage in more general sentiment analysis tasks (Pak & Paroubek, 2010). Like other social media, microblogs afford data that allow researchers to answer questions such as “What is the public's current attitude towards universal healthcare?” In an IR setting, the sort of opinion detection that microblogs enable supports queries such as “Which are the best Thai restaurants in San Francisco's Marina district? Is the movie Inception worth seeing?”

Entity Search

One issue that makes microblog retrieval a compelling research area is its youth. Even the most basic problems in microblog IR remain to be identified. Among these basic problems is defining useful units of retrieval. Given a corpus of tweets and a user with a particular information need, precisely what should an IR system present to the searcher?

Early IR systems helped people find literature by searching textual surrogates of documents or books. Ad hoc IR as exemplified by early iterations of the Text REtrieval Conference as well as a good deal of web search are predicated on the idea that a retrieval system runs a user's keyword query against a collection of documents whose full text has been indexed, ultimately delivering a ranked list of documents.

Given a microblog corpus and a keyword query, returning individual tweets might be helpful. But it seems likely that showing a user a list of tweets taken out of context is not the best way to solve realistic information needs. If we suspect that a ranked list of tweets is not the best (or at least not the only) way to retrieve and present microblog information, then what approach would be better?

We can find one answer to the question what is a useful unit of retrieval in microblog IR in the established field of entity search. Entity search is simply a type of IR in which we begin with a corpus of documents but return information about objects or actors that exert influence in those documents. Perhaps the best-known type of entity search is expert finding, in which the unit of retrieval is a person who has expertise on a topic identified by a user's query. For example, Balog and de Rijke (2006) discuss the problem of finding topical experts in corpora of e-mail, a challenge that saw a great deal of attention in the context of the TREC enterprise track (Balog et al., 2008) and which has continued to attract attention (Balog, Azzopardi, & de Rijke, 2009; Balog & de Rijke, 2006).

Entity search plays a natural role in microblog IR, in which the appropriate unit of retrieval is likely to depend on the type of information need that the searcher brings to the search session. Current microblogging services such as Twitter are based on a publish/follow model, in which a person reads posts that are written by other people whom the reader “follows.” This dynamic leads to the obvious problem of finding suitable people to follow. A keen question for users of microblogs is, whose Twitter stream should I read and why? Or, I am interested in topic X. Who posts regularly and with authority on that topic?

A key problem in expert finding is the matter of author representation. Given a corpus C containing N documents, NA of which were written by author A, an expert finding system must induce a model for A's topical output. The task of finding people to follow in a microblogging environment presents a nearly identical problem.

Several methods of expert finding have been proposed in the IR literature. In what has been termed the virtual document approach, an author is represented simply by the totality of documents that he or she has written (Balog et al., 2006; Craswell, Hawking, Vercoustre, & Wilkins, 2004). That is, for each author, we construct a “virtual document” that comprises the concatenated text of his or her Na documents. Based on this virtual document, we may conduct retrieval as usual. Given a query q and a corpus of virtual documents built from a collection of posts, we may rank the virtual documents in decreasing order of query likelihood using Equation 2.

An alternative approach is offered in Macdonald (2009), in which a variety of voting models are applied to the expert search problem. In these approaches, we perform a search against a corpus of documents (not virtual documents). We then process the result set by analyzing the authors of each document. As we proceed, each author is given a “vote” when we encounter a document written by him or her. Precisely how we tally the vote varies from method to method—e.g., simple majority, Borda counting, Condorcet voting.

Although finding people is certainly an important problem in microblog IR, entities lend themselves to consideration for search in other microblog settings as well. Lists of authors, groups of tweets that comprise a multitweet conversation, and question and answer pairs are all candidates for treatment of this kind.

User-Generated Metadata

A sea change in people's interaction with information came in the early 2000s. This decade saw the development and maturation of social media, online environments that allow people to interact in novel computationally mediated ways. With respect to information science, a key development brought by these media was the proliferation of user-generated content. In contrast to early web resources, newer services allow, and often rely upon, their users to create, edit, and mediate access to information. The popularity of user-generated content, particularly in the context of social media, has spawned a host of novel IR problems (King, Li, Xue, & Tang, 2009).

An important aspect of microblog ecology is authors' use of informal metadata. Services such as Twitter make few demands on content. But in the course of their interactions, user communities have invented and adopted a variety of metadata conventions aimed at extending their texts' expressiveness.

For instance, the convention of using so-called hashtags has become pervasive in the Twitter community.1111 Hashtags are simply character strings preceded by a # sign. People's reasons for applying hashtags vary from enhancing topical access to lending tweets tongue-in-cheek humor. For instance, many people tweeting from the 2010 SIGIR conference added the hashtag # sigir2010 to their posts in order to help others collocate this information. On the other hand, a SIGIR attendee added the tag #genevaishotandhasnoairconditioning to one tweet.

Less widely used, but similar to hashtags, are metadata that Chris Messina proposed (2009) and that blogger Chris Blow (2009) terms “slash tags.” A slash tag simply begins with the / character. Messina originally proposed a modest three slash tags: /via, /by, and /cc (a re-posting of someone's earlier tweet, a citation to a possibly external resource, and carbon copy, respectively). Only 10,004 tweets in the corpus contain at least one of these slash tags.

Precisely how to marshal these user-generated metadata is an open question. But the matter has seen some treatment in the literature.

Efron (2010) describes the problem of “hashtag retrieval,” a type of entity search. The goal of hashtag retrieval is to find, for a topical query q, a ranked list of hashtags, such that highly ranked tags have a strong tendency to mark tweets on q's topic. This task has at least three potential uses:

  • Ad hoc tag retrieval: Help a searcher find tags that are often applied to tweets on a topic that he or she wishes to stay abreast of. This allows the user to “follow” particular tags.

  • Query expansion: In the paper cited above, Efron found that during pseudo-relevance feedback, hashtags provide especially strong data for query expansion. By limiting expansion terms to hashtags, a significant improvement in three effectiveness metrics was observed with respect to both a baseline (nonexpanded) model and a model using standard terms for query expansion.

  • Result display: Hashtags could be used to arrange results of searches for tweets (or other entities), providing a de facto clustering mechanism for organizing returned documents.

Hashtag retrieval is a concrete example of the entity search problem described in the previous subsection. Given a query, we wish to find a ranked list of hashtags, where our knowledge about each hashtag is induced from an analysis of its use in observed tweets. A natural way to approach this problem is via language modeling, using Equation 2. Here, we create a virtual document for a hashtag h, in which the virtual document comprises the concatenated text of all tweets containing h. We use this virtual document to estimate the parameters of a language model. Retrieval then proceeds by calculating the (log) likelihood of the query, given the estimated model for each hashtag in our corpus.

Table 3 shows the top five results of three ad hoc searches over hashtags in the data described in Table 1. Without investigation into the typical usage of each tag, it is difficult to assess the quality of these retrievals. They do, however, appear to be plausibly useful to a person who is interested in the topics that generated each query. For instance, a person who was eager to keep abreast of news on the 2010 Gulf oil spill might indeed wish to read tweets that were tagged with #blacktide by their authors.

Table 3. Results for three ad hoc hashtag searches.
hadoop nosqlSemantic webGulf oil spill
  1. Note. Each column lists the top five hashtags retrieved for the query shown in its heading, along with the query log-likelihood calculated from each tag's estimated language model.

−10.20 #hadoop−10.43 #semantic−10.86 #oilspill
−10.53 #nosql−10.94 #linkeddata−12.30 #blacktide
−11.87 #hbase−11.38 #semweb−12.81 #eco
−12.36 #bigdata−12.17 #rdfa−13.04 #environment
−12.38 #mapreduce−12.88 #a11y−13.47 #ocean

Other user-embedded metadata involves explicit representations of social linkages. An author may direct a tweet to a particular user by including @<user> in the tweet, in which <user> is the screen name of the person to whom the tweet is directed. Huberman et al. (2009) find that about 25% of tweets contain an @ direction. Most third-party microblog clients display any @ mentions directed to their owner prominently, even if the owner does not follow the tweet's author.

Direction via the @ sign is used in a variety of ways on Twitter. It is common to see tweets with a structure such as:

@milesEfron are you nearly done with your AIS review?

Likewise, we often see tweets with this type of structure:

Congratulations to @userX on his new article in JASIS&T.

In the first example, the tweet author is asking a question of the particular user milesEfron. It is worth noting that such a message will be visible to all of the author's followers. Presumably, the tweet author believes that this message is at least of passing interest to his or her other followers.1212

The second example sheds some light on the motivation for broadcasting @-directed messages. Tweets of this kind alert other users that a social linkage exists between the author and the user to whom the message is sent. Thus, such a message demonstrates a social bond between two users, a function alluded to in boyd (2009). Additionally, @ mentions allow followers of a tweet's author to learn about potentially interesting users (i.e., the user to whom the message is directed). The @ directive may also be applied in a more rhetorical fashion, along the lines of:

Why is the @twitter search failing to find tweets from last month?

In this case, the author directs the message to Twitter's account, although there is no implication that Twitter follows this author. In some sense, this mention is acting as a hashtag. In all likelihood, managers of the twitter user account track all such mentions, “following” them via a persistent search. More rhetorically, the author of this tweet is signaling that Twitter is especially important to the topical nature of the text. Here, in other words, it is difficult to articulate the difference between discussing @twitter search and #twitter search.

These ambiguities notwithstanding, microblog posts are often replete with metadata despite their brevity. Marshaling these metadata to improve retrieval entails a promising avenue in microblog IR research.

Authority and Influence

It is well-known that modern web retrieval relies on hyperlink structure to induce models of authority that help deliver retrieval sets that are not only on topic but also of high quality. Algorithms such as PageRank (Brin & Page, 1998) and HITS (Kleinberg, 1999) build models of authority by finding the steady state of differently defined Markov processes. The intuition behind these approaches is that a given resource is likely to be of high quality if it has inbound links from many other high-quality resources.

A similar intuition has found its way into microblog search. The matter has been most widely studied in the estimation of the influence of particular people in microblog environments. In a 2009 blog post (Tunkelang, 2009), Daniel Tunkelang proposed “a Twitter analog to PageRank.” His metric, dubbed TunkRank, is indeed similar to PageRank, lending each person influence by summing the influence of those people who follow him or her:

equation image(3)

where Followers(.) is the set of people following a given user and Following(.) is the set of people a given user follows and p is a real-valued number corresponding to the probability that a given tweet is re-tweeted. A person's TunkRank score is similar to a web resource's PageRank, in which the score reflects the probability that a tweet by that person will be read. Though TunkRank isn't a proper probability, it could easily be normalized to behave as one, allowing us to incorporate its model of influence into document ranking as the prior probability in Equation 1.

TunkRank is only one among many so-called “peopleRank” metrics (Kamvar, Schlosser, & Garcia-Molina, 2003; Lee, Kwak, Park, & Moon, 2010; Weng, Lim, Jiang, & He, 2010), but in recent research it showed robustness for the problem of identifying accounts that post spam on Twitter (Gayo-Avello, 2010). Gayo-Avello found that among five prestige algorithms based on users' social graph, TunkRank was the least sensitive to the presence of “nepotistic links,” follower/following relationships that give a false impression of a user's network centrality.

Figure 1 visualizes the relationship between users' TunkRank scores1313 and their immediately observable social statistics—number of friends and number of followers on Twitter. The panels in Figure 1 show that a positive correlation between both friend count and follower count and TunkRank exists,1414 but that popularity does not necessarily translate to strong influence as measured by TunkRank. In particular, the bottom of each panel shows that people with a great breadth of friends or followers are not influential according to TunkRank, a fact that bears on the results reported by Gayo-Avello.

Figure 1.

The relationship between users' number of followers (friends) and TunkRank.

Note. The x-axis of the left panel is the log of the number of people who follow each user. The y-axis gives the log TunkRank. The right panel gives data for each user's number of “friends” (i.e., people whom they follow).

Temporal Issues

No branch of IR can afford to neglect the matter of time in information interactions. But time has a prominence in microblog search that is especially keen. A great deal of recent research and development treats so-called real-time search. The term real-time search refers to the problem of keeping indexes and the search engines they support current to within a very low tolerance for latency. However, time plays a role in microblog IR that is not limited to real-time search. Many tweets treat topics that have a limited time horizon—conferences come and go, news stories fade in importance, and newly released movies are eclipsed by still newer movies. Additionally, people's social networks change over time, providing information that could prove useful in retrieval.

Twitter has brought new research areas into the broad field of real-time search. The flood of data coming from Twitter and other social networking services has brought new challenges to commercial search engines (not least among them, Twitter's own search service). According to David Geer (2010), search engine developers hope to bring microblog data into their indexes quickly both to enable direct access to those data and to improve results for other time-critical arenas such as news search. But it should be stressed that while microblogs pose novel challenges for search engines, real-time search follows on an established body of literature on so-called streaming algorithms (Alon, Matias, & Szegedy, 1996; Charikar, Chekuri, Feder, & Motwani, 1997; Charikar, O'Callaghan, & Panigrahy, 2003; Henzinger, Raghavan,& Rajagopalan, 1999).

Immediacy is one important facet of time's role in microblog search. However, temporal concerns enter into microblog IR in other senses too. For instance, Lee et al. (2010) bring a temporal approach to bear on the problem of user influence modeling as described above. They note that although social networks play an important role in user influence, users who report important stories early in the stories' life cycle are likely to be especially influential. Alonzo, Gertz, and Baeza-Yates (2009) also introduce a nuanced temporal dimension into IR. They use documents' lexical content to construct “temporal profiles” that, in turn, allow them to cluster search hit results (in Twitter and other domains) by time.

Figure 2 suggests another way in which time bears on microblog search. The figure shows the results of four queries posed against the corpus described in Table 2. The number of hits (i.e., tweets) is displayed in bar plots, in which each bar is a day (the left of the x-axis is July 18 and the right is August 28). Each bar's height gives the number of hits in our corpus that were written on the corresponding day.

Figure 2.

Frequency of hits on four searches. Taken from data described in Table 1. The x-axis of each panel is the date (from July 18–August 28, 2010). Each bar is the number of tweets returned for the query shown in the panel title on each day.

The panels in Figure 2 show several distinct distributions with respect to time. The queries sigir and jetblue flight attendant are strongly unimodal. sigir was mentioned commonly during the 2010 ACM SIGIR conference. The distribution for jetblue flight attendant shows a similar rise and drop of activity. But instead of a formal event, the peak for this query owes its strength to the outburst of a flight attendant on a Jetblue Airline flight, an event that drew strong and amused attention on Twitter. A query such as iphone has a less visibly temporal character, with a number of daily hits that does not change systematically over time (although the distribution does appear to have several weakly evident modes). Finally, the story of the miners stranded underground in Chile emerged immediately after the cave-in that trapped them. Though the graph ends mid-story, we can see that the hits for this query rise suddenly, indicating the emergence of a trending topic.

Figure 2 suggests that time plays a strong role not only in finding trending topics, but also in terms of more retrospective queries. We might, for instance, wish to arrange tweets returned for a given query differently for a news story than for a query such as iphone, or even for a query about an academic conference.

Concluding Remarks: Continuing Issues in Microblog Search

In this review, I have chosen to focus on a few topics in some detail. Of course this approach comes at the expense of a treatment of many other important considerations. By way of conclusion, this section discusses several outstanding problems.

Geographical Information in Microblog Search

Locale bears on microblog search in several ways. Most obviously, relevance for many plausible queries over microblog data will be contingent on geographical concerns (this is no different from many other types of IR). A query such as movie times issued in San Francisco is ostensibly different from the same lexical query issued from Champaign, Illinois. Queries such as downtown traffic, allergen levels, and where are all of these ambulances going all entail implicit geographical references. Earlier in this review I discussed user-generated metadata in microblogs. Geographic metadata also find their way into microblog posts in the form of the latitude and longitude coordinates from which a given tweet was sent—the so-called geocode data for the tweet. Though many users opt not to include geocode data in their tweets, the ability to insert this information is available in many microblog client applications.

Microblogs, Information Needs, and Queries

In my discussion of how people use Twitter to ask questions, I reviewed literature concerned with the nature of the information needs that people bring to microblogs. But these findings do not readily translate to the other sense in which we considered IR—retrieving existent microblog data. One issue that complicates our anticipation of how people would like to search microblogs lies in the fluidity of the unit of retrieval in these settings. It is true that we could retrieve and rank tweets in decreasing likelihood (predicted by lexical evidence) that they will satisfy a person's information need. But the entity search scenarios that we have outlined invite a different experience and thus a different type of information need. For example, it is one thing to build a system that lets people find tweets about a particular piece of consumer electronics. It is another matter to help people find people, or communities of people, who write often and clearly on the matter of consumer electronics.

Because microblog search is in its infancy, it is important to be mindful of realistic, compelling use cases now. As we build pilot systems and undertake research agendas, we implicitly take as axiomatic what it is that people will search for and how they will engage in this searching. Queries by example over a variety of entities using many criteria (e.g., time, place, topic, authority) for organizing results seem like plausible interactions, at least as plausible as simple keyword searches for indexed tweets. There is a great deal of creativity to be applied to design and research in the space of supporting user needs in microblog search.

Microblog Search Evaluation

For many years, variations on the Cranfield model of evaluation have allowed IR researchers to assess the merits of proposed innovations. The Cranfield model, with its reliance on access to canonical corpora, queries, and corresponding relevance judgments has always engendered debate with respect to its relation to actual search effectiveness. Nonetheless, Cranfield-style experiments have dominated modern IR research in an era that has seen innovations in the field that are difficult to discount.

Assessing the effectiveness of searches in microblog environments does not preclude Cranfield-type analysis. Indeed, we could craft a set of queries to run against a corpus of documents (or entities) and acquire query-document relevance judgments on which to base calculations of such familiar statistics as precision and recall. But a number of issues must guide such an approach:

  • Relevance. The matter of how relevant a document is to a query has always been crucial to successful Cranfield-style evaluation and it has always been controversial; microblogs don't change this. What is different in the context of microblogs is the array of new problems and criteria that enter into the matter of retrieval. Precisely what task a user is trying to complete surely bears on what constitutes relevance. To this consideration, we must add factors such as temporality, locality, authority, which I have discussed already in this review. All of these factors are conflated in a naturalistic idea of successful retrieval. This does not qualitatively set microblogs apart from other data. But it is the case that microblogs' idiosyncrasies merit careful deliberation in designing experimental settings.

  • Corpora. Thanks to microblog services' APIs, acquiring microblog data is easy. However, it is not clear how we should gather these data when creating corpora for IR experimentation (using the Cranfield model or any other approach). The putatively random sample that comes from the Twitter API's “garden hose” feed gives us a cross-section of Twitter data. Setting aside the question of how random this sample actually is, we might also ask if a random sample is desirable. The answer to this question surely hinges on what we are studying. Analyses that take into account the experience of particular users, for instance, should presumably harvest data that comprise those users' “social horizons” (i.e., the people writing nearby that person, with respect to social network proximity, topic, geographic location, etc.).

  • Recency. If we built a microblog test collection, would it be useful in a year? Would it be useful in 10 years? Of course, any test collection ages: The queries of the early TREC collections have little topical resonance today. Yet these early news wire collections can still be valuable. It is less clear that the work of building a microblog test collection would yield the enduring value that we have seen with many TREC collections. Not only would topics of interest come and go, but peoples' use of microblogs would also be likely to change over time. Today, we use hashtags and mentions in our tweets. How will we tweet a year from now? Different conventions and different rhetorical and topical preoccupations stand to make microblogs (and queries we would pose against them) very different in a short time.

None of this is to say that Cranfield-style evaluation does not have a place in microblog IR research. Surely it does. But we should be strategic in crafting assessment methodologies at this early stage of research and development in microblog retrieval. Serious consideration of naturalistic and behavioral methods of assessing system performance will no doubt have a large impact on future research, as we work to make our studies both realistic and generalizaeble.

Microblogs form part of the vital and rapidly changing landscape of technologies that mediate peoples' contemporary interactions with information. People post to Twitter; they write on Facebook; they search the web, browsing articles on Wikipedia. They do all this while also finding information offline, via face-to-face relationships, in newspapers, in libraries, etc. Building search services using microblog data is challenging for many reasons outlined in this review. But chief among those challenges is the fact that microblogs already act as de facto, informal search services. They provide venues for people to ask questions, answer questions, and make suggestions. As microblog IR research moves forward, it will be important to avoid simply mimicking this “guerilla” social search function of microblogs. The question that faces us is, given so much data created by so many actively engaged people, how can we apply our expertise as IR researchers to identify and solve information problems whose solutions are latent in microblog arena?


  1. 1

  2. 2

    The “value” of Twitter updates is, of course, subjective. A 2009 study (Kelly, 2009) classified 41% of tweets as “pointless babble,” an opinion rejected by danah boyd (2009).

  3. 3

  4. 4

  5. 5

  6. 6

  7. 7

  8. 8

    I omitted the followers of one core community member because his social network was unmanageably large.

  9. 9

  10. 10

  11. 11

    The motivation and convention for using hashtags was described by the blogger Chris Messina (Messina 2007).

  12. 12

    Authors may contact other users privately using a so-called direct message.

  13. 13

    TunkRank scores were obtained from (Adams, 2009).

  14. 14

    The Pearson correlation between log-followers and log-TunkRank is 0.519. Correlation between log-friends and log-Tunkrank is 0.264.