Crowdsourcing the indexing of film and television media



In this paper we describe a project that explores how advances in information technology could be used to make film and television media more accessible to both scholarly and non-scholarly audiences. By indexing, at a detailed level, a range of time-synchronized and non-time-synchronized elements in a test collection of 12 films and 8 television programs, we demonstrate how structured data representing many aspects of media content can be produced in a streamlined manner, and discuss how this work could potentially be augmented with automated indexing to be more efficient. We present examples of how this data can be utilized to produce a variety of tools and artifacts that make film and television media more accessible, and suggest that crowdsourcing could be an effective strategy for accomplishing this work on a larger scale. This research contributes to the growing body of literature exploring how multimedia collections can be made more accessible and useful for a variety of purposes.


Access to film and television media has historically been a critical factor in the development of academic disciplines involving film and television. The formation of the Film Library at the Museum of Modern Art (MoMA) in 1935, for example, provided academic institutions with the opportunity to engage with the material around which they could organize their first film studies courses and “definitively establish[ed] new possibilities for the academic study of cinema” (Pollan 2007, p. 15). This film library served the film studies community by collecting and archiving films and providing a foundation for the broad growth of film studies curricula introduced in American universities in the 1960's by “establishing a study center with films, books, journals, stills, and related documents” (p. 15).

As the academic study of film and television matured and became relevant to a broader range of disciplines in the arts and humanities from the 1960's through the 1980's, technology increasingly played a part in the evolution of scholarly research and educational goals. Whereas in the 1960's and 1970's film was primarily accessed by scholars and students through museums, festivals, and public screenings—or individual access to low-quality Super-8 or 16-mm prints— the availability of media on videotape and laserdisc in the 1980's made repeated viewings more practical, and enabled increased focus on film-specific analyses and broad histories based on in-depth research (Altman, 2009). The increased availability of media on DVD in the 1990's continued this trend and contributed to research and teaching that more often involves study of a larger corpus: “genre studies, cultural studies, national studies, or studies of specific periods” (Altman, 2009, p. 133).

It is noteworthy, however, that the technological advancements that have been leveraged thus far to improve scholarly access to media were developed for the consumer market and appropriated for use in academia. Despite rapid advances in information technology in the first decade of this century, and the recent proliferation of online film and television media, we have not yet seen changes in the scholarly research and teaching that takes significant advantage of the possibilities offered by these developments. Indeed, many of the examples that demonstrate the potential of today's technology affordances for the study of film and television come not from academia but from non-scholars, often “fans,” sharing ideas, speculations, and new ways of looking at their favorite content through blogs, online discussions, visualizations, even cartoons (see Figure 1).

This paper describes work done as part of a larger project that focuses on how advances in information technology and knowledge from the field of information science might be used to improve and expand access to film and television media. Specifically, we share work that explores how film and television media could be indexed to be more accessible, the potential benefits such indexing would enable, and how the task of indexing large volumes of content at a detailed level could be accomplished.

Figure 1.

Movie plots visualization in Web comic XKCD11


Given the potential value of applying current information technology to film and television media for the benefit of both scholars and non-scholar “fans” with demonstrated interest in this material, we focused our current work on the following research questions:

  • In what new ways might film and television media be made more accessible for scholars and non-scholars?

  • To what extent can we rely on automated techniques, versus techniques that require human effort, to produce data that can be used to make film and television media more accessible?

  • How could the significant level of work required to develop useful data for large film and television collections be feasibly accomplished?


Our research is based on the premise that applying information technology to film and television media can lead to new systems and methods for teaching, studying, and enjoying this content, by both scholars and non-scholars. We explore this premise below, followed by a review of work that has been done to develop techniques for indexing film and television media, and how crowdsourcing has been utilized in large-scale projects.

Information Technology in Film and Television Studies

Andreano (2007) suggests there is a “missing link” between scholars and film archives (p. 84), a combination of the difficulty in finding and accessing film content and the challenge of locating desired units of information from specific media. Consumer technologies are limited in their ability to overcome these hurdles in that they primarily provide easy access to only one film at a time, and that access at its most granular is through DVD chapters, a relatively crude level of retrieval. Augst and O'Connor (1999) pointed out this limitation over a decade ago, noting that while videodisc technology did enable “random access to points within a film text and a limited form restructuring the text” (p. 346), using the technology was awkward and a common goal of comparing two sections of film or a section of film with an associated document (e.g., script or journal article) was impractical. While Augst and O'Connor made the argument for the benefits of the digital environment for studying film, there has since been relatively limited work in this area.

Projects such as Stephen Mamber's “Digital Hitchcock” (1990's) and Lauren Rabinowitz and Greg Easley's “The Rebecca Project” (1995) were early steps towards providing students and scholars with richer access to film material, each focusing, however, on a single film. The Virtual Screening Room (VSR), an educational computing project at MIT, was developed as a film browsing and searching system that uses timestamp synchronized transcripts of a larger set of films (Ronfard, 2004). This system presents a user with an interface for browsing film via short clips, providing analyses oriented towards film editing. However, VSR is intended to serve as an “electronic textbook” for teaching film, rather than a general system for scholarly film analysis (MIT Center for Educational Computing Initiatives, 2006).

MovieBrowser, another project aimed at students of film, is a Web-based system that uses film shot-based segmentation and aggregation techniques to support movie analysis and browsing in a film studies context (Ali & Smeaton, 2009). In a study aimed at measuring the benefit of using new technology for this purpose, seven students in a film theory and history course at Dublin City University were given two film analysis tasks to complete, using either a conventional DVD player interface or the MovieBrowser system. Both quantitative data (e.g., interaction clicks) and qualitative data (analysis of task outcomes and interview questions) were collected. The authors note the limitations of this study (small sample size, artificial study environment, and limited time to complete the assigned tasks), but their findings showed improved student outcomes and more positive feedback with the MovieBrowser system compared to the DVD player. Interestingly, students took longer to complete the tasks with the MovieBrowser system; the authors suggest this might provide additional support for the system, because it enabled students to become more engaged with their task.

Although not focused on narrative film, a current project at Boston's PBS station, WGBH, serves scholars by providing online access to their in-house generated media content (Michael, Todorovic, & Beer, 2009). They have found that scholars working with media have different needs than educators or the general public, and seek an “intense level” of information about resources (p. 22). WGBH is thus indexing resources at the “sub-item or shot-log level” (p. 22) rather than simply cataloging the entity as a complete work as is traditionally done with a standard such as Machine-Readable Cataloging (MARC). As described more fully later in this paper, a similar level of indexing is also central to our current work.

Techniques for Indexing Multimedia

Without indexing that provides more granular access points, scholars, students, and others interested in film have to invest considerable time and effort to locate specific locations or sequences in the media (Andreano, 2007). Strong arguments for creating more detailed, time-coded “shotlists” (Terris, 1998) have been made; alas, previous attempts to do so, such as the effort made in the original British National Film and Television Archive in 1935, have not been sustainable (Andreano, 2007). Developments in content-based image retrieval (CBIR) over the past 20 years, however, have increased the practicality of producing detailed indexing of film and television collections by potentially reducing the amount of human effort required.

CBIR techniques, which are used for both still and moving images, typically consist of automated content-based detection using low-level (pixel-level) attributes such as color, shape, or texture to infer the existence of higher-level features such as faces, specific types of objects, or settings (Christel & Conescu, 2005). In moving image content, low-level features can also be fairly reliably used to detect distinct shots (Smeaton, 2004); representative frames from shots can also be used as visual surrogates for semantic units of a film (e.g., to produce a visual storyboard). A more difficult area of research is grouping detected shots into larger units such as scenes, which currently works best in “well-structured video domains” such as broadcast TV news (Smeaton, 2004, p. 7). More recently, automated video indexing techniques have used visual features in combination with audio and text transcript information to improve identification of shot content (Enser, 2008). Automated video indexing continues to be an active area of research, led in part by annual NIST-sponsored TRECVID workshops that provide a venue for research teams to develop and evaluate video retrieval techniques and systems with realistic tasks and video testbeds (Smeaton, Over, & Kraaij, 2004), often consisting of news, documentary, or sports content.

Because of the less consistent structure of fictional film and television media, it is more challenging to apply automatic indexing techniques in this domain, but considerable work has been done. Rasheed and Shah (2005) used shot detection and visual similarity of shots to cluster shots in movie scenes. Other approaches attempt to use low-level features to automatically detect “events” that occur in a film or television program. In Zhai, Rasheed, and Shah (2005), the authors classified film and television clips as conversation, suspense, and action, while Lehane, O'Connor, Lee, and Smeaton (2007) classified portions of content as dialogue, exciting, or montage events. An evaluation in the latter study with a test collection of ten films and nine television shows resulted in an average of 90% or better recall for each of the event types, with at least 59% average precision.

There is a fundamental problem with the use of automated techniques for indexing film and television media, however, called the semantic gap, or “the lack of coincidence between the information that one can extract from the (visual) data and the interpretation that the same data has for a user in a given situation” (Smeulders, Worring, Santini, Gupta, & Jain, 2000, p. 1352). While a range of promising techniques to reduce the semantic gap when using low-level features are currently being investigated, it is also clear that at the present time low-level features are not always sufficient to describe things users might want to find in visual content (Liu, Lu, & Ma, 2007). Supporting a range of useful access points to film and television media necessitates some level of human indexing.

Crowdsourcing Large-Scale Projects

If only a subset of the indexing necessary to support a detailed level of access to film and television media can be automated in the foreseeable future, human indexers will have to do the rest. Until recently, such an effort might have been feasible only in situations where an organization has determined the potential benefits of such indexing are worth the costs of a team of indexers. Since 1999, for example, The Music Genome Project, which is the database that runs Pandora Radio, has employed a small number of musicians to musically analyze songs based on a number of predefined attributes (Westergren, 2008). Creating these “musicological fingerprints” takes around 20 minutes for each song, and approximately 10,000–12,000 songs are added to the database each month (Westergren, 2008; Pandora FAQ). A company called Jinni is creating a similar “genome” for movies by employing “a team of film professionals” to index feature films by characteristics such as mood, tone, and plot elements, aimed at providing a commercial film search and recommendation service for a general audience (Jinni – The Movie Genome, 2010).

While these expert-driven sites can justify a significant investment in indexing with the expectation of resulting revenue, for many media collection efforts this is not an option. Recent advances in information technology and the development of the “commons-based peer production” model (Benkler, 2006, p. 107), however, provide a new opportunity: crowdsourcing (Howe, 2006). Although this term is fairly new, group volunteerism has been a byproduct of the World Wide Web since its inception. Drawing inspiration from Richard Stallman's GNU project, Linux, the open source operating system started by Linus Torvalds in 1991, is one of the earlier products created from what Eric S. Raymond described as a “bazaar” style of software development (2000). By breaking away from the more traditional style of code development, typified by a small team overseen by a manager, Torvald's turned Linux into a community creation. Coders were free to contribute what they could, when they could.

When given the right tools within a collaborative environment, an army of volunteer users can create large-scale projects, such as Linux, in a comparatively short amount of time and problems can be rapidly solved. Raymond summarized the self-healing properties of open source software in the axiom, “given enough eyeballs, all bugs are shallow” (2000, Abstract). Wikipedia contributors take a similar approach, although in this case the corrections are more self-reinforcing than self-healing. When the community encounters vandalism, incorrect page additions are quickly reverted to their previous state. In part, this phenomenon can be attributed to a sense of “community introspection” that facilitates consensus building (Viégas, Wattenberg, & Dave 2004). Communities interested in what they perceive as correct data will fight to protect it.

The success of Wikipedia has inspired many other niche wikis, often centered on a specific topic22 . Their ubiquity supports Clay Shirky's claim that these types of tools, which have greatly lowered the cost of forming groups, become more interesting as they become more mundane (2008). Luis van Ahn's experiments in crowdsourcing data collection using different types of games as a platform (Games With A Purpose) have shown the feasibility of accomplishing tasks too costly under a financial reward system (2009). Many other examples, such as OpenStreetMap ( for mapping, Galaxy Zoo ( for astronomical research, and the Steve Project ( for art museum collections, show that people participate in a crowdsourced project because of a conscious desire to contribute to the mission of that project.

Television and film fans have already demonstrated interest in commons-based peer production by creating information sharing hubs, effectively crowdsourcing analysis of their favorite series or movie. Episodic series in particular provide fertile ground for discussion as viewers try to decipher mysteries central to the narrative. Twin Peaks (1990) was one of the first shows to exploit Usenet (, and capitalize on networked communication. One fan “provided a detailed sequence of all the narrative events (both those explicitly related and those implied by textual references) and updated it following each new episode,” while another “built a library of digitized sounds from the series.” (Jenkins, 1995, p. 54) With the help of VCRs, fans freeze-framed scenes in their hunt for clues that might reveal the identity of Laura Palmer's killer. Others looked at films Twin Peaks actors had previously starred in, supplementary merchandise like Laura Palmer's diary (written by Lynch's daughter), the Julee Cruise album Floating Into The Night (featuring lyrics by Lynch), and films whose plot lines inspired Twin Peaks. The VCR was a tool that enabled viewers to treat the series more like a manuscript, while “the computer net,” Jenkins argues, “allowed a scriptural culture to evolve around the circulation and interpretation of that manuscript” (p. 54).

This active process of participatory analysis can be synthesized in any number of textual or visual forms. Postmodern series like Twin Peaks benefit from visual comparisons as a way of understanding intertextuality (Figure 2). Lost (2004), a contemporary show partly inspired by Twin Peaks, has an equally devoted fan base operating in a similar manner. The “Lostpedia33 ,” a wiki dedicated solely to Lost, is an extensive, fan-curated breakdown of the series, charting character relationships, plot lines, literary techniques, locations, themes, and cultural references.

Figure 2.

Fan-generated film comparison


To investigate our research questions related to exploring the potential value of applying current information technology to film and television media for the benefit of both scholars and non-scholar “fans,” we developed a test collection consisting of 12 feature films and 8 episodes of Twin Peaks (the first season), indexing each at a detailed level. This work is part of a larger effort to develop a framework for the crowdsourcing of film and television indexing. While fully describing that framework is beyond the scope of this paper, the work detailed in the sections that follow is aimed at supporting a scenario in which fans, students, scholars, and others with interests in film and television media can contribute effort to index and annotate that media at a granular level and/or use available indexed data to generate artifacts (e.g., an annotated list of film clips on a given theme, a visualization, an interactive tool) that reflect their interests.

Developing an Appropriate Metadata Schema

A foundational component of this research is a metadata schema that prescribes the elements of film and television media we believe are useful to index. As noted by Enser (2008), “visual asset management has lacked the adherence to universal standards of cataloguing and classification which characterized traditional library practice with text-based material” (p. 532) and as such there is no clearly established metadata schema that we could simply adopt with confidence that it would serve the goals of this project. There do exist schemas intended for film and television media, but for different reasons none of these is ideal. The Moving Image Collections (MIC) schema is intended for film and video (Moving Image Collections, 2004), but its emphasis is on describing titles and collections, not time-based details of video content. MPEG-7 is a metadata standard developed specifically for cataloging details of digital audio and video, including low-level features, objects within a video, and time-based elements (Martinez, 2004). MPEG-7 is very complex, however, and despite being an approved standard for a decade there are few practical examples of its use in accessible projects. Closer to the goals of our project, the Public Broadcasting Metadata Dictionary Project has produced a schema called PBCore intended for use by public broadcasters and related communities (PBCore Metadata, 2005). While PBCore does contain many elements that we have adopted as part of our global elements, it also possesses many elements not applicable to our project and lacks support for the detailed time-based elements we require. So while the PBCore, MPEG-7, and MIC schemas all served as relevant models in the initial stage of our work, we developed our project-specific metadata in several subsequent stages.

First, there are well-known film and television databases, such as IMDb and NetFlix, which provide general guides to what high-level metadata is useful and expected to describe film and television media. We adopted many of these common elements for our non-time-based metadata.

For our time-based metadata (metadata elements that are synchronized to the media content), we developed preliminary ideas of useful elements through course projects in two instances of a semester-long graduate-level course at the School of Information at the University of Texas at Austin. Students worked both individually and in groups to develop ideas for potentially useful ways to index a film, indexed an actual film, and discussed their indexing decisions and outcomes with other groups.

Finally, based on both feedback from the student indexers and an examination of what worked well in this pilot indexing and what did not, we developed a more complete yet still tentative metadata schema. This tentative schema was further revised based on the indexing experience with a small set of new films using the interface described next.

Indexing a Sample Set of Film and Television Media

The software used in the initial indexing done in the Digital Media Collections course was a product called GLIFOS gmCreator ( This software enables, among other features, the synchronization of transcript text to a multimedia file. It is not, however, designed with the level of indexing we sought to do in this project, so we worked with the developers of GLIFOS to create a customized version of the software specifically oriented towards film indexing.

Figure 3 shows the interface to our customized GLIFOS product. This interface enables efficient indexing, working within a Web browser and saving the completed indexing work in an XML file.

Figure 3.

Project interface for indexing global metadata

As shown in Figures 3 and 4, much of the indexing is done simply by choosing options from pulldown menus that consist of values from controlled vocabularies.

Figure 4.

Interface for indexing time-based metadata

Developing a Sample Set of Generated Artifacts

To explore the potential utility of the detailed indexing of film and television media, we developed a range of artifacts—visualizations, tools, interface mockups—that can be produced with metadata generated by the indexing. These are quickly produced examples of what is possible, designed to obtain feedback from end-users on the potential value of the indexed data and artifacts. Based on this feedback, we can design a complete system that will better support the tasks and goals of potential users.


In this section we describe our completed metadata schema, the data set generated from the indexing work, and the range of sample artifacts produced from the data set.

Metadata Schema

The metadata schema is a key component of this project in that it determines which media elements can be indexed through the indexing interface and what data will thereby be available for the creation of artifacts. The schema consists of several distinct high-level components. One component is the project-wide controlled vocabulary elements. As shown in Table 1, these define the vocabularies for elements that are used when cataloging any film or television program. Katz (2005) and Trottier (1998) influenced the choice of these particular terms.

Table 1. Project-wide controlled vocabulary elements.
ElementVocabulary Values
Set_typeInterior, Exterior, Montage
TimeDay, Night, Dusk, Dawn, Unknown
Shot_typeClose Up, Close Shot, Medium Close Shot, Medium Shot, Medium Long Shot, Over-the-Shoulder Shot, Point-of-View, Two-Shot, Long Shot, Establishing Shot, Tracking Shot, Insert, Pan, Zoom, Title Card

The other metadata schema components are film-specific. One of these is the collection of elements that are not time-based (i.e., not synchronized to the media), shown in Table 2. Elements marked with an asterisk in the tables in this section are repeatable fields (i.e., can have multiple values).

Table 2. Non-time-based metadata elements.
Global elementsContributors
Alternate Title*Writer*
Release DateProducer*
Genre*Production Company*
ColorFilm Editor*
SynopsisCostume Designer*
 Sound Designer*
 Production Designer*
Character Name*Location*: Name,
Actor: Name, Sex,Description
Headshot image 
Song*: Title, Author,Sound*: Title, Type,
Motif*: Title, Type,Commentary*: Title, Type,

Table 3 shows the table of contents metadata. This is a hierarchical structure that follows a common definition of film grammar, consisting of film sequences, each of which contains one or more scenes, which themselves contain one or more shots, as illustrated in Figure 3.

Figure 3a.

Hierarchical structure used in table of contents

Metadata for the table of contents includes the timecode at which each sequence, scene, and shot begins along with additional information that varies depending on the type of structure. Some of this additional metadata (Character and Location) consists of a reference to one of the previously defined controlled vocabulary elements listed in Table 2 (e.g., each character in a scene refers to a Cast ≫ Character element), while other elements, such as Set_type, Time, and Shot_type, consist of a value selected from the project-wide controlled vocabulary shown in Table 1 (e.g., Set_type is one of Interior, Exterior, or Montage).

Table 3. Table of contents metadata elements.
SequenceStart_time, Title
SceneStart_time, Set_type, Time, Location Character*
ShotStart_time, Shot_type

The remaining metadata elements are time-based elements synchronized to timecode in the media. Sounds and songs have a quantifiable value—there are only so many sounds and songs within a given film or show—although the indexer may be selective in which are chosen due to time constraints. Motifs and commentary, on the other hand, are potentially boundless. The indexer again exercises personal judgment when choosing which to include.4

Table 4. User defined metadata elements.
SoundStart_time, Stop_time, Sound, Note
MotifStart_time, Stop_time, Motif, Note
SongStart_time, Stop_time, Song
CommentaryStart_time, Stop_time, Title, Description Reference*: Start_time, Stop_time

The last time-based metadata elements are script related (Table 5). Scripts are processed into small chunks, each of which are synchronized to time code in the film and television media, and include the speaker of the dialogue and the actual dialogue spoken.

Table 5. Script metadata element.
ScriptStart_time, Stop_time, Speaker, Text

Sample Data Set

The basis for most user-generated artifacts will be the indexed data described by the metadata schema and indexed using the indexing tool. As it is indexed, data is stored in an XML file (see small representative snippet in Figure 4).

Figure 4a.

Snippet from XML output file

XML provides a structured, human-readable file format, and enables the indexed data to be filtered, transformed, or converted to other metadata schemas. For example, we converted all indexed data in our sample set to a relational database format. The data is identical in each format, but each has its own advantages when used with different tools to create artifacts. For this reason, in an operational system, we expect to also provide data sets in comma-separated value (CSV), JSON, and MySQL dump file formats.

Sample Generated Artifacts

The data sets containing the detailed indexing described above enable many potential uses. In this section, we present a few examples to illustrate the possibilities.

Figure 5.

Keyframe overview of an episode

Figures 5 and 6 are two different overview visualizations for separate episodes of Twin Peaks. Before exploring an episode, a user may want an overview of it to identify points of interest and gain a sense of the episode's internal scene structure. In Figure 5, additional detail is available by clicking on any individual frame to bring up an enlarged view with contextual metadata for the episode – sequence, scene, shot, and characters who appear in the scene.

Figure 6 is an alternative structural overview, plotting shot types against running time. Here we can quickly see the most common shot types, and where certain shots cluster within the episode. Again, clicking on a shot provides a visual preview and contextual metadata.

Figure 6.

Shot analysis of Twin Peaks episode

Twin Peaks, like many television series, employed an array of different directors throughout its run. Figure 7 shows three of those directors paired with statistics generated from our indexing. Whether or not all of the statistics are useful data points remains to be seen, but clearly each director leaves an imprint on the episode he or she directs. Users doing comparative directorial analysis should find these statistics illuminating.

Figure 7.

Twin Peaks director comparison excerpt

The detail of our indexed data enables close examination of character interactions within a film or television series. Figure 8, for example, shows a visualization to help understand character co-occurrence within scenes in the first episode of Twin Peaks. Each row and column in the interaction matrix represents a character in the episode. A cell in the matrix is filled in when the characters in the intersecting row and column appear in the same scene.

Figure 8.

Character co-occurrence interaction matrix

Figure 9 illustrates the same character co-occurrence data, this time in the form of an arc diagram. The diagram represents a character with a circle, sized relative to the number of scenes in which that character appears, with an arc connecting characters that appear in the same scene.

Figure 9.

Character co-occurrence arc diagram

Figure 10 demonstrates the interactive comparison possibilities opened up by a collection of data sets that are based on the same metadata schema. This preliminary interface presents a dual viewer where users can select different clips for analysis, and then save them to a pool for larger group comparisons. Inter-episode and within-series comparisons are suggested in Figure 10, but the interface could just as well accommodate film-to-film or film-to-television show comparisons. Scholars and non-scholars interested in David Lynch might compare and contrast sounds, motifs, and characters that appear in Twin Peaks with those in Lynch's films.

Figure 10.

Comparison interface example


Generating the data set necessary to produce the artifacts described above required a substantial amount of the authors' time – upwards of 10–15x the running time of the source material. Twin Peaks season one, for example, consisting of a 90-minute pilot and seven 45-minute episodes, took about 85 person hours to index. Generating this sort of data for the broader range of film and television media necessary for general use by scholars and non-scholars, using the same methods employed in our work thus far, would require significant human effort. The Pandora and Jinni models that rely on expert indexers will not scale based on this amount of effort, at least in a non-commercial context. How might this work be accomplished more efficiently?

Our indexing experience reveals that the bulk of indexing time was spent on the table of contents and script metadata elements, specifically identifying and synchronizing to timecode new shots, identifying shot types, and making speaker role assignments. While we are not aware of any current automated technique that can reliably identify the types of shots we have outlined in our schema vocabulary, continued advances in automated video indexing—particularly shot boundary detection—have the potential to produce preliminary sequence, scene, and shot divisions for initial “stubs,” which could then be corrected by human indexers as necessary. Attempts at automatically classifying scenes as containing specific types of events also show promise (Zhai, Rasheed, & Shah, 2005; Lehane, O'Connor, Lee, & Smeaton, 2007). Similarly, although there may not be practical solutions today for automating the identification of scores, songs, and sounds embedded in films or television shows, these elements do have an identifiable fingerprint, and applications like Shazam ( show that this fingerprint, for songs at least, is exploitable for music discovery. The improvement of shot, scene, and sequence detection algorithms, along with emergent music and sound databases for retrieval comparisons, could in time realistically replace much time-consuming manual indexing work.

A partially automated system would be well supported by crowdsourced editing. Luis van Ahn's ESP Game, which calls upon players to submit descriptive tags for images until both players provide matching tags, is a useful model for harnessing the crowd to perform a task that computers cannot do. Similar systems (or games) could be developed that present users with short video excerpts based on automatically generated time codes for shots, songs, and sounds. Users would then be tasked with correctly identifying our metadata elements in accordance with our shot vocabulary, or supplying tags to song and sound cues.

In a hybrid system, the division of labor required for fully indexed film or television content plays to the strengths of volunteer-based systems. Users will have not only different skill sets, but also different levels of motivation and expertise. Perhaps contrary to intuition, this does not appear to pose a significant obstacle to success. For instance, a crowdsourced project that relies on social, rather than financial, incentives for participation is not necessarily susceptible to lower-quality results (Mason & Watts, 2009).

While crowdsourced contributions inevitably will be of varying quality, even bad data is better than no data at all, because bad data is a call for improvement. Shirky claims the stubs on Wikipedia are motivators for those who enjoy editing, but are reluctant to start new topics: “many more people are willing to make a bad article better than are willing to start a good article from scratch” (2008, p. 122). We believe the film or television equivalent of a stub on Wikipedia would entice users with different skill sets to contribute. Understanding shot types is a different skill from identifying sound design and foley work, but both skill sets can be brought to bear on a system that supports crowdsourced data.

The crowd's many eyeballs are exceedingly efficient at data correction, but what about the crowd's interpretive capability? Lostpedia and suggest that platforms that facilitate open discussions provide a space for rapid, communal analysis, assuming one has the technological access and ability; debates over themes, characters, and references beget further discussions. No single indexer could possibly match the analysis put forth by the crowd in an equal amount of time. Issues with data quality, such as interpretive vandalism, should be reduced according to the same self-reinforcing principle that guides Wikipedia, given a sufficiently sized audience.


The work described in this paper seeks to take advantage of advances in information technology to explore new ways film and television media can be made more accessible for both scholars and non-scholars with interests in this domain. By indexing, at a detailed level, a range of time-synchronized and non-time-synchronized elements in a sample collection of films and television programs, we have illustrated how a rich set of structured data representing many aspects of media content can be produced in a streamlined manner. We have also shown how this data can be utilized to produce a variety of tools and artifacts that make film and television media more accessible.

For our approach to be more feasibly applied on a larger scale, replacing some of our manual indexing decisions with automated indexing would be beneficial. Automated processing of video content is an active area of research, and advances in this area could be usefully applied to our project. At the same time, further work is required to determine if the quality of automated indexing techniques that might substitute for manual indexing in the near future is good enough to justify replacing the human element.

Even if the data for some elements of the metadata schema used in this project could be generated automatically, it is likely that manual indexing will be a necessary part of the process for the foreseeable future. We have described how existing approaches to crowdsourcing could be reasonable models for our project, where the use of “stubs” and the presentation of indexing as games or intellectual challenges could effectively leverage the distributed expertise and knowledge of domain experts and fans to accomplish large-scale manual indexing.

An important thread of future work on this project is obtaining feedback from our intended audience. Reactions and suggestions related to our metadata schema, indexing interface, datasets, and the usefulness of generated artifacts will inform further development of the project.

Providing new forms of access to film and television media has potential legal risks, primarily related to copyright law. We believe that requiring the indexing tool to directly access a DVD or otherwise lawfully acquired local copy of the media mitigates the primary risk. No additional copy of the media would need to be made or distributed, because the user would access his or her own copy of the work. The generated data sets are facts, which are not ordinarily subject to copyright. If produced within the spirit of the law, generated artifacts should be considered fair use. Nevertheless, creating a platform that successfully navigates these risks needs to be considered in greater detail, and is a current thread of our research.

Finally, we are still exploring the range of artifacts that can be produced from our indexed data. There is significant previous research on tools and systems for working with collections of digital video, but much of this work is based on a more limited range of time-based metadata. The possibilities that exist from the depth of indexing produced in our current work leave much to explore.


This work was funded in part by a John P. Commons Teaching Fellowship from the University of Texas at Austin School of Information. We also thank Niki Arroyave, Rodrigo Arias, and the rest of the team at Glifos, Carlos Ovalle, and acknowledge the valuable effort of all the UT iSchool students who contributed to this project.


  1. 1

  2. 2

    Wikia (, whose mission is to “enable communities to create, share and discover content on any topic in any language,” is one such repository of niche wikis.

  3. 3