User-assigned preferred entry point levels for Web image searching: A comparison with the Pyramid model

Authors


Introduction and Previous Studies

Pictures are important to us; they provoke reaction, stimulate ideas, and rekindle memories. Images on the Web are no exception. However, Web images can be difficult to search and retrieve. Few studies have been conducted to identify the ‘entry points’ (Jolicoeur, Gluck,& Kosslyn, 1984), or the most commonly preferred detail level of description, used by online image searchers. Based on the idea of entry points, every object has one particular entry point at which contact is first made with semantic memory, otherwise known as the ‘basic level’ (Rosch, Mervis, Gray, Johnson,& Boyes-Braem, 1976). However, as Jolicoeur, Gluck, and Kosslyn concluded, visual objects are not always primarily identified at the basic level. Entry points may vary according to the user's domain knowledge of the items or concepts represented in a picture. At the same time, general terms may be needed for Web use (Goodrum & Spink, 2001). Image searchers' chosen entry points when manifested as folksonomy-based ‘tags’ (descriptive terms chosen with no restrictions placed on users' descriptions) could potentially replace or enhance the authoritative and pre-defined lists of subject terms assigned by select experts (Neal, 2007; Neal, 2008). Tags are commonly used as descriptive indexing terms on social websites such as the photograph sharing website Flickr (http://www.flickr.com).

The act of librarians assigning subject headings or other controlled vocabulary terms is not ideal in every situation, since they choose terms that are based on their own interpretations (Krause, 1988). Greisdorf and O'Connor (2005) found that a strong personal element exists in individuals' image descriptions. Shatford (1986) addresses the complex issue of determining the ‘ofness’ and the ‘aboutness’ present in a visual document.

The ten-level Pyramid model (Jaimes & Chang, 2000; Jorgensen, Jaimes, Benitez, & Chang, 2001) provides a guide for describing images based on four levels of visual perception (e.g., color, texture, etc.) and six levels of semantic concepts (e.g., objects, events, etc.) Within its levels, it distinguishes image content in a variety of ways, such as generic, specific, and abstract semantic content. The purpose of this study was to develop and test a new Pyramid-influenced hierarchical model for categories of image description based on participants' descriptions of photographs.

Methodology

First, we analyzed the all-time most popular tags used on Flickr (http://www.flickr.com/photos/tags) in order to identify some predominant image tag categories. The following eight categories emerged: Color, Living Thing, Time, Events, Places, Concepts, Environment, and Objects. We then chose 16 photographs from Dr. Neal's personal photo collection that seemed to best represent each category; two pictures per category. A total of 138 people participated in the online survey, which was distributed via SurveyMonkey. As part of the survey, participants responded to the following prompt about each image: “Provide your first, instant reaction to the content of each picture. Only spend a few seconds looking at the picture. Describe what you see in 10 words or less.” The purpose of this was to elicit participants' preferred entry points for the 16 provided photographs.

After collecting and performing a preliminary data analysis, we created a hierarchy based on the participants' responses to the survey. There are seven main categories in the hierarchy. Moving from the highest to the lowest level of abstraction, the categories are Abstract Scene, Abstract Object, Specific Scene, Specific Object, Generic Scene, Generic Object, and Physical Content. Six of these seven main categories have subcategories, and each category and subcategory includes a definition. Using content analysis, two coders classified each participant's response by category and subcategory in the hierarchy. They were instructed to assign the first category in the list that they thought described the participant's response, thus assigning the highest applicable level of abstraction. Then, we analyzed the results from the two coders, and compared the results with the Pyramid model.

Results and Discussion

At the main category level, Cohen's kappa was 0.56. Given that values over 0.70 are considered satisfactory in most situations, the intercoder reliability level may seem slightly low, but this data was based on seven categories, unlike other studies that address fewer categories. For example, Rorissa and Iyer's (2008) study used only three categories (subordinate, basic, and superordinate) in the coding process. Thus, a kappa value of 0.56 with seven categories may be considered as more than moderately reliable. Furthermore, only the data for which both coders were in agreement were analyzed. Additionally, the kappa rating in the current study reflects the subjective nature of image descriptions, as well as possible interpretations of those descriptions.

We defined the most frequent subcategories as the ones that the coders chose at least 150 times. Based on their codes, the ten most frequently chosen subcategories were used in 70.9% of the assigned codes. Generic Scene (Event) was the most frequently chosen (11.2%). Abstract Scene (Emotion) was the second most frequently chosen (9.8%). The other top choices, in order of next most frequently chosen, were Generic Object (Artificial Object), (9.4%), Generic Scene (Location with Specific Information) (69%), Generic Scene (Nature Scene) (6.8%), Generic Object (Human with Location Information) (5.7%), Abstract Scene (Opinion) (5.6%), Generic Object (Artificial Object with Color Description) (5.5%), Abstract Scene (Talk Bubbles) (5.2%), and Generic Scene (Location) (4.9%).

Based on the coding and analysis, participants only selected entry points that fall under three of the seven main categories: Generic Scene, Generic Object, and Abstract Scene. Five of the image types (Event, Object, Living Thing, Places, and Time) were coded under Abstract Scene. Among the 10 levels in the Pyramid model, only Generic Object, Abstract Scene, and Generic Scene are used in the participants' descriptions. Also related to the Pyramid model, the various image types frequently belong to the same main categories and subcategories. For example, the most frequent subcategory for Color is Generic Scene (Nature Scene) and the one for Concept is Generic Scene (Event). It seems necessary to subdivide the Pyramid model into subcategories in order to more accurately reflect the preferred entry points, especially for the most frequently used categories.

Between the coders, there was not a single agreement of the most frequent subcategory for any two images under the same Flickr tag-based category such as Color or Time in the results. The coders did not agree on any subcategory for the image pairs based on the categories we created from the Flickr tags. In many cases, when there were conflicting results between two images that fell under the same Flickr tag-based category, the most frequent category chosen from our hierarchy was Generic Object. While the authors' process of categorizing the Flickr tags was a subjective process in itself, the results indicate the challenge in achieving interindexer consistency.

Also, our results indicate that neither the size nor the location of the object was the main factor in the participants' descriptions, but the context around the object is a powerful factor. For example, some images show that small artificial objects are noticed often, and the image is categorized accordingly. For example, Event (‘Birthday’) and Environment (‘Sunshine over a statue’) were coded as Generic Object. Thus, our results illustrate the need for further investigation regarding what factors influence the user's decision on whether the image is about a Generic Object.

Our findings in this study emphasize the importance of a ‘bottom-up’ approach for Web image descriptions in situations where it is impossible for a few experts to describe numerous images. We believe that user-based techniques such as tagging must be actively used in indexing Web images. The findings in this study will be used for continued research in the area of Web 2.0-based image retrieval, topical relevance in image data, and novel interface designs for image retrieval systems.

Ancillary