SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

Image categorization research has created knowledge on the types of attributes humans use in interpreting image similarity. Little effort has gone into studying the effect of any wider contextual factors on the image groupings people create. This paper reports the results of an investigation on the effect of associated text on magazine image categorization. Image journalism professionals performed a two-phase free sorting of 100 test images and the resulting data were analyzed both qualitatively and quantitatively; using grounded theory methods and hierarchical clustering. The categorization behavior of the two groups was rather similar but there was a statistically significant difference in the types of names given to the categories constructed. Results indicate that having page context available results more likely in descriptions based on overall theme or story of the image. When the context is withheld, people are more prone to describe the people, objects and scenes portrayed in the images, and to combine various categorization criteria. This has implications for the design of interfaces for image archival and retrieval.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

Image categorization has been studied to gain knowledge on image similarity for use in e.g. content-based image retrieval applications. These commonly use algorithms based on the visual features of images, such as color or texture. However, in many contexts such as websites and journalism images are presented with associated text rather than simply by themselves. The associated text often influences the human interpretation of image content, e.g. when discovering the intended message of an artwork through its title. Also image retrieval methods attempt to make use of the text available since text retrieval techniques are more straightforward. Text thus plays into both user interpretation and technical retrieval of images. The possible effect of associated text on image categorization has not been studied even though creating meaningful image groupings is a key issue in image retrieval research. This type of a study would shed more light on the basis of image categorization and the role of text in these image similarity evaluations. It would be useful for the purposes of supporting the design of multimodal image retrieval systems which employ both semantic information provided by associated text and visual information provided by image features. The results would also inform the presentation of image retrieval result sets by investigating the types of meaningful image groupings people construct. The aim of this study was to evaluate how associated text in the form of magazine page context affects the categorization of magazine images.

Past research on image categorization

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

Image categories

Image similarity and categorization have been studied via subjective categorization experiments. Results indicate that humans most commonly evaluate image similarity on a high semantic level (Mojsilovic & Rogowitz 2001) and describe image content on an interpretational, as opposed to perceptual, level (Rorissa & Hastings 2004). This results often in generic thematic image categories such as city scenes or landscapes. The importance of this description level is reflected in the categorization systems in commercial image collections which often rely on thematic categories. However, image categories are sometimes also created based on more abstract image features, resulting in abstract categories, e.g. love (Greisdorf & O'Connor 2002). Image syntax is also employed in some image similarity evaluations, resulting in categories based on visual elements, e.g. round (Laine-Hernandez & Westman 2006).

For magazine images there exists no common typology or classification. In journalism, photographs are often typified through the photojournalistic processes through which they are created and selected to be published. For example, Hall (1981) distinguishes between 1. photographs which have been commissioned from a certain subject or event, 2. photographs taken without a specific news context in mind and later used to illustrate a news story, and 3. photographs not intended as news photographs which by accident become useful. Beyond this classification of news photographs, reportage photographs and (photo) illustrations are commonly identified as journalistic imagery.

The relationship between images and text

Images are open to a variety of possible meanings and interpretations by their viewers. In many contexts images are surrounded by related text that works as the outer context of the image. For example, Hall (1981) notes that the caption and headline play an important role in connecting news images to a story. This text-image relation is an instance of so called relationship attributes of images (Shatford Layne 1994). Shatford Layne argues that both the existence and nature of these relationships should be indexed alongside with biographical and subject attributes of images. Several theories exist on the nature of possible relations of text and images, describing how they may be related.

Barthes' (1977) foundational study of image-text relations is based on a simple logic of three possibilities of how images and text relate to one another: 1. text supporting image (anchorage), 2. image supporting text (illustration), and 3. the two being equal (relay). In anchorage, a piece of text, such as a caption, provides the reader with a link between the image and its context. The text guides the interpretation of the image content: “it is a very common practice for the captions to news photographs to tell us, in words, exactly how the subject's expression ought to be read” (Hall 1981). In illustration the text comes first and the image elaborates the text by forming an illustration of it. In relay there is a complementary relation between the image and text in that each contributes to the overall message meant to be conveyed.

Martinec and Salway (2005) present a two-fold framework for the relations of images and text in both old and new media. They, like Barthes, specify the relevant status of the text and images as equal (image and text independent or image and text complementary) or unequal (image subordinate to text or text subordinate to image). Further, they analyze the logico-semantic relationships between text and images distinguishing between elaboration, extension and enhancement. In elaboration, the text and images depict and refer to the same participants, processes and circumstances. If new but related information is referred to or depicted, there is extension. If related temporal, spatial or causal information is provided, this is called enhancement.

Marsh and White (2003) also present a taxonomy of relationships between images and text. The development of the taxonomy is based on analyses of research in the fields of children's literature, dictionary development, education, journalism, and library and information design. The taxonomy was evaluated by analyzing text-image pairs on Web pages with educational content for children, online newspapers, and retail business pages. The highest level division is done based on the image's relation to the text as follows:

  • close relation to the text: reiterate, organize, relate, condense, explain,

  • little relation to the text: decorate, elicit emotion, control, or

  • beyond the text: interpret, develop, transform.

These classifications are general but seem applicable to journalistic imagery and its relationships with other elements on the page of a publication. For example, Martinec and Salway (2005) note that different types of news photographs are used to serve different functions, so that photographs that elaborate text are often portraits while enhancement is often brought about by an image depicting a general scene.

Methodology

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

The aim of this study was to discover if and how the page context of magazine images, i.e. the relationship between the image and text on the page, affects the types of categories people construct when evaluating image similarity. A subjective image sorting study was conducted to answer the following research questions: Is there an effect of page context on magazine photograph similarity evaluations? How does the inclusion or exclusion of page context affect the types of categories constructed?

Participants and procedure

A total of 24 subjects (23 female, 1 male) participated in the study. The participants were staff members at picture agencies (8), newspapers (7), museum photograph archives (5) and magazines (4) who as part of their job select, annotate and/or archive photographs. Their job description varied from newspaper archivist to museum researcher. The average age of the subjects was 43 years. We used a between-subjects design to evaluate the effect of inclusion/exclusion of page context on image categorization. The subjects were divided into two groups of 12: those categorizing images in page context (context group) and those categorizing images with the context removed (no context group).

The experimental procedure consisted of two phases for both groups. In phase 1 all the photographs were handed to the subjects in random order in a pile. The subjects were instructed to go through the photographs and sort them into an unrestricted number of piles according to their similarity so that photographs similar to each other would be in the same pile. The subjects were told to decide on their own the basis on which they would evaluate similarity. There was no time limit. At the end of phase 1, the subjects were asked to describe the similarity element in each pile, i.e. to name the piles. The experimenter wrote down these category names. The photographs were then placed into a single pile and shuffled.

In phase 2 of the experiment the subjects were shown the category list generated during phase 1. The subjects were asked to go through all the photographs again one at a time and write the number of each photograph next to each category the photo could belong to. Each photograph could be assigned to one or more categories. The subjects were told that they need not remember how they categorized the photographs in phase 1. Again, there was no time limit. In phase 2 the subjects had the opportunity to assign each photograph to one or more of the categories they formed in phase 1. Phase 2 was completed by saving the category listing.

Material

The test material was taken from five Finnish magazines of different genres (women's, economy, general, high visual and travel). The magazines were chosen based on their wide circulation within their segment as well as varied photographic content and photojournalistic style. The issues from week 7 during Spring 2006 were used for each magazine. The editorial photographs in the magazines were numbered, excluding photographs with more than 20% of them covered with text or graphics, photographs that spanned over two pages and photographs where any dimension was smaller than 4 cm. Twenty photographs were chosen in random from the numbered photographs in each magazine. No two photographs were chosen from the same page.

For the context group the magazine pages on which the one hundred selected photographs appeared were cut out and the page headers and footers removed in order to make the subjects pay less attention to which magazines the pages were from. The amount of context on a page varied from a simple caption to a full article including other visuals. Each page was cropped to the same size and glued on grey cardboard. The pages were numbered in a random order.

For the no context group the same photographs were used, but now the context surrounding the photographs was removed. Each photograph was cut out of its page and glued on grey cardboard. If there was text written over the photograph or another visual element on top of it, the additional element was hidden by covering it with black ink. The pages were numbered in a random order.

Data analysis

The data were analyzed both quantitatively and qualitatively. A data-based qualitative analysis based on grounded theory methodology was carried out on the category names provided by the subjects. If the category name included more than one term and thus possibly more than one basis for categorization, the necessary amount of instances was created from the category name. For example, from a category named “people and work”, the instances “people” and “work” were separated. All of the resulting instances were grouped iteratively by placing similar instances together. Gradually top-level classes containing identifiable sub-level classes emerged from the data. The data were then coded according to these classes. One category was dismissed from the analysis due to its ambiguous nature. The reader should note that in this paper, the terms category and class carry distinct meanings.

Category refers to a photograph group created by a subject in the sorting experiment. Class refers to an instance of category names coded according to the class structure. A single category (people and work) may include references to multiple classes (People-person and Theme-work). This is referred to as a multi-class category.

The use of different class levels in the category names is reported as frequencies and percentages. All the reported statistically significant differences are based on conducting two-tailed t-tests with unequal variances or, when specified, chi squared tests on frequency data. The categorization data was analyzed using hierarchical cluster analysis in Matlab based on the pairwise occurrences of photographs in categories. Due to the procedure in phase 2 of our experiment, no common similarity measure such as percent overlap or co-occurrence measure could be used as these could have resulted in distances smaller than 0 in cases where the two photographs co-occurred in several groups. Therefore the similarity of two photographs was calculated using a modified percent overlap measure. The similarity P of two photographs i and j (for each pair of photographs i and j) was the ratio of the number of common placements of both i and j in the same category to the total number of placements of i, where the total number of placements of i was smaller or equal to the total number of placements of j. The formula for P thus became P = 1—[p(i, j)/p(i)], where p(i) ≤ p(j). The modified percent overlap gave a measure of similarity, which was then converted to a measure of dissimilarity or distance D = 1—P.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

Categories constructed and time spent

Subjects in both groups formed on average 12 categories. The number of categories constructed varied slightly more in the no context group. Altogether 292 categories were collected from the test subjects: 147 from the context group and 145 from the no context group. Including the page context in the task did not significantly lengthen the time needed for sorting. The complete experiment took on average a little over an hour per participant to complete. Table 1 shows the number of categories constructed by the subjects and the time taken for the entire experiment.

Table 1. Number of categories constructed and time elapsed
Subject groupNo. of categoriesTime [min]
 minmaxmeanstdminmaxmeanstd
Context81712341816213
No context52012441856112

Table 2 displays the number of category placements in phase 2 where it was possible to assign a single image into multiple categories. In phase 2 a photograph was assigned on average to 1.25 (context group) or 1.36 (no context group) categories. In 77 % of cases the participants of the context group placed an image into a single category, in 21% there were two placements and in 2% three or four placements of one photograph. For the no context group, these figures were 73%, 22% and 4%, respectively. In 1% of cases the subjects in the no context group categorized an image into four or five categories.

Table 2. Number of category placements in phase 2
Subject groupCategory placements in phase 2
 minmaxmeanstd
Context10015312321
No context10320813630

Types of categories

The categories constructed by the subjects in the context group were predominantly thematic (e.g. categories culture, traffic, travel, and fashion). Several subjects formed a category consisting of photographs of people and simply named it “people”. Some distinguished between common people, celebrities and those who represented some organization. One subject divided the photographs of people into three categories based on the shooting distance (face, upper body or full body). Some categories described the function of the photograph, stating e.g. that they were symbol images or illustrations. The time aspect of photography was present in the several categories named “history” or “historical”.

The subjects in the no context group also constructed categories based on the overall theme of the image. Almost as commonly, categories referred to people, either simply by stating “photographs of people” or by further specifying the number of people (e.g. category group shots), shooting distance (e.g. passport images) or the context of the people (e.g. people at work). Many categories were based on the function of the photographs (e.g. advertising/marketing images). Also frequent were categories based on the objects appearing in the photograph (e.g. buildings).

The category names provided by the subjects were analyzed in order to study the sorting criteria employed. 339 class instances were created and analyzed from the 292 categories in total constructed by the subjects. Table 3 shows the distribution of classes in the category names for the context and no context groups. A chi square test was conducted to find out if the difference between the subject groups was statistically significant. The expected frequencies at the sub-level classes were too low for the requirements of chi square analysis. At the level of the nine top-level classes the use of the classes was different in the context and no context groups (p<0.001). Subjects who were provided with the page context used the top-level class Theme much more: the share of thematic classes was over double that of the no context group (51% vs. 25%). The participants categorizing images without context created more categories based on the People (24% vs. 20%), Scene (6.8% vs. 3.1%) or Objects (11% vs. 2.5%) portrayed in the photograph. Figure 1 illustrates the effect of page context on categorization.

Table 3. Percentage distribution of classes in category names provided by the subjects
Thumbnail image of
thumbnail image

Figure 1. Percentage distribution of top-level classes by subject group

Download figure to PowerPoint

Multi-class categories

Some categories included more than one basis for categorization and referenced multiple classes from Table 3. The share of these multi-class categories was smaller in the context group than in the no context group (9.5% vs. 21.5%). This difference was statistically significant (p<0.001). Altogether 45 category names (14 context, 31 no context) of the total collected 292 were divided into several classes, resulting in 93 class instances. 42 categories included 2 classes, and 3 categories included 3 classes each. Each multi-class category was divided into an average of 2.07 classes for the context group and 2.06 classes for the no context group.

The category names that were separated into several classes were most often given to categories of photographs of people. Over 75% of all multi-class categories included a People class as one of the classes used in the category name. It was often combined with classes Visual & Photography (category facial shots of people) or Description (lone male). Function classes were present in over 22% of the multi-class categories, most often combined with a People class (character shots of people). Theme was used in 16% of multi-class categories, most commonly with People class (people at work). Object and Scene classes were present in 13% and 11% of all multi-class categories, respectively. Description was used in 27% and Visual & Photography in 22% of all multi-class categories.

Some top-level classes were often combined with another class, while other classes were used mostly on their own. Description was used only in conjunction with another class, while Affective was not used at all in multi-class categories. The share of all instances of class Visual & Photography which occurred in multi-class categories was quite high (71%) as was that of People (55%). The class People was used more often in multi-class categories than either Scene (29%) or Object (26%). Story and Theme were mostly used on their own (7% and 6%). These shares reflect the different natures of the classes, as some were used as standalone descriptors and others mainly modified another class.

Clustering results

The results of the hierarchical cluster analysis are presented in Figure 2 and 3. The analysis was conducted using the subjective similarity data from phase 2. Twelve nodes are drawn, corresponding to the mean number of categories constructed by the subjects. The node labels have been extracted from the category names provided by the subjects. The nodes containing only a single image have been italicized. Cluster analysis was done with the average-linkage method used also by Rorissa and Hastings (2004). This produced better results for the current data sets than the complete-linkage method used by Laine-Hernandez and Westman (2006), Lohse et al (1990), Teeselink et al (2000), and Vailaya et al (1998). The quality of the solution was evaluated by calculating the cophenetic correlation coefficient which should be close to 1 for a high-quality solution. The coefficient value of the clustering solution was 0.92 for the context group and 0.81 for the no context group.

thumbnail image

Figure 2. A dendrogram of the test photographs for the context group. 12 nodes are shown.

Download figure to PowerPoint

thumbnail image

Figure 3. A dendrogram of the test photographs for the no context group. 12 nodes are shown.

Download figure to PowerPoint

Jaccard's coefficient was used to measure if the categorization differed between the groups. The measure has also been used by Lohse et al (1994) and Rorissa and Hastings (2004) to test the consistency of subjects in image sorting tasks. The value of Jaccard's coefficient for the phase 1 categorizations by the two subject groups of this study was 0.651 and for phase 2 data it was 0.790. It may be concluded that there was similarity in their categorizations. The context and no context group share many similar nodes: fiction, portraits, style/fashion, interiors, symbolic and objects. Higher-level clusters have not been named but in both groups one of the highest level dividers seemed to be whether or not the photograph was of people (by themselves or in some scene). In the context group the different surroundings and situations people were depicted in fragmented this division somewhat, as contexts such as travel were judged similar to e.g. interiors. The photographs without any people belong to nodes containing symbolic images, images of interiors, objects/devices and transportation/cars.

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

Both subject groups created on average 12 image categories, taking an hour to complete the two phases of the experiment. The cluster analysis revealed similarity in the categorization behavior of the two groups. Most of the nodes were based on similarity of general semantic content which is in line with previous results in image categorization (Mojsilovic & Rogowitz 2001; Rorissa & Hastings 2004). Some nodes also reflected abstract content (e.g. fiction) or image function (e.g. symbolic images). In the context group, the page context of the images was reflected in the nodes based on evaluating e.g. if the object depicted was a technical device or whether or not a situation shown was staged. It was understandably easier to interpret the reasons behind for example various groups being photographed (e.g. work team) from the page context, while in the no context group subjects needed to rely on criteria easily verifiable from the image, for example the number of people depicted (e.g. more than 1 person). These example categories included many of the same images, yet the names given to the categories were distinct. It seems that the inclusion or exclusion of page context had a larger effect on the naming of the categories than the actual placement of photographs in them. The major difference between subjects groups in the clustering concerned photographs of people. In the no context group these appeared separated from pictures containing inanimate scenes or objects while in the context group photographs of people in some thematic context (e.g. sightseeing) appeared in corresponding thematic nodes (e.g. travel) in addition to there being nodes specific to people in weaker contexts (e.g. portraits).

This suggests that while magazine image similarity was judged rather similarly with and without page context, the interpretations of, and names given to image groupings were more affected by the context. Having the page context available when categorizing magazine images led more likely to categorization on a single facet compared to having no contextual information present which more commonly led to categories combining several criteria. The context also made for very high semantic level categories with usually only the single facet, most likely Theme (50%), People (20%) or Function (10%). Manipulation of the image context could potentially be used to elicit descriptions of images on different levels in the collection annotation phase, as having context available seems to result more likely in descriptions of overall theme or story and no context seems to guide towards describing the people, objects and scenes depicted. It is interesting to note that including the page context did not lengthen the time subjects took to finish the tasks, so this is not a factor.

Subjects stated that they had looked at the page context rather automatically. Some read parts of the articles and looked at other photographs on the page (if any), others simply skimmed the captions. The majority of the subjects said that they felt the context had affected the categories they constructed. They seemed to use the text mainly to gather more knowledge about the subject of the photograph and the reason for the photograph having been taken and published. One subject stated that she searched for information about the image from the text, wanting to find out the meaning of the image through the text. Another subject said that some photographs and publishing choices could not be understood without the text but the text “returned the function” into the photograph. One subject went as far as stating that she could not have done the classification without the text. This was echoed in some cases where the subjects in the no context group said that they would have needed to verify from text (which was not available in this group) if an image was from real life or e.g. a movie production (a few of the test images were in fact screen-shots from films) before deciding how to sort it.

Altogether it was clear that the subjects used the text to interpret the content and meaning of the images. Only two subjects out of 12 in the context group said they had made a conscious decision to not look at the page context, and one of these two admitted that she had not managed to avoid gathering information from the page context also. Text was seen as something which anchored (Barthes 1977), explained (Marsh & White 2004) and elaborated or extended (Martinec & Salway 2003) the meaning and purpose of the images. The particular type of relationship between the text and image (e.g. their relative status) did not seem to function as a deciding factor in categorization, but rather the additional information acquired through the interpretation of image context at large. One subject called this process “finding keywords”, reflecting the terminology of annotation processes. The extreme case of having information used in the categorizations present in the text would be the actual category names appearing in the articles or captions. The question then becomes, could e.g. the theme of an image be extracted from the text accompanying it (Barnard & Forsyth 2001). The caveat here is maintaining a distinction between categorizing the images and the text, as their relationships can be versatile and not known a priori. The extraction of textual information could be anchored by visual image feature analysis, increasing the detection accuracy of correct image categories.

Conclusions

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

This study was only a first attempt at discovering the effects of contextual factors on image categorization. We believe it contributes to the understanding of how context affects the categorization of images. The results obtained here need to be verified and supplemented with a study using a wider sample of images from different magazines in the genres used as well as other magazine and photojournalistic image genres. This study showed that humans also use contextual information, i.e. text and other images on the page, to interpret magazine image content. This could indicate good possibilities for image retrieval approaches which employ text mining techniques in conjunction with content-based algorithms to discover the topic of journalistic images. The results on page context having an effect on the names assigned to image categories have implications for the design of interfaces for the post-production annotation of journalistic images. Page context could be included when unified categorization on the level of theme is needed and withheld when detailed descriptions on image function or the people, scene and objects present in the image are required. Together with other research on image categorization, the results of this study provide information on the types of attributes human annotators use to categorize magazine images and name these categories thus guiding the research on what types of (visual and textual) image information to exploit in image search applications.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References

The authors wish to acknowledge the support of The National Technology Agency of Finland for this research project. We would also like to thank Professor Pirkko Oittinen for her constructive comments.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Past research on image categorization
  5. Methodology
  6. Results
  7. Discussion
  8. Conclusions
  9. Acknowledgements
  10. References
  • Barnard, K. & Forsyth, D. (2001). Learning the Semantics of Words and Pictures. Proceedings of the Eighth International Conference on Computer Vision (ICCV'01).
  • Barthes, R. (1977). Image-Music-Text. London: Fontana.
  • Greisdorf, H. & O'Connor, B. (2002). What do users see? Exploring the cognitive nature of functional image retrieval. Proceedings of the 65th Annual Meeting of the American Society for Information Science and Technology.
  • Hall, S. (1981). The Determinations of News Photographs. in Cohen, Stanley & JockYoung (Eds.) The Manufacture of News: Social Problems, Deviance and the Mass Media. London: Constable 226243.
  • Laine-Hernandez, M. & Westman, S. (2006). Image Semantics in the Description and Categorization of Journalistic Photographs. Proceedings of the 69th Annual Meeting of the American Society for Information Science and Technology.
  • Marsh, E. E. & White, M. D. (2003). A taxonomy of relationships between images and text. Journal of Documentation 59 (6), 647672.
  • Martinec, R. & Salway, A. (2005). A system for image–text relations in new (and old) media. Visual Communication 4 (3), 337371
  • Mojsilovic, A. & Rogowitz, B. (2001). A Psychophysical approach to modeling image semantics. in B.Rogowitz & T.Pappas (Eds.) IST & T/SPIE Human Vision and Electronic Imaging VI, SPIE vol. 4299, 470477.
  • Rorissa, A. & Hastings, S.K. (2004). Free sorting of images: Attributes used for categorization. Proceedings of the 67th Annual Meeting of the American Society for Information Science and Technology.
  • Shatford, Layne S. (1994). Some Issues in the Indexing of Images. Journal of the American Society for Information Science 45 (8): 583588.