This poster reports a practitioner evaluation of a multifaceted image categorization model designed for journalistic imagery. We show that the model led to consistent categorizations and was evaluated as useful by the participants in the user test.
Research has shown that people evaluate image similarity at the conceptual level constructing categories based on e.g. people, animals and inanimate objects depicted; whether the scenes and objects are man-made or natural; and abstract concepts related to emotions, culture and visual elements (Greisdorf&O'Connor 2002; Laine-Hernandez&Westman 2006; Mojsilovic&Rogowitz 2001; Teeselink et al. 2000). Image categorization studies have usually focused on classifying images into single categories despite the widespread acceptance of the need to categorize information objects on multiple facets. In a previous study (Laine-Hernandez&Westman 2008) we identified different types of image attributes (function, content, content descriptors) together with users' strategies of combining these to create multifaceted categories. Based on this we constructed a multifaceted categorization model for magazine images. The aim of the present study was to evaluate the categorization model embedded in an image archive application. Expert participants used a specially designed interface for categorizing a set of stock images. We analyzed both categorization behavior and subjective evaluations on the model. The goal was to see if the categorizations would be agreed upon by different participants and whether they thought the model served the purposes of image categorization in magazines.
The participants (n=24), aged from 24 to 56 (mean=36.8, std=8.5) had several years of experience in editorial work (mean=8.6 years), image retrieval (8.8) and categorization (4.8). They worked in magazine editorial offices as journalists (n=7), graphic designers (n=3), photographers (n=10) and archivists (n=4). The majority (n=18) were female. Participants were shown three practice images followed by 20 stock photographs in random order, together with the categorization model of 10 main classes (see Table 1) and 72 subclasses. They were given a simulated work task (Borlund 2003) asking them to give descriptive categories to the photographs so that other employees could find them by browsing the archive. Each photo was to be categorized according to its essential content. After categorizing each image the participants filled out a post-task survey asking e.g. how well the model fit the image (Likert scale, 1=poorly; 5=well). Post-test they were interviewed and asked to evaluate the model, for example whether each main class was (1= not useful; 5=very useful) in categorizing images. We used Rolling's (1981) measure to depict subject agreement on categorizations at the sub-class level, calculating an average value for each image across all subject pairs (24*23 pairs). Based on matching category selections we evaluated image similarity by conducting multi-dimensional scaling (MDS) using a distance measure derived from the Dice measure.
The average number of category placements per image was 8.2 (std= 3.7, min=1, max=27). Table 1 displays the number of times each main class was used on average per photo and participant, share of all categorizations and the average evaluated usefulness of the class in categorizing images. On a 5-point Likert scale (1=completely disagree; 5=completely agree) the participants found the model fit the images to be categorized rather well overall (mean=3.4). They thought using several categories made the categorization easier (4.3) and improved the categorization result (4.5). The participants thought the model was suitable for categorizing wider collections of images (3.4) and professional use (3.4). The majority (75%) felt there were too few categories available, while the rest found the amount suitable.
Classifier agreement and model fit are visualized in the MDS results in Figure 1. The average classifier agreement for the test images ranged between 0.33 and 0.56, the average across all images was 0.47. The images with the highest Rolling's measure depicted people. Participant-evaluated fit of the model for each test image ranged between 2.87 and 4.00. The correlation between Rolling's measure and participant-evaluated fit was 0.42.
Table 1. Main level categories and their use
The categorizations made with the help of the model were multifaceted in 99.6% of the cases. The Function and main content (Theme, People, Object, Scene) categories were thought to be very useful in categorizing photographs. The remaining content descriptor categories were thought to be somewhat less useful. However, they accounted altogether for a fifth of all category selections. The average Rolling measure of 0.47 meant that on average a little less than half of the categorizations were the same between all subjects. This may be judged fair for 24 different participants. The categorization model will be further developed based on results of this evaluation and feedback given by the participants. The ultimate goal is to improve current categorization practices at magazines to enable image retrieval and reuse by the editorial staff. This work is part of an ongoing effort for creating a standard categorization scheme for magazine imagery.