Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Aalto University School of Science and Technology
This paper describes a comparison of categorization criteria for three image genres. Two experiments were conducted, where naïve participants freely sorted stock photographs and abstract/surreal graphics. The results were compared to a previous study on magazine image categorization. The study also aimed to validate and generalize an existing framework for image categorization. Stock photographs were categorized mostly based on the presence of people, and whether they depicted objects or scenes. For abstract images, visual attributes were used the most. The lightness/darkness of images and their user-evaluated abstractness/representativeness also emerged as important criteria for categorization. We found that image categorization criteria for magazine and stock photographs are fairly similar, while the bases for categorizing abstract images differ more from the former two, most notably in the use of visual sorting criteria. However, according to the results of this study, people tend to use descriptors related to both image content and image production technique and style, as well as to interpret the affective impression of the images in a way that remains constant across image genres. These facets are present in the evaluated categorization framework which was deemed valid for these genres.
The ability to represent images in meaningful and useful groups is essential for the ever-growing number of applications and services involving large collections of images, for both professional and personal use. Being central issues for visual information retrieval, subjective image categorization and description have been studied extensively. The subjects of the images in these studies vary from narrow topics such as images of people (Rorissa & Hastings, 2004; Rorissa & Iyer, 2008) or grayscale images of trees/forest (Greisdorf & O'Connor, 2001), to “generic” photographs with extremely varied content including e.g. people, objects and scenery (from a Kodak Photo-CD, Teeselink et al., 2000; from PhotoDisc, Mojsilovic & Rogowitz, 2001). They can also be selected based on a specific image genre such as vacation images (Vailaya et al., 1998), journalistic images (Laine-Hernandez & Westman, 2006; Sormunen et al., 1999) or magazine photographs (Laine-Hernandez & Westman, 2008; Westman & Laine-Hernandez, 2008), and as such may cover a wide range of semantic content.
The contents of the images have an influence on the categorization results. For instance, using photographs of people with the backgrounds removed the participants of Rorissa and Hastings (2004) formed the following main categories: exercising, single men/women, working/busy, couples, poses, entertainment/fun, costume and facial expression. Vailaya et al. (1998), on the other hand, used outdoor vacation images which subjects sorted into the following categories: forests and farmlands, natural scenery and mountains, beach and water scenes, pathways, sunset/sunrise scenes, city scenes, bridges and city scenes with water, monuments, scenes of Washington DC, a mixed class of city and natural scenes, and face images. A comparison of the category labels in these two studies reveals the effect of the material that is being categorized.
The above-mentioned prior research has revealed that people evaluate image similarity mostly on a high, conceptual (as opposed to low, syntactic) level, based on the presence of people in the photographs, distinguishing e.g. between people, animals and inanimate objects as well as according to whether the scenes and objects in the images are man-made or natural, e.g. buildings vs. landscapes (Laine-Hernandez & Westman, 2008; Mojsilovic & Rogowitz, 2001; Teeselink et al., 2000). Laine-Hernandez and Westman (2006) found that people also create image categories based on abstract concepts related to emotions or atmosphere, cultural references and visual elements. Professional image users (journalists) have been shown to evaluate image similarity based on criteria varying from syntactic/visual to highly abstract: shooting distance and angle, colors, composition, cropping, direction (horizontal/vertical), background, direction of movement, objects in the image, number of people in the image, action, facial expressions and gestures, and abstract theme (Sormunen et al., 1999).
Most previous studies have involved a descriptive analysis of the category labels used but have not generated a hierarchical image categorization framework. Laine-Hernandez and Westman (2008), however, employed a qualitative data-based analysis on the category names given by their participants and produced the framework for magazine image categorization shown in Table 1.
The goals of the current study are to:
1.discover which image attributes (similarity criteria) are used when categorizing stock photographs and abstract images, and
2.validate the class framework of Laine-Hernandez & Westman (2008) by studying whether it can be generalized across image genres.
We set out to compare three different genres of images: magazine photographs, stock photographs, and abstract images In the case of magazine photographs, we looked at the results obtained by Laine-Hernandez and Westman (2008), and for stock photographs and abstract images we conducted two new experiments with identical procedure. As in many previous image categorization studies (e.g. Laine-Hernandez & Westman, 2006; Rorissa & Hastings, 2004; Vailaya et al., 1998), we used free sorting to obtain the image categories. The first two image genres (magazine and stock photographs) are more similar to each other − due to the production technique but also in style and semantic content − compared to the third genre (computer-generated graphic abstract images). By selecting these materials we hope to address the potentially different degrees of applicability of the results on image categorization. Some aspects of magazine image categorization behavior may apply to stock photographs but not abstract images, while other aspects might be shared between all three image genres under study.
The study of Laine-Hernandez and Westman (2008) involved two participant groups, expert (staff members at magazines, newspapers, picture agencies and museum photograph archives) and non-expert (engineering students or university employees). Differences were found in the two groups'categorization behavior. For the two new experiments we recruited non-expert participants, and therefore only look at the results of Laine-Hernandez and Westman obtained with non-experts (N=18).
Experiment I – Stock photographs
One hundred stock photographs were chosen at random from a collection of 4000+ colored stock photographs with varied content originally collected from the stock photography service Photos.com. Examples of the test images are shown in Figure 1. The photographs were printed so that the longer side of each photo measured roughly 15 cm. Each photograph was glued on grey cardboard for easier handling.
The participants (N=20, 10 female) were engineering students and university staff. Their age varied from 19 to 55 (mean 27). They were rewarded for their effort with movie tickets.
The participant was handed the test images in one random-ordered pile. The written task instruction was:
Sort the images into piles according to their similarity so that images similar to each other are in the same pile. You decide the basis on which you evaluate similarity. There are no ‘correct’ answers, but it is about what you experience as similarity. You decide the number of piles and how many images there are in each pile. An individual image can also form a pile. Take your time in completing the task. There is no time limit for the experiment.
The participant was also told that there would be a discussion about the piles after the completion of the task. After completing the sorting task the participant was asked to explain the basis on which they has formed each pile, i.e. to describe and name them.
Experiment II – Abstract/surreal images
Test image candidates were collected from various websites (caedes.net, deviantart.com, digitalart.org, flickr.com creative commons, sxc.hu). All the collected images were keyworded as “abstract” or “abstract/surreal” on these sites. The images were vector art, 3D rendered art, fractal art, computer desktop backgrounds etc. We chose images based on high user-rating if provided by the website. In order to be able to produce decent-quality prints, we required that the smaller side of the image be at least 500 pixels and the longer side at least 800 pixels. We looked for color images that had visually complex, abstract content, but also images with semi-representative content (e.g. 3D modeled images) were included in order to make the test image collection varied within the genre. We collected approximately 250 test image candidates, and then selected the final one hundred test images from those at random. Examples of test images are shown in Figure 2.
The images were printed at a resolution of 180 dpi, with the length of the longer side of the images varying between 11.2 and 12.7 cm. The images were glued on grey cardboard for easier handling. Image orientation was marked with a small circle drawn under each image denoting bottom, and all images were handed to the participant in correct orientation.
The participants (N=20, 9 female) were recruited through engineering students' newsgroups. Their age varied from 20 to 38 (mean 24). They were rewarded for their effort with movie tickets.
The first part of the procedure was identical to the procedure described for Experiment I. Category names were not required for single image categories due to expected difficulty in describing these categories.
After sorting the images and describing the resulting categories, 9 of the participants (4 female) carried out an additional abstractness/representativeness evaluation task. The evaluation procedure was taken from the context of image quality evaluation (Leisti et al., 2009). The test images were placed in one pile in random order. A paper rating scale with numbers from 1 to 7 appearing at equal intervals as well as the words “abstract” and “representative” written in the ends of the scale was placed on a table. The participant was first asked to select the most abstract test image and place it on number 1 (“abstract”). The participant was then asked to select the most representative test image and place it on number 7 (“representative”). After that, the participant was asked to place the remaining images on numbers 1 to 7 so that the representativeness grows linearly from 1 to 7. A mean opinion score was calculated from the participants' evaluations for each test image.
Qualitative analysis of category names
The category names or descriptions given by the participants in Experiments I and II were analyzed qualitatively by the authors and assigned to the main and subclasses shown in Table 1. The reader should note that in this paper, the terms “category” and “class” carry distinct meanings. Category refers to an image group created by a participant in an experiment. Class refers to an instance of category names coded according to the categorization framework developed by Laine-Hernandez & Westman (2008). If the category included references to multiple classes (i.e. multiclass category), two or more class instances were created from the name. For example, if the category was described as “Ocean-themed, blue images”, the instances “ocean-theme” (Theme – Misc. theme) and “blue” (Visual − Color) were separated.
The same procedure was employed by Laine-Hernandez and Westman (2008) and the results obtained here will be compared to those. Data from the second phase of the sorting procedure in Laine-Hernandez and Westman (2008) is disregarded here, as the results regarding the use of classes were determined by the first part of the experiment, identical to the procedure in Experiments I and II.
For purposes of statistical comparisons, the distributions to be compared were tested for normality using the Shapiro-Wilk test. If they were normally distributed, parametric tests for two or more independent samples (t-test or one-way analysis of variance (ANOVA), respectively) were applied. If they were not normally distributed, we used non-parametric tests for two or more independent samples (Mann-Whitney U test or Kruskal-Wallis test).
Multidimensional scaling (MDS)
The sorting data from Experiments I and II was also visualized in Matlab (The MathWorks) using non-metric multidimensional scaling. MDS is a multivariate data analysis method that seeks to express the structure of similarity data (given in the form of a distance matrix) spatially in a given (low) number of dimensions. The Standardized Residual Sum of Square (STRESS) value is used to evaluate the goodness of fit of an MDS solution. A squared stress, which is normalized with the sum of 4th powers of the inter-point distances, was used in the analysis. A stress value <0.025 has been considered excellent, 0.05 good, 0.10 fair and >0.20 poor (Kruskal, 1964). These are only rough guidelines and not applicable in all situations as the stress value depends on the number of stimuli and dimensions. When the number of points (here images) is much larger than the number of dimensions of the space they are represented in, as in our case (100 images in a two-dimensional space), higher stress values may be acceptable (Borg & Groenen, 2005). In practice stress values below 0.2 are taken to indicate an acceptable fit of the solution to the similarity data (Cox & Cox, 2001).
In order to obtain the MDS solution the data from each experiment was first converted to an aggregate dissimilarity (distance) matrix as follows. The percent overlap for each pair of photographs i and j was calculated as the ratio of the number of subjects who placed both i and j in the same category to the total number of subjects. The percent overlap Sij gives a measure of similarity, which was then converted to a measure of dissimilarity: δij = 1 − Sij. MDS was performed on these dissimilarity matrices.
Correspondence analysis is an exploratory multivariate statistical technique used to analyze contingency tables, i.e. two-way and multi-way tables containing some measure of correspondence between the rows and columns. We used correspondence analysis to analyze the usage of subclasses in Experiments I and II. This was done after separating the multiclass categories into two or more categories. The image sorting data by itself was not used in the analysis, but only the connections between images and subclasses. Only those subclasses were included that were connected to at least two images. The contingency table was formed by assigning subclasses (C) to columns and images (I) to rows. The value in cell (Ci,Ij) was determined by the number of times subclass Ci was used as category description for image Ij.
Correspondence analysis (CA)
CA was carried out with R (R Development Core Team, 2010), an environment for statistical computing, using the package FactoMineR (Husson et al., 2008). The results were further visualized using Matlab.
Results: Number of categories
The numbers of categories formed for different image genres are listed in Table 2. The amount of test images was the same (100 images) in all three experiments. For magazine photographs, participants created 298 categories (out of which 2 were excluded from further analysis because of ambiguous meaning) in total, for stock photographs 352 categories and for abstract images 419 categories (out of which 2 were excluded from analysis because of ambiguous meaning, and 68 because they were single-image categories which needed not be named by the participant). According to a one-way ANOVA there was no statistically significant difference between the three experiments in the number of categories formed.
Out of the magazine photograph categories, 33.8% included two or more bases for sorting, resulting in a total of 409 class instances. Magazine photograph category names contained 1.38 classes on average. Of the stock photograph categories, 23.6% were multiclass categories, resulting in 442 class instances. Stock photograph category names contained an average of 1.26 classes. Of the analyzed abstract image categories, 38.1% were multiclass categories, and resulted in a total of 507 class instances. Abstract image category names contained 1.45 classes on average.
It was also possible to form categories of single images. For magazine photographs 16% of the categories consisted of one image, for stock photographs 19% and for abstract images 21%. According to a Kruskal-Wallis test there was no significant difference between the three experiments in the number of single-image categories formed.
Table 2. Number of categories for different image types
Time spent on sorting
With stock photographs the participants spent on average 16 minutes (min 7, max 32, sd 7) sorting the images (excluding the time it took to name the categories), with magazines photographs 24 minutes (min 9, max 45, sd 11), and with abstract images 33 minutes (min 13, max 76, sd 18).
According to Mann-Whitney U tests the difference between stock photographs and the other two image genres was significant (p<0.02), but the difference between magazine photographs and abstract images was not significant (p=0.11).
Category types: Category descriptions
For magazine photographs, according to Laine-Hernandez and Westman (2008) “non-expert participants often formed categories on various semantic levels instead of a controlled categorization on a thematic level.” Category descriptions often contained named objects, scenes and aspects related to the people in the images. Categories were also formed based on affective aspects, i.e. emotional impact or the mood interpreted in the photograph.
Stock photograph categories mostly referred to the main semantic content of the images − often using just one word (e.g. animals, buildings, portraits, food) or a combination of two simple criteria combined (e.g., close-ups of objects, modern cityscapes, people working, nature details).
Most of the category descriptions for abstract images were more complex than the ones for stock photographs. The participants did not know beforehand that they would have to name the categories (just that they would have to discuss them in some way), so they probably grouped images based on some feeling or intuition, resulting in category descriptions such as: “A feel of realism, but also artistic mess; a combination of a drawing and realism”, or “3D modeled, the same mood, a very coherent group, spheres”, or “bright colors, they remind me of stereotypical computer graphics, could be (computer) desktop backgrounds, rich, lots of details”. On the other hand, there were also many concise category names targeting various semantic levels (space, modern art, graffiti, flowers, landscapes, spheres, splashes, spirals etc.).
Main class usage
The percentage distributions of main class usage for magazine photographs, stock photographs and abstract images and are listed in Table 3 and visualized in Figure 3. The exact usage percents of the main and subclasses for each image genre are also listed in the table in Appendix 1. In order to make the comparison between photographs and non-photographs (i.e. abstract images) possible, the main class Photography was extended to include aspects related to production technique (e.g. drawing, painting, comic-like).
For magazine photographs main classes People (25.2%), Theme (22.2%), Object (12.2%) and Scene (10.3%) were used the most. The least-used classes were Visual (2.0%) and Function (3.2%). Stock photographs were also categorized mostly based on their main semantic content. The main classes that were used the most were Object (23.1%), Scene (18.6%), Theme (16.5%) and People (15.8%). The four most-used main classes were the same (although not in the same order) for magazine and stock photographs. The least used classes for stock photographs were Affective (1.4%) and Visual (2.0%).
Table 3. Percentage distribution of main classes for different image genres. Significant differences between all three image genres are marked with ***. Significant differences between abstract images vs. magazine and stock photographs are marked with **. Significant difference between magazine and stock photographs is marked with *
In contrast to magazine and stock photographs, abstract images were categorized the most based on visual features. The most used main classes were Visual (38.1%), Object (20.1%), Function (9.9%) and Description (8.7%). The least used classes were People (1.6%) and Story (2.6%).
According to a Kruskal-Wallis test the usage of the following three main classes did not significantly differ between the three experiments: Affective, Description and Photography. For the rest of the main classes we used the Mann-Whitney U test for two independent samples to see where the pairwise differences occurred. For classes Function, Story, Theme and Visual there was no significant difference between magazine and stock photographs, but the abstract images differed from the other two (Function p=0.050, Story p<0.007, Theme p<0.001, Visual p=0.000). For class Object the only significant difference occurred between magazine and stock photographs (p=0.018), and for class Function abstract images differed from both stock photographs (p=0.050) and magazine photographs (p=0.044). For classes People and Scene the differences were significant between all three images genres (p<0.001).The usage of the main classes was thus most similar between magazine and stock photographs; for seven out of ten main classes there was no significant difference.
We also analyzed for how many images (out of 100) each main class was used for each image genre. The results are depicted in Figure 4. The classes that were used to describe the most images across all genres are Description (80% of all images), Theme (74%), Function (71%) and Photography (71%).
A visualization of the two-dimensional MDS solution for stock photographs is presented in Figure 5. The stress value for the solution is 0.07. In the MDS several clusters of images can be identified, e.g. photographs of people, urban scenes, nature scenes, animals, non-living objects and food photographs. The diagonal axes of the solution seem to differentiate between 1) images with people and images without people, and 2) whether the images depict objects or scenes.
A visualization of the two-dimensional MDS solution for abstract images is presented in Figure 6. The stress value of the solution is 0.17. This means that inter-categorizer agreement was lower for abstract images than for stock photographs.
Dark images seem to be placed in the bottom-left corner of the MDS visualization. We therefore converted the images to the CIELAB color space (Fairchild, 2005) and calculated their mean lightness. We then calculated the Pearson correlation coefficient r between the mean lightness and MDS x+y values of the images which resulted in r=0.756 (p<0.001), indicating strong correlation. Representative images seem to be located in the top left part of the MDS visualization. We therefore calculated the correlation between the abstractness scores (calculated as the mean of the participants' evaluations) and MDS y−x coordinates of the image resulting in r=0.772 (p<0.001), also indicating strong correlation.
Correspondence analysis results
The result of the correspondence analysis for the stock photographs is shown in Figure 7. Dimension 1 (horizontal axis) explains 14.3% of the total inertia (variance) of the data and dimension 2 (vertical axis) explains 13.5%. The two dimensions together then account for 27.8% of the inertia.
The subclasses that correlate positively the most with dimension 1 are People − Person (r=0.87) and People − Portraits (r=0.74), and the ones that correlate negatively Description- Property (r=−0.56) and Scene – Landscape (r=−0.53). The subclasses that correlate positively the most with dimension 2 are Object- Buildings (r=0.69) and Scene- Cityscape (r=0.67), and the ones that correlate negatively Scene- Nature (r=−0.70) and Scene- Landscape (r=−0.55). All these correlations are statistically significant (p<0.001). Dimension 1 seems to differentiate between images of people and images with no people. Dimension 2 appears to represent the distinction between man-made and natural scenes/objects.
The result of the correspondence analysis for abstract images is shown in Figure 8. Dimension 1 explains 17.0% of the inertia and dimension 2 explains 15.4%, adding up to 32.4% of the total inertia. Even though the two dimensions explain more of the variance than the two dimensions for stock photographs, the solution is not as clear to interpret as the one for stock photographs (Figure 7). This is an indication of the complexity of similarity evaluations for abstract images. It is also partially due to the fact that two of the images – the only ones with clearly identifiable human figures – are isolated and the rest of the images are tightly clustered.
The subclasses correlating positively the most with dimension 1 are Photography – Technique (r=0.70) and Function – Advertisement (r=0.52), and the ones that correlate negatively Theme – Misc. theme (r=−0.76) and Photography – Distance (r=−0.71). The subclasses that correlate positively the most with dimension 2 are People – Person (r=0.72) and Scene – Nature (r=0.43), and the ones that correlate negatively Object – Non-living (r=−0.66) and Description – Property (r=−0.52). All these correlations are statistically significant (p<0.001). The semantic aspects represented by the two axes are not as clear as in the case of stock photographs, but subclasses People – person and Description – property can be found in the ends of both dimension 1 of stock photographs and dimension 2 of abstract images.
Magazine photograph category names contained on average 1.38 classes, stock photograph category names 1.26 classes, and abstract image category names 1.45 classes. These findings are roughly in line with Jörgensen's (1995) results, where one third of image group names were composed of multiple (on average 1.5) terms. There is, however, some variation, which could be interpreted as an indication of the complexity of categorization criteria. The time taken to sort the images can be interpreted to correlate with the difficulty of grouping the test images into coherent categories. Taken together, these two findings indicate that image categorization was the most simple for stock photographs and the most complex for abstract images. The more complex the categorization is, the more difficult it is to represent the criteria employed to perform it (e.g. in the form of a categorization framework).
Dimensions of similarity evaluations
The dimensions of image similarity criteria as revealed by MDS were different for stock photographs and abstract images. The result for stock photographs was largely similar to the results reported in earlier studies (Laine-Hernandez & Westman, 2008; Rogowitz et al., 1998; Teeselink et al 2000). The result for abstract images, however, revealed entirely different bases for categorization: the lightness vs. darkness and abstractness vs. representativeness of the images. These dimensions are not about the main semantic content of the images, but describe their visual and stylistic/representational aspects.
For stock photographs, the correspondence analysis conducted using sublevel class information revealed very similar axes as the MDS conducted by Rogowitz et al. (1998): man-made (buildings & cityscape) vs. natural (landscape & nature) and more human-like (person & portraits) vs. less human-like (property & landscape). Considering that the image sorting data was not used at all in the correspondence analysis, the result shows that the categorization framework of Laine-Hernandez and Westman (2008) is capable of accurate characterization of stock photographs when compared to user-evaluated image similarity. For abstract images the result was not as clear, indicating that the framework is perhaps less suitable for non-photographs and some other image genres.
Additions to the categorization framework
During the data analysis of Experiments I and II, a few possible additions to the categorization framework (Laine-Hernandez & Westman 2008) emerged and were used in the data analysis for stock photographs and abstract images. Most notably, the addition of the subclass Object- nature objects would logically fill the gap left between subclasses non-living and animals. The subclass includes plants, flowers etc. and was used in both experiments. Because of the occurrences in stock photograph category names, the theme technology was extended to include industry. In addition, especially for abstract images but to a lesser extent also for stock photographs, the function art images appeared in the category names. Furthermore, visual impression (e.g. clear, fuzzy) was used as a categorization criterion for abstract images. The addition of these subclasses does not change the results of the comparisons between the three image genres, because the main class level remained unmodified.
Applicability of the framework for image categorization
Based on our results on main class usage, the categorization framework developed for magazine photographs (Laine-Hernandez & Westman, 2008) can be extended to apply to stock photographs. For seven out of ten main classes there were no significant usage differences. The only main classes with significant differences were People, Scene and Object. Since these classes describe the main semantic content of the images, it is not surprising that their usage differs between different image genres and even between different image collections within genres. It should, however, be noted that despite the differences, these three classes were all among the four most-used classes for both magazine and stock photographs.
Abstract images differed more from the two photograph genres; only three out of ten main classes did not show a statistically significant difference across all three image genres: Affective, Description and Photography. The usage of these classes was at best moderate if evaluated by the percentage usage of them in the category descriptions. However, taking into account the number of images they were used to categorize, classes Description,Function and Photography were among the four most-used main classes if averaged over the three image genres. This shows that they form important and widely applicable bases for image categorizations across image genres.
The largest differences across all three image genres occurred in the use of main classes People,Scene and Visual. Classes People and Scene were seldom used in the categorization of abstract images. As mentioned above, the differences can be explained by the subject matter of the test images. Of the 100 stock photographs, 38 both depicted people as the main subject matter and were used by the participants as sorting criteria. The study of Laine-Hernandez and Westman (2008) included photographs from five different magazine types with each of them including photographs of people, one of them “predominantly” so. In contrast to this, only 2 of the 100 abstract images contained people depicted clearly enough for the participants to use them as sorting criteria, and these two images were strongly categorized based on this aspect (evident also from the correspondence result in Figure 8). This demonstrates again the importance of the presence of people in images. Class Visual, on the other hand, was used rarely in the categorization of photographs but in the case of abstract images it was by far the most used class. Visual aspects are therefore important in image categorization across genres.
This paper presented the results of two subjective image categorization experiments and a comparison between them as well as an earlier study on the categorization of magazine photographs. We found that the image categorization criteria for magazine and stock photographs are fairly similar, while the bases for categorizing abstract images differ more from the former two. According to the results of this study, the image categorization framework developed for magazine photographs can fairly well be generalized to apply to different image genres, and especially well to stock photographs. The categorization framework is detailed enough to be able to characterize the contents and similarity of stock photographs.
This study was funded by the Academy of Finland and the National Technology Agency of Finland.
APPENDIX 1 Percentage distribution of main and subclasses in the three experiments. New additions to the class hierarchy in italics.