In the eye of the beholder: Is color classification consistent among human observers?

Abstract Colorful displays have evolved in multiple plant and animal species as signals to mutualists, antagonists, competitors, mates, and other potential receivers. Studies of color have long relied on subjective classifications of color by human observers. However, humans have a limited ability to perceive color compared to other animals, and human biological, cultural, and environmental variables can influence color perception. Here, we test the consistency of human color classification using fruit color as a model system. We used reflectance data of 67 tropical fruits and surveyed 786 participants to assess the degree to which (a) participants of different cultural and linguistic backgrounds agree on color classification of fruits; and (b) human classification to a discrete set of commonly used colors (e.g., red, blue, green) corresponds to natural clusters based on light reflectance measures processed through visual systems of other animals. We find that individual humans tend to agree on the colors they attribute to fruits across language groups. However, these colors do not correspond to clearly discernible clusters in di‐ or tetrachromatic visual systems. These results indicate that subjective color categorizations tend to be consistent among observers and can be used for large synthetic studies, but also that they do not fully reflect natural categories that are relevant to animal observers.

Prior to and following the advent of spectroscopic advances in color quantification, many studies on the ecological and evolutionary relevance of color relied on subjective, human categorizations of color (Brodie, 2017;Burns et al., 2009;Lu et al., 2019;Onstein et al., 2019Onstein et al., , 2020Sinnott-Armstrong et al., 2018) (Bennett et al., 1994;Endler, 1990;Valenta et al., 2018;Vorobyev & Osorio, 1998). Critiques of these approaches note that most animals do not share human color vision phenotypes, and therefore, these assessments are at best unreliable, and at worst irrelevant (Cronin et al., 2014;Kemp et al., 2015;Valenta et al., 2018). The majority of humans are trichromats, possessing three different types of cones.
The human trichromacy phenotype is exceedingly rare across the mammalian Class and is shared only with our closest relatives, the diurnal catarrhines of the Order Primates (Jacobs, 2008a). Most other mammals are dichromatic, making them unable to chromatically distinguish between greens and reds (Jacobs, 2008b), while most birds are able to discern color across a wider range of the spectrum than mammals (Osorio & Vorobyev, 2008). Given the limited ability of humans to detect color relative to most diurnal, nonmammalian species, it is nearly guaranteed that humans cannot perceive the full range of color signals in nature and instead can only detect a subset (Bergeron & Fuller, 2018). In many cases, the human perception of color likely differs substantially from the way that color is perceived by the intended signal recipient, whether they be an antagonist or mutualist (Cronin et al., 2014;Ruxton et al., 2019). This is particularly true for color signals that exist outside of human perception limits, for example, in the range of ultraviolet reflectance (~300-400 nm), which humans are incapable of detecting (Honkavaara et al., 2002).
In addition to variation in color vision across the animal kingdom, subjective human assessments of color can be confounded by several factors that are difficult to control for. There is extensive literature on the relationships between language and color perception and categorization (Goldstein et al., 2009;Lindsey & Brown, 2019;Martinovic et al., 2020;Roberson & Hanley, 2007;Thierry et al., 2009;Witzel, 2019). Cross-linguistic studies have found evidence that color categorization is strongly linked to language (Athanasopoulos et al., 2010) and, within a given language, can be affected by cultural variation (González-Perilli et al., 2017). This is particularly problematic for studies that integrate subjective color descriptions collected across geographical areas with speakers of different languages (Brodie, 2017;Sinnott-Armstrong et al., 2018). Further, although most humans are trichromatic, there is compelling and recent behavioral and molecular evidence that some human females are functionally tetrachromatic, with an ability to distinguish chromatic variation beyond what is normally observed in humans (Jordan & Mollon, 2019). More commonly, approximately 2% of human males are thought to lack one of the three cone types permitting trichromacy, and these human dichromats vary in their color perception compared to both trichromats and to other dichromats with different photoreceptor phenotypes (Álvaro et al., 2015).
Thus, even among humans, subjective assignments of colors can vary-colors may not be consistently described by different observers, due to biological, linguistic, and cultural variation that may drive human color perception and categorization.
In this study, we used a free, publicly available surveying platform to investigate the consistency of human color perception and categorization, using images of wild fruits as a model system. We showed 67 images of wild fruits for which we obtained spectroscopic data to 786 volunteers from across the globe and asked them to classify each fruit as one of ten colors that are commonly used in seed dispersal literature (red, orange, yellow, green, blue, purple, pink, brown, white, and black). We first test to what degree participants agree on color classifications, across the sample and between speakers of different languages. We then test whether human classification corresponds to natural clusters of fruit color in the eyes of nonhuman observers using reflectance spectra of the same fruits to model their quantum catch in the eyes of a dichromatic mammal and a tetrachromatic bird. In both, we first look at the classification to the ten commonly used categories (red, orange, yellow, green, blue, purple, pink, brown, white, and black) and then look at two higher-level divisions that have proposed ecological or evolutionary significance to fruit ecology: conspicuous versus cryptic colors, and colors associated with bird or mammal dispersal. Conspicuous colors are identified in the literature as colors that contrast with background foliage, for example, red, yellow, orange, whereas cryptic fruits are those that do not contrast with background foliage, for example, green, brown, based on a human trichromatic phenotype (Melin et al., 2008). We report that, in most cases, participants show a high degree of agreement on color categorization, but that there are discrepancies among participants, especially between speakers of different languages. Moreover, we find that reflectance spectra of the fruits classified to separate color categories by human observers show a high degree of overlap once processed through a nonhuman animal visual system, indicating that they do not correspond well to each other.

| ME THODS
We assembled a photographic database of 67 wild fruiting species from two sites in Madagascar (Ranomafana National Park, Ankarafantsika National Park) and one site in Uganda (Kibale National Park). Photographs were collected from the author's own photographic databases of fruits. For all fruits in the database, we had previously measured their reflectance in the field, relative to a Spectralon white reflectance standard (Labsphere, North Sutton, NH), using a Jaz portable spectrometer and a PX-2 pulsed xenon lamp (Ocean Insight, Orlando, FL) emitting a D-65 light source. The fruit scanning angle was fixed at 45° using a probe holder, and external light was blocked using thick black fabric. Each fruit was scanned 3-5 times, and the resulting spectrograms represent mean spectral reflectance. See Valenta et al. (2018) for detailed methods. The list of species is available at the supplementary materials.
Using freely available Google Forms software, we generated an online survey that had participants view a photograph of a single fruit species and select one of ten colors that best described the target fruit: red, orange, yellow, green, blue, purple, pink, brown, white, and black (online supplementary materials). Because the photographs often portrayed multiple fruits of the same species, we identified the target fruit with a white circle ( Figure S1). Each target fruit and its associated color query was presented on a single page, allowing participants to focus on one target fruit at a time. We selected the ten color categories to coincide with manuscripts that rely on fruit color categorizations for their analyses (Janson, 1983;Onstein et al., 2020;Sinnott-Armstrong et al., 2018). To collect data on additional variables that may influence color perception, we asked each participant to report their age, biological sex, native language, and whether they had any known color vision deficiency (either suspected or medically diagnosed). Because electronic devices can vary in their use of color and light, thereby changing the representation of a given image, we additionally requested information about the electronic device used to complete the survey (type, brand, and year). To exclude the possibility that variance in display types did not introduce substantial amounts of noise, we repeated all analyses (see below) on a subset of participants who reported using Apple iPhone models from 2017 onwards. These devices are equipped with OLED displays ("true black") and are expected to provide highly comparable, if not identical, color displays. The results were practically identical to the results of entire dataset ( Figure S2), which led us to conclude that differences in display types did not contribute a significant amount of noise. Therefore, to increase the sample size and linguistic diversity of our sample, we used the full dataset in all analyses. Informed consent was obtained from all participants, and the research was in accordance with relevant guidelines and regulations. All research was approved by the University of Florida Institutional Review Board for research on human subjects (IRB Protocol # 202,001,589).
We circulated the survey online, allowing for an opportunistic, snowball sampling technique, and collected responses between April 23, 2020 and May 7, 2020. We removed all responses by individuals under the age of 18, per the requirements of the ethics review board (IRB, University of Florida). To reduce the potential impact of charging and brightness settings, participants were asked to ensure their viewing device was plugged in and charging and to maximize their device's screen brightness. In total, we analyzed 786 survey responses, of which 20 participants self-reported suspected or diagnosed color vision deficiencies (e.g., red-green colorblindness). We included these individuals in all analyses under the assumption that many observers contributing to published reports of fruit color, especially among field assistants, may not report or even be aware of their color vision deficiencies. Additionally, the results are qualitatively identical even if they are excluded from the analyses.
To test whether color classifications among participants were significantly more consistent than expected by chance, we conducted a randomization test. Using the collected data, we first classified each fruit to a single color based on the plurality vote, that is, the color most commonly attributed to the fruit among all participants, and calculated the "consensus index"-the percentage of individuals that assigned a fruit species to the most commonly assigned color. We then simulated 999 randomized datasets that assumed colors were randomly assigned to each fruit. For each simulated dataset, we calculated the percentage of randomized responses that matched the color originally attributed to these fruits by a plurality of participants in the collected data. We then compared the collected classifications to the generated distribution to obtain a p-value. We further used descriptive statistics to estimate the degree of discrepancy between participants. We calculated the percentage of misclassifications (classification of a fruit to a color different from the one determined by the plurality of participants) to determine whether any colors tended to be interrelated in their classifications (i.e., were fruits with a plurality of red classifications more likely to be misclassified as orange versus green). We then conducted two analyses that divided the ten colors to two major To assess whether human classifications are likely to be similarly classified by nonhuman observers, we used the reflectance data of all 67 fruit species included in the survey. We standardized the data by trimming the reflectance under and above the visual spectrum (400-700 nm), smoothed it using a running average with pavo: procspec (Maia et al., 2019), and converted the reflectance to relative amounts (i.e., standardizing the total reflectance across all samples). We then calculated the quantum catch for each photoreceptor for each fruit in two model organisms, representing two common visual systems: dichromatic mammals (dog) and tetrachromatic UV-perceiving birds (average avian pigment sensitivity), assuming homogenous illumination, transmission, and background using pavo:vismodel. We then used pavo:coldist to calculate the chromatic distance between each pair of fruits in the dog and avian model systems. To test whether clusters are distinguishable, we used PERMANOVA on the quantum catch on each photoreceptor and 999 permutations. To visualize the results and examine to what degree human-defined clusters are apparent in these two visual systems, we conducted a principal coordinate analysis (PCA) on the resulting distance matrices.

| Agreement among human observers
Fruits were consistently assigned to the same color across survey participants (randomization tests, all ten colors: p < .01). This indicates that colors are not assigned randomly and that the majority of participants agreed on the color of each fruit. At the same time, classifications never reached true consensus: misclassifications (compared to the majority classification) were apparent in all color groups, with high degrees of disagreement over white, pink, orange, red, and purple fruits, as opposed to near consensus in brown, blue, and green fruits (Figure 1). Notably, fruits that were classified as white by half of the participant were classified as green or yellow by the other half.
Fruits were also consistently assigned to the four bins (conspicuous versus cryptic; bird versus mammal), but with differing degrees of consistency (Figure 2). Fruits classified as conspicuous (red, orange, yellow, pink) versus cryptic (green, blue, purple, brown, white, black) were assigned to these bins by 90.2 ± 12.3% and 94.3 ± 10.7% of participants, respectively (Figure 2a). Fruit colors associated with bird dispersal (red, blue, purple, pink, white, black) versus mammal dispersal (orange, yellow, green, brown) were classified as such with an average agreement of 87.4% ± 16.5 and 95.2% ± 11.4, respectively ( Figure 2b). However, disagreement among participants was not negligible, particularly for colors related to the bird dispersal syndrome, in which, on average, 12.6% ± 16.5 of participants classified fruits in such a way that they would be assigned to colors related to a mammalian dispersal syndrome.
Although native speakers of different languages showed a high degree of consistency, some clear discrepancies arose: In the six F I G U R E 1 Degree of agreement among participants on fruit color classification. Inner circles represent raw data. Each inner circle is a fruit assigned by at least one participant to a certain color, and the circle size corresponds to the percentage of participants assigning a color to that fruit. Lines connect the most commonly assigned color for each fruit to other colors used to classify the same fruit. Distances between dots within color do not represent real differences in participant classification and are the result of random placement by the algorithm to avoid overlap between points. For example, a large red circle connected to a medium orange and small brown circle indicates that the fruit was classified primarily as red, but with a significant share of misclassifications to orange, and a small minority to brown. The outer section provides summary statistics for each color: The large circles give the percent of participants who agreed with the plurality opinion, and the smaller circles around show the breakdown of the misclassifications best-represented languages in our sample, 22.3% of fruits were classified by a plurality of participants to at least two different colors ( Figure S3). Classifications to color bins were more consistent across languages because misclassifications tended to include colors in the same bin as the color chosen by the plurality (e.g., purple and black).
For 3% of fruits, speakers of different languages misclassified fruits in such a way as to alter their placement with respect to the conspicuous or cryptic bins. This type of misclassification occurred for 7.5% of fruits and their placement with respect to bird or mammal dispersal syndromes.

| Agreement between human observers and animal visual systems
Color classifications by humans showed partial agreement with the clustering patterns of chromatic distances based on the visual systems of a dichromatic mammal (dog) and a UV-perceiving avian.

| D ISCUSS I ON
Our goals were to identify whether human observers consistently classify fruits to the same colors and to assess the degree to which these classifications reflect how nonhuman observers may view the same fruit. Overall, we found that there was a high degree of consistency in color categorizations across survey participants, indicating that, in the human visible spectrum, color assessments made by human observers are generally consistent. However, there was some interesting variation among color classifications. For some colors, for example, green, there was a very high level of agreement, whereas for other colors, for example, white, categorization was less consistent.
Although fruits were consistently categorized to the same color, subsequent binning of those colors into ecologically relevant categories, as was done in previous studies (Onstein et al., 2020;Sinnott-Armstrong et al., 2018), revealed interesting variation. In particular, human assessments of fruit colors related to a bird dispersal syndrome were not fully consistent: Across all four bins, the greatest discrepancy resulted from participants classifying fruits in such a way that they would be categorized with mammal dispersal colors, whereas the fruit's plurality color placed them in the bird dispersal bin. Perhaps unsurprisingly, this indicates that, as mammals, humans may be less reliable at identifying color signals that are associated with bird dispersal (e.g., white), compared to signals associated with mammal dispersal (e.g., green). In addition, across different languages, discrepancies were also greater after fruits were assigned to dispersal syndromes. Thus, although subjective single color categorizations might be consistent, their subsequent interpretation may introduce a non-negligible amount of noise.
Comparison of human color classifications and the quantum catch of the reflectance data of the 67 species in a dichromatic mammal and a tetrachromatic bird revealed both the validity of human classifications and its limitations. Among birds, whose color discrimination capacities surpass humans', some of the more common colors (red, green, yellow) formed clear clusters that were statistically distinguishable from most other human-classified colors. At the same time, some were not, and all showed substantial overlap with other colors, indicating that a high degree of consistency in human color perception. For example, fruits classified as green by humans can, in the eyes of birds, be closer to fruits classified as white, brown, or F I G U R E 2 Classification of fruits to major bins: (a) conspicuous versus cryptic, and (b) bird versus mammal dispersal syndromes. Conspicuous: red, orange, yellow, pink. Cryptic: green, blue, purple, brown, white, black. Bird: red, blue, purple, pink, white, black. Mammal: orange, yellow, green, brown. Each circle is a fruit classified into either bin. Circle size corresponds to the share of participants who classified the fruit to the bin, and lines connect the same fruit if classified by at least some participants to the other bin. Location within a bin (e.g., "Bird") is meaningless. Dots were jittered around the center of each category to visualize the variance in agreement among participants for each individual fruit blue by humans. Nonetheless, the fact that many colors were distinguishable, and that the main functional bins were also clearly distinct, is an indication that human classification, while not noise-less, is a reasonable proxy for chromatic variance.
An additional consideration is that participants in the survey viewed fruit photographs on different devices and screens, and the resulting variation in hue, saturation, and chromaticity could have influenced our results. While we cannot exclude that this introduced some noise, we believe that it has little effect on our results for several reasons. First, this noise is expected to reduce interparticipant agreement, thus making the analysis even more conservative and strengthening our conclusion that participants show a high degree of agreement. Second, our analysis of a subset of the participants who used very similar devices reproduces the results, thereby indicating that display types did not play a major role in subjective color classification. Third, potential variation introduced by differences in viewing devices likely pales in comparison to that introduced by the myriad light and viewing conditions in the field, particularly in tropical forests where our fruit samples were taken (Endler, 1993;Yoshimura & Yamashita, 2012). Furthermore, many field studies rely on observers to categorize the color of ripe fruits and to identify which fruits are ripe among a possible array of fruits at different developmental stages (Brodie, 2017;Burns et al., 2009;Lu et al., 2019;Onstein et al., 2019;Sinnott-Armstrong et al., 2018). For example, in the field, an observer may identify a species that, from their perspective, transforms from an unripe green state, through yellow, to orange, to red, to black. The observer may decide that the orange or red stage represents the color at ripeness, whereas the black stage F I G U R E 3 Chromatic distances between fruit in two visual systems. Principal coordinate analysis of chromatic distances based on quantum catch in dogs (dichromatic mammals) and an average UV-perceiving avian. Each dot is a single fruit. Dot colors represent the human plurality classification  Figure 3. In all graphs, the x-axis is PCoA1, and the y axis is PCoA2. Functional bins are classified based on commonly used classifications: bird (black, white, red, pink, purple, blue), mammal (green, brown, orange, yellow); cryptic (brown, black, green, blue, white, purple), conspicuous (red, orange, yellow, pink). In the top panel, blue dots represent colors binned to a bird dispersal syndrome, yellow dots represent colors binned to a mammal dispersal syndrome based on the plurality opinion. In the bottom panel, red dots represent conspicuous colors, and green dots represent cryptic colors, again based on the plurality opinion between human observers is likely to be low. However, studies relying on subjective color classifications should exercise caution since human color categorizations can be affected by physiological, linguistic, and/or cultural biases and do not correspond well to the colors perceived by animals with different color vision phenotypes.

ACK N OWLED G M ENTS
ON gratefully acknowledges the support of iDiv funded by the German Research Foundation (DFG-FZT 118, 202548816) and grant nr. NE 2156/3-1 by the German Science Foundation.

CO N FLI C T O F I NTE R E S T
KV, SLB, and ON declare that publication of this manuscript will enhance their CV and publication statistics and thus improve their academic prestige and employment status. Y-DJ has no conflict of interest to declare.