Is color data from citizen science photographs reliable for biodiversity research?

Abstract Color research continuously demands better methods and larger sample sizes. Citizen science (CS) projects are producing an ever‐growing geo‐ and time‐referenced set of photographs of organisms. These datasets have the potential to make a huge contribution to color research, but the reliability of these data need to be tested before widespread implementation. We compared the difference between color extracted from CS photographs with that of color extracted from controlled lighting conditions (i.e., the current gold standard in spectrometry) for both birds and plants. First, we tested the ability of CS photographs to quantify interspecific variability by assessing > 9,000 CS photographs of 537 Australian bird species with controlled museum spectrometry data. Second, we tested the ability of CS photographs to quantify intraspecific variability by measuring petal color data for two plant species using seven methods/sources with varying levels of control. For interspecific questions, we found that by averaging out variability through a large sample size, CS photographs capture a large proportion of across species variation in plumage color within the visual part of the spectrum (R2 = 0.68–0.71 for RGB space and 0.72–0.77 for CIE‐LAB space). Between 12 and 14 photographs per species are necessary to achieve this averaging effect for interspecific studies. Unsurprisingly, the CS photographs taken with commercial cameras failed to capture information in the UV part of the spectrum. For intraspecific questions, decreasing levels of control increase the color variation but averaging larger sample sizes can partially mitigate this, aside from particular issues related to saturation and irregularities in light capture. CS photographs offer a very large sample size across space and time which offers statistical power for many color research questions. This study shows that CS photographs contain data that lines up closely with controlled measurements within the visual spectrum if the sample size is large enough, highlighting the potential of CS photographs for both interspecific and intraspecific ecological or biological questions. With regard to analyzing color in CS photographs, we suggest, as a starting point, to measure multiple random points within the ROI of each photograph for both patterned and unpatterned patches and approach the recommended sample size of 12–14 photographs per species for interspecific studies. Overall, this study provides groundwork in analyzing the reliability of a novel method, which can propel the field of studying color forward.


| INTRODUC TI ON
Organism color is a visually remarkable, yet complex trait. Aspects of color are easily observed, and as such, much about its function, production, perception, and evolution is known. An organism's color can play vital roles in physiology, providing thermoregulatory (Caro, 2005;Stelbrink et al. 2019), photosynthetic, and photoprotective (Brenner & Hearing, 2008;Hirschberg, 2001) functions by regulating the absorption of light. More sophisticated functions of color evolved with the development of color vision in complex organisms, allowing coloration to be used in visual cues and signaling, such as the colorful plumage in birds for courtship and mating, or brightly colored flowers that attract animal pollinators (Dyer et al. 2012).
Color is often used to study evolutionary processes such as selection and drift (Hoekstra, 2006). Increasingly, patterns at large scales, including color variation across biogeographic regions (Dalrymple et al. 2015) and throughout time (Zeuss et al. 2014), have proven to be interesting. In this line of color research, large sample sizes of observations through space and time are needed and these may prove difficult to obtain.
The study of color has traditionally been difficult for two main reasons: traditional observations of color have been subjective descriptions instead of quantitative measurements (Endler, 1990), and best practices of measuring color are expensive and time-consuming (Leighton et al. 2016). Shifting from an abstract, qualitative description of color to quantitative data results in greater rigor in scientific studies, advancing our knowledge of color. Recent studies involving color in ecology and evolution have used spectrometry (Dalrymple et al. ,2015(Dalrymple et al. , , 2018Delhey, 2015;Shrestha et al. 2013), photography (Dalrymple et al. 2018;Shrestha et al. 2013;Tapia-McClung et al. 2016), or in some cases scanning of illustrations (Dale et al. 2015;Pinkert et al. 2017;Stelbrink et al. 2019). The success of these approaches for different questions suggests that different aspects of color research will continue to draw from different data sources.
There are still logistical issues which create limitations to the use of these modern methods in ecology and evolution. While the use of spectrometers has become the standard in measuring color objectively (Badiane et al. 2017;Endler, 1990;Johnsen, 2016), it is still expensive (Byers, 2006) and technically difficult (Johnsen, 2016).
Spectrometers can produce greatly varied measurements with different lighting conditions, and from variation in either angle of illumination or angle of observation (Johnsen, 2016), factors that are rarely kept constant across different setups. While commercial photography offers a more affordable, practical, and accessible alternative to spectrometry, there is often a great deal of information loss compared to spectrometry measurements. Nonetheless, the sampling process associated with both methods is still most likely limited to existing specimens or biodiversity at a particular time and place. Large-scale macroecological research thus requires that scientists invest much time, labor, and funding to be able to gather sufficient data across spatial and temporal scales (Pocock et al. 2017).
One potential way to obtain biological data at broad spatial and temporal scales is by using citizen science (CS). CS, which refers to the contribution of scientific data from people outside of the professional scientific community regardless of citizenship status, mostly comprises data collection that requires little to no additional training or equipment (Rotman et al. 2012). This effort harnesses observers who already engage in hobbies and activities like birdwatching (Silvertown, 2009), bug-catching (Yoshioka, 2013, and wildlife photography (Nowak et al. 2020), as well as seeks to spread both scientific contribution and engagement with nature. Compared to traditional data collection efforts, a major advantage of CS efforts is in the large number of contributors, thus expanding the temporal and geographic scope of data collection (Pocock et al. 2015) and allowing scientists to focus on analysis rather than collection of data (Cohn, 2008 Leighton et al. 2016;Moore et al. 2019;Parkinson et al. 2016) is still relatively new. The lack of control and standardization in CS photography subjects it to numerous potential issues.
For example, color appearance and measurement in photographs are susceptible to variation from the effects of different shutter speeds, lighting conditions, noise (Jackowski et al. 1997), camera exposure levels, and specifications of the individual cameras used (Byers, 2006). Additionally, these effects vary depending on the subject of the photograph, likely associated with camera sensitivity to the optical properties of different colors and surfaces.
Therefore, prior to the broad-scale application of CS photographs in color research, there needs to be a fair assessment and accounting of its limitations, as well as a quantification of how color information from CS photographs corresponds with color information from controlled spectrometry.
Our overall objective was to assess the ability of using CS photographs to quantify color for both interspecific (using birds as a study system) and intraspecific (using plants as a study system) questions. We first tested the ability of using CS photographs to make interspecific comparisons (i.e., among species) using color information for >500 bird species. We then tested whether specimen age, patterning in the plumage, or different subjective color families influence color measurement. Second, we quantified the ability to use CS photographs to capture intraspecific variability in color, by analyzing color across a spectrum of methodological control, ranging from spectrometry to CS photographs for two distinct and different plant species (Figure 1). We hypothesized that the variability in color measurements would increase from controlled (i.e., spectrometry) to uncontrolled (i.e., CS photographs) methods. For both objectives, our analyses were conducted in three common color spaces: RGB (red green blue), HSV (hue saturation value), and CIE-LAB (abbreviated as Lab). Ultimately, our work will demonstrate the largely untapped potential of CS photographs and establish the reliability of this novel method for use in biodiversity research.

| Species selection and sampling of citizen science photographs
We used photographs from the Macaulay Library CS project. The Macaulay Library (https://www.macau layli brary.org/) houses over 20 million photographs of birds, contributed by volunteer birdwatchers all over the world. Each photograph is rated by the birdwatching community based on their quality (see: https://help.ebird.org/custo mer/en/porta l/artic les/26659 49-photo -quali ty-ratin g-guide lines ?b_id=1928), ranging from 1 (barely identifiable to species) to 5 (excellent, in-focus photo). For Australian species with a large number of photographs (>50), we requested a random subsample of 4-and 5-star photographs (maximum of 50 per species) from the Macaulay Library. For any other species, we requested all photographs greater than a 3-star rating. We obtained 22,754 photographs of 563 species which we then manually filtered again to exclude poor photographs that we considered to have been inaccurately rated or that were clearly duplicates.
For each species, all photographs were sorted by visual appearance into adult or juvenile, male or female if sexually dichromatic, and breeding or nonbreeding plumage where applicable, using Pizzey and Knight (2012). To limit the influence of intraspecific color variation on our analysis, we excluded the following photographs: (a) juveniles; (b) nonbreeding plumage birds; (c) females with obviously different coloration; and (d) birds in molt or in between forms. All morphs or races for a species were included as long as they were either all sexually dichromatic or all sexually monochromatic.
We focused on the upper breast because it is commonly used in comparisons of bird colors (Dale et al. 2015;Mcqueen et al. 2019) and because it is commonly visible in photographs. The photographs were filtered further to obtain only those with a visible upper F I G U R E 1 The spectrum of control in the study of color is represented by citizen science photographs, highquality citizen science photographs, controlled photography, and controlled spectrometry. Sample size generally decreases as level of control increases. We analyzed interspecific variability in birds, comparing color between highquality citizen science photographs and controlled spectrometry. We also analyzed intraspecific variability in plants, comparing color across all four levels of control breast patch, large enough that measuring its color would be feasible (detailed in Section 2.1.3). Patterned patches were treated no differently from solid patches, and with color-blocked patches, one solid color was chosen to be the region of interest (ROI; Figure S1).
Information on each bird species and its upper breast plumage was recorded to note where multiple morphs are present and if the ROI is patterned or part of a color-blocked patch.

| Measuring colors in citizen science photographs
For each CS photograph that met the criteria above, we used color-Zapper (Valcu & Dale, 2014) to retrieve color information in both RGB and HSV color spaces. Three random points were selected within the ROI of each photograph ( Figure S1). Multiple colors in patterned patches (see Figure S1) were not treated as separate colors because (a) in many photographs it is difficult to define the borders between spots/streaks and background color (unlike in color-blocked patches where there is a more distinct separation between solid colors), and (b) we tried to measure patches as similarly as possible to how spectrometry is conducted, which measures an average color within a patterned area. A maximum of 25 photographs per species was measured, and only one bird was measured for every photograph for cases in which multiple individuals were present in a photograph. Because of the potential for misinterpreting the location of the upper breast on some species, we inspected the RGB values from the museum spectral measurements in comparison with our data to ensure we measured the same part of the bird.

| Processing bird spectral measurements
We used spectral data on 555 Australian bird species from Delhey (2015). These data consist of reflectance spectra of plumage on museum specimens taken using a spectrometer under controlled lighting conditions (for detailed methods, see Delhey, 2015). After resolving taxonomic differences, we were left with a total of 537 species that were in common with species obtained from the Macaulay Library. Because the spectral data also encompassed birds of both sexes and a total of 17 plumage patches, this was again filtered to only include measurements of the upper breast plumage of male birds. We converted the spectral measurements into RGB values using the R library pavo (Maia et al. 2013). We also used a psychophysical model of avian color vision (see detailed methods in Supplementary Methods).

| Statistical analyses
Our analyses were conducted in R version 3.6.2 and used the tidyverse workflow (Wickham et al. 2019). We performed analyses at two levels: a species mean (N = 537) and the individual photograph (N = 9,441). The three measurements within the ROI ( Figure S1) were averaged to obtain a mean R, G, and B value for each photograph.
Firstly, we performed an overall analysis on the relationship between CS and museum color measurements at the species level. We plotted species mean values for CS against corresponding species mean values for museum and ran linear models to obtain R 2 values for each color component in RGB space. The same analyses were then done at the individual (photograph) level, which used individual values for CS and corresponding species mean values for the museum measurements.
Because we had sampled varying numbers of photographs per species ranging from 1 to 25, we explored whether increasing sample size produced lower residuals. Species-level CS measurements were regressed against museum species mean measurements to obtain these residuals. We visualized the effect of increasing sample size on the mean color estimate by taking the absolute values of the residuals and averaging these for each number of photographs per species (1-25). We also tested several hypotheses to identify other explanatory variables. As the museum data were obtained from specimens of varying ages, we analyzed whether specimen age affects color. This involved using linear models on residual plots to obtain R 2 and p-values for R, G, and B. We also analyzed whether patterned patches/ROIs measure differently from unpatterned ones. We then considered if RGB values within different color families (e.g., white, pink, or yellow) are captured differently by commonly used cameras. First, species were categorized subjectively into color families by which color(s) appear on their upper breast ROI. Those with multiple colors on their ROIs (e.g., patterned patches, multiple forms) were included in multiple color families (Table S1). Linear models were then run on plots of individual CS measurements versus museum species mean measurements to obtain residuals. Standard deviations were calculated for each species. These were displayed in box plots to show differences in accuracy and precision across different color families.
We also analyzed the data in two other color spaces, HSV and Lab. The latter color space is unique in that it separates chromatic (the wavelength of photons, a and b) from achromatic (the amount of photons, L) variation. As we had done with the RGB measurements, we averaged the HSV measurements from colorZapper to obtain mean values for each photograph. Lab equivalents were obtained by converting RGB measurements using patchPlot (Bruneau, 2013), followed by averaging. Museum data had to be converted for both HSV and Lab, using functions built into base R version 3.6.2 (R Core Team, 2019) and patchPlot (Bruneau, 2013). HSV values were further analyzed for precision by color family for both CS and museum data. Finally, we ran models predicting species means from museum specimens in the achromatic part of the bird visual space using species means from CS measurements in the achromatic part of the Lab space, as well as each of the x, y, and z dimensions for each eye type (U and V) using three linear models with the Lab chromatic components of a and b as predictors.

| Species selection and sampling
We used two common species in this study, representing different subjective colors: Macroptilium atropurpureum and Senna pendula.
Photographs were manually filtered for quality and whether the photographed flower is large enough to measure its color without difficulty. An additional step was performed to separate high-quality photographs from lower quality photographs, using eBird guidelines adapted for plants (see above).

| Spectrometry
Spectrometry was performed inside an enclosed setup ( Figure S2) to minimize the influence of outside lighting sources. This con-

| Statistical analyses
To prepare our data for analysis, RGB values from 2.2.2 were averaged for each individual flower, and RGB values from 2.2.3. were averaged for each photograph. These were then plotted together to visualize and analyze differences in color measurements across all treatments. Standard deviations in RGB and Lab were computed for each treatment. We then carried out Bartlett's test to check for variances across treatment groups, followed by a Kruskal-Wallis test to determine whether the differences across treatment groups are statistically significant.

| Interspecific variability
We analyzed a total of 9,441 CS photographs of 537 bird species. The mean number of photographs for a species was 17.6 ± 8.1. We found a strong relationship between CS and museum color measurements at the species level, with R 2 values of 0.71, 0.71, and 0.68 for R, G, and B, respectively ( Figure 2). When the relationship was analyzed with individual CS photographs, we found lower R 2 values of 0.48, 0.47, and 0.46 for R, G, and B, respectively ( Figure S3). Museum measurements on average measured noticeably higher for R, while CS measurements on average measured higher for B ( Figure S4). Increasing the number of photographs measured for each species appeared to have a decreasing effect on residuals, however the increase in precision plateaued at approximately 12-14 photographs (Figure 3).
Residuals for R and G in RGB space do not seem to be affected by specimen age, however, there is a weak effect found in B (Figure 4).
There is no observable bias in patterned patches (Figure 4), suggesting that our method of measuring color in CS photographs is comparable to the spectrometry method applied in Delhey (2015). R measurements in red ROIs are higher on average in CS photographs, especially noticeable because of the considerably lower G and B measurements ( Figure 5). R measurements for red patches also stand out for high imprecision, varying greatly across photographs  (Table 1). We found mixed results in HSV, with strong R 2 values for S and V at 0.61 and 0.71, but 0.15 for H (Table 1; Figure S5). H values are very variable especially in CS data across almost all color families, and pink and purple ROIs measure poorly for H even under controlled spectrometry in museum data ( Figure S6).
In our analysis of how well color in CS photographs corresponds with measurements in bird visual space, we found a strong relationship between the achromatic components from each space, with an R 2 value of 0.73 (Table 2). Chromatic components correspond well for both U-and V-type eyes in y and z, all with R 2 values exceeding 0.7. However, the relationships are weaker for both eye types in x, with values of 0.1 and 0.46, respectively.

| Intraspecific variability
We analyzed 48 CS photographs and 8 individual specimens for Senna pendula, and 62 CS photographs and 5 individual specimens for Macroptilium atropurpureum. RGB values for lower quality CS photographs show high variability ( Figure 6). We measured extremely high R and G values in Senna pendula, especially for "Olympus no flash," F I G U R E 2 Linear model of the relationship between citizen science and museum colors in RGB space. Each point is a species mean in citizen science data plotted against the corresponding species mean in museum data F I G U R E 3 Relationship between sample size and precision. RGB measurements in the citizen science data were averaged at the species level and plotted against corresponding species mean values in the museum data to obtain residuals. Absolute residuals were then averaged for each number of photographs per species from 1 to 25 which appear condensed at 255 (Figure 6), indicating a saturation issue (Stevens et al. 2007); therefore, Senna pendula data were excluded from further analyses.
In RGB space, the variability of CS photographs is higher than all other methods for Macroptilium atropurpureum except for "Phone flash," while "Olympus flash" showed the lowest variability-even less than spectrometry (Table 3). However, in Lab space, where achromatic variation is separated from chromatic variation, the high standard deviation in "Phone flash" was found to be mostly from the lightness (L) component (Table 4). Additionally, spectrometry measurements had the lowest variability in all three Lab space components, with values of 2.29, 2.87, and 1.23, respectively. Our subjective rating system which separated low from high-quality photographs did not improve variability in the latter group. The variances across treatment groups for R, G, and B are different (p <.001 for all six tests). Following that, all Kruskal-Wallis tests had p-values of less than .001.

| D ISCUSS I ON
Using photographs of both birds and plants, we demonstrate the potential use of CS photographs in the future of color research for both interspecific and intraspecific ecological and biological questions. For our interspecific objective (i.e., taking a mean across many samples per bird species), we found strong relationships between CS and museum color measurements in both RGB and Lab color spaces, suggesting that CS photographs capture a significant amount of color information for interspecific studies. For our intraspecific objective (i.e., using photographs of two plant species), we found that along the spectrum of control, intraspecific variability was overall greatest for CS photographs and lowest for spectrometer measurements. We demonstrated that improvement in the precision of a species' mean color estimate slows greatly after 12-14 photographs, suggesting that a larger sample size to a point does help to average out method dependent variability F I G U R E 4 Panels (a), (b), and (c) show the relationship between color and specimen age. Residuals were obtained by plotting individual specimen measurements against species mean measurements. Panel (d) shows differences in bias in patterned and unpatterned patches; there was no significant difference for any of the three contrasts (p = 1). Residuals were obtained from plotting individual citizen science measurements against corresponding museum species mean measurements across individual photographs, albeit to a lesser extent compared to interspecific studies. Moreover, given the growing popularity of CS-and contributions of photographs-this is likely a feasible sample size for many questions. In this study, the mean number of photographs per bird species was 17.6, but we recognize that available sample sizes will depend on the question.
Using photographs of organisms in nature is closer to the functional purpose of the color (i.e., attracting a mate or a pollinator) compared to museum measurements, but the natural setting introduces a great deal of variability in the measurement. Differences in lighting and equipment can contribute greatly to noise and imprecision, obscuring true results when studying color in an ecological context. In our analyses, we found strong relationships at both the individual and species levels, despite the variability, in RGB space and even more so in Lab space. Interestingly, because the Lab color space separates chromatic (a and b) from achromatic components (L), our analysis at the individual level shows that the variation is more related to the combination of how brightly lit a patch is and the capture of light by individual cameras versus chromatic differences (Table 1).
Aside from the level of control, we also evaluated three factors-patterning, specimen age, and color families-for their  Whether bird specimens accurately preserve color in living birds is an area of research with mixed results (Armenta et al. 2008;Doucet & Hill, 2009;McNett & Marchetti, 2005), but there was minimal effect of specimen age in this specific study. Future research should confirm these comparisons between CS photographs and museum specimens for a larger number of samples of museum specimens, given the potential of conflating factors such as the variability between individuals that exist in nature. Overall, we did not find substantial differences across the subjectively assigned color families, although there are noticeable points in red and black. All museum measurements on average are higher in R values, but red ROIs specifically measure higher for R in CS photographs, which are also the most imprecise values. This could be due to saturation in photographs and camera processing; however, within our sample we found that cases of R-saturated red patches were rare. Black ROIs were especially variable. One possible explanation for variability in black ROIs is glossy feathers in several species (Maia et al. 2011 but sometimes measurement noise is amplified and interpreted as such by visual models (Schaefer et al. 2007), and this can increase variability as well. Thus, this may be a criterion to expect high variation and complex lighting effects in black ROIs when using CS photographs for studying color.
One set of research questions using color seeks to understand the role of color in the context of nonhuman visual systems. Our results suggest that CS photographs may be useful for some-but not all-of these questions. We show that CS photographs in Lab space have strong predictive power for three of the four dimensions of the two bird color spaces, and the poorly predicted dimensions include mainly information from the UV part of the spectrum. UV is not captured by the sensors in consumer cameras, and as such, only a subset of questions related to bird color vision may be addressed with CS photographs. We note, however, that variation in UV reflectance independent of variation in the rest of the visual spectrum is quite rare in birds (Andersson, 1999) and that this dimension of chromatic variation (x, SD = 1.42, 1.03) is much less variable than the other chromatic axes considered here (y, SD = 2.65, 2.37, and z, SD = 2.67, 2.62), which will contribute as well to weak correlations with photography data. and Lab spaces are much more evident and significant due to the separation of variation into chromatic and achromatic components, and we found that the Lab space performed slightly better in this study.
We also found that a high-quality photograph as per eBird guidelines is not a strict necessity for gleaning color data: as evidenced by the increase in standard deviation in higher quality photographs versus lower quality photographs of Macroptilium atropurpureum. It will likely be beneficial to have a separate set of guidelines for rating how ideal a photograph is for color analysis. The current method of extracting points of measurements in photographs manually for a large sample size can be taxing and time-consuming. Future work in this space should look to use automated machine learning techniques such as image extraction (Ott et al. 2020) to streamline this process.
The current assumption in color research is that spectrometry produces color measurements that are the most accurate and pre- cise. Yet, arbitrary decisions across many study setups all have an effect on the output. We have provided strong evidence that studies conducted at the lowest end of methodological control (i.e., CS photographs) can provide reliable results in the future of color research, while significantly reducing costs for data collection. With regard to analyzing color in CS photographs, we suggest, as a starting point, to measure multiple random points within the ROI of each photograph for both patterned and unpatterned patches and approach the recommended sample size of 12-14 photographs per species for interspecific studies. With future research and continuous development, it is certainly possible to further refine techniques in using CS photographs, minimize trade-offs, and subsequently introduce it into mainstream methods of studying color.

ACK N OWLED G EM ENTS
We would like to thank all the contributors on iNaturalist and eBird, as well as Macaulay Library for collecting and organizing all the images in this study. We would also like to thank G. Taseski for his expertise in plant identification and for collecting the plant specimens, and C. Baxter for training and advice on spectrometry.

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available in Zenodo (https://doi.org/10.5281/zenodo.4505774).