Overcoming biodiversity blindness: Secondary data in primary citizen science observations

1. In the face of the global biodiversity crisis, collecting comprehensive data and making the best use of existing data are becoming increasingly important to understand patterns and drivers of environmental and biological phenomena at different scales. 2. Here we address the concept of secondary data, which refers to additional information unintentionally captured in species records, especially in multimedia-based citizen science reports. We argue that secondary data can provide a wealth of ecologically relevant information, the utilisation of which can enhance our understanding of traits and interactions among individual organisms, populations and biodiversity dynamics in general. 3. We explore the possibilities offered by secondary data and describe their main types and sources. An overview of research in this field provides a synthesis of the results already achieved using secondary data and different approaches to information extraction. 4. Finally, we discuss challenges to the widespread use of secondary data, such as


| THE UNTAPPED INFORMATI ON IN E XIS TING B IODIVER S IT Y DATA
Citizen science contributes enormously to biodiversity monitoring (Chandler et al., 2017) by providing data that are potentially as useful as those collected by professional scientists, especially for research over large spatial and temporal extents (Callaghan et al., 2020).However, there are still major gaps in the taxonomic, temporal and spatial coverage of biodiversity data to track changes in species' abundance and distribution (Amano et al., 2016;Feldman et al., 2021;Wetzel et al., 2018).Furthermore, biodiversity encompasses not only the diversity of organisms but also the diversity of interactions between them and with their environment, which has been given scant attention by citizen science (Chandler et al., 2017;Groom et al., 2021).Ecological interactions are the foundation of ecology and the architecture of ecosystems, but observing these relationships is challenging (Jordano, 2016).The sheer amount and complexity of interactions as well as the detection probability in the field limits studies with respect to number of species, locations and time periods that can be studied.
However, both occurrence and interaction data are needed to understand, address and mitigate the consequences of the five major drivers of biodiversity loss (Díaz et al., 2019).For instance, land-use change may affect host-vector dynamics (Spence Beaulieu et al., 2019), pollution may lead to adaptations in species traits (Rech et al., 2022), climate change can affect trophic cascades and distribution shifts (van Gils et al., 2016), overexploitation may change former mutualisms (Speziale et al., 2018) and biological invasions can result in novel plant-pollinator networks (Parra-Tabla & Arceo-Gómez, 2021).The variety of direct and indirect effects of biotic and abiotic interactions are difficult to study, which poses a challenge that can only be addressed through global collective effort (Díaz et al., 2019).
To achieve more accurate and reliable monitoring and interaction data, we need to improve and integrate current methods (Besson et al., 2022;Kühl et al., 2020;van Klink et al., 2022), to better analyse existing data (Probert et al., 2022), and to extract more information from already collected data (Johnston et al., 2022).
In the latter case, one untapped source of abundant data is the corpus of digital images and other media, generated and shared on citizen science platforms.As of May 2023, four such websites alone (Artportalen, iNaturalist, Observation.org,and Pl@ntNet) had collectively published over 78 million images through the Global Biodiversity Information Facility (GBIF).Almost all these photos were taken of organisms or signs of their presence in situ (i.e.nests, faeces, tracks, etc.) and thus may capture ecologically relevant information as a by-product.This type of additional information has been termed 'secondary data' (Callaghan et al., 2021, see Box 1 for details).Having recognised the potential, a growing number of studies have examined ecological questions using incidentally captured information.Yet, so far, the scientific community has been relatively blind to the opportunities to extend our knowledge of biodiversity from secondary data.Therefore, we assume a 'biodiversity blindness', drawing on the concept to describe the public's lack of attention to the presence and diversity of plant and animal life (Moscoe & Hanes, 2019).
Following our definition of secondary data (Box 1), we do not explicitly refer to the metadata (e.g.timestamp or geolocation) associated with the occurrence record, though they also represent potential secondary data sources and play a major role in selecting BOX 1 What are secondary data?
Secondary data in the context of citizen science in biodiversity research, as we define it, refers to a subset of information that is unintentionally captured alongside primary data.Primary citizen science data collected for a specific research focus, such as monitoring the distribution of a species, provide information on the location and date of record of the species in addition to evidence of its occurrence.This primary data of 'what?', 'when?' and 'where?' are the intended focus of many ad hoc observing portals.In contrast, secondary data are ancillary details that are also present in the materials collected but were not the intended subject of the study.Indeed, the observer may have been unaware of the secondary information they evidenced.
Secondary data can offer valuable opportunities for additional research and analysis, enriching our understanding of ecosystems' functioning, population dynamics, natural behaviours and environmental conditions.They represent any retrievable pieces of information that can be seen on an image or in a video, heard on an acoustic recording or be included in a descriptive text.The information they contain may relate to features of individuals or populations, biotic interactions (including human-nature interactions), landscape and environmental conditions, or any other biotic or abiotic features or their combination (Figure 1).We recognise, of course, that the subject of investigation may be something other than the mere detection of species.In complex citizen science programmes, the primary data can only be separated from secondary data with reference to the objectives of the project.For example, in the COASST project (Parrish et al., 2017), citizen scientists record bird carcasses on beaches, providing not only an image but also record a wealth of information about the morphology of the carcass and the state of the environment.In addition to the primary information collected, the images of dead birds may contain even more information than the research project anticipated, such as the presence of necrophagous species.Thus, secondary data are data that the methodology was not intended to capture, though there are no sharp demarcation lines.datasets and their analyses.An important element in utilising secondary data is the explorative character of this research method, as the nature of this unintentionally recorded information may be unclear.The additional information may document biotic interactions and co-occurrences, which could provide not only important ecological information of the observed species but also records of the bycatch for respective monitoring programmes.Secondary data may also include details on morphology, behaviour, habitat and various other aspects of a species' traits and ecology.This paper explores the opportunities and pitfalls of extracting secondary data from multimedia records of biodiversity.We also examine how advances in artificial intelligence can and might accelerate data extraction.Our goal is to illuminate the hidden treasure of biodiversity data contained in citizen science multimedia records and available openly to the scientific community.While the efforts to explore and exploit the realm of secondary data are still in their infancy, they already demonstrate numerous opportunities to enrich and inform biodiversity research.

| RE S E ARCH OPP ORTUNITIE S AND T YPE S OF S ECONDARY DATA
Extracting secondary data from existing citizen science sources helps to address universal challenges of biodiversity research, such as taxonomic bias, detectability of species and their interactions, and recognition of spatio-temporal dynamics.
Taxonomic bias towards charismatic or well-known species in citizen science data poses a challenge for researchers and limits the possibilities of launching projects that deal with less popular, cryptic or under-researched species.In addition, although simple, unstructured programmes generate high numbers of citizen science observations, information that is more complex to record, such as an individual's health condition, is not purposefully collected and therefore not formally documented.Similarly, studies that are less engaging, such as those that are time-consuming, physically challenging or in less attractive localities are discriminated against.Extracting secondary data from existing observations could be fruitful to fill such data needs.For example, diurnal or seasonal activity patterns or vocal characteristics of rare species could be retrieved from soundscapes or the background of audio recordings of focal species.From images, occurrences of less charismatic arthropods or pathogenic fungi living on photographed plants could be extracted.In the latter case, in 2010, citizen scientists monitored and scored leaf damage on horse-chestnut trees (Aesculus hippocastanum) caused by the leafminer Cameraria ohridella in Great Britain (Pocock & Evans, 2014).
Today, the infestation could also be detected as secondary data in images on which horse-chestnut trees are the primary observation, thereby improving the data situation at low additional costs and resource expenditure.Using secondary data to take advantage of the taxonomic bias towards well-documented species also brings additional research opportunities.When observations for a given species are widely available, one can extract data on multiple aspects of interest, such as morphological traits or biotic interactions, without spending time and resources on launching and running a new raw data collection campaign.For example, Putman et al. (2021) used images of the secretive but thoroughly documented lizard Elgaria multicarinata that were primarily collected for determining the species' distribution in Southern California.From the images, the authors assessed predation pressure and health condition by measuring the lizards' tails and by looking for ectoparasites in the animal ear regions.Likewise, citizen science photos have been used to identify subtle morphological differences between two very similar species of grasshopper and to establish their distributions (Pélissié et al., 2023).
Coincidental evidence can mitigate low detection probabilities.By identifying a rare species in primary observations of other species, whether that is through biotic interaction or an incidental co-occurrence, the pool of observations can be enlarged.A citizen science project in Australia has shown that co-occurrence of a common and rare possum species can lead to more detections of the latter (Steven et al., 2021).Aside from potentially increasing sample sizes of monitoring data for the benefit of statistical analyses, we can improve our understanding of ecological impacts on other species, including people.We envisage application in pollination dynamics, invasion impact or climate change research.
For example, we can potentially study the preferred flower species and colour in a network of native and exotic bumblebees and host plants (Catron et al., 2023;Fontúrbel et al., 2023).Another example is hair loss in moose (Alces alces) and wapiti (Cervus canadensis) caused by the expanding distribution of winter tick (Dermacentor albipictus) due to climate warming in Yukon, Canada (Chenery, 2023).Serendipity is a factor as well to reveal ecological interactions; Rosa et al. (2022) not only found new and supposedly extinct species as primary observations, but also novel predatory interactions that were accidentally captured in the iNaturalist images of marine snails.Such chance discoveries based on the background information could be especially useful in invasion science, where secondary data may reveal new or hidden invasions or previously undocumented ecological processes that facilitate or hinder invasions.
Extracting secondary data from a series of observations across space and time can also support efforts to move from a mere single-species snapshot (an occurrence record) to spatio-temporal biodiversity dynamics.Using timestamps and geolocation metadata of citizen science observations to investigate spatial and temporal dynamics has been successfully applied before (e.g.Feldman et al., 2021;Newson et al., 2016).Given a sufficient temporal span and frequency of observations, we suggest linking secondary data to such a stamp to obtain a variety of observable dynamics.For example, a series of landscape images would not only contribute to monitoring data (e.g. the abundance and distribution of species on the images), but can also be useful for studying phenological dynamics at the community level (Hofmeester et al., 2020).This context applies to different scales and scopes of consideration, specifically on the level of individuals, populations, communities, the surrounding environment and the human dimension.For each of these levels, Table 1 gives extensive lists of types of information contained in secondary data.

TA B L E 1
Types of information extractable from secondary data from citizen science projects.As the literature on secondary data is scarce and obscure due to a missing common terminology, we also listed examples that used secondary information in combination with other sources and approaches (e.g.iEcology or literature).The table groups the publications by level (human-nature interactions, features of the individual, etc.) as described.For each study it gives the feature of interest (extracted data) and the data elements used to extract it as well as a short description (study example) of the content and the methods and sources used (source and extraction method).

Study example
Source and extraction method a

Reference
Observer/human-nature interactions

| S ECONDARY DATA ARE S LOWLY D IFFUS ING INTO THE SCIENTIFI C LITER ATURE
Studies using secondary data (Table 1) have mostly focused on the extraction of morphological information, such as the pigmentation on wings of Calopterygidae damselflies (Drury et al., 2019), coloration patterns of grass snakes (Fritz & Ihlow, 2022), and intra-and interspecific variabilities in coloration of birds and plants (Laitly et al., 2021).Some studies also used secondary data to assess Citizen science has already proven useful in mapping and tracking biological invasions (Encarnação et al., 2021).The additional information that comes with secondary data could reveal even more aspects of the invasion process, thereby supporting invasive species management.For example, first approaches explored the host plants of introduced pollinators (Bila Dubaić et al., 2022;Guariento et al., 2019;Pernat et al., 2022) and cavity occupancy by wild honey bees (Apis mellifera) in Australia (Saunders et al., 2021).
Reanalysis of images has also been used for trait-based studies to characterise, for example particulate matter in the global oceans (Trudnowska et al., 2021) and the feeding habits of marine copepods

BOX 2 Secondary data, conservation culturomics and iEcology-What is the difference?
The use of secondary data in research is similar to the emerging areas of conservation culturomics and iEcology (Jarić et al., 2020).Culturomics seeks to understand human culture through the quantitative analysis of changes in word frequencies in large bodies of digital texts (Michel et al., 2011).In the context of biodiversity, the emergent area of 'conservation culturomics' focuses on the relationship between people and nature (Ladle et al., 2016), informed by contents of various types of online data.iEcology, on the other hand, is an umbrella term for analysing various types of digital data generated or collected for purposes other than ecological research to obtain insights into ecological questions.In contrast, in citizen science projects, people consciously contribute to the goal of a particular activity, such as biodiversity monitoring or invasive species detection (Marchante et al., 2023).Only the sources and methods used to obtain secondary data whose data type was the subject of the study (feature, interaction, etc.) are listed.In most cases, the analyses also used geo-locations, dates and metadata that were part of the primary observation.
TA B L E 1 (Continued)  (Vilgrain et al., 2021), although these studies did not use citizen science data.The potential for extracting functional traits from images, either directly measured or inferred by combining visible features with context metrics from the metadata, has been thoroughly considered for plankton (Orenstein et al., 2022).

| S ECONDARY DATA E X TR AC TI ON COULD B E ACHIE VED ALONG A G R AD IENT OF HUMAN AND ARTIFICIAL INTELLIG EN CE
As approaches to obtain secondary data are just emerging, such data are still mainly extracted manually.This can be challenging when thousands of images need to be interpreted and evaluated.
For example, the aforementioned study of anther-smut infection within the Caryophyllaceae examined 79,801 iNaturalist images (Kido & Hood, 2020).There is much to be gained from automation that could scale up the process to millions of images, particularly for pre-selecting images and recognising relevant image features.
For example, computer vision could be used to extract and analyse information on colour in images, for example, greenness of plants (Yuke, 2019), and deep learning models to detect, count and classify specific features of interest (Bjerge et al., 2023;Mann et al., 2022).

Likewise, algorithms and pretrained dictionaries in Natural Language
Processing could leverage the use of textual content, such as image captions, commentaries and tags in secondary data.Automated systems would also facilitate real-time analysis of biodiversity dynamics, making them particularly useful for informing decision-makers regarding effects of conservation efforts or as early warning tools (van Klink et al., 2022).
Despite the obvious appeal of machine learning for automatic data extraction from citizen science sources, several obstacles lie before its full potential can be realised.Developing robust models that effectively handle diverse and noisy datasets is challenging and resource-intensive.Nevertheless, for some tasks, existing tools may be customised or applied directly.Multiple trained deep learning models to screen multimedia for human or natural objects are freely available.For instance, object detection models, which are often pretrained and benchmarked on the COCO dataset (Lin et al., 2014) containing 80 different object categories (including birds and other animals), may already provide relevant secondary data output.Moreover, models exist for specific groups of organisms and data types: Merlin Bird ID and BirdNET (Kahl et al., 2021) for bird detection based on sound (the former can identify species also from images), Pl@ntNet API for plants, Bjerge et al. (2023) created a test dataset for insects, FishID for fish species in images; MegaDetector or TrapTagger for animals in camera trap photos; and BatDetect2 and BatNet for bats (Aodha et al., 2022;Krivek et al., 2023) in sound recordings.Additionally, customised models can be trained on open datasets, for example, FathomNet for marine organisms (Katija et al., 2022), Pl@ntNet for plants and iNaturalist for a range of different species.Importantly, even with readily available models, manual resources and expertise are required to ensure the anticipated model behaviour and performance on new data.
In other cases, models and analysis pipelines may need to be developed from scratch.Where models or training data are not available, the cost-benefit ratio of developing new artificial intelligence models should be weighed against the use of human-mediated approaches.For example, Mann et al. (2022) developed an approach to automatically detect flowering plants in images, which were then examined by citizen scientists for the rare presence of insects.
Efficient processing is relevant when dealing with large amounts of data, but it is critical to consider the resources needed.Developing custom automated methods and their broader usefulness and applicability versus setting up and maintaining manual processing pipelines (e.g.citizen science projects or recruiting and managing volunteers) may differ in terms of time, costs and personnel demands as well as the output quality.In any case, to address the uncertainty in exploratory analyses of secondary data variables, that is, to get an idea of what kind of additional information primary datasets contain, a subset of data would most often be analysed manually.This pre-processing can inform researchers about which methodologies to apply for larger scale extraction of information.
We expect that future secondary data extraction will be performed on a continuum between fully human and fully automated approaches with the respective advantages and disadvantages along this spectrum.Hybrid intelligence, that is, the combination of deep learning and human diligence (Mann, 2022;Rafner et al., 2021), can be effectively used to extract and analyse secondary data.Primary data can be filtered manually and, if necessary, annotated or immediately tested for relevant secondary data in the case of an existing algorithm.Conversely, one or more features can be selected from images (or other types of media) by an algorithm to be processed afterwards by a human (e.g. for annotation, validation or analysis; Figure 3).
Another challenge to apply artificial intelligence in secondary data studies is not knowing which data variables to look for and how to select or develop a potential identification algorithm or, simply put, how to search for the unknown unknowns.A human eye is able to identify the unexpected while the algorithm only recognises what is expected of it, that is, what it was trained for.In order to leverage the power of artificial intelligence for effective data extraction, data collection generally needs to be guided by precise research questions and must be based on a priori identification of variables of interest.As detection models trained to recognise an increasing number of objects, or segmentation models able to distinguish different areas in images (Kirillov et al., 2023)

| WHY ARE WE NOT THERE YE T ?
Citizen science multimedia records are clearly more than meets the eye.To protect biodiversity, it is not only essential to inventory and monitor species, but also to understand the ecological networks they are part of.By giving many examples of current and possible future areas of application we demonstrated how secondary data offer the opportunity to extend and complement systematically collected interaction and monitoring data.Although we are convinced of the great potential in the untapped information, we still see some challenges to overcome and specific pitfalls to address.
Similar to the early days of the citizen science movement, the issue of bias can cast doubt on this new resource.Indeed, we suspect a similar bias in secondary data as in primary data (Isaac et al., 2014).Secondary data, however, would be less influenced by known recording behaviour (e.g.aesthetic preferences or charisma of observed target species) and more affected by previously less considered human actions.Staging of observed species in a particular location and environments, and cultural differences in what can be appropriately photographed are imaginable examples.In these cases, scientists using secondary data can benefit from accelerated development and discussions in analysing opportunistic data to correct for bias (Johnston et al., 2022).Transparent handling of potential biases should be a given in both metadata and corresponding publications.
Of greater concern are biases from data generated by citizen science projects with unknown scientific goals.For example, when for a project citizens document a particular plant solely in forest habitat, higher-level analysis of that plant's habitats based on image backgrounds would document an unrepresentative proportion of this plant in forests.Therefore, the source of data should be known, that is, in unclear cases, the project organisers would also need to be con- Since improving metadata according to the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles is a worldwide effort, this problem will hopefully be solved with time.Similarly, the development of new and better machine-aided object recognition will allow large amounts of secondary data to be processed automatically in the future.In addition, approaches have been developed to not only recognise objects in multimedia data sources, but also to differentiate by anomalies or other species-specific features such as plant or animal colours (e.g.Hantak et al., 2022;Perez-Udell et al., 2023).Ultimately, the ever-improving models used to generate ecological networks also help to turn information into knowledge.
A more pressing issue is the legality and ethical defensibility of using millions of secondary data sources for purposes other than those intended when the primary observation was recorded and posted.Considerations of ethical and privacy issues are not exclusive to secondary data.They are also pertinent regarding the primary data used in iEcology and culturomics (Jarić et al., 2020), from which secondary data can be derived.While much of primary data (e.g.online texts, images, videos or audio recordings) are publicly available, and in many cases, people have given consent to their availability (e.g. by registering in citizen science or social media platforms), researchers are required to pay careful consideration to how they collect, use and share these data (Di Minin et al., 2021;Thompson et al., 2021;Zimmer, 2010).
Ethics are particularly relevant when dealing with online data from social media, where work is often used or distributed without the owners' consent.In fact, most social media platforms allow posting nearly any content, as they are not able to automatically identify copyrighted material.This issue is less prominent on citizen science platforms, as the users have stronger control of posted data and media licensing.But the way such information is shared and scraped still opens various possibilities of copyright infringement in the digital space.As such, there is a considerable uncertainty regarding situations in which acquiring permission and crediting authorship becomes mandatory.If not Creative Commons, checking licences can become a time-consuming process.Especially, when dealing with big data that are derived from multiple sources, it may be highly unfeasible to directly contact the media owners to get permission for use (Leighton et al., 2016).
Ethical issues have to be carefully considered and are especially delicate when secondary data allow recognition of people, or allow the identification of contentious human interactions, such as illegal fisheries (Sbragaglia et al., 2021), poaching or trade in wild organisms (Di Minin et al., 2019;Zimmer, 2010). Di Minin et al. (2021) have suggested a set of guidelines that can help address ethical concerns in research when using such data.Likewise, while publicly sharing species location information is useful for research, disclosing the location and identification of rare or threatened species can become a threat to their conservation (Lindenmayer & Scheele, 2017).
Although citizen science platforms such as iNaturalist already consider 'taxon geoprivacy' as a way to safeguard the locations of species 'at risk', a sensitive species as secondary data would still come with full coordinates if not recognised as such.
Finally, it is most important that awareness of the existence and potential of secondary data grows among scientists.With our contribution, we aim to open the eyes of the scientific community to overcome 'biodiversity blindness' and acknowledge the wealth of information far beyond the location and date of a species observation in the millions of freely available multimedia files.Besides being blind to this treasure of data, studies and projects dedicated to the topic may also not be seen as such due to a lack of common terminology.Therefore, we would like to establish the term secondary data as proposed by Callaghan et al. (2021) or at least stimulate a discussion about terminology, so that a corresponding field of research can grow.
It should be clear to the community that this approach applies not only to data from citizen science, social media or webpages, but also to data collected by scientists in the field or in the laboratory.
The multiple benefits demonstrated here should convince people to make the (raw) data available to the public according to the FAIR principle, be it via GBIF, GitHub or other openly accessible repositories.As with all new and innovative methods, a transition period will be necessary before this approach is fully integrated into the research toolkit.Again, we draw comparisons with iEcology, culturomics and citizen science in that secondary data are utilised in a

Figure 2
Figure 2 illustrates how secondary data can add contextual dimensions to primary species observations, thereby mitigating 'biodiversity blindness' by expanding on the information in citizen science multimedia records beyond geographical locations and time.
human-nature interactions such as bat handling during the COVID crisis(Van der Jeucht et al., 2021), to classify marine habitats using image backgrounds(Bolt et al., 2022), and to identify plants visited by hummingbirds(Marín-Gómez et al., 2022).Secondary data from citizen science often combined with iEcology or culturomics data sources(Jarić et al., 2020) or museum collections (Box 2).Examples include a dietary study of African snakes(Maritz & Maritz, 2020), arthropod parasitism by hairworms(Doherty et al., 2021)  and the distribution of anther-smut disease in the Caryophyllaceae plant family(Kido & Hood, 2020).
complementary and supportive way to other data sources and verified with ground truthing.Efforts to explore and use secondary data, although still in their early stages, are already demonstrating many ways to enrich and inform biodiversity research.AUTH O R CO NTR I B UTI O N S Nadja Pernat conceived the ideas and planned and facilitated a 3day workshop in November 2022 that was attended by all authors; Nadja Pernat, Yuval Itescu, Jasmijn Hillaert, Cristina Preda and Marina Golivets researched the reviewed articles; Nadja Pernat, Jasmijn Hillaert and Susan Canavan created the visualisations; Nadja Pernat, Quentin Groom and David M. Richardson led the writing of the manuscript.All authors contributed critically to the drafts and gave final approval for publication.