Species traits influence the process of biodiversity inventorying: a case study using the British butterfly database

The description of how biological information is compiled over time is essential to detect temporal biases in biodiversity data that could directly influence the utility, comparability, and reliability of ecological and biogeographical studies. We explore trends in species recording over time using one of the most spatially and temporally comprehensive country‐level databases for any group of insects in the world – the database of butterfly occurrences from Great Britain. Firstly, we used two crucial milestones (the year in which the taxonomic inventory is complete, i.e., when the last species was recorded, the year in which all species are recorded together for the first time) to delimit three main phases in the process of biodiversity recording (taxonomic, faunistic and exhaustive phases). Secondly, we aimed to quantify how far species features (attractiveness and detectability) influence the process of recording through time. During the first stage of biodiversity compilation, when the main aim is to complete the taxonomic inventory (taxonomic period), entomologists tend to record attractive species more frequently. However, once the inventory is complete, particularly in the period during which more spatially and temporally comprehensive information about species distribution is amassed (the exhaustive period), the recording pattern clearly changes to more detectable species. Common, highly detectable species are undersampled in the first phase of biodiversity data compilation and oversampled in the final stages. Awareness of such temporal patterns in recording is necessary in order to correctly interpret and address bias in insect biodiversity trends.


Introduction
Spatial and/or temporal biases and gaps in biodiversity data can directly influence the utility, comparability, and reliability of ecological and evolutionary studies (Hortal et al., 2007). Although growing concern about biodiversity loss underscores the need to quantify and understand temporal changes (Dornelas et al., 2013;Tessarolo et al., 2017;Cardoso & Leather, 2019;IPBES, 2019;Montgomery et al., 2020), existing research on biodiversity data quality has typically focused on spatial biases in biodiversity databases (e.g., Pardo et al., 2013;Ruete, 2015;Lobo et al., 2018;Shirey et al., 2021).The description of how biological information is compiled over time is essential to understand which factors influence species recording processes through time and, thus, accurately quantify biodiversity trends (Cabrero-Sañudo & Lobo, 2003;Boakes et al., 2010;Costello et al., 2015;Isaac & Pocock, 2015).
Discriminating the successive stages of biodiversity recording using a high quality database helps to avoid, or at least to be aware of, future bias in recorded species traits in other less complete biodiversity databases. This knowledge would enable scientists and conservation organisations to assess the workforce needed to achieve reliable biodiversity information. It is especially important in the case of insects, which despite being one of the most diverse and functionally important animal groups, are greatly under-sampled worldwide (Leather et al., 2008;Cardoso et al., 2011). Indeed, to date, most studies of faunistic databases have reported a dearth of complete and extensive inventories for insect taxa (e.g., Romo et al., 2006;S anchez-Fern andez et al., 2008;Santos et al., 2010;Bruno et al., 2012;Ballesteros-Mejia et al., 2013;Fattorini, 2013;Lobo et al., 2018). Within insects, diurnal Lepidoptera are affected by undersampling to a lesser degree than other taxa (Troudet et al., 2017), in all likelihood due to their relatively large size and aesthetic appeal, and, as a result, are a particularly useful taxonomic group for investigating patterns of biodiversity recording.
In this study, we explore trends in how species are recorded over time to evaluate temporal biases in biodiversity data compilation using one of the most spatially and temporally comprehensive country-level databases for any group of insects in the world (S anchez-Fern andez et al., 2021)the Butterflies for the New Millennium (BNM) database of Great Britain (GB) . This database contains more than 10 million species occurrence records gathered during more than 200 years for a butterfly fauna of 58 current breeding butterfly species. Dennis et al. (2006) found that more apparent butterflies were recorded earlier than those that are less conspicuous. Here, we go further by examining the temporal consistency and heterogeneity of the BNM database. We aim to identify some of the main foci and tendencies of entomologists and recorders driving the biodiversity inventory process by examining temporal trends in the number of records in relation to species traits. Specifically, our aims are to delimit the main periods in the process of biodiversity inventory and to quantify the relative importance through time of species traits associated with attractiveness and detectability.

Methods
We analysed data extracted from the BNM database managed by Butterfly Conservation. The BNM database comprises butterfly occurrence records (unique combinations of species Â recorder Â location Â date) for the United Kingdom, Isle of Man and Channel Islands, many of which are the result of opportunistic, non-standardised sampling by community scientists (verified by a network of expert volunteers). The BNM database was created in 1995 but incorporates a substantial volume of historical records, particularly those gathered by a previous recording scheme that led to the first butterfly atlas of Britain and Ireland (Heath et al., 1984). The extract used for this study contained all records of the 58 breeding species in GB from 1800 to 2014 (see Fox et al., 2015;S anchez-Fern andez et al., 2021). Species that occur only as occasional, usually non-breeding, immigrants were excluded. Species that have become extinct in GB were also removed (including the reintroduced Phengaris arion) so that temporal trends could be constructed for species observed throughout the complete study period, although the data for Nymphalis polychloros were retained since it may have recently recolonised GB naturally after several decades of absence (see Fig. S1 in the Supporting Information). The final extract contained a total of 10 046 366 records (see Data S1 in the Supporting Information).

Measures of butterfly attractiveness and detectability
Two approaches utilising species traits were used to assess data collection over time, here referred to as 'attractiveness' and 'detectability' (see Data S2 in the Supporting Information). Attractiveness values were obtained from Dennis et al. (2006). These authors used a scoring system in which the mean subjective perception of visual apparency (conspicuousness) was determined independently by 13 researchers who work on butterflies; each researcher assigned an apparency score ranging 1 (low conspicuousness) to 5 (high conspicuousness) to each species, and the results were calibrated against image analysis of the wing surfaces for a subset of the species (males, n = 50, Pearson correlation coefficient r = 0.89; females, n = 42, r = 0.87; P < 0.0001) (for details, see Dennis et al., 2006).
To account for butterfly detectability, we compiled information on several trait categories that account for species functional and ecological characteristics that can be used as proxies for detectability (see Supporting Information Table S1): larval diet breadth, adult feeding, type of roost and rest sites, overwintering life cycle stage, voltinism, mobility (flight period and mobility capacity), and occurrence in different habitats (Dennis, 2010;Middleton-Welling et al., 2018). These detectability traits included continuous, ordinal, multiple-choice and fuzzy-coded traits and comprise 55 specific categories (see Supporting Information Table S1). For multiple-and fuzzy-coded traits, different states were organised into various columns representing different affinity categories (e.g. Overwintering phase was divided into 'egg', 'larvae', 'pupae' and 'adult' categories). For multiple-coded traits, one or multiple options were assigned for each species in a binary way. For fuzzy coded-traits, affinity values were distributed across the categories of each trait for each species, according to the frequency of occurrence within the species. This approach is called fuzzy coding (Chevenet et al., 1994) and entails compiling the intraspecific biological information available for each species. Before analysing data, multiple-and fuzzy-coded data were converted to percentages of affinity for each trait to obtain a standardised representation. To obtain a single value of detectability of each species, we summarised the trait space using a principal coordinate analysis (PCoA) selecting the first axis as a surrogate of detectability. This first axis explained 27.2% of the total variability (see the factor scores of each trait in the Supporting Information Table S1). Gower's coefficient was used since it provides traitbased species dissimilarity calculations when binary, numerical and categorical attributes are considered simultaneously (Pavoine et al., 2009;Maire et al., 2015). Positive values of this factor are associated with highly detectable species, i.e., mobile, multivoltine, polyphagous species that use a broad range of nectar sources and occur in highly modified biotopes (e.g. gardens, arable crops and road banks).

Data analysis
For each year, we calculated both the number of records collected and the number of species recorded. We also estimated the cumulative number of butterfly species observed in consecutive years, as well as the proportion gained annually over the total number of butterfly species. These data were used to delimit two main temporal points that could be considered crucial milestones to define phases in the process of biodiversity recording: (i) the year in which the taxonomic inventory of GB is complete, i.e., this period ends when the last species was recorded and (ii) the year in which all the species (58 butterflies) were recorded (together) for the first time. These two years delimit three key temporal phases in any biodiversity database: a first period that relates to the establishment of the national inventory (hereafter 'taxonomic period'), a second in which species occurrence data are accumulated but at a relatively slow rate, mainly devoted to establishing the distributional range of each species (hereafter 'faunistic period'), and a third period in which more spatially and temporally comprehensive information about species distribution is amassed (hereafter 'exhaustive period').
Multiple linear regressions were used to quantify the influence of attractiveness and detectability in determining the number of records accumulated for each species in each of the main periods of biodiversity inventory and over the whole time series. Before analysis, we log-transformed response variables (records collected during taxonomic, exhaustive and total periods) to reduce skewness. We displayed back-transformed fitted values to represent relationships between variables better. For each model, we obtained partitioned variance for detectability and attractiveness using the R package variancePartition (Hoffman & Schadt, 2016). This package calculates variance fractions based on the sum of squares explained by each predictor, i.e. the total amount of variance explained by each factor (unique plus attributable shared fraction). All regression models were validated by visually checking their residuals for normality and homoscedasticity.
All analyses were conducted using the R version 3.5.3 (R Development Core Team, 2019).

Results
The BNM database extract contained butterfly species occurrence records across a period of 215 years. Most records are recentas 50% were gathered during the last 11 years of the study period (2004-2014) ( Fig. 1; Fig. S1 in the Supporting Information).

Main periods in the process of biodiversity inventory
The complete GB butterfly inventory defined for this study was established in 1877 ( Figs. 1 and 2). The last species to be recorded, Thymelicus lineola, was not recognised until 1889, but earlier specimens were subsequently identified in collections. The first year in which all the GB butterfly species were recorded at the same time was 1939 ( Figs. 1 and 2). Thus, the three periods were delimited as follow: the 'taxonomic period' (up to 1877), the 'faunistic period' (from 1878 to 1939), and finally, the 'exhaustive period' period (from 1940 to 2014).

Bias related to species features captured over time
Detectability was the most important variable in explaining the total number of records collated for each species (R 2 = 55.4%; P < 0.01), with the number of records rising exponentially with increasing detectability (Fig. 3). On the other hand, butterfly attractiveness had a very low (R 2 = 0.2%; P = 0.66) association with the cumulative number of records per species (see Fig. 3). However, the importance of detectability and attractiveness in explaining the variability in the number of records of each species changed over time (see Table 1 and Fig. 3). In the taxonomic period, attractiveness showed a weak, but significant, positive relationship with the total number of records of each species in the database (R 2 = 6.5%; P < 0.05). In contrast, detectability showed a weak, significant negative correlation with the number of records (R 2 = 4.2%; P < 0.05). After the taxonomic period, no significant relationships were found between attractiveness and number of records. There was a relatively strong positive linear relation between detectability and the number of records in the faunistic period (R 2 = 30.9%; P < 0.01), a relationship that increased in strength during the exhaustive period (R 2 = 54.7%; P < 0.01) (see Table 1; Fig. 3).

Discussion
The butterflies of Great Britain are probably the best-studied insect group in the world with the largest amount of data per species and the longest record of observations. The temporal patterns derived from the study of such an intensively recorded, long-term dataset allow us to discriminate several temporal phases in which the attractiveness and detectability traits of species variably affect recording. Our results highlight the importance of species characteristics in the historical process of biodiversity inventorying and may shed light on the future Fig. 3. Relationship among species traits (detectability and attractiveness) and the sampling effort (number of records) carried out in each one of the three periods of the database inventory process (taxonomic, faunistic and exhaustive) and for the total time period considered. Blue lines represent fitted values for significant relationships (P-value ≤ 0.05; see Table 1). prospects of initiatives compiling information about other (including hyperdiverse) taxonomic groups around the world.
We chose two different points in time to exemplify the main phase changes in the process of biodiversity inventory. Using these time points, we identified three distinct periods in the process of accumulating butterfly biodiversity knowledge in GB. An initial taxonomic phase in which the efforts of naturalists were focussed on the discovery of new uncollected species. Although this process is typically unplanned and uncoordinated, the result of the individual actions of naturalists has the effect of making the discovery of new species more and more difficult. The attractiveness and charismatic character of the species seem to exert a moderate role in recording during this phase. This period is followed by another in which the effort of the naturalists is progressively directed towards gathering more and more information about the distribution of the species (biogeography after taxonomy). Here, the conspicuousness of different species does not seem to influence recording. Instead, detectability, assessed using species traits as proxies, becomes highly relevant. Thus, a greater number of records are gathered for polyphagous species, those that are bivoltine or multivoltine, have a high dispersal ability, and frequently occur in highly modified land-use types (such as road and river banks, gardens and arable crops) (see Supporting Information Table S1). Finally, a third phase of exhaustive study (intensive surveying, encouraged by organised recording schemes) can be differentiated, in which species detectability is even more strongly related with the number of records in each species, as occurs in the selection of species for conservation programmes (Gunnthorsdottir, 2001;Leandro et al., 2017). In the taxonomic period, the most recorded species were Euphydryas aurinia, Nymphalis polychloros, Leptidea sinapis and Boloria euphrosyne, i.e. species with a relatively high attractiveness, while in the exhaustive period, the species accumulating the greatest numbers of records are Maniola jurtina, Pieris rapae, P. brassicae, Pararge aegeria, Aglais urticae, A. io, the most, abundant, widespread and easily detectable species.
Our results indicate a change in the recording process once the national inventory is taxonomically complete. It can be reasonably expected that the progressive difficulty of finding new species in a country or region and the accumulation of taxonomic resources (keys, collections, journals, taxonomical revisions, etc.) would increasingly facilitate efforts towards the completion of regional inventories (Brunbjerg et al., 2019). The variables used here as surrogates of detectability positively correlate very closely with the geographical range, the occurrence on islands, and the population size of GB butterfly species (Dennis et al., 2000(Dennis et al., , 2004. Thus, after a first phase of progressive increase in taxonomic knowledge not influenced by the differential detectability of the species, a second phase begins in which the data accumulation is biased towards the most detectable species. The accumulation of observations in 'rare' or 'common' species could be an indicator of the stage of the biodiversity recording process. Identifying 'common' and 'rare' species according to their 90th or 10th percentile distribution in the number of records and occupied 5' cells, leads to the observation that the ratio between the records of both types of species increases notably throughout the time period studied (Fig. 4).
At the end of taxonomic period, the mean number of records by species in the database was 34, and at the end of the faunistic period was 434.2. By 2014, this value had increased to 173 213. The faunistic and exhaustive phases could be difficult to discriminate and could in fact be a systematic process in which particular idiosyncratic events can be determined (Lee et al., 2005). For example, there is a quirk relating to 1939 in the BNM data. When historical data were being added during the 1970s and 1980s, sources that described species occurrence as 'before the war' were assigned to the year 1939. Thus 1939 could be an outlier in terms of species occurrence records that determine, in our case, the first year in which all British species were recorded. However, this fact does not greatly change recording trends as the following years showed similar results. On the other hand, even in this case, the taxonomic period could not be quite over, considering the impact that new molecular techniques are having on butterfly taxonomy in Europe (Dinca et al., 2011;Hern andez-Roldan et al., 2016). Indeed, Prof T.G. Shreeve and colleagues may well have found a unique Polyommatus lineage in the Outer Hebrides (Arif et al., 2020). Also, we expect new butterfly species to arrive in GB with climate change (Harrison, et al. 2006;S anchez-Fern andez et al., 2021). Be that as it may, temporal changes in the common/rare ratio of species seem to suggest that faunistic and exhaustive phases could be discriminated by the existence of a turning point in this ratio which marks the beginning of the recording increase of 'common' (i.e. abundant and widespread) species against 'rare' ones. Analyses of other species occurrence data sets are required to determine the usefulness of this pattern in identifying the level of biodiversity knowledge in a country or region. Recently, the evidence and potential consequences of declines in insect biodiversity has aroused a great scientific, media and public attention (Goulson, 2019;Habel et al., 2019;Cardoso et al., 2020;Warren et al., 2021), with some studies calling for immediate policy responses (Forister et al., 2019;Harvey et al., 2020). However, noting the heterogeneity of insect responses, others studies have highlighted the need for more data (Montgomery et al., 2020;Wagner et al., 2021) to avoid risky overextrapolation from the limited current evidence (Didham et al., 2020. In order to estimate insect trends accurately and thus better to understand the full extent of global biodiversity change, it is essential to know the current state of each biodiversity database as some comparative results could be simply a consequence of sampling bias. Here, we demonstrate that common, highly detectable species are undersampled in the first phase of biodiversity data compilation and oversampled in the final stages. From a general point of view, it should be noted that currently the great majority biodiversity databases in the world are still in the initial stages of data compilation (Ball-Damerow et al., 2019), so these distributional databases, especially in the case of insects, are very likely in the taxonomic or at best, in the faunistic periods of data collection. Awareness of such temporal patterns in recording is necessary in order to interpret correctly and address bias in insect biodiversity trends. Fig. 4. Temporal changes in the logarithm of the number of records of 'common' (blue circles) and 'rare' (orange circles) butterfly species in Great Britain, and temporal change in the ratio of common/rare species (grey circles). Common species are defined as those above the 90th percentile in the number of records or in the number of 5' occurrence cells considering the whole data. Rare species are those below 10th percentile in these same variables. The dashed line represents the LOWESS fit of the ratio variable. Arrows indicate the two main temporal points delimited in the process of biodiversity recording.