A weighting method to improve habitat association analysis: tested on British carabids

Analysis of species’ habitat associations is important for biodiversity conservation and spatial ecology. The original phi coefficient of association is a simple method that gives both positive and negative associations of individual species with habitats. The method originates in assessing the association of plant species with habitats, sampled by quadrats. Using this method for mobile animals creates problems as records often have imprecise locations, and would require either using only records related to a single habitat or arbitrarily choosing a single habitat to assign. We propose and test a new weighted version of the index that retains more records, which improves association estimates and allows assessment of more species. It weights habitats that lie within the area covered by the species record with their certainty level, in our case study, the proportion of the grid cell covered by that habitat. We used carabid beetle data from the National Biodiversity Network atlas and CEH Land Cover Map 2015 across Great Britain to compare the original method with the weighted version. We used presence-only data, assigning species absences using a threshold based on the number of other species found at a location, and conducted a sensitivity analysis of this threshold. Qualitative descriptions of habitat associations were used as independent validation data. The weighted index allowed the analysis of 52 additional species (19% more) and gave results with as few as 50 records. For the species we could analyse using both indices, the weighted index explained 70% of the qualitative validation data compared to 68% for the original, indicating no accuracy loss. The weighted phi coefficient of association provides an improved method for habitat analysis giving information on preferred and avoided habitats for mobile species that have limited records, and can be used in modelling and analysis that directs conservation policy and practice.


Introduction
Habitat association analysis is used in determining the likely habitat requirements of individual species (Cole et al. 2010). These requirements are important, for example, in studying impacts of habitat loss and fragmentation (Maclean et al. 2011), dispersal and habitat connectivity (Brodie et al. 2016), and modelling foraging and movement over landscapes, such as in pollinator models (Lonsdorf et al. 2009) and conservation prioritisation (Pouzols and Moilanen 2014). Such analyses are particularly important when planning landscapes for conservation: for example, in assessing the impact of adding a patch of habitat for certain species, it is also necessary to understand which species avoid that habitat. Lawton et al. (2010) highlight that the approaches available for designing ecological networks are limited by the availability of evidence, usually using expert consensus. Habitat association analysis contributes to this evidence base.
Searching the literature for habitat association or preference returns many examples of species distribution models (SDMs) and indicator species analysis. Examples of analysis looking at preference of a species to each of several alternative habitats are returned less often. For example, SDMs predict where species are likely to be found within a landscape, with habitat type being only one factor (De Lima et al. 2016). Indicator species analysis identifies species that best represent a habitat or group of habitats, and is used in monitoring habitat condition (Hill et al. 1975, De Gasperis et al. 2016. Direct analysis of which habitats a species prefers and which it avoids, which is particularly useful in conservation planning, are few. In this paper, we consider a direct approach to determine habitat association, which comprises the relative preference of a species for multiple habitats. Information on habitat associations is generally derived from expert knowledge (Lonsdorf et al. 2009) or analysis over a small geographic area (Ball et al. 2013, De Lima et al. 2016, Ferrão et al. 2018) and is often limited to associations with a single habitat or a few broad habitats (Webb et al. 2017). Large-scale analysis of habitat association de-emphasises the less frequent recordings of a species in a habitat in which the species is transient, which could be misconstrued at a small-scale as association. Although SDMs (Petit et al. 2003, Phillips et al. 2006, Porto et al. 2018) and indicator species analyses (Hill et al. 1975, Gardner 1991, Ricotta et al. 2015 are often done over large scales, this is rare for analysis of the preference of a species . Exceptions are Eyre and Luff (2004), who used ordination to study habitat preferences of carabids in North East England and the Scottish Border, and Redhead et al.(2016) who used general linear mixed effects models to study butterfly habitats across Britain. Eyre and Luff (2004) used ordination in a straightforward way, giving each carabid species a weighted average from positive to negative for each habitat. They did, however, point out that care should be taken in interpreting their findings due to some anomalous results. Redhead et al. (2016) used the coefficients from their model to derive associations. Their method worked well, albeit with large variation in the associations within individual species, but needed approximately 5000 records to ensure accuracy. They used this approach, as other methods required more precise locations information than the 1 km they used. De Cáceres and Legendre (2009) created a framework for ecologists explaining when to use IndVal or an alternative, the Phi coefficient of correlation (Pearson 1896). We focus in this paper on the Phi coefficient of correlation, ("correlation index") which like IndVal is simpler than ordination. Unlike IndVal, the correlation index gives a negative association value when a species appears to avoid a habitat, and uses species' absences to provide extra information (De Cáceres and Legendre 2009). The Phi coefficient gives degree of preference for a habitat compared to other groups. By contrast IndVal assesses how much the target site group matches a set of sites where the species is found and is an indicator species analysis. The correlation index was created by Karl Pearson (1896) and at its simplest is the binary version of the Pearson's correlation (De Cáceres et al. 2008). It is the preferred method in plant science for calculating site fidelity (De Cáceres and Legendre 2009), but has not been adopted more generally despite De Cáceres and Legendre's (2009) framework. The index uses two binary vectors to describe a location: one representing presence or absence of the species and the other whether a location is the habitat of interest. The index does not incorporate uncertainty in the habitat of the location, it is either habitat or not. Species records often have a degree of uncertainty, particularly concerning the spatial resolution of the record. The area covered by the resolution of the record may contain multiple habitats. The binary nature of the correlation index requires either removal of mixed or uncertain habitat data or a judgement as to which habitat to assign. While this might be considered as an error in the record, movement of individuals from preferred into adjacent less-preferred habitats is common (Ries et al. 2004), and so the precise location in which a mobile individual is found may not be in a preferred habitat. To incorporate these issues, we propose a new version of the correlation index, adding a third vector to each record, which is a weighting based on the certainty of the habitats at a location. We present the weighting as the proportion of a Accepted Ar ticle 'This article is protected by copyright. All rights reserved.' particular habitat in 100 m grid cell. However, the weighting could be the probability of correctly classifying a habitat from remote sensing or a combination of weightings. In this paper, we present our weighted version of the correlation index and test it against the original version using a case study of carabid beetles of Great Britain. We also carry out a partial validation of the correlation indices using qualitative data from species descriptions. The analysis uses records from the UK National Biodiversity Network (NBN) Atlas (2018) and Centre for Ecology and Hydrology (CEH) Land Cover Map 2015 (LCM2015) (Rowland et al. 2017a). We used a method that considers the number of other species within the family found at a location as proxy for survey effort (Hickling et al. 2006, Redhead et al. 2016. We use an absence threshold of 14 carabid species and conduct a sensitivity analysis of the threshold value. Most species have fewer than 1000 records. We, therefore, ascertain how many records are required to give a valid estimate of habitat association.

Correlation indices
The original correlation index uses binary presence-absence with each location assigned to one group Where is the total number of locations ( ), the number of locations with the habitat of interest ( ∑ ), is total number of occurrences across all locations ( ∑ ), and is the number of occurrences in habitat of interest ( ∑ ). In the event that a location is not a point location and instead covers an area, a location could contain more than one habitat. For example, in location 4 (Table 1), an area location contains acid grassland (2%), inland rock (59%) and heather (39%). We do not know in which habitat the species was found, therefore when calculating the original index, either only locations that contain a single habitat could be included or a habitat would need to be chosen. We might choose to discard all locations with more than one habitat. This would leave locations 1, 2, 3, 5, and 7 in Table 1. The carabid species of interest is then either present or absent within that single land cover type. Using this approach can remove a large proportion of the data, sometimes making a species unviable for analysis. Another way of conducting the unweighted analysis would have been to choose the habitat covering the largest proportion of the 100 m location; a version of the analysis doing this can be found in Appendix 1. To allow the use of a larger proportion of the data, we created a weighted version of the index (equation 2).
This version added a third non-binary vector of the weighting of each habitat at each location ( ). This weighting could be any by which each location sums to one (for example land cover classification certainty) but we used the proportion of each habitat. All three vectors have length . is still the total number of locations( ∑ )r, and is still the total number of occurrences across all locations( ∑ ). The values of and are the same as they would be if each of the locations only had a weighting of one (a single habitat in our example). The and values change however, now denoted as and . These can be calculated as ∑ (lower than a hypothetical would be) and ∑ (smaller than a hypothetical ). So for only the data in Table 1 (assuming no threshold was applied) and with Inland rock being the habitat of interest; 0.59 + 1 = 1.59, = 0.59 + 1 + 1 = 2.59, N = 7 and n = 2 and therefore using the equation for the weighted correlation index (equation 2) gives; The weighted version balances the reduced terms in the numerator of the equation with in the numerator and denominator meaning the equation still gives both positively and negatively correlated habitats. If all of the locations within the analysis are certain (one habitat), the weighting is 1 and the Accepted Ar ticle 'This article is protected by copyright. All rights reserved.' result is the same as the original correlation index. See Appendix 2 for matrix representation of the data and other equations.
We calculated both the original and the uncertainty-weighted correlation index ϕ values and permutated (De Cáceres and Legendre 2009) to get a p-value for each habitat and for each carabid species. See De Cáceres and Legendre (2009) for additional considerations when conducting permutation tests.

Data
We used the large volume of carabid (Coleoptera: Carabidae) location records and high-quality land cover data available in Great Britain.

Carabid data
The National Biodiversity Network (NBN) atlas (2017) contains presence records for many species, at 100 m resolution resulting from the six digit Ordnance Survey grid reference (Telfer 2006). We downloaded all records of carabid locations from the NBN atlas on the 7/11/2017 and initially selected those above an arbitrary threshold of at least 10 records (268 species). We converted the coordinates into 100 m grid cells, with the coordinates representing the bottom left corner, using ArcGIS (v 10.4.1 © 2016 ESRI, Redlands, California). NBN species names were checked and synonyms corrected using the Natural History Museum UK species inventory checklist (Raper 2014). Remaining synonyms were corrected using the checklist in Luff (2007). These steps increased the number of records for species with accepted names on these checklists. The NBN does not include absence data. A species cannot be considered to be absent from all locations where it is not recorded. To allow us to have confidence that a species was genuinely not at a particular location, we counted the number of other species found in each location as a measure of survey effort. Following Hickling et al (2006) we considered a location to be a true absence if it had more records than a threshold number of other carabids. The threshold number of species is arbitrary. For butterflies, Redhead et al. (2016) used a value of 10% of the species pool (5 species). Using 10% of the carabid species would have required 28 or more species, giving only 94 locations across Britain. We used a threshold of 14 species (5%) giving 556 potential absence grid cells and conducted a sensitivity analysis of this value. Absence locations for a species are the remainder of these 556 grid cells after removing those containing the species of interest.

Land cover
We used the vector LCM2015 for Great Britain (Rowland et al. 2017a) to provide habitat data. LCM2015 contains 21 land cover classes based on the UK Biodiversity Action Plan Broad Habitats (Jackson 2000). These classes are assigned to Ordnance Survey Master Map polygons using a Random Forest object-based classification of satellite Landsat-8 (30 m resolution) and AWIFS data (60 m resolution) (Rowland et al. 2017b). Polygons smaller than 0.5 ha or less than 50 m in width are merged into neighbouring polygons. This can remove linear habitats such as those within freshwater, only capturing larger water bodies and wide rivers (Rowland et al. 2017b). We intersected LCM2015 data with the 100 m NBN squares and calculated the proportion of each habitat at each location. In principle, one might include temperature or altitude, or group land cover classes. Analysing a large number of alternative habitats can lead to a loss of power. Therefore, if dividing some habitats, others should be amalgamated. But here, for simplicity, we used the LCM2015 classes as they are without further classification.

Validation data
To allow validation of both weighted and original correlation indices we used information from Luff, (2007) "The Carabidae (ground beetles) of Britain and Ireland". Luff (2007) is a comprehensive text on British carabid identification including descriptions of where the species might be found. We used only habitat preferences within the British Isles due to differences in associations to other parts of Europe (Eversham andTelfer 1994, Desender et al. 2005). Luff, (2007) stated the preferred habitat for each carabid species in a descriptive way; for example, "In most habitats, especially agricultural fields, gardens and other disturbed, open and dry situations" (p. 68, Trechus quadristriatus). Luff (2007) did not create the book as a database of species associations. It was, therefore, necessary to convert the text into a database against which we could compare our analyses. We developed a method using as little subjective interpretation as possible. We looked at all words in the descriptions in Luff (2007) of habitat and picked out those words naming a habitat. We then translated these, into either an individual or group of LCM2015 habitat classes. For example, "moorland" in Luff was translated as including Inland rock (in LCM2015 documentation included under "Mountain, heath, Accepted Ar ticle 'This article is protected by copyright. All rights reserved.' bog" (Rowland et al. 2017b)), Acid grassland, Heather grassland, and Heather & Bog. Where Luff's habitat descriptions represented a group of land covers, the group was included in the database as an aggregate class against which to check the analysis. For a table showing a full list of the words used and resulting LCM2015 habitat classes and aggregates (see Table S1).

Analyses
The NBN data contained a separate record for each species at each relevant 100 m location meaning that individual locations appeared multiple times. We created a version of the data with each location represented once, giving presence or absence (absences determined as described above) for each species at that location. We created this wide format version by using an R script to go through each location and assign a new binary column of presence for each species. Table 1 shows an example of the data after preprocessing. The correlation index and permutations of the analysis, for each species and versions of the method, were processed using the JASMIN cluster (Lawrence et al. 2013). The R scripts for all analyses are given in Appendix 3.

Sensitivity analysis
We conducted a sensitivity analysis of threshold number of species used to define absence locations by using Spearman's rank correlation to determine to what extent the order of the habitat associations from positive to negative ϕ changed using seven (2.5% of the total species number) and 28 (10%) species number thresholds compared with the baseline of 14 . We also compared the order of habitat associations from positive to negative ϕ between the weighted and original index for each species using Spearman's rank correlation.

Validation
The correlation index results for each carabid species were validated by comparing them to the database created from Luff (2007) (section 2.3). For each species, we calculated the percentage of "Luff habitats" that were also found to be significantly (p-value ≤ 0.05) and positively associated habitats in our correlation analysis for that species.

Results
By allowing the use of locations containing more than one habitat, the weighted index used more records for each species and therefore included 52 extra species; 19% more. For example, for Bembidion prasinum the original method only included 14 records, but the weighted method used 79 records. Luff (2007) describes this species as living in shingle near running water. The original method did not include freshwater at all due to a lack of records. The weighted method associated the species most strongly with freshwater. Comparing the rank of the habitats based on their Phi score for the weighted and original analyses for this species using Spearman's rank, the rho value was only 0.62. The species that have far fewer records in the original than the weighted version, like B. prasinum, drove the average correlation down. In most cases where both species had many records, the rank correlation was higher. One exception to high correlation with many records is Curtonotus aulicus that had 106 original and 258 weighted records. The original version had freshwater non-significantly (p = 0.392) positive despite this being described as a dry habitat species (Luff 2007). The weighted analysis of C. aulicus had freshwater as the habitat most significantly (p = 8.00 × 10 -04 ) avoided.

Validation
Using the 14 species threshold for absence, the original version had 207 and the weighted version 264 species with at least one significant habitat association. Furthermore, the weighted and original indices gave similar ranked habitat associations, with the average Spearman's rank correlation 0.82 (SE 0.008) between the two indices. That is not to say however, that significant results sensibly described the habitat of the species. We, therefore validated the correlation results against the database created from Luff (2007). Considering the average (across species) percentage match of our analyses to Luff habitats, the original analysis identified on average 68% (using 187 species) of Luff habitats and the weighted analysis 70% (using 239 species). This is not a great deal more on average, but does include more species. In the original version, all of the Luff habitats were identified for 94 species and at least one Luff habitat for 157 species. In the weighted analysis, all of the Luff habitats were selected for 126 species and at least one Luff habitat for 205 species. Comparing with Luff (2007), the weighted version matched 18 species less Accepted Ar ticle 'This article is protected by copyright. All rights reserved.' well than the original version, 141 matched as well, and 28 matched better. Overall, using only the species analysed using both methods, the weighted version matched 6.8% on average better compared to the original version. Fig 1 shows the graphical comparison of the two versions of the index. The weighted version generally gave a slightly higher percentage matches for species with a moderate to large number of records, and included more species with few records.

Individual species examples.
Here we give examples showing comparisons between the original and weighted version of the index, the improvement using the weighted method and establishing how few records are required to give a reasonable estimate of habitat preference. For the full dataset of all carabids analysed see Appendix 4.

Original vs weighted index
Abax parallelepipedus is described by Luff (2007) as a woodland and moorland species. Due to insufficient data, the original version failed to classify three habitats, despite having 176 records, but did show a preference for woodland and heather grassland (Fig 2). The weighted method classified all habitats and captured the woodland and more of the moorland habitat types. For Acupalpus dubius neither analysis matched Luff ("In litter, moss and tussocks near fresh water"(p.175) translated as Freshwater), but may give additional information (Fig 2) as an association was found with "Fen, marsh and swamp", potentially represent the moss and tussocks of Luff's description. The analyses identifies freshwater for other waterside species (see Appendix 4), this therefore is not a consistent problem with detecting freshwater. Two examples are; Anthracus consputus and Trechoblemus micros, which both Luff and our analysis classify as freshwater species. Calathus fuscipes and Loricera pilicornis are two examples of species that matched Luff habitats better in the original than the weighted version, which failed to match open grassland and suburban respectively (Fig 3). For both species the named habitat remained positively associated in the weighted analysis, but had higher p-values, 0.16 and 0.23 respectively.

Number of records required
Species with between 10 and 35 records in the weighted analysis gave matches with an average of 66% of Luff habitats. With so few presence records, however, the analysis had less power to differentiate habitats and to detect significance. For Amara curta the analyses was not able to detect any avoided habitats and analysis failed to pick up on the heath association suggested by Luff (2007). With 50 or 60 records, as in the case of Bracteon litorale or Harpalus anxius, the analysis was more able to differentiate the individual habitats. Bracteon litorale, which Luff (2007) describes as "On bare sand and fine shingle near rivers or standing water", was associated in our analysis with broadleaved woodland and improved grassland, as well as agreeing with Luff by including freshwater. For Harpalus anxius, the analysis seemed to select the dunes of Luff's description well, with supralittoral sediments the most preferred habitat, but did not select heaths. Additionally a positive association with saltmarsh was identified, which is often near dunes (Fig 4).

Sensitivity analysis
Spearman's rank correlation values were high when comparing habitat association calculated with the threshold value of 14 to a threshold of seven or 28 (Table 2). Even comparing the seven to the 28 threshold, the rank of the habitats remained consistent.

Discussion
Our new weighted version of the Phi correlation index allowed substantially more records to be included for each species and therefore increased the number of species that could be analysed and improved the predictions of habitat association. The use of the number of species records as a proxy of survey effort was robust, being insensitive to the threshold for defining absence locations. The weighted analysis was able to give accurate results with as few as 50 records, and the use of absences enhanced the ability to determine habitat associations. Informative results using so few records are in stark contrast to other methods which require thousands of records for each species. Redhead et al. (2016) suggest that few taxa are well-enough recorded to provide so many records, our improved method will be applicable to many more taxa. For example, 35% of cerambycid beetles have 50 or more records in Great Britain (44% for carabids). Our method also gives a target for recording the rarer specialist species, whose conservation most requires an evidence base (Lawton et al. 2010).

Accepted Ar ticle
'This article is protected by copyright. All rights reserved.' As the number of records gets very large the Phi coefficient becomes the Ochiai index, which is itself related to a modified version of IndVal (De Cáceres et al. 2008). The number of records in the data we have are not large and the Ochiai index was therefore not applicable. It is possible, however, to extend both the non-equalized and group-equalised IndVal in a similar way to the phi coefficient we present in this paper by adding habitat weighting. The values still range between zero and one and the weighted version gives a value for more of the habitats. The results of weighting IndVal have not been tested, but this could be done in future research identifying indicator species. To facilitate such a test, this capability is included in our PhiCor R package. Dufrêne and Legendre (1997) used carabid data from pitfall traps to validate IndVal originally. The capture locations of all individuals were known precisely. However, besides using the weighting for imprecise locations, as presented in our case study on the phi coefficient, the weighting method could be useful in cases with precise locations for a number of the indices presented in De Cáceres and Legendre (2009). These cases include species foraging or dispersing into neighbouring habitats (McIntire et al. 2013), source-sink dynamics of plants (Kadmon and Shmida 1990), or to account for the uncertainty of land cover classification (Morton et al. 2011). It may even be worthwhile drawing buffers around record locations so as to include information on surrounding habitat. Unlike species distribution models, the correlation index does not suffer from overfitting (Breiner et al. 2015). However, as numbers of presences and absences differ between species, comparison among species is not straightforward. The maximum ϕ values vary with the number of records and are rarely comparable between species. The rank of the habitats is comparable but where two species have similar ranks for a habitat they may not have the same affinity. The number of positive habitats for each species, however, is positively correlated with the degree to which a species is categorised independently as generalist vs specialist (see Appendix 5). One possible way of increasing the comparability between species is to use the group equalised correlation index (Tichy and Chytry 2006). Beyer et al. (2010) reviewed the factors influencing habitat preference of species, arguing that species which are found more often are so because the habitat is more common. Tichy & Chytry (2006) suggested a group (habitat) equalised version of the correlation index. This version modifies several of the inputs by the number of groups. In our case, group equalising usually resulted in the same habitats having significant associations, although the ϕ values were often different. As an example, Bembidion lampros is associated in the non-equalised analysis with arable followed by conifer and urban. In the equalised analysis the same habitats are retained in the top three, but now the beetle is most associated with coniferous, urban and then arable. A weighted group-equalised version (Appendix 6) did not match the Luff (2007) validation data quite as well, but is included in the full output (Appendix 4). It should be noted that species may not be equally detectable in different habitats and therefore, where the data is available a similar equalisation could be done using detectability. The analysis we have conducted agrees to some extent with previous smaller scale studies of carabids using different analytical approaches. Eyre and Luff (2004) used constrained ordination with 126 carabid species against the proportion of 12 habitats within 1 km squares across north-east England and south-east Scotland. Some of their results agree with ours, although, as an example, their analysis suggests a higher preference of Abax parallelepipedus for inland water than broadleaved woodland. Eyre and Luff (2004) point out that some unexpected relationships of species and land covers suggest care is needed when interpreting their results and that the low eigenvalues and cumulative percentage variation suggest noisy data. Within the literature the same species is sometimes attributed to different habitats in different studies without clear information on where this association information stems from or the species' other associations. An example is Pterostichus madidus, which is variously described as inhabiting dry open, urban, moorland or grassland (Butterfield et al. 1995, Dennis et al. 2004, Angold et al. 2006, Morecroft et al. 2009), with Luff (2007 describing the species as "woodland, garden and dry grassland". Our analysis agrees with all of these habitats, suggesting the species is associated with a wide range of habitats. The method we present provides a robust method of presenting all the associations of a species, which can be used to paint a clearer picture of habitat associations. We chose in the main analysis to remove record locations with more than one habitat. Another option was to choose the most abundant habitat. We conducted a version of the unweighted analysis choosing the most abundant habitat in each 100 m square. This version matched the Luff (2007) validation less well than the unweighted version removing records (Appendix 1). This is likely due to misclassification of the habitat that the species was found in or the loss of information about which habitat individuals of the species could have been in prior to being caught. In conclusion, our new weighted method demonstrates an improvement to the Phi coefficient of association, which is simpler than ordination, requires fewer records than regression, and gives habitat preference and avoidance. Our method allows for uncertainty in the habitats associated with the record location and is ideal for mobile species, which may be found outside of preferred habitats. It utilises more Accepted Ar ticle 'This article is protected by copyright. All rights reserved.' of existing sources of data, including every habitat within a non-point location, giving quantitative information on habitat preference. Our work provides guidance on the flexible threshold defining absence records and targets for the number of records necessary to achieve a reasonable result for each species. The method is usable as-is to provide detailed data usable in conservation planning and the case study provides the carabid analysis ready to use in modelling and improving interpretation of the results of future studies. Having established the method as working for carabids, the method would benefit from further testing with different taxa.