Testing the effectiveness of the forest integrity assessment: A field-based tool for estimating the condition of tropical forest

1. Global targets to halt biodiversity losses and mitigate climate change will require protecting rainforest beyond current protected area networks, necessitating responsible forest stewardship from a diverse range of companies, communities to understand the conservation value of the forests they manage and to identify areas for targeted improvements.

to understand the conservation value of the forests they manage and to identify areas for targeted improvements.

K E Y W O R D S
ecological integrity, forest quality, forest set-aside, high carbon stock, human-modified, rapid assessment, tropical ecology INTRODUCTION Globally, forests are at increasing risk of degradation or conversion to agriculture as the need for food and other resources continues to rise (Hosonuma et al., 2012). Yet, these complex ecosystems support high levels of biodiversity, harbour rare, threatened and endangered species, store and sequester large amounts of carbon, regulate local and global climate systems, and maintain soil, hydrological and other ecosystem services (Watson et al., 2018). Provision of ecosystem services is at its highest where forests are in the best condition, by which we mean that they more closely resemble intact, or primary, habitat (e g. in terms of carbon: Wang et al., 2001;biodiversity: Tawatao et al., 2014; water quality: Luke et al., 2017). Continued assessment and monitoring of forests, coupled with the ongoing maintenance and enhancement of forests to improve their condition, are therefore essential if global targets on biodiversity and climate mitigation are to be met.
To curb continued deforestation and degradation, conservation initiatives incentivize companies, communities and individuals to manage forest areas outside of protected areas. Industry-based certification standards, such as the Roundtable on Sustainable Palm Oil (RSPO) and the Forest Stewardship Council (FSC), require companies to set aside and manage natural forests within their management units. The High Conservation Value (HCV) approach has been widely adopted by certification standards such as the RSPO, and by corporations aiming to conserve biodiversity and meet avoided deforestation commitments (www.hcvnetwork.org), while additional areas are now being set aside by the palm oil, pulp and paper, and cocoa sectors, under the High Carbon Stock (HCS) approach (www.highcarbonstock.org). Community forest stewardship, for example, via REDD+ (Reducing Emissions from Deforestation and forest Degradation) or ecotourism-based forest restoration schemes, is also becoming a frequent means of achieving forest conservation while also benefiting local livelihoods (Holck, 2008;Kunjuraman & Aziz, 2019). The widespread adoption of these schemes creates new opportunities for increased forest protection and improved management, but constraints may exist with respect to the ongoing stewardship of these vital ecosystems -which, in some cases, falls to institutions or individuals that have limited capacity in forest and conservation management.
Existing forest monitoring techniques often require a high level of technical knowledge and can be both time consuming and expensive (Gibbs et al., 2007). Over recent years, interest has grown in remote sensing (e.g. by using satellites or drones) as a means of delivering forest monitoring (Finer et al., 2018). While these techniques provide information about forest structure and biomass over large areas, they also require technical expertise, and are thus often inaccessible to smaller operations or local communities. Data derived from remotely-sensed imagery can also mask important subcanopy aspects of conservation value and disturbance (such as hunting or the presence/absence of endangered species; Green et al., 2019).
Existing community-based forest-monitoring approaches focus largely on counting and measuring trees. While relatively simple, these approaches require a degree of expert knowledge, are time consuming, and conservation managers can find it challenging to translate outputs into information of practical relevance (Holck, 2008). Other tools such as SMART (https://smartconservationtools.org/) focus on monitoring of threats to wildlife. SMART relies on patrols and has been shown to be effective in enhancing protection of endangered species, but is contingent on continuous on-the-ground patrolling, has far less emphasis on wider ecosystem quality, and is only an option for well-funded and staffed conservation programmes (Critchlow et al., 2017;Hoette et al., 2016). Many of these techniques focus on individual aspects of the forest ecosystem, such as forest structure or specific species, andas a consequence -encourage forest managers to maintain a narrow focus, rather than a broader view of the whole ecosystem. Ecological integrity assessment methods recognize the need to understand the multiple interacting characteristics that contribute to ecosystem functioning and thus the provisioning of key ecosystem services (Tierney et al., 2009;Wurtzebach & Schultz, 2016). However, these processes can be complex to assess and monitor, thus a more simple, low-cost and rapid forest assessment technique is needed to assess forest condition.
The Forest Integrity Assessment (FIA) tool assesses multiple facets of forest condition, while also addressing issues of time, resource and the need for technical expertise. The tool was developed by the High Conservation Value Resource Network (HCVRN), in partnership with SE Asia Rainforest Research Partnership (SEARRP) in the Malaysian context (https://hcvnetwork.org/library/forest-integrity-assessmenttool/) as a rapid (< 1 hour to complete) means of conducting broad assessments of forest condition via a cheap and efficient approach that does not require expert knowledge or extensive resources. Until now, however, the robustness of scores generated by the tool has not been tested. To enable such a test, we first conducted a large-scale field trial of the survey tool, completing 967 assessor surveys across 16 sites in the dipterocarp rainforests of Sabah, Malaysian Borneo. We use this trial data to test three key performance aspects of the tool relevant to its wider application: (1) how well the scores derived from the FIA TA B L E 1 Indicators of forest condition, their description and reference in the ecological literature, and the value of the R-squared statistic when compared to the FIA scores of our assessors   (3) if the survey is efficient as it can be, that is do all questions within the survey tool discriminate between sites in good or poor condition effectively. We conclude by discussing where and how the FIA tool might be deployed as a rapid and low-cost means of assessing forest condition.

The survey tool
The FIA survey tool has been adapted for different forest types globally (Lindhe & Drakenberg, 2019), and we tested the version developed for Lowland dipterocarp rainforest (Lindhe et al., 2015). Until recent clearances, Lowland dipterocarp rainforest was the dominant forest type across South East Asia, and it represents the modal pre-clearance forest type in areas managed by RSPO member oil palm companies.
Dipterocarp forest has also been surveyed by a substantial number of independent research projects, and we utilize the resulting datasets in this study (  Once the assessment began, the assessors were asked not to discuss answers with one another or seek any further clarifications. The survey was conducted along a pre-designated 500 m transect at each site.

Collecting survey test data
Assessors spent a total of 1 h walking each transect and considering their answers.

Test #1: Comparison of survey test scores to independent forest condition metrics
To ascertain the ability of the survey to correctly identify sites of good or poor forest condition, we compared scores from the full survey test dataset with data from independent published studies that were conducted at, or whose spatial coverage overlapped with, our 16 test sites. We used a total of 967 assessor surveys for this test -note that not every assessor visited every site, but nearly all did. The validation data were derived from independent studies and encompassed a variety of aspects of forest condition or conservation value (Table 1, Supplementary Figure 2), namely the species richness of dipterocarp trees (Yeong et al., 2016a) and of ants (Tawatao et al., 2014), aboveground carbon stocks (Asner et al., 2018) and vegetation structure complexity, decomposition rate, and various aspects of ecosystem function in dipterocarp trees (Yeong et al., 2016a(Yeong et al., , 2016b. In almost all cases, these independent data were collected on the same forest trails, plots, or coordinates as our assessments. The exception was the generation of aboveground carbon estimates (which were provided as a 30 m x 30 m resolution raster), where the mean value across a circular buffer of diameter 1.5 km was extracted using the coordinates for each site. We used these datasets to calculate R-squared statistics between these components of forest condition and the survey scores generated by our assessors (Table 1).

Test #2: Examining assessor effects on survey scoring
In order to establish if the FIA tool could be rolled out more widely, irrespective of the background and other characteristics of the assessors involved, we needed to understand how survey scoring may have been affected by prior knowledge and expertise. We also needed to assess the extent to which individual assessors scored consistently across sites. We therefore used the test data to examine the effect of these factors on survey scoring using a Generalized Linear Mixed Modelling (GLMM) approach. For these analyses, and to link scoring of particular individuals across sites, we constructed GLMMs using test data from a subset of 34 assessors who (1) chose not to remain anonymous and (2) had identified themselves across at least five survey sites (n = 34 assessors, n = 493 total assessor surveys). Explanatory variables in models included the highest educational qualification obtained (four categories: (1) 'primary school' , (2) 'secondary school' , (3) 'pre-university' , diploma or Malaysian matriculation programme and (4) 'degree' -university undergraduate degree); prior forest knowledge (two self-assessed categories: prior knowledge or no prior knowledge); age (three categories: aged <31, 31-50, or >50); gender (M/F). Candidate models were constructed using all possible combinations of these variables, with individual assessor identity (assessor 1, assessor 2, etc) specified as a random effect (intercept) in all models. GLMMs were constructed using the 'lme4' package in R (Bates et al., 2015), model selection was performed using AIC (Akaike information criterion ;Burnham & Anderson, 2002), and the associated marginal and conditional Rsquared values were estimated using the Delta method (Bartoń, 2020).

Test #3: Determining survey effectiveness and identifying improvements
We tested two key performance aspects of the survey, namely (1) survey effectiveness -how effective the survey questions were at discriminating between sites of better or worse condition (hereafter 'discriminative ability'), and (2) response consistency -how consistent different assessors were when surveying the same sites (hereafter 'agreement rate').
To estimate the discriminative ability of questions, we first calculated the mean answer for each of the 50 questions at each site (0 ≤ x ≤ 1), across all assessors. We then calculated the standard deviation in these means across all 16 sites to obtain a discriminative ability score for each question (hereafter referred to as the 'discriminative ability' measure). Using the mean answer for each question at each site ensured that variability associated with the assessors' answers at each site did not contribute towards our measure of the question's ability to discriminate between sites.
To estimate an agreement rate score that accounts for scorers agreeing simply by chance, we calculated Fleiss's kappa score for each survey question using the 'irr' package in R (Gamer et al., 2019). Fleiss's kappa is a dimensionless score from zero to one, where one indicates a high level of agreement, and zero a low level of agreement.
Both these measures were derived from a subset of the full data, wherein 30 responses -the maximum number of responses available for all 16 sites -were randomly chosen and analysed for each of the 16 sites. This was to ensure that each site was represented by the same number of survey responses, that is a balanced design, with each site contributing equally to the overall scoring for each aspect of survey performance (30 responses per site × 16 sites = 420 total assessor surveys).
Finally, in the interest of optimizing the survey, we tested the effect of removing survey questions where agreement and/or discriminative ability was low, using the balanced subset of 420 assessor surveys.
Questions were removed in groups (see Results for details of these), and final scores were recalculated, before up-weighting them to the equivalent of a score out of 50, to enable comparisons with scores from F I G U R E 1 Agreement between FIA score (x-axes throughout) and independent metrics of forest condition (y-axes) sampled across a range of patch sizes in Sabah, Malaysian Borneo. The species richness values represent raw (non-bootstrapped) counts of (a) dipterocarp species (> 30 cm Diameter at Breast Height; Yeong et al., 2016a), (b) ant species (quadrats; Tawatao et al., 2014), (c) vegetation structural complexity or 'forest quality' score (dimensionless; Yeong et al., 2016a), (d) aboveground carbon (Mg C per ha; Asner et al., 2018), (e) litter decomposition (% leaf litter mass lost over 120 days; Yeong et al., 2016a), (f) dipterocarp seedling prevalence (n seedlings, max 4; Yeong et al., 2021), (g) fruit prevalence (n fruits, max 4; Yeong et al., 2021) and (h) dipterocarp seedling survival (%; Yeong et al., 2016b). Error bars represent ± 95% confidence intervals. Where they appear, the dark green circles represent surveys conducted in continuous forest, while the lighter green circles represent surveys conducted in forest fragments, where symbol size is in proportion to the cube root of their area the full survey. R-squared statistics were also recalculated to test for any change in agreement with the independent forest condition metrics (see Section 2.3).

Test #1: Comparison of survey test scores to independent forest condition metrics
Overall, survey scores ranged from 10 to 47 out of 50, indicating that our sample of sites spanned a wide range of forest condition. The survey scores generated by our assessors agreed with the majority indicators of forest condition that we extracted from the independent studies specified in the methods (Figure 1, Table 1). Foremost prevalence R 2 = 0.78, Figure 1f; Table 1). There was reasonable agreement between scores and the alpha diversity of the forests they were collected in (R 2 = 0.50 vs. dipterocarps, 0.54 vs. ants; Figure 1a,b), considering the high variability of measured diversity at these sites (Tawatao et al., 2014;Yeong et al., 2016a). Finally, some aspects of functioning did not seem to be associated with survey score (dipterocarp fruiting, Figure 1g; dipterocarp seedling survival, Figure 1h). Overall F I G U R E 2 Performance of the FIA questionnaire. Each question (n = 50 questions) is represented by a smaller circle, and scored based on its discriminative ability (y-axis) and the rate at which recorders agreed on the answer at each site, Fleiss's kappa score (x-axis). Grand means for each criterion (n = 7 criteria) appear as larger circles, while the global mean score for the whole questionnaire appears as a black diamond (see key). Panels (A to D, illustrated by dotted lines) demark question scores relative to the global mean these results offer good support for the use of the FIA tool as an estimate of the relative condition of forest sites.

Test #2: Examining assessor effects on survey scoring
We found limited evidence for assessor effects on survey scoring. A GLMM containing the prior knowledge variable (alone) was the 'best' performing model in the set (Burnham & Anderson, 2002), but five other models with different formulations achieved an AIC score within two points of this model, thus also achieving 'substantial' support (Supplementary Table 2). Importantly, the marginal R-squared statistics for these six models (as for all models in the set) were low -maximum 0.06 and minimum 0.03 -indicating that assessor characteristics accounted for no more than 6% of the variation in scoring (Supplementary Table   2). The conditional R-squared statistic for the participant identity random effect was 0.18, indicating that 18% of the variation in scoring was due to participant identity. It was therefore likely that individual-level variability between assessors had a stronger influence on scoring than the other characteristics (education, prior knowledge, age or gender).

Test #3: Determining survey effectiveness and identifying improvements
Discriminative ability of questions varied widely (Figure 2). There was a large amount of variation in answers to questions on landscape, topography or trees, whereas answers to disturbance questions were more consistent. One question had zero spread of answers: Q4 asking if the fragment was 1 ha or above. This question therefore had zero dis-F I G U R E 3 Effect of survey question removal on (a) FIA score and (b) R-squared scores versus independent metrics of forest condition. Neither the removal of the worst 10 questions for agreement, or the removal of all 'Panel C' questions (i.e. questions with low agreement and low discriminative ability, n = 23 questions), had a substantial effect on the ranking of sites by FIA score (a) or the estimates for R-squared, calculated using independent metrics of forest condition (b). Using only Panel B questions (those with high agreement rate, high discriminative ability, n = 20 questions) to calculate FIA scores had a mild effect on the order of site rankings, but an overall adverse effect on R-squared estimates criminative power across the range of sites we visited. The average Fleiss's kappa score across questions was 0.25, indicating fair agreement between scorers across sites ( Figure 2). The level of agreement for two questions was worse than that expected by chance alone; these were questions relating to the presence of waterfalls (Q14) and offtrail visibility (Q46). Overall, questions with a higher level of agreement tended to also be better at discriminating across sites (Figure 2).
Removing questions tended to have a negligible effect on the relative ranking of sites based on the average survey score (Figure 3a) and also had little effect on the value of R-squared statistics calculated against independent metrics of forest condition (Figure 3b). Removing every question appearing in the lower left-hand panel of Figure 2 (labelled 'C'), corresponding to all survey questions with below average discriminative ability and below average agreement rate (n = 23 questions), tended to raise scores for sites in good condition, and in two instances this would have changed the ordering of site rankings (by ±1 rank, Figure 3a). Removing Figure 2 panel 'C' questions also mildly worsened R-squared scores, by a mean of −0.008 (Figure 3b).
Using only those questions with above average discriminative ability and agreement rate (n = 20 questions, upper right hand panel labelled 'B' in Figure 2) had a similar but slightly stronger effect on site rankings ( Figure 3a), but again reduced the value of r-squared statistics versus the forest condition metrics (by a mean of −0.026, relative to the full survey). Removing the 'worst' 10 questions (as per assessor agreement) had a negligible effect on site rankings (Figure 3a) and on R-squared scores (Figure 3b, mean change: +0.001).

Agreement with independent forest condition metrics
The FIA tool was effective in measuring forest condition across our criteria of vegetation structure complexity, biodiversity and ecosystem functioning ( Figure 1). Strong agreement with vegetation structure and aboveground carbon estimates might have been expected, given the emphasis on structure-related questions in the survey, such as the size and number of trees (n = 13 structure questions of 50 in total). This nevertheless demonstrated that assessors were able to accurately identify critical aspects of forest structure that reflect the forest condition and conservation value without spending large amounts of time taking detailed measurements of tree size and identity. The strength of these associations also indicated the potential applicability of this tool to REDD+ schemes and the HCS approach, which both use carbon and vegetation structure as proxies for forest 'value' (Brofeldt et al., 2014;Rosoman et al., 2017).
Vegetation structure and carbon stocks have been shown to correlate closely with biodiversity in tropical forest habitats (Deere et al., 2018;Gao et al., 2014;Lindenmayer et al., 2000;Magnago et al., 2015), and this was the case among our test sites (Tawatao et al., 2014). For the two biodiversity datasets available at our study sites (dipterocarp tree diversity and ant diversity), the survey responses were reasonably well correlated with these datasets, particularly given that the FIA assessors were not required to count or identify species. We would expect that the survey would perform similarly against many other groups of species that are forest dependent, and hence vary in occurrence in relation to forest condition. We were not able to test the FIA tool against datasets of vertebrate biodiversity, which is often of concern to conservation initiatives, but sections of the survey that ask questions about evidence of human disturbance including hunting, as well as signs or sightings of mammals, provide potentially important insights into threat levels to vertebrates (Brodie et al., 2015;Green et al., 2019).
The functioning of forest ecosystems is critical to their longer term ability to support biodiversity and ecosystem services (Tierney et al., 2009;Wurtzebach & Schultz, 2016). Although assessors were not asked to identify specific forest functions, our survey scores showed good agreement with aspects of ecosystem functioning, including regeneration and decomposition rates, indicating that the characteristics covered in the survey are highly relevant to ecological processes as well as structure and diversity.
It should be noted that the characteristics of a forest that make it of conservation value differ depending on the particular conservation goal and are ultimately a value judgement. Our aim was to measure forest condition against the assumption that an intact forest ecosystem will provide the widest range of important services. We believe the tool goes some way to both address the complexity of the forest ecosystem and reduce the impact of value judgments that any individual metric might place on what constitutes 'good condition' by including a range of important elements such as vegetation structure, fauna and indicators of human disturbance. Therefore, although a site cannot achieve a perfect score if any of these elements are reduced, a site can still score well based on a range of different criteria. Questions such as those relating to saplings and fruits also indicate potential for the site to recover. It will be useful for managers to scrutinize the elements of the survey that contribute to the score, rather than simply use the total score, in order to understand the site condition in relation to specific conservation goals and to develop effective management.
Developing supporting guidance on interpreting scores for subsequent conservation management would therefore be beneficial.

Usability and consistency
Overall, assessors were consistent at ranking sites by their relative condition. Accuracy and usability are key criteria for the success of the tool in improving the uptake of effective forest monitoring among a wider range of forest stewards with varied backgrounds and levels of expertise. Scoring by individuals had strong internal consistency across sites, and there was no evidence that prior knowledge or experience of forest ecology influenced the ability of assessors to discriminate between sites of different condition. Community forest monitoring for REDD+, which uses more involved vegetation measurement protocols, was found to be similarly reliable when undertaken by non-experts (Holck, 2008). The FIA tool enables the inclusion of other forest properties beyond vegetation structure, with a quicker and simpler approach, and without requiring additional expertise or training.
It should however be noted that the use of a subset of assessors who had chosen not to remain anonymous could have introduced some level of bias into the derived scores. For example, those who were comfortable in identifying themselves may have been more confident in their knowledge of forests, and those less confident may have opted to remain anonymous. It may also be the case that the assessors that opted to self-identify were less likely to deviate from what they perceived to be the consensus for the site, that is to provide some sort of 'right' answer ('impression management'; Drescher et al., 2013). It is difficult to speculate on what the net effect of these biases might be, but we would however highlight that our tests on the full dataset (anonymous or otherwise) showed that assessor characteristics did not affect scoring in any substantial way.
The assessors in our study had the opportunity to ask for clarifications from a scientist before starting the survey. Questions were fully understood, on the whole, and clarifications were requested for unfamiliar technical words like 'ravine' , 'ephemeral' or 'cicada chimney' .
This could have some effect on surveys conducted without an expert present; however, we would expect that future revisions to the FIA, including simpler wording, further written explanation or provision of photos or diagrams would largely solve this issue.
Although assessors were consistent in their ranking of sites, our results also show that there is substantial variability in the absolute scores recorded by each assessor at each site, and thus scores for intact forest sometimes overlapped scores for sites likely to have been in poorer condition. Therefore, individual scores cannot currently be used to ascertain a specific level of forest condition. The FIA tool can, however, be used to rank sites, to understand which areas may be of higher or lower conservation value and to identify suitable targets for restoration activities. It may also be possible to use the FIA to monitor the con-

Applications
The FIA tool is able to rank the condition of forest areas based on our key criteria. For this reason, the tool is competitive alongside alternative methods such as tree enumeration, remote sensing or wildlife monitoring, because it is able to capture vital information from across all these aspects in a fraction of the cost or time and without the need for technical expertise. These attributes would enable projects to rapidly assess forests on the ground in a way that allows for engagement and participation by the wider community. Our results also indicate a continuing need for field surveys, even with full access to remote sensing data, because a number of aspects of forest condition for which measurement via remote sensing is either difficult or impossible -such as signs of leeches (Q41), epiphytes (Q33), and sub-canopy tree metrics (Q15-25) -scored highly for discriminative ability at our sites (Figure 2).
The FIA tool may have applications for any project which requires information about forest condition. These could include eco-tourism restoration initiatives or monitoring of conservation set-asides, such as those in RSPO certified oil palm plantations. Some conservation initiatives require specific information on particular aspects of forest condition, such as aboveground carbon stocks (e.g. in carbon accounting schemes) or the abundance of particular focal species of conservation concern (e.g. orangutan conservation programmes). The FIA tool is no substitute for the detailed and focused measurements required for these sorts of projects. It may, however, complement or supplement such measurements, enabling field staff and local communities to cheaply, efficiently and systematically capture information about the wider condition of the forest which could be pertinent to the focus of the project. For example, it may provide information about vegetation degradation that may affect habitat for focal species or identify human disturbance that could impact on carbon stocks in the future. We argue that regularly monitoring the full range of aspects of forest condition, through the use of the FIA, could therefore contribute towards conservation goals, as well having wider non-target benefits.

CONCLUSIONS
1. The FIA tool is effective in ranking sites in terms of condition, but variation among assessors means that it is important that the same individual is used to conduct comparison of sites over space or time.
Alternatively, taking a mean score across multiple assessments of the same site is likely to improve the robustness of condition estimates.
2. More information and training would enhance the accuracy of the survey. Some common sources of inaccuracy could be mitigated by the provision of photos and other visual aids to help understand survey questions. However, to maximize uptake it is important to balance the need for improved accuracy with the need for the survey to remain quick, cost-effective and accessible for non-experts.
3. The tool was shown to be effective in discriminating among forests of varying conditions, but we did not test whether it is sufficiently sensitive to detect changes over time, or how repeatable scores are by individuals for the same site, which are important factors for monitoring purposes. More testing is needed to understand whether it can be used to monitor restoration projects, for example, and if so, the requisite frequency and intensity of surveying that would be required.
4. While its simplicity may not provide the detail needed for focused conservation projects, the FIA tool provides a robust and systematic means of monitoring forest set-asides, providing rapid monitoring data that are accessible to a wide range of potential users.

DATA AVAILABILITY STATEMENT
Survey data available from the Dryad Digital Repository: https://doi. org/10.5061/dryad.8kprr4xnb . The testing data were drawn from publicly available datasets, see Table 1 for references.