A regionally scalable habitat typology for assessing benthic habitats and fish communities: Application to New Caledonia reefs and lagoons

Abstract Scalable assessments of biodiversity are required to successfully and adaptively manage coastal ecosystems. Assessments must account for habitat variations at multiple spatial scales, including the small scales (<100 m) at which biotic and abiotic habitat components structure the distribution of fauna, including fishes. Associated challenges include achieving consistent habitat descriptions and upscaling from in situ‐monitored stations to larger scales. We developed a methodology for (a) determining habitat types consistent across scales within large management units, (b) characterizing heterogeneities within each habitat, and (c) predicting habitat from new survey data. It relies on clustering techniques and supervised classification rules and was applied to a set of 3,145 underwater video observations of fish and benthic habitats collected in all reef and lagoon habitats around New Caledonia. A baseline habitat typology was established with five habitat types clearly characterized by abiotic and biotic attributes. In a complex mosaic of habitats, habitat type is an indispensable covariate for explaining spatial variations in fish communities. Habitat types were further described by 26 rules capturing the range of habitat features encountered. Rules provided intuitive habitat descriptions and predicted habitat type for new monitoring observations, both straightforwardly and with known confidence. Images are convenient for interacting with managers and stakeholders. Our scheme is (a) consistent at the scale of New Caledonia reefs and lagoons (1.4 million km2) and (b) ubiquitous by providing data in all habitats, for example, showcasing a substantial fish abundance in rarely monitored soft‐bottom habitats. Both features must be part of an ecosystem‐based monitoring strategy relevant for management. This is the first study applying data mining techniques to in situ measurements to characterize coastal habitats over regional‐scale management areas. This approach can be applied to other types of observations and other ecosystems to characterize and predict local ecological assets for assessments at larger scales.


| INTRODUC TI ON
Assessing the ecological status of ecosystems and natural resources in the face of anthropogenic and environmental stressors is necessary to inform and guide appropriate management decisions (Mumby & Steneck, 2008). Consistently with an ecosystem-based (EB) approach to management (Long, Charles, & Stephenson, 2015), assessments of biodiversity and resource status are necessary at the scale of large spatial entities such as territories or regional ecosystems. In this paper, assessment refers to periodic evaluation of changes in monitoring-based indicators of biodiversity linked to management targets, for example, for marine protected areas (MPA) (Hockings, Stolton, Leverington, Dudley, & Courrau, 2006). However, the spatial and temporal distribution of biodiversity indicators depends on both management-related factors (anthropogenic pressures and/or protected area status) and environmental factors, such as habitat, which must thus be accounted for in monitoring and assessment. It has long been acknowledged that the spatial distribution of natural communities is largely shaped by the characteristics and availability of their habitat in the environment (Bell, McCoy, & Mushinsky, 1991).
This paper focuses on benthic coastal habitats described by geometric parameters, for example, complexity, rugosity (Charbonnel, Ruitton, Serre, Harmelin, & Jensen, 2002), and other measures of configuration or landscape metrics (Grober-Dunsmore et al., 2008), geomorphology (e.g., Andréfouët & Torres-Pullizza, 2004), and biotic and abiotic covers. Small-scale (<100 m) patchiness of habitats is preferably captured by in situ measurements. Here, we characterize benthic habitats at observation scale using panoramic underwater video. Measurements of habitats and fish communities were collected on both hard substrates and soft-bottom areas within vast marine managed areas where periodic assessment of both habitats and fish communities is required.
To be utilized as an explanatory factor in assessments, a concise description of habitat is needed at each station. In the past, habitat typologies (also termed systematic classification schemes; Mumby and Harborne (1999)) have been obtained from quadrat and distance-based transect data using nonsupervised multivariate methods such as factorial and cluster analyses (Ferraris et al., 2005;Mumby & Harborne, 1999;. The cluster index forms a concise habitat proxy (covariate) for explaining spatial variations of fish assemblages (Ferraris et al., 2005) or for informing management and science through standardized maps. Yet, this synthetic proxy neglects within-habitat heterogeneity, which also influences spatial variations of macrofauna (see above). In addition, predicting habitat from data collected either in follow-up monitoring surveys or at other locations is tedious as it requires mathematical computations, namely projecting the new data on the clusters.
In the case of large databases, mining techniques are an appropriate and efficient way to determine meaningful association rules between variables of interest under the form of sets of conditions on their values, along with measures of confidence and frequency (Agrawal, Imielinski, & Swami, 1993;Fournier-Viger, Wu, & Tseng, 2012;Han, Pei, Yin, & Mao, 2004). Significant rules are typically frequent patterns encountered in the data set at hand (Han et al., 2004), but methods are also developed for mining rare patterns (Piri, Delen, Liu, & Paiva, 2018).
Using both clustering techniques and supervised classification rules, we developed a methodology for (a) devising a habitat typology consistent across scales within large management units; (b) characterizing heterogeneities within each habitat type; and (c) predicting habitat from new survey data. The methodology was applied to a comprehensive data set of underwater video observations collected in New Caledonia (NC, Southwest Pacific). of the lagoon, the NC Exclusive Economic Zone (EEZ) comprises remote well-preserved reefs, islands, and atolls that make up for the Coral Sea Marine Park (CSMP, 1,300,000 km 2 ) declared in 2014 ( Figure 1). Aside from CSMP, 15,743 km 2 (i.e., 80%) reef and lagoon areas were declared a World Heritage (WH) serial property in 2008 due to the exceptionally high diversity of their coral reef ecosystems (https://whc.unesco.org/en/list/1115). Both WH and CSMP management involve periodic monitoring for assessment and reporting on fish resources and biodiversity.

| Observation equipment
Data for benthic habitat and fishes were collected using a remote unbaited rotating underwater video system (STAVIRO;. A standardized procedure for sampling design, field operations, image annotation, and data analysis was described in . The STAVIRO system consists of an HD video camera and a motor programmed to rotate the camera housing by 60° every 30 s (1 rotation ~ 3 min), yielding 6 contiguous fixed frames per 360° rotation. This relatively lightweight (6 kg) system was dropped from the boat at the station location and set horizontally on the sea bed. The system was left for 15-20 min to record the video over three complete undisturbed rotations. between March and September, outside of the summer season. The sampling design at each site was stratified using geomorphological maps (Andréfouët & Torres-Pullizza, 2004) and included main reef areas and associated soft-bottom habitats. Within each stratum, stations were distributed to cover the entire site area and account for management status (marine protected area (MPA), WH property, unprotected areas). In total, 3,145 stations were sampled ( Figure 1) at depths ranging between 1 and 41 m.

| Data validation and image analysis
After fieldwork, video footage was validated when (a) underwater visibility (estimated from reference images; see below) was at least 5 m, and (b) the field of view was not obstructed by any sea floor or benthos relief that would prevent image analysis within a 5-m radius around the system. For each valid video, habitat attributes (Table 1) were evaluated from a single rotation for an estimated 5-m radius around the video system, corresponding to an observed surface area F I G U R E 1 Study area showing distribution of 3,145 sampling stations (red). Inset: location of NC in the southwest Pacific, with the perimeter of the EEZ and external boundary of the Coral Sea Marine Park (CSMP) in green. The CSMP coastal boundary is the barrier reef surrounding the main island and the three islands of the Loyalty archipelago (including Lifou) located between Astrolabe and Walpole. Boundaries of the World Heritage property are in orange of ca. 78.5 m 2 . Each attribute was evaluated in each frame, and values were then averaged over the six frames of the rotation.
Fish and other marine animals were identified at the most precise taxonomic level based on a reference species list, and counted on each frame and for each of three undisturbed rotations within a 5-m radius around the system. The reference list included 42 families (Appendix 1). For each species at each station, abundance was calculated as the mean count over three rotations, which averaged out the variability between rotations. Abundances were expressed in densities as numbers of individuals per 100 m 2 (ind/100 m 2 ). Species richness was the number of species observed within a 5-m radius around the camera during the three rotations.
Estimation of visibility, attributes, and 5-m radius followed training of annotators with reference images comprising bright and dark fish silhouettes of several sizes filmed at a range of distances and in several visibility conditions. Training was validated after successful joint analyses of a set of images were conducted with an expert.

| Data analysis
Our classification method had two steps: (a) producing the habitat proxy (cluster index) summarizing habitat attributes at each station, and (b) deriving classification rules for describing within-cluster heterogeneity and predicting habitat ( Figure 2).

| Constructing the typology and the habitat proxy
In this broadly distributed data set, biotic covers differed strikingly between well-preserved remote sites, and coastal sites subject to anthropogenic pressures. The typology was constructed from the 2,609 coastal stations only, and the remote stations were a posteriori projected on the typology to avoid: (a) a systematic contrast of remote and coastal stations due to average differences in live coral cover; and (b) failure to discriminate between habitat variations between coastal areas. The clusters of coastal stations were obtained by combining principal component analysis, hierarchical ascending clustering, and Random Forest (RF) modeling (Breiman, 2001, Appendix 2). Based on this typology of coastal stations, habitat was predicted at the 536 remote stations using a second RF model (Appendix 2). Clusters were characterized by habitat attributes by Note: Topography and complexity scores range between 1 and 5. Percent covers (PC) refer to the observed surface area. "Macroalgae" does not include encrusting algae. "Other algae" mostly includes algal turf, that is, typically low-lying (mm to cm tall) layer of algae (Connell et al., 2014). "Dead coral" still retains a coral shape. Habitat annotation was derived from Clua et al. (1996).
F I G U R E 2 Analytical workflow: methods (left) and outputs (right) testing differences in means between each cluster and the overall set of stations (Pelletier & Ferraris, 2000).
The distribution of the habitat proxy was mapped at the scale of the entire territory and at the site scale. The relevance of the clusters as a habitat proxy for explaining spatial variations of fish communities was illustrated by: (a) testing the effect of habitat for two widely used metrics, overall fish abundance and species richness; and (b) computing and plotting frequency per family in each habitat.

| Classification rules
Classification rules are used to describe multivariate data sets (Appendix 3). In this paper, a classification rule is made of a set of conditions on habitat attributes that imply a specific habitat (here, the habitat proxy). Top1000 rules were searched for three min_conf values, 80%, 90%, and 95%, producing three sets of 1,000 rules.
The Top1000 rulesets were then selected and reorganized based on expert knowledge, to achieve a compromise between representativeness (i.e., a large proportion of the stations in each habitat were described by the rules with a high confidence level) and parsimony (not having too many rules). Each rule had to (a) include a condition on the archetypical attribute of each habitat; (b) comprise up to four conditions on habitat attributes; and (c) not overlap with another rule.
Expert knowledge was also useful to identify specific habitat attributes that were relevant to describe within-habitat heterogeneity.
Including a condition on such an attribute in some rules increased the rules' confidence by making it more specific of the habitat type.
In some habitats, rules with lower confidence were considered to increase their support. The resulting set of expert-selected rules was then used for describing within-habitat heterogeneity. We then assessed the ability of this set of rules to predict habitat considering the confidence level for each habitat type and over all habitats.

| Habitat typology and proxy
Five clusters (i.e., habitat types) were retained, each clearly characterized by an archetypical attribute and named accordingly. Three habitats pertained to soft sand-dominated bottoms (Macroalgae, Seagrass, Sandy), while two habitats corresponded to dominant hard substrates (Live Coral and Debris). In each cluster, the archetypical attribute was larger than 15%, but for the Live Coral habitat, 113 stations displaying a lower live coral cover were assigned because they also had a substantial dead coral cover. They were set aside from the coastal station data set, which was then used to train a RF classification model (based on 1,000 trees, out-of-bag (OOB) error of 3.9%). From this model, habitat was predicted for the 113 stations: Respectively 77 and 35 stations were classified in the Debris and Sandy habitats, and one in the Live Coral habitat (live coral cover, The second RF model trained from this consolidated typology (based on 1,000 trees, out-of-bag (OOB) error of 4.1%) served to predict habitat for the 536 oceanic remote stations. These were assigned to the Live Coral (48%), Sandy (27%), and Debris (25%) habitats.
The final clusters with all the stations were described by habitat attributes ( illustrating the heterogeneity inherent to each habitat. The distribution of the habitat proxy across sites illustrated differences between sites ( Figure 3, Appendix 5). Soft-bottom habitats were more frequent on the western coast, consistently with a larger and shallower lagoon area. Hence, the prevalence of fringing seagrass beds was outstanding in Bourail (WH property) and macroalgae fields were common in Nouméa and Ouano areas. In contrast, stations in the Live Coral habitat were numerous at oceanic sites (48% at stations versus 17% at coastal stations; Figure 3, Appendix 4).
The ability of the habitat proxy to explain variations in fish communities was first illustrated by comparing overall abundance density and species richness (SR) across habitats ( Figure 4).

| Habitat heterogeneity explained through classification rules
For the Macroalgae habitat, four rules described 95% of stations with an 80% overall confidence ( Note: Highly significant attributes (p < 10 -50 ) are in bold. Higher (resp. lower) mean in cluster signifies that the mean attribute was higher (resp. lower) in the cluster than on average over all stations (statistics and boxplots in Appendix 4).

| Habitat prediction from rules
Based on the 1,000 rules obtained from the TopK algorithm, habitat was predicted correctly for 70% to 84% of the stations depending on the required confidence level (Table 5, column 5). However, the Macroalgae habitat could not be predicted at all (columns 2-4), because with fewer stations, it was described by rules with smaller supports that were not among the 1,000 rules with maximum support.
The 26 expert-selected rules (Tables 3 and 4) may be used instead for prediction ( Overall, the 26 rules selected enabled more stations to be correctly classified in each habitat, and with higher confidence than the Top1000 rules.

| A regionally scalable habitat proxy for consistent assessments
We have developed a methodology to construct a habitat classification that we applied to a large data set of sampling stations distributed over the entire reef and lagoon areas of NC   Note: "Algae" cover corresponds to the sum of "erect algae" and "algal turf" covers. Rules with a confidence lower than 70% are in italics. Overall, our results illustrated the strong dependence of fishes upon very small-scale (<100 m 2 ) habitat features, both biotic and abiotic (see references in §1), and the feasibility of a scalable approach.

TA B L E 4
The habitat proxy was successfully used in assessments of benthic habitats and fish communities in NC (see §4.4).
Owing to the spatial coverage of the data, general patterns of habitat distribution in vast reef and lagoon areas were for the first time evidenced from comprehensive field measurements covering the entire EEZ of New Caledonia: the prevalence of Seagrass and Macroalgae habitats in the western lagoon of the main island and, importantly, high live coral covers frequently observed at oceanic reefs remote from anthropogenic pressures.

| Classification rules for habitat description and prediction
Supervised classification rules constitute a novel approach to habitat description previously achieved through clustering , nonparametric multidimensional scaling (Davis et al., 2016;Giménez-Casalduero, Gomariz-Castillo, & Calvín, 2011), or other statistical modeling.
In cluster analysis, within-habitat heterogeneity is measured through the variance of habitat attributes in each cluster. In con-

| Implications for conservation and management
The habitat proxy was derived from a comprehensive baseline data set comprising areas subject to a range of anthropogenic pressures, and it is consistent at the scale of New Caledonia's EEZ, including the 1.3 million km 2 CSMP and the 15,743 km 2 WH property. It has been successfully used in a number of assessments of the ecological status of fish communities, biotic covers, and other marine animals such as turtles (e.g., Pelletier, Bockel, Roman, Carpentier, & Laugier, 2016;Pelletier et al., 2014;Schohn, Bockel, Carpentier, & Pelletier, 2017;, where it better explained habitat-related variations of biodiversity than geomorphological maps. The abundance, diversity, and community composition observed in the five habitats showed that an ecosystem-based monitoring strategy must encompass not only reef areas (hard-bottom areas) TA B L E 5 Proportion of stations classified and corresponding confidence for the (a) Top1000 rules with three conditions (columns 2-4), and (b) rules from Tables 3 and 4 (columns 6-7) but also soft-bottom areas, such as in this case Sandy, Seagrass, and Macroalgae habitats. Designing future surveys will benefit from our results, and the ruleset will be used to predict habitat with high confidence for any new observation.
Both the rules and the habitat proxy are thus useful tools for monitoring-based assessment of habitat and associated macrofauna.
The study relies on an underwater video technique, which simultaneously records benthic habitat and fishes at the same exact spatial scale, and at a relatively low cost per observation: This modeling approach may apply to any data set aimed at characterizing and predicting local habitat for assessments at larger scales. More generally, it could apply to other data sets where an observation is described by a number of attributes, for example, habitat attributes or species presence or abundance, obtained from other observation protocols.
As the numbers and sizes of monitoring data sets grow, robust data analysis tools and methods are needed to (a) update knowledge base as monitoring is conducted; (b) summarize numerous ecological attributes into a tractable nontechnical description; and (c) use these synthetic descriptions in assessments. Easy-tounderstand descriptions, ideally complemented by in situ images and maps (Appendix 5), support the uptake of outcomes by scientists, by managers, and by a broader audience. Complementary efforts to develop interfaces that facilitate knowledge uptake by end users are underway.

ACK N OWLED G M ENTS
This work was part of the AMBIO project funded by IFREMER, NC Government and Provinces, the Conservatoire des Espaces Naturels of NC, the Ministry of Ecology, the French Initiative for Coral Reefs (IFRECOR), and the MPA Agency. We thank Alan Williams for insightful comments on the paper. The authors thank three anonymous reviewers and the journal's editor and associate editor for their detailed and most useful reviews.

CO N FLI C T O F I NTE R E S T
None declared.

M E TH O D D E TA I L S FO R TH E CO N S TRU C TI O N O F TH E H A B ITAT T Y P O LO GY
The clusters were first obtained through a combination of principal component analysis (PCA) and hierarchical ascending clustering (HAC) (Pelletier & Ferraris, 2000). The number of clusters was determined from Ward's (1963) criterion based on a trade-off between relevance and parsimony.
Clusters were then checked for stations not assigned to the most relevant cluster, which may occur in unsupervised techniques.
Hence, in each cluster characterized by an archetypical biotic cover, stations with this cover less than 15% were set aside. 15% corresponded to the presence of the biotic cover on a single frame of the station and was a reasonable expert-based threshold. A random forest (RF) algorithm (Breiman, 2001) was trained from the other coastal stations and then used to predict cluster (i.e. habitat proxy) for each station set aside, enabling to reclassify the station in a more appropriate cluster with a known confidence level.
To assign a habitat to the 536 remote stations, a second RF model was then trained from the typology of coastal stations and used to predict habitat at these remote stations.
Resulting clusters were characterized by habitat descriptors by testing differences in means between each cluster and the overall set of stations (Pelletier & Ferraris, 2000). Analyses were performed with R 3.5.1 (R Core Team, 2018) using the FactomineR package (V 1.41, Lê, Josse, & Husson, 2008) and the randomForest package (Liaw & Wiener, 2002).

A PPE N D I X 3 M E TH O D D E TA I L S FO R CL A SS I FI C ATI O N RU LE S
What are classification rules?
Association rules are used to describe multivariate data sets, particularly for mining large data sets of categorical variables (Agrawal et al., 1993). An association rule r is an implication of the form r: R →Q, with R the antecedent of the rule and Q the consequent of the rule. Classification rules are association rules that conclude to a particular attribute being a label, for example, a class index. The label of the consequent Q was here the habitat proxy from the typology, while the antecedent R comprised the conditions on the habitat descriptors.

Additional constraints for the supervised classification algorithm
A huge number of rules may be result from the combinatory of conditions on categorical variables. Constraints are thus set to discover the most interesting and relevant rules (McGarry, 2005). Objective constraints are interestingness measures (Freitas, 1999) [3][4][4][5] TopKRules algorithm was implemented using the SPMF software (Fournier-Viger et al., 2014).

TA B L E A 1
List of taxonomic families considered in image analysis and in the metrics reported in Figure 4 Acanthuridae Lutjanidae

Expert-based constraints to select among the rules
The 1,000 rules found by the algorithm were selected and reorganized, in order to achieve a compromise between representativeness (i.e., a large proportion of the stations in each habitat were described by the rules with a high confidence level) and parsimony (not having too many rules). In addition to the constraints on support and confidence, the following constraints were thus set for each rule: A large support meant the rule described a frequent pattern, and this was desirable since we aimed at identifying rules accounting for as many stations as possible in each habitat. A large confidence indicated that the rule would reliably predict habitat from habitat descriptors, which was also an objective of the analysis.
Specific habitat attributes not considered in the typology were included in the rules to increase confidence when they were deemed relevant to describe within-habitat heterogeneity. In some habitats, rules with a lower confidence were useful to increase support and gain more explanation about within-habitat heterogeneity. The resulting set of rules was then used to describe this heterogeneity.

Habitat prediction from classification rules
In the case of classification rules, the set of solutions forms a classification model ordered by decreasing support and confidence.

This model was used to predict the label of a new individual by
finding the first rule it satisfies within the set of solutions. We determined the ability of both the Top1000 and the expert-selected rules to predict habitat with a satisfactory confidence level.

A PPE N D I X 4
Statistics and boxplots for habitat attributes TA B L E A 2 Algorithm settings for the supervised classification algorithm used (TopKRules)

Parameter Definition Relevance to the study's objective
Number of conditions Number of conditions in the subset R A simple rule is preferred. But more complex rules may be needed to assign more stations to clusters K Number of rules r to be searched for More rules enable to assign more rules to clusters Confidence Proportion of stations that are correctly assigned to the cluster based on the rule, i.e. Card(r)/Card(R)) A high confidence level is needed to classify stations correctly. In return, particular (and thus) rare stations may be assigned with a lower confidence level Support Number of individuals satisfying the rule (potentially not belonging to the cluster if confidence level is smaller than one) Rules with a larger support are preferred as they correspond to more frequently encountered conditions. However, particular conditions also occur in relation to specific features of habitat Note: Note that the higher the confidence, the smaller the rule's supports, meaning a trade-off between support and confidence. 28.9 ± 27.5

TA B L E A 3 Mean and standard deviation of habitat attributes in each habitat, with numbers of stations in parentheses
5.9 ± 8.6 3.5 ± 6.7 Boulder (%) Depth (m) 7.6 ± 9.0 9.0 ± 4.9 9.2 ± 5.3 6.7 ± 5.2 9.1 ± 6.7 6.8 ± 6.0 13.3 ± 10.3 8.7 ± 7.0 13.5 ± 8.9 Note: Values for remote oceanic stations shown separately for habitats found at these locations. Values for archetypical attributes in each habitat are in bold.  Note: Families are ordered by decreasing overall frequency (frequency computed over all habitats). For each family, the three habitats with maximum frequency are in bold.

TA B L E A 6
Frequency per family in each habitat of the typology