Utilizing the density of inventory samples to define a hybrid lattice for species distribution models: DISTRIB‐II for 135 eastern U.S. trees

Abstract Species distribution models (SDMs) provide useful information about potential presence or absence, and environmental conditions suitable for a species; and high‐resolution models across large extents are desirable. A primary feature of SDMs is the underlying spatial resolution, which can be chosen for many reasons, though we propose that a hybrid lattice, in which grid cell sizes vary with the density of forest inventory plots, provides benefits over uniform grids. We examine how the spatial grain size affected overall model performance for the Random Forest‐based SDM, DISTRIB, which was updated with recent forest inventories, climate, and soil data, and used a hybrid lattice derived from inventory densities. Modeled habitat suitability was compared between a uniform grid of 10 × 10 and a hybrid lattice of 10 × 10 and 20 × 20 km grids to assess potential improvements. The resulting DISTRIB‐II models for 125 eastern U.S. tree species provide information on individual habitat suitability that can be mapped and statistically analyzed to understand current and potential changes. Model performance metrics were comparable among the hybrid lattice and 10‐km grids; however, the hybrid lattice models generally had higher overall model reliability scores and were likely more representative of the inventory data. Our efforts to update DISTRIB models with current information aims to produce a more representative depiction of recent conditions by accounting for the spatial density of forest inventory data and using the latest climate data. Additionally, we developed an approach that leverages a hybrid lattice to maximize the spatial information within the models and recommend that similar modeling efforts be used to evaluate the spatial density of response and predictor data and derive a modeling grid that best represents the environment.


| INTRODUC TI ON
Modeling a species' potential niche and mapping habitat suitability (HS) are standard practices for environmental research examining aspects of an ecosystem that influence species distributions, especially those impacted by ongoing change (Guisan et al., 2013). Such efforts can assist in defining a species' habitat range, be combined with field sampling to verify model performance, or aid in conservation and resource planning. Whether process-based (Wang et al., 2014;Yospin et al., 2014) or statistical (Iverson, Prasad, Matthews, & Peters, 2008;Prasad, Iverson, Matthews, & Peters, 2016;Warwell, Rehfeldt, & Crookston, 2010), habitat modeling requires, at a minimum, information about the species' occurrence (i.e., importance or presence/absence, often from in situ plot data) and spatially indexed environmental conditions. Knowing what environmental information to include can be challenging in that variables are (a) scale dependent since values can differ at locations of species presence compared to aggregated values provided to the model (Kadmon, Farber, & Danin, 2003) and (b) must be relevant to the scope of the model, whether a macro-versus site-level analysis or predicting occurrence versus abundance.
Models that encompass regional extents (e.g., areas greater than ~10,000 km 2 ) generally rely on remotely sensed data representing environmental conditions, and the spatial resolutions of these datasets have improved over the past few decades (Pettorelli et al., 2014).
Additionally, techniques have been developed to downscale and relate climate data to local scales (Daly et al., 2008;Wang, Hamann, Spittlehouse, & Murdock, 2012) and HS models are being developed at these finer resolutions (Franklin et al., 2013;Gottschalk, Aue, Hotes, & Ekschmitt, 2011). While it seems advantageous to model HS at the finest resolution possible, over large extents the resultant models may not adequately match the spatial density of inventory data used for model training, which often only represents a fraction of the modeling extent. Additionally, issues related to model extrapolation (see Dormann, 2007;Peters, Herrick, Urban, Gardner, & Breshears, 2004;Rastetter, 2017) need consideration related to what is being modeled.
The availability and accuracy of in situ inventory data collected by field sampling can be the biggest limitation to spatial resolutions when modeling over large extents. Field inventories are costly to implement at high densities, and thus, inventories may not fully sample representative habitats for rare species (Guisan et al., 2006;Mao & Colwell, 2005). Regardless of these drawbacks, inventory data are generally well-suited for ecological modeling of habitats, provided positional errors are minimal (e.g., via data aggregation or smoothing) and/or modeling approaches that can reduce the influence of such errors are employed (Guisan et al., 2007). Dealing with locations where inventory data are not available can be done by omitting these locations from the model altogether or excluding them from the training dataset to then predict a value at these locations.
However, the area of omitted or predicted locations can be quite large depending on the extent and grain size of the model in relation to the density of inventory samples.
We propose that the spatial density of inventory plots be used to develop a hybrid lattice of grid cells (Stevens, 1997;Tsui & Brimicombe, 1997) for summarizing model predictor variables. The U.S. Department of Agriculture, Forest Service Forest Inventory and Analysis (FIA) dataset is a systematic random sample of forest conditions with one survey plot per ~2,428 ha (Bechtold & Patterson, 2005); however, due to the spatial distribution of forestland, densified sampling in some states from state-funded sampling, and the randomness of plot locations, each cell within the modeling lattice will have a varying number of inventory plots. One solution is to use Thiessen or Voronoi polygons to provide spatial structure among inventory plots, where irregular-shaped polygons containing a single data point at their centroid partition the landscape. Holland, Aegerter, Dytham, and Smith (2007) examined regular and irregular geometries for use in modeling movement across a landscape and concluded that irregular geometries reduced bias resulting from the spatial structure representing the landscape. Irregular geometries are ideal for nonparametric point-pattern analyses (Boots, 1980;Vincent, Haworth, Griffiths, & Collins, 1976), since no assumptions about the statistical distribution of response and dependent variables are made. However, a drawback to Thiessen polygons is that in regions that are sparsely sampled, large polygons represent a single inventory plot which may be uncharacteristic of the total area (Wilkin, King, & Sheldon, 2009). Therefore, a gridded network may provide a better representation of landscape conditions irrespective of sampling densities since variance of environmental conditions within each grid is reduced compared to the entire extent.
In this paper, we propose that a hybrid lattice may incorporate benefits of both Thiessen polygons and uniform grid networks for HS modeling. We compare whether models parameterized with data summarized with nested grids of both 10 × 10 and 20 × 20 km cells perform better in terms of model accuracy and reliability than 10 × 10 km uniform grid models. Concurrently, the evaluation of the hybrid lattice provides an update to our HS model, DISTRIB (described below), in which DISTRIB-II attempted to model 135 tree species of the eastern United States.

| Species distribution model parameterization
The HS model, DISTRIB, uses FIA (www.fia.fs.fed.us) data to derive individual tree species importance values (IV, i.e., weighted abundance; Curtis & McIntosh, 1951) which are correlated to environmental conditions using Random Forest (RF hereafter, Iverson et al., 2008;Prasad, Iverson, & Liaw, 2006). It used a grain size of 20 × 20 km to summarize 38 environmental variables and aggregate species IVs, generally among two or more inventory plots, within the eastern United States (Iverson et al., 2008). DISTRIB has been used to predict potential current and future HS for 134 tree species under various scenarios of climate change; outputs are available from the Climate Change Tree Atlas (www.fs.fed.us/nrs/atlas , Prasad, Iverson, Peters, & Matthews, 2014) as are various vulnerability assessments (Brandt et al., 2017;Swanston et al., 2011) and general summaries of potential impacts Prasad et al., 2016).
We introduce DISTRIB-II, which incorporates an overhaul of data sources, updates to the RF modeling technique, and the hybrid lattice approach of modeling. DISTRIB-II takes advantage of the increased resolution of available environmental variables, and the newer and more comprehensive inventory data available through the FIA database. The FIA program collects and reports information about the nation's forest lands, and beginning in 1999, implemented annual inventories completed over a 5-to 7-year cycle (O'Connell et al., 2017). However, due to insufficient funding, cycles for some states were extended. Sampling of forest conditions and individual trees ≥5.0 inches in DBH is performed on four 24-foot radius subplots (O'Connell et al., 2017). For privacy, locations and information have been "fuzzied and swapped," respectively (see Lister et al., 2005), and we use these records to calculate individual IV from the number of stems (e.g., relative density) and basal area (e.g., relative dominance). The resulting IVs range from 0 to 100 and were used as the response value for 135 eastern U.S. tree species (Appendix A1) and 45 environmental variables (Table 1) were aggregated from native resolutions to a 100 (10 × 10 km) and 400 km 2 (20 × 20 km) lattice. Efforts to improve modeled HS by increasing the spatial resolution at which data are provided to RF have been conducted , and 84,204 annualized FIA records (Forest Inventory & Analysis Database, 2017) from the most recently completed cycle for 37 eastern states sampled during the period 2000-2016 were processed for DISTRIB-II. Most states completed inventory cycles in 6 years initiated in 2005, 2007, or 2008; however, Louisiana, Texas, and West Virginia had longer cycles (11, 12, and 10 years, respectively). The underlying response grids used to develop HS models were refined to 10-and 20-km grids (Figure 1), where each 20-km cell was divided into four 10-km cells. In addition to using a uniform grid (hereafter DISTRIB-10), a hybrid lattice (hereafter DISTRIB-hybrid), composed of 10-and 20-km grids established by FIA plot density, was used to represent landscape conditions. An iterative algorithm determined whether sufficient FIA plots existed within each 20-km grid to warrant increasing the resolution to 10 km. The 10-km grids were accepted if ≥50% of the four 10-km cells within a 20-km cell contained two or more FIA plots; otherwise, the focal 20-km cell was retained.
In DISTRIB-II, RF models were developed using the randomForest library (Liaw & Wiener, 2002)  Only species occurring in at least 60 grid cells were considered for HS modeling. For each species, cells were excluded from the training data if (a) fewer than two FIA plots were present, (b) forest cover was <5% defined by the 2006 NLCD (Fry et al., 2011) (classes 41, 42, 43, and 90), or (c) the mean IV was >1.5 times the interquartile range of all cell IVs for the species so outlier values would not influence the models. We excluded cells from the training dataset with <5% forest cover (i.e., highly agricultural regions) containing two or more FIA plots because environmental drivers in those cells are likely to only marginally relate to forest species. We excluded IV outliers because they are unlikely to represent the broad 100-or 400-km 2 area, likely representing an artifact of recent forest or land use change. RF was parameterized to generate 1,001 regression trees, use eight randomly chosen predictor variables at each node (i.e., mtry), and grow each regression tree with a minimum of 10 observations. We set mtry to eight instead of the default, one-third of the number of predictors (15 in this case), because predictor set redundancy resulted in better model performance statistics with fewer variables used at each node. Once the RF model was trained, predictions of IV were made to all cells regardless of size for the hybrid lattice, whether sufficient FIA plots were present, or percent forest cover was less than five percent.

| Modeling species' importance
Each of the 1,001 regression trees built by RF provides information about the predicted IV, and the default is to report the mean prediction. However, the resampling of only eight of 45 variables at each node can result in spurious trees due to, for example, omission of an entire class (e.g., climate); while this does not influence overall prediction (Breiman, 2001), outliers can influence prediction distributions at a given cell (Roy & Larocque, 2012). Therefore, we compared the mean predicted value to the median for each cell; if the median = 0 and among all 1,001 predicted values the coefficient of variation (CV) ≥2.75, then 0 was used as the predicted IV rather than the mean, which was 0 < IV mean <8 among all species.
This "mean-median" combination is a modification to the approach suggested by Roy and Larocque (2012) which limits the influence on outlier predictions, yet retaining some marginally suitable habitat (e.g., mean prediction). In doing so, it gives more weight to half of the forest predicting a zero compared to a few trees predicting values >0 when the deviation of values is 2.75 times greater than the mean.

| Evaluating model performance
We assessed statistical performance, or model reliability (ModRel), for each species' model among the DISTRIB-hybrid and DISTRIB-10 with five variables: (a) a pseudo-R 2 obtained from the RF model (RF R 2 ); (b) a Fuzzy Kappa (FK) comparing the imputed RF map to the FIA-derived map (Hagen-Zanker, 2006, 2009; (c) a true skill statistic (TSS) of the imputed RF, after removing records with very high CV (e.g., mean-median combination); (d) the deviance of the CV (CVdev) among 30 regression trees via bagging (Iverson et al., 2008;Prasad et al., 2006); and (e) the stability of the top five variables (Top5) from 30 regression trees (Iverson et al., 2008). The five variables were normalized to a 0-1 scale and weighted as follows to arrive at a final ModRel score: 0.33*RF R 2 + 0.33*FK + 0.11*TSS + 0.11*CVdev + 0 .11*Top5 which gives more weighting to RF R 2 and FK, a primary performance metric and a comparison of predicted to observed TA B L E 1 Environmental data used to predict habitat suitability of eastern U.S. tree species. Data were either aggregated to 10-and 20km grids or derived from aggregated data  Farr et al. (2007). c Forsythe, Rykiel, Stahl, Wu, and Schoolfield (1995 values, respectively. Then, ModRel scores were assigned to one of four classes: high (ModRel ≥ 0.7), medium (0.7 > ModRel > 0.54), low (0.55 > ModRel ≥ 0.14), and unreliable (ModRel < 0.14). Any species with negative RF R 2 were deemed unreliable and excluded from HS modeling.
Described in Iverson et al. (2008), FK is derived from a cell-bycell comparison between the FIA IV and RF-modeled IV (see Prasad et al., 2006), producing a 0-1 scale where one is a perfect match.
FK is a better measure than percentage correct because the Kappa statistics account for uneven quantities of classes (Hagen-Zanker, 2006, 2009, while the "fuzzy" part considers the proximity between classes (e.g., IV 1-3 vs. IV 4-6), which are a closer match than classes farther apart (e.g., IV 0 vs. IV 21-30).
The variation among 30 regression trees via bagging allowed an assessment of the consistency of the model outputs and provided two components of the ModRel scoring (Prasad et al., 2006 represents an absence for the species. In addition to calculating the performance statistics, confidence values (Wager, Hastie, & Efron, 2014) were calculated as a percentage of the 1,001 predicted values within one standard deviation of the mean or the median absolute deviation for records utilizing the mean-median combination, respectively. These confidence values can then be mapped to reveal spatial patterns of performance.
F I G U R E 1 Extent of DISTRIB models and distribution of 10-km and 20-km grid cells within the hybrid lattice

| Species' range, detection, and abundance
For each of the 135 species modeled, information related to the spatial distribution of FIA data was used to classify the (a) distribution as narrow or wide, (b) density of FIA plots (commonness) as dense or sparse, and (c) FIA mean IV (i.e., abundance) as high or low. These classifications allow us to collectively evaluate the quality of the models as well as generalize some species characteristics. A species' distribution was considered narrow if the area of grid cells with FIA IV > 0 occupied <10% of the eastern United States; otherwise, it was assigned as wide. The density of FIA plots was considered dense if ≥40% of FIA plots among grid cells with IV > 0 for the species reported presence; otherwise, density was assigned to sparse. The abundance was considered high for average IV ≥ 6.0, the median of mean IV where the species occurs among all species, and low for values <6. These three categories were combined and coded with the first letter, where WDH indicates the species has a wide distribution, dense FIA plot ratio, and high IV. Codes for species that were withdrawn due to poor performance metrics were appended with an "X."

| Predictor variable importance
An evaluation of predictor importance was performed as described by Iverson et al. (2008) where a variable importance index

| Representing species importance
Across the eastern United States, uniform grids containing 41,681 and 10,691 cells had an average of 2.9 and 9.4 FIA plots at 10 and 20 km, respectively (Figure 2a,b). In contrast, the hybrid lattice contained 29,357 cells with an average of 3.2 FIA plots. Among the 20-km grids, a maximum of 40 FIA plots occurred within some cells as a result intensified inventories. However, the hybrid lattice, of which 84.7% of the 29,357 cells are 10-km grids, reduced the number of FIA plots to range from one to 12, with a mean plot count similar to the 10-km uniform grid ( Figure 2c). Thus, the hybrid lattice may provide a more representative sampling of the species' distribution across environmental conditions compared to a uniform 20-km grid, which aggregates information from 2 to 40 plots for modeling.

| Predicting species importance
For several species, the resulting DISTRIB-hybrid models contained

| Evaluating model performance
The same modeling framework and underlying data were used to model HS for 135 species, and we examined how changes to the model's spatial resolution affected performance statistics (

| Species' range characteristics
Among the 135 species modeled, 96 were considered to have narrow distributions within the eastern United States, 85 had sparse FIA densities, and 68 had low mean IVs (Table 2). Among the 10 species withdrawn resulting from narrow and uncommon distributions, one was from the NSH (i.e., narrow range, sparse density, high abundance) and nine were from NSL (narrow range, sparse density, low abundance) classes. Scaled RF R 2 and FK tended to be higher among species with dense FIA records rather than sparse records, especially for narrow distributions (
The DISTRIB-II models are intended to provide macroscale information about the IVs for 125 tree species modeled within the eastern United States under current and future (see  conditions. The models are highly dependent on FIA plot data, and the hybrid lattice approach is somewhat akin to the first DISTRIB models which predicted species IVs among counties using regression trees (Iverson & Prasad, 1998). While the underlying framework (e.g., statistical regression trees, climate, elevation, and soil data) of the DISTRIB-II modeling approach is similar to previous versions, predicted values between DISTRIB versions will differ as a result of changes in source data and modifications to how environmental data were processed. Thus, each DISTRIB version represents a snapshot derived from current data and scientific knowledge, and one would expect each version to differ in the modeled spatial patterns and trends of abundance for each species.
In our current efforts, differences arise from using a hybrid lattice approach, but also from (a) , 1981-2010) to account for recent changes and potential stressors that may be reflected by the inventories compared to earlier climate normals.
Trees are long-lived, and newly established individuals are likely responding to more recent climatic conditions and disturbances; using current conditions, we aim to capture these responses when exploring changes under future projections.
By refining the spatial resolution of DISTRIB-II models and examining how cells size affected model performance, we have shown that for some metrics, the DISTRIB-hybrid models had higher model performance as compared to the DISTRIB-10 models. This behavior is likely due to added information in cells that would otherwise be excluded from DISTRIB-10 training data because of too few FIA plots. However, by combining metrics that target specific aspects (e.g., presence/absence or fuzzy values) of the overall model's performance, the DISTRIB-hybrid models for many species perform similarly or better than DISTRIB-10. In the case of FK scores, however, the DISTRIB-10 models were higher than the DISTRIB-hybrid

| Comparing the hybrid grid to other approaches
Thiessen polygons, though used in models for birds (Schlicht,  amphibians (Holcombe, Stohlgren, & Jarnevich, 2007), like other irregular geometries used to model movement (Holland et al., 2007), have not been used generally to model vegetation patterns. This trend may be a result from the underlying vegetation inventory area having higher densities of sampling data or the plethora of gridded digital data available. Our use of a hybrid lattice borrows the spatial structure component of Thiessen polygons, but retains the uniformness of gridded data to stratify the landscape.
Approaches that use regional models (Ellenwood, Krist, & Romero, 2015) within a larger extent to predict local HS are useful when considering current ranges, but may not fully capture range-wide conditions necessary to explore habitat changes arising from climate change. Therefore, we did not attempt to employ such techniques, but acknowledge that for current conditions, such approaches may improve model performance by reducing zero-inflated datasets (Savage, Lawrence, & Squires, 2015). Other modeling efforts have produced predictions at spatial resolutions <500 m (Evans & Cushman, 2009;Rehfeldt, Worrall, Marchetti, & Crookston, 2015;Wilson, Lister, & Riemann, 2012) and as computational power increases and downscaled climate datasets become more widely available, this trend may become more prominent. However, while high-resolution habitat models are desirable, the sampling density of inventory data remains a limitation in that sparsely sampled regions may not provide sufficient information for modeling.
A spatial grid, derived by the density of a focal object (e.g., FIA plots), may provide a more representative dataset for model training, and as such, we explored here the potential use of a hybrid lattice to model tree HS across the eastern United States, which has forest coverage ranging from nearly null in the "Corn Belt" to ~100% in the Appalachian Mountains. We found that the hybrid lattice capitalizes on the varying density of FIA plots to maximize the information content across the region; with ~85% of cells across the eastern United States having sufficient FIA plots for the 10-km grid, we realize a fourfold increase in spatial resolution on tree species attributes over our earlier estimates with a 20-km grid.

| Transferability of modeling approach
Forest managers and decision makers often rely on modeled HS, primarily based on a uniform grid, which can have spatial resolutions too coarse for local needs. Although data aggregation techniques can readily process gridded and vector datasets independently, and methods exist to aggregate values between these two forms, results can include accuracy errors (Openshaw, 1983;Turner, O'Neill, Gardner, & Milne, 1989). The hybrid lattice approach presented here was applied to forest inventory data which was the main limitation to increasing the spatial resolution for our HS models. All environmental predictor datasets were available at native resolutions <1-km grids. The process to derive the hybrid lattice can be useful for any information meeting specified criteria and applied iteratively to create multinested grids across the landscape.
Random Forest does not directly consider any spatial information (e.g., proximity or size), and our models were developed with tabular data, rather than gridded datasets. Other statistical techniques commonly used to model HS would have to accept tabular records or vector data, as raster data would have to be based on the smallest area within the hybrid lattice. A raster dataset would produce many duplicate values among the larger grids of a hybrid lattice and could influence the models similarly to zero-inflated datasets by artificially increasing combinations of response and covariates. Thus, we believe the hybrid lattice approach presented here represents an improvement in DISTRIB-II, and we advocate for its wider use in HS modeling.

ACK N OWLED G M ENTS
We are grateful to the field crews that collect the Forest Inventory and Analysis data used in our models, in which this work would not be possible as well as sources of digital environmental datasets. We thank Jenifer Costanza, Bryce Adams, and two anonymous reviewers who provided comments on earlier versions of this manuscript. DISTRIB-II model output will be made available from https ://www. fs.fed.us/nrs/atlas and archived at https ://www.fs.usda.gov/rds/ archi ve/

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R S ' CO NTR I B UTI O N S
MPP processed environmental data, modeled habitat suitability, and led the writing of the manuscript; LRI and SNM developed model reliability scores and contributed to the manuscript; AMP processed FIA records, provided guidance on Random Forest modeling, and contributed to the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data presented in this paper will be made available at https ://www. TA B L E A 2 Confident values among Importance Value classes for DISTRIB-10 and DISTRIB-hybrid models for 135 tree species. Confidence values are a combination of the percentage of RF trees predicting a mean value within ± 1 standard deviation of the mean or the median absolute deviation among the 1,001 regression trees    Note: Variable names are described in Table 1. SumVarImp1 = sum of predictor importance scores from Random Forest (percent increase) across all species. SumRankRecip2 = sum of the reciprocal of rank of each predictor across all species. FreqTop103 = frequency of the top 10 predictors with the highest importance across all species. VarImpIndx4 = Variable Importance Index is overall mean importance defined by three normalized metrics which indicates the variables influence in models for the 135 species.

TA B L E A 4
The number of FIA inventory plots corresponding to the hybrid lattice and uniform grid cells for the region depicted in Figure 4 E and F