Estimating carrying capacity for juvenile salmon using quantile random forest models

. Establishing robust methods and metrics to evaluate habitat quality is critical for the recovery of endangered Paci ﬁ c salmonids ( Oncorhynchus spp.). A variety of modeling approaches are used for status and trend monitoring of anadromous species throughout the Paci ﬁ c Northwest, USA, but current methods may fail to capture the complex relationship between ﬁ sh and habitat and are often limited in predictive power beyond speci ﬁ c watersheds. Further, the focus on species distribution and abundance is not easily manipulated to predict carrying capacity and traditional stock-recruitment analyses are reliant on long-term data which are not always available. In this study, we developed a quantile random forest model to provide estimates of habitat carrying capacity for Chinook salmon ( O. tshawytscha ) parr during the summer months, at both the site and watershed scale. Quantile random forest models allow for the considera-tion of noisy data, correlated variables, and non-linear relationships: common features in ﬁ sh – habitat datasets. We leveraged Columbia Habitat Monitoring Program data to select habitat co-variates and predict capacity at those sites. We also identi ﬁ ed a set of globally available attributes to extrapolate capacity estimate predictions throughout wadeable streams within the Columbia River basin. Total capacity estimates for watersheds closely matched estimates from alternative ﬁ sh productivity models. Carrying capacity estimates based on quantile random forest models, like those presented here, provide managers a framework to guide the identi ﬁ cation, prioritization, and development of habitat rehabilitation actions to recover salmon populations.


INTRODUCTION
The decline of anadromous Pacific salmon (Oncorhynchus spp.) across the Pacific Northwest, USA has prompted numerous actions aimed at reversing that trend. These actions are often categorized into four H's: harvest modification, hatchery practices, hydrosystem operations, and habitat rehabilitation. Problematically, there is substantial uncertainty regarding the degree of change that can be exerted across and within these categories, and what combination of changes will most cost-effectively and sustainably reduce mortality. Freshwater habitat capacity deficits have recently been identified as a major factor directly impacting population abundance which has been largely overlooked in Columbia River Basin salmonids (Bond et al. 2019, Hinrichsen and Paulsen 2020, NOAA Fisheries 2020. Specifically, restoring salmonid carrying capacity through tributary rehabilitation actions has been identified as a key component of recovery efforts for salmon and steelhead (O. mykiss) in the Pacific Northwest, USA (NOAA Fisheries 2016a,b). Efforts have included increasing and improving existing habitat for both spawning adults and rearing juveniles. However, estimating habitat carrying capacity (both historic and contemporary) for various life stages of Pacific salmon, as well as identifying important habitat characteristics that influence capacity, has been an ongoing but necessary challenge (Bond et al. 2019, Hinrichsen and Paulsen 2020, NOAA Fisheries 2020. Reliable methods to better understand fish-habitat relationships and estimate capacity are necessary to identify those salmon and steelhead life stages that are limited by habitat capacity to better direct tributary rehabilitation efforts. When it comes to estimating carrying capacity, spawner-recruit models are the gold standard (Moussalli andHilborn 1986, Myers et al. 1999). However, such models require a long time-series of accurate estimates of abundance for adults and juveniles, with variation in the number of adults. Such data are unavailable in most watersheds (Cramer and Ackerman 2009), and they do not necessarily allow one to link capacity to habitat characteristics, except perhaps at the watershed scale. Bioenergetics approaches, such as the net rate of energy intake (NREI) have been applied to salmonids to estimate capacity on the 200-600 m reach scale (Wall et al. 2016). However, there are some potential issues with how the food supply (i.e., invertebrate drift) is measured with these methods that could lead to biases in capacity estimates (Dodrill and Yackulic 2016) as well as difficulty in properly constraining drift depletion and inter-species competition, and computational and spatial limitation of this modeling approach (Wall et al. 2016, Carmichael et al. 2020). In addition, those authors did not take the step of scaling the capacity predictions at the reach scale to entire watersheds. In contrast, Sweka and Mackey (2010) estimated carrying capacity of Atlantic salmon (Salmo salar) parr at the watershed scale, using a quantile regression approach, but the only habitat covariate they included was cumulative drainage area. Estimates of salmonid carrying capacity that leverage fish-habitat relationships are lacking at the watershed scale in the Pacific Northwest.
Most studies that have investigated fish-habitat relationships focus on predicting a species' distribution (presence/absence) or the average abundance or density: neither of which can be easily manipulated to predict carrying capacity. Further, many of these studies focus on only one or two measures of habitat. Fausch et al. (1988) conducted a thorough review of attempts to predict the abundance of fish from measurable habitat covariates from 1950 to 1985 and found that the vast majority of multiple linear regression models failed to detect a significant fish-habitat signal. Since that review, there has been progress in identifying some fish-habitat relationships for several salmonid species. Nickelson et al. (1992) demonstrated that juvenile coho salmon (O. kisutch) were found in higher densities within pool habitat on the Oregon coast. Similarly, pool and pond densities were good predictors of coho smolt densities in western Washington (Sharma and Hilborn 2001). Bryant and Woodsmith (2009) found that juvenile coho abundance was positively related to large wood at the reach scale; however, their results demonstrated a negative relationship between abundance and the number of pools. Braun and Reynolds (2011) similarly found positive associations between spawner densities of sockeye salmon (O. nerka) in the Fraser River and large wood, in addition to positive relationships to percent undercuts and percent pools. Densities of adult spawning coho were also higher in forested areas compared with urban or agricultural areas in the Snohomish River watershed (Pess et al. 2002). Mossop and Bradford (2006) examined juvenile Chinook salmon (O. tshawytscha) in the Yukon River and found positive correlations between the log of fish density and several metrics related to residual pool dimensions and large woody debris abundance, as well as a negative correlation between fish density and gradient. These studies were focused on predicting observed fish densities, not necessarily capacity, and for most of them the predictive extent is limited to a particular watershed. In addition, they all assumed some form of linear fish-habitat relationship which often results in weak predictive power.
A number of studies have used other modeling approaches to elicit fish-habitat relationships. Dunham et al. (2002) used a quantile regression approach to show a negative relationship between cutthroat trout (O. clarkii) densities and the width:depth ratio of a stream for the upper quantiles of trout density. The same approach was also used to map the potential extent of sole (Solea solea) in the English Channel and southern North Sea (Eastwood et al. 2003). Machine learning models such as boosted regression trees and random forests have been used to examine species biomass, diversity, and distribution for a number of different species (Pittman et al. 2009, Knudby et al. 2010, Compton et al. 2012). The results from these studies highlight the importance and effectiveness of using techniques that can accommodate non-linear fish-habitat relationships, and provide motivation for furthering research in this realm.
For the purposes of this paper, we define carrying capacity as the maximum number of individuals that can be supported given the quantity and quality of habitat available at a given life stage. We assume that higher observed relative densities within a given life stage are a function of habitat quantity and quality. Furthermore, we assert that observed fish density is a poor proxy of habitat capacity owing to both a paucity of individuals, especially for threatened or endangered species, and the existence of unmeasured variables that may serve to limit capacity. To address this, we have developed a model to estimate juvenile rearing capacity for Pacific salmon in wadeable streams based on quantile random forest (QRF; Meinshausen 2006) models using measurements of fish abundance and density, and habitat characteristics. QRF models combine the theory and justification of quantile regression modeling (Koenker andBassett 1978, Cade andNoon 2003) with the flexibility and framework of random forest models (Breiman 2001). They account for unmeasured variables and can be used to describe the entire distribution of predicted fish densities for a given set of habitat conditions, not just the mean expected density. Random forest models have been shown to outperform more standard parametric models in predicting fish-habitat relationships in other contexts (Knudby et al. 2010). Quantile random forests share many of the benefits of random forest models, such as the ability to capture non-linear relationships between independent and dependent variables, naturally incorporate interactions between covariates, and work with untransformed data while being robust to outliers (Prasad et al. 2006). In addition, quantile regression models have been used in a variety of ecological systems to estimate the effect of limiting factors (Terrell et al. 1996, Cade andNoon 2003).
The fish abundance/density and habitat data used to fit the QRF model presented here were available from seven watersheds within the interior Columbia River basin, Pacific Northwest, USA. Within the interior Columbia River basin two major runs of Chinook salmon occur, streamtype (i.e., spring/summer-run) and ocean-type (i.e., fall run), each characterized by different life history characteristics. Stream-type Chinook salmon adults enter freshwater from the ocean earlier in the year, spawn in the upper reaches of a watershed, and the juveniles rear for up to 16 months in freshwater before entering the ocean as smolts. Ocean-type Chinook salmon adults enter freshwater later (e.g., fall or winter), spawn lower in the watershed, and the juveniles may spend between several weeks and six months in freshwater before migrating to the ocean as subyearlings. Here, we focus on stream-type Chinook salmon, and in particular the juvenile summer rearing period during low flow, during which juveniles are often termed parr, referring to the camouflage markings that occur on their sides during this life stage. Data presented here are from Chinook salmon populations in the Upper Columbia River spring-run and Snake River spring/summer-run Evolutionary Significant Units (ESU). The Upper Columbia spring-run ESU is listed as endangered under the Endangered Species Act, the Snake River spring/summer-run is listed as threatened (NOAA Fisheries 2016a,b). Hereafter, we refer to both ESUs simply as Chinook salmon.
In this study, we developed a QRF model to: 1. identify measured habitat characteristics that are most strongly associated with observed Chinook salmon parr abundance and density, 2. elicit fish-habitat relationships for those habitat characteristics identified as important for determining fish abundance and v www.esajournals.org density, using paired fish and habitat measurements, 3. predict contemporary habitat carrying capacity at all sites where the important habitat characteristics are measured, 4. extrapolate capacity predictions at measured habitat sites across a watershed using globally available attribute data to estimate the Chinook salmon parr capacity of that watershed, and 5. validate estimates of carrying capacity from our approach across multiple watersheds using independent estimates of capacity (e.g., spawner-recruit relationships).
Our study incorporates multiple measures of stream habitat to estimate fish-habitat relationships that encompass the collinear nature of most stream habitat metrics and can be used to predict carrying capacity. Our approach moves across several spatial scales, inferring fish-habitat relationships from detailed, localized habitat data and extrapolating capacity predictions across wide swaths of unsampled locations. Additionally, this approach for estimating life stage specific habitat-based carrying capacity can be used to quantitatively identify the magnitude of tributary habitat rehabilitation necessary to support delisting. Given the multitude of (often correlated) habitat metrics and the potentially non-linear fish-habitat relationships that define capacity as a function of habitat, we explore the application of QRF modeling to habitat capacity estimation, validated using data from Columbia River Chinook salmon. For perhaps the first time, the necessity of tributary habitat rehabilitation can be demonstrated, and the magnitude of required change can be placed in context with the other H's.

Study site
Habitat data used in our study were collected from eleven watersheds within the interior Columbia River basin, Pacific Northwest, USA (Fig.1). The Columbia River basin covers more than 668,000 km 2 draining large portions of Idaho, Oregon, and Washington, and smaller portions of Montana, Nevada, Utah, and Wyoming, as well as the southeastern portion of British Columbia. The habitat data used to populate the QRF model were collected by the Columbia Habitat Monitoring Program (CHaMP; Volk et al. 2017) and were downloaded from https:// www.champmonitoring.org. Data from the following eleven CHaMP watersheds were used in this study: Asotin, Entiat, John Day, Lemhi, Methow, Minam, South Fork Salmon, Tucannon, Upper Grande Ronde, Wenatchee, and Yankee Fork Salmon. Juvenile density and abundance data were collected in a subset of seven watersheds (see Table 1 and Fig. 1) at CHaMP survey reaches and were graciously provided by a number of agencies and projects, including the Integrated Status and Effectiveness Monitoring Project (Volk et al. 2017).

Data
CHaMP sites are 200-600 m reaches in wadeable streams across select watersheds within the interior Columbia River basin. The sites were selected based on a spatially balanced generalized random tessellation stratified sample selection algorithm Olsen 1999, 2004). Habitat data within CHaMP sites were collected using the CHaMP protocol (CHaMP 2016) which calls for field data collection during the low-flow period, typically from June through October. CHaMP habitat data include, but are not limited to, measurements describing channel complexity, channel units, disturbance, fish cover, large woody debris, riparian cover, stream size (depth, width, discharge), substrate, temperature, macroinvertebrate productivity, and water quality.
Juvenile fish surveys were conducted for Chinook salmon parr during the summer low-flow season at many of the same sites surveyed using the CHaMP protocol. Survey methods included mark-recapture, three-pass removal sampling, two-pass removal sampling, single-pass electrofishing, and snorkeling. These data were used to estimate Chinook salmon parr abundance at all CHaMP sites where fish survey data were available. Three-pass removal estimates used the Carle-Strub estimator (Carle and Strub 1978), following advice from Hedger et al. (2013). Twopass removal estimates used the estimator described by Seber (2002). Mark-recapture estimates used Chapman's modified Lincoln-Peterson estimator (Chapman 1951) and were deemed v www.esajournals.org valid if they met the criteria described in Robson and Regier (1964). These estimates were made using the removal function from the FSA package (Ogle et al. 2020) or the closedp.bc function from the Rcapture package (Rivest and Baillargeon 2019) in R software (R Core Team 2019). Snorkel counts were transformed to abundance estimates using paired snorkel-electrofishing sites to calibrate snorkel counts. For sites with invalid estimates or that were sampled with a single electrofishing pass, we developed an estimate of capture probability based on valid estimates, using a binomial generalized linear mixed effects model. Fixed effects were species, wetted width of the site, density of fish caught on the first pass, and all possible two-way interactions.
We included a random effect for fish crew/watershed. We used this model to predict abundances based on the number of fish caught on the first pass and any other covariates.
Abundance estimates at all sites were then translated into linear (parr/m) fish densities which were paired with the associated CHaMP habitat data. For sites that were sampled in multiple years, only the fish and habitat data from the year with the highest observed fish density was retained to avoid possible pseudoreplication, while remaining consistent with our goal of estimating carrying capacity. After removing duplicate sites, our initial dataset contained 327 unique sites with paired fish-habitat data (Table 1). We did explore using areal fish v www.esajournals.org densities (parr/m 2 ) as the response but found very similar results; so, in the interest of brevity we only present results based on linear fish densities.

Habitat covariate selection
A key step in developing a QRF model to predict fish capacities was selecting the habitat covariates to include in the model. The CHaMP program generated more than 100 habitat metrics at each site, many of which were correlated with each other to one degree or another, as is often the case with stream habitat variables. We sought to include a small set of covariates that were not overly redundant (i.e., not highly correlated), described many aspects of stream habitat (e.g., substrate, temperature, and complexity), and were highly associated with the observed fish densities, presumably because they contained information about what types of habitat fish sought or avoided. Full details of how the twelve covariates used in the QRF model were selected can be found in Appendix S1.

QRF model fit
Using the selected habitat covariates (Table 2), we fit a QRF model to predict habitat rearing capacity for Chinook salmon parr during summer months using the natural log of fish densities as the response. After constructing a random forest, predictions of the mean response can be made by averaging the predictions of all trees, similar to the expected value predictions from a statistical regression model. The individual predictions from each tree, viewed collectively, describe the entire distribution of the predicted response; therefore, the random forest model can be used in the same way as other quantile regression methods to predict any quantile of the The sum of station discharge across all stations. Station discharge is calculated as depth 9 velocity 9 station increment for all stations except first and last. Station discharge for first and last station is 0.5 9 station width 9 depth 9 velocity. 3 Avg. August temp. Temperature Average predicted daily August temperature from NorWest, averaged across the years 2002-2011. 4 Width:depth ratio Complexity Average width to depth ratio of the wetted channel measured from cross-sections. Depths represent an average of depths along each cross-section. 5 Fines Substrate Average percentage of pool-tail substrates comprised of sediment <6 mm. 6 Fish cover Cover Percent of wetted area with the following types of cover: aquatic vegetation, artificial, woody debris, and terrestrial vegetation. 7 Channel unit frequency ChannelUnit Number of channel units per 100 m. 8 Depth complexity Complexity Standard deviation of water depths within the wetted channel. 9 Large wood freq. in pools Wood Total volume of large wood pieces within the wetted channel and slow water/pool channel units, scaled by site length. 10 Riparian canopy Riparian Percent of riparian canopy with some vegetation. 11 Substrate: D16 Substrate Diameter of the 16th percentile particle derived from pebble counts. 12 Braidedness Complexity Ratio of the total length of the wetted mainstem channel plus side channels and the length of the mainstem channel.
Note: Metrics are ranked in the order of relative importance.
v www.esajournals.org response. There were missing values for some habitat data; thus, any site visit with more than three missing covariates was removed from the dataset and the remaining missing habitat values were imputed using the missForest R package (Stekhoven andB€ uhlmann 2012, Stekhoven 2013). We fit the QRF models using the quantreg-Forest function from the quantregForest package (Meinshausen 2017) in R software (R Core Team 2019), incorporating data from 327 records (paired fish-habitat data) and twelve habitat covariates (27.2 data points per covariate; Table 2). The 90th quantile of the predicted distribution was used as a proxy for carrying capacity following the suggestion of Sweka and Mackey (2010), and to avoid higher quantiles that draw from the very upper tails of observed fish density, where the variability of predictions may be influenced by small sample size issues. After model fitting, the QRF model was then used to predict capacity at sites with measurements of the habitat covariates that were used to fit the model. In our case, this includes all sites within CHaMP basins in the interior Columbia River basin. For CHaMP sites that were sampled in multiple years, we first calculated the mean for each habitat metric among years to make predictions. In total, we generated 589 predictions of Chinook salmon parr capacity, during summer months, for the following basins: Entiat, Grande Ronde (including Minam), John Day, Lemhi, Methow, South Fork Salmon, Tucannon, Wenatchee, and Yankee Fork Salmon. CHaMP sampled between 1% and 28% of the Chinook domain within these watersheds, with an average of 11%.

Extrapolating to other sites
To predict capacity at larger spatial scales, such as the watershed or population, we developed an extrapolation model based on globally available attributes (GAA) which were available for the entirety of tributary habitat utilized by a given population. The GAA data used here were taken from the list of generalized random tessellation stratified master sample sites that the CHaMP sites were originally selected from (Larsen et al. 2008(Larsen et al. , 2016. Possible covariates included temperature range, elevation, watershed, the first principal component of a natural feature classification and human disturbance classification (Whittier et al. 2011), the square root of cumulative drainage area, stream power, slope, channel type, bankfull width, and average August temperature ( Table 3). The natural log of the CHaMP site capacity predictions (parr/m) was used as the response variable in a multiple linear regression model that incorporated the design weights of the CHaMP sites using the svyglm function from the survey package (Lumley 2020) in R software (R Core Team 2019). The design weights are generated from how much of the watershed each site is meant to represent. Because the CHaMP sites were selected from strata that usually comprised unequal portions of that watershed, these weights must be accounted for to lead to v www.esajournals.org unbiased model coefficients (Nahorniak et al. 2015). We fit two different extrapolation models, one that included watershed as a covariate for use in predicting capacity within CHaMP watersheds, and one that did not for predicting everywhere else. We then made predictions of linear capacity at all master sample sites throughout the interior Columbia River basin, generally spaced about one kilometer apart. These points do not represent specific segments of streams, however, so some form of spatial averaging of capacity predictions to generate larger scale capacity estimates is necessary.
To summarize capacity at larger scales, the mean linear capacity (e.g., parr/m) of the master sample points along a particular tributary is multiplied by the length of that tributary. We first restricted the upstream limit of master sample points and lengths of streams to those within the domain of spring/summer-run Chinook salmon, as defined by StreamNet (http://www.streamnet. org) or used expert opinion from local biologists, v www.esajournals.org and the downstream limit by when streams were no longer wadeable (often determined by some combination of estimated bankfull width and cumulative drainage area). The capacities of various tributaries could then be summed to estimate capacity at almost any spatial scale. A conceptual diagram showing the data and modeling framework of the QRF and extrapolation models is shown in Fig. 2.

Model validation
Spawner-recruit data from several watersheds within the interior Columbia River basin were compiled to validate the extrapolated QRF estimates of Chinook salmon parr capacity. Some watersheds had direct estimates of parr, while some had estimates of pre-smolts and smolts (i.e., fall and spring emigrants) from rotary screw traps. For the latter, estimates of parr were calculated using estimates of over-winter survival to back-calculate parr from smolt estimates, and then adding that to pre-smolt estimates. A series of spawner-recruit functions were then fit to this data including Beverton-Holt, Ricker, and hockey stick (Froese 2008), using the FSA package (Ogle et al. 2020) in R. Estimates of capacity from each of these spawner-recruit curves were compared with QRF estimates of capacity for the same regions.
All code and data for the analyses presented here can be found in a GitHub repository (https://doi.org/10.5281/zenodo.4300942).

Habitat covariate selection
We categorized 165 habitat measurements collected using the CHaMP habitat protocol (CHaMP 2016) into eleven habitat categories, and for each habitat covariate the Maximal Information Criteria (MIC) value was calculated based on the strength of association between the habitat covariate and the response variable, parr density (parr/m; see Appendix S1 for further details). We chose the following twelve CHaMP habitat covariates to fit the QRF model: wetted width, observed discharge, average August temperature, wetted width:depth ratio, percent fines less than 6 mm, total percent fish cover, channel unit frequency, standard deviation of the wetted depth, frequency of large wood in pools, percent riparian canopy cover, lower quantile of substrate size (D16), and braidedness (Table 2).

QRF model
A QRF model was fit using those metrics and the quantregForest package (Meinshausen 2006) in R (R Core Team 2019), and the 90th quantile of the predicted distribution was used as a proxy for carrying capacity. After model fit, we examined the relative importance of each habitat covariate included in the model (Fig. 3), quantified by the average decrease in residual sum of squares for splits on that variable amidst the trees in the random forest, implemented by the importance function from the randomForest package (Liaw and Wiener 2002). Moreover, QRF models allow one to visually examine the marginal effect of each habitat covariate on the quantile of interest using partial dependence plots. These plots show the marginal effect of changing a single habitat covariate while maintaining all other covariates at their mean values (Fig. 4). However, given that many habitat metrics are somewhat correlated, these marginal effects are often not independent of one another and care should be taken when interpreting them. After model fitting, the QRF model was used to predict habitat capacity at all CHaMP sites within the interior Columbia River basin.

Extrapolating to other sites
We fit a linear regression extrapolation using QRF-based predictions of capacity at all CHaMP sites as the response, and various GAAs as the independent variables. The coefficients for the extrapolation model can be found in Table 3 and the summary of the model fit in Table 4. From this, we calculated estimates of capacity at every master sample point in the Columbia River basin, each representing roughly one kilometer of stream length.

Model validation
Estimates of Chinook salmon parr capacity from the QRF and extrapolation models were comparable to independent estimates from spawner-recruit data (Table 5, Fig. 5). QRF estimates had overlapping confidence intervals with one or more of the Beverton-Holt, Ricker, or hockey stick model estimates in each of the nine locations where comparisons were possible v www.esajournals.org (Fig. 5). Potential additional uncertainty was not accounted for in estimates of spawners-per-redd or spawners-per-parr, which would increase the confidence intervals around spawner-recruit estimates and the overlap among estimates. Correlations between parr capacity estimates from the QRF model and spawner-recruit models ranged from 0.710 (Beverton-Holt) to 0.966 (Ricker). This favorable comparison provides strong validation as the spawner-recruit estimates of Chinook salmon parr capacity were developed from completely independent datasets and using entirely different methods.

A tool to estimate habitat capacity
In this study, we developed a novel approach to estimate the capacity of habitat to support Chinook salmon parr during summer months in wadeable streams. Our model can be used to quantify juvenile rearing capacity in Chinook salmon watersheds or populations and, in turn, quantify the magnitude of tributary habitat rehabilitation that may be necessary to support Endangered Species Act delisting. The QRF and extrapolation models presented here provide useful tools toward the prioritization, implementation, and evaluation of habitat rehabilitation actions to recover depleted salmon populations. Moreover, these models can be applied to multiple stages within the life cycle (e.g., parr, smolt, adult). Estimates of habitat carrying capacity for multiple life stages will allow biologists and managers to identify what life stages and/or specific habitat patches may be limiting. As an example, QRF models and associated extrapolation models may demonstrate that habitat for a given population is sufficient to support adult spawning required to achieve delisting targets, but that juvenile rearing capacity may not be sufficient to support the target abundance. In such a case, habitat rehabilitation actions may be most cost-effectively and sustainably directed toward improving juvenile rearing habitat. Models to estimate habitat carrying capacity for multiple life stages will help to better direct habitat restoration actions and help guide not only the type of action, but also the location at which an action is performed. The favorable comparison between QRF estimates of carrying capacity and the spawnerrecruit based estimates in select watersheds helps support and validate this approach. Although built from completely different data, when these multiple lines of evidence converge it lends credence to the QRF capacity prediction results.
There are two aspects that make this approach data hungry, meaning a large dataset is needed to fit a QRF model like this. First, random forest models generally require more data than parametric models, due to the lack of parametric distribution assumptions and the lack of an assumed form of the relationship  between dependent and independent variables. Second, it takes larger datasets to accurately predict the lower and higher quantiles in a quantile regression framework. For example, if a dataset consisted of thousands, rather than hundreds, of data points, a researcher might feel comfortable using the 95th or the 98th quantile as a proxy for capacity, rather than the 90th. Our dataset consisted of 327 sites, across a variety of habitats and years, providing contrast in all the habitat covariates and presumably satisfying the data hungriness of a QRF model, based on our validation with spawnerrecruit data.

Biological expectations from QRF model
The results of the QRF parr capacity model for Chinook salmon meet many biological expectations. Focusing on the partial dependence plots (Fig. 4), the QRF model predicts capacity to increase when the wetted width, discharge, and the width:depth ratio grow, temperatures are cooler (Brett 1952, Raleigh et al. 1986, Bjornn and Reiser 1991, there is less fine sediment (Hillman et al. 1987, Bjornn and Reiser 1991, Allen 2000, there is more fish cover (Hillman et al. 1987, Bjornn and Reiser 1991, Holecek et al. 2009), channel unit frequency increases, and the standard deviation of the wetted depth (a proxy for streambed complexity) increases. These are all patterns that emerged from the fish-habitat data, and where available, match those fish-habitat relationships identified qualitatively in other studies (Mossop and Bradford 2006).
The biggest driver of capacity identified in this study is stream size, whether measured by wetted width or discharge, which should be unsurprising since we are using fish per meter as our response. In many ways, these metrics define habitat quantity; however, other metrics used in our QRF model help define habitat quality, such as cooler temperatures in August, less pool-tail fine sediment, and higher channel unit frequencies (a measure of habitat complexity and surrogate for the number of pool-riffle sequences or potential sheer areas providing feeding zones), and fish cover. Metrics that describe habitat quantity set some bounds around possible capacity estimates, while metrics describing habitat quality refine those estimates to better match conditions at that site.

Extrapolation model
Fish are mobile creatures and determining the appropriate spatial scale to estimate how their capacity may be determined by habitat characteristics is important. In the summer, for Chinook salmon parr, our fish data clearly show movement between multiple channel units (e.g., pool, riffle, run), suggesting that fish are utilizing habitat at a larger scale than the channel unit. However, it is unlikely that they are moving up and down the entire watershed, and we believe the 200-600 m reaches used in this study are an Our extrapolation model was focused on extrapolating to other master sample points because that is the dataset available to us, but the methodology could be improved. Extrapolating to reaches on a stream network, as opposed to points on the landscape, could improve the interpretability of the results. This would require a stream network with relevant attributes attached to each reach, similar to the GAAs we used. Another approach could be to move toward sampling habitat in a more spatially continuous fashion, covering most or all of a watershed, and building a QRF model from that dataset. Even if the fish data were not collected continuously, estimates of capacity could be made directly from the QRF model across the entire stream . Spawner-recruit data from nine watersheds. Solid lines are the spawner-recruit curve, dashed lines are the estimated capacity, and shaded polygons depict the 95% confidence intervals of capacity. Red corresponds to Beverton-Holt models, purple to Ricker models, blue to hockey stick models, and green to QRF estimates. The QRF solid curve is a Beverton-Holt model with the capacity parameter fixed to the QRF estimate of capacity. A few curves with high capacity estimates were not plotted to improve readability. network without the need for an extrapolation model.
One of the potential downsides to the extrapolation approach used here is that the GAAs generally do not change through time, and therefore may not reflect the dynamic nature of changing stream habitat. While the QRF model itself uses habitat characteristics that can be observed to change over the course of several years, most GAAs are static, generally derived remotely or from another model. This is the nature of extrapolating to such large spatial extents; it can be impossible to gather actual habitat data on such a scale, but with improvements in remote sensing, spatially continuous data (modeled or measured) may be on the horizon (Tonina et al. 2019).
The future: improving habitat data Given the cost/extent of data necessary for QRF extrapolation in watersheds outside of the Columbia River basin, there is a pressing need to develop new tools for habitat analyses. Unmanned Aerial Systems (UAS or drones, commonly) are gaining popularity in wildlife and ecosystem monitoring for their ease of use, safety, accessibility, and cost-efficiency (Jones et al. 2006, Chabot andBird 2015). UAS produce high-resolution, permanent data at a fraction of the cost of on-the-ground habitat sampling. Advances in imaging techniques (e.g., multispectral imaging) and post-processing (e.g., automation of data collection from imagery) are already demonstrating an increase in the efficiency and accuracy of data collection (Whitehead and Hugenholtz 2014, LeCun et al. 2015, Weinstein 2018. Further, developments in Light Detection and Ranging (LiDAR) technology have allowed for the characterization of watershed scale geomorphologic and hydraulic variables not previously possible (McKean et al. 2008, Tonina et al. 2019. Development of a standardized protocol to incorporate remotely sensed data (LiDAR, aerial imagery) into the collection of habitat metrics would greatly increase the broadscale application of QRF. Rapid advances in drone technology further improve upon traditional habitat data collection by leveraging (1) sub-meter global navigation satellite system (GNSS) receivers; (2) cost-effective drone imagery collection, image stitching, and photogrammetry; and (3) semi-automated to automated data post-processing. All data collection efforts would be georeferenced and topologically compatible to increase repeatability of methods and data collection locations; a primary criticism of previous CHaMP survey efforts. The implementation of such a protocol would circumvent the need to extrapolate by collecting data for individual channel units in a rapid manner using remote sensing technologies, thereby reducing labor, providing a cost-effective tool for habitat data collection supporting status and trend evaluation and model products to better inform habitat rehabilitation prioritization and planning.

Conclusions and next steps
If a species' carrying capacity is defined or constrained, at least in part, by the habitat in which it lives, then illuminating statistically how such habitat impacts carrying capacity can lead to understanding how a species interacts with its environment. This understanding could be of crucial importance in the realm of conservation when dealing with an endangered or threatened species, but species/habitat interactions are a core element of ecological studies more generally. We have demonstrated how a quantile regression approach, coupled with a random forest framework, can be used to estimate these relationships and predict a habitat's capacity. As large ecological datasets become more accessible, and the ability to measure large swaths of habitat more feasible, we believe this approach has many potential applications, from the North American breeding bird survey to groundfish trawl surveys. The framework could also be applied to depleted, non-migratory, and isolated populations (e.g., desert pupfish Cyprinodon macularius) to identify limiting factors in populations and/or determine whether a given habitat patch could support a viable population if limiting factors were addressed. Capacity estimates could also be used to evaluate potential translocation sites to determine whether those sites could support an abundance considered viable before investing in translocation efforts.