Sensitivity Analysis of Wind and Turbulence Predictions With Mesoscale‐Coupled Large Eddy Simulations Using Ensemble Machine Learning

Coupling between mesoscale models and large‐eddy simulation (LES) models is increasingly used to more realistically represent the wide range of scales of atmospheric motions affecting boundary layer winds and turbulence that need to be simulated accurately for applications such as wind energy. However, such mesoscale‐to‐microscale coupled modeling frameworks are potentially affected by a large number of uncertain closure parameters. Here, we investigate the sensitivity associated with six closure parameters related to a 1.5‐order subgrid‐scale turbulence closure for an ensemble of mesoscale‐coupled LES. The simulations are performed using the Weather Research and Forecasting model nested from horizontal resolutions of greater than a kilometer down to tens of meters. Closure parameters are varied to generate perturbed parameter ensembles for two case studies of highly sheared, convective boundary layers observed in the Columbia Basin of Oregon and Washington during the Second Wind Forecast Improvement Project. Machine learning algorithms are used to explore the sensitivity of LES predictions, considering the effects of the perturbed physical parameters alongside categorical factors such as the case study identity, measurement location, and LES resolution. For the conditions we examine, a single parameter, the eddy viscosity coefficient, is the dominant source of parametric sensitivity and its importance is comparable to the categorical factors for several of the simulation response variables we examine.

Another challenge is the multiplicity of sources of uncertainty for mesoscale-to-microscale coupled modeling. Many of these sources of uncertainty, for example, specification of boundary and initial conditions and numerical discretization choices, are structural in nature. However, parametric uncertainties associated with poorly constrained parameter values within model closures are also important. Both mesoscale models and LES make use of parameterizations to represent unresolved scales of turbulent motions, surface layer interactions, and other physical processes. In a holistic quantification of parametric uncertainty, sensitivities propagating from both mesoscale and microscale modeling components would be taken into account in order to quantify the overall coupled modeling uncertainty. However, the large number of poorly constrained and possibly influential parameters (plus the relatively high computational cost of LES) renders such an approach infeasible. This implies the need for an approach that first identifies the influential parameters in each of the mesoscale and LES closures, and then performs combined mesoscale and LES uncertainty analyses considering only a significantly reduced but representative subset of parameters.
From the mesoscale modeling perspective, there has been significant progress made in examining parametric sensitivities of hub-height wind predictions. For example, using the WRF model (Skamarock et al., 2019) in a mesoscale configuration to simulate wintertime conditions in the complex terrain of the Columbia Basin region, Yang et al. (2017) analyzed the sensitivity of predicted turbine-height wind speeds to 12 parameters of the Mellor-Yamada-Nakanishi-Niino (MYNN) Level 2.5 planetary boundary-layer (PBL) scheme (Nakanishi & Niino, 2006, 2009 and 14 parameters of the revised MM5 surface-layer scheme (Jiménez et al., 2012). Promisingly, this study found that most of the uncertainty in predicted wind speeds was attributable to just a few of the parameters in each of the boundary-and surface-layer schemes. Berg et al. (2019) confirmed this finding for wintertime conditions, while Yang et al. (2019) showed that the sensitivity of WRF's Yonsei University PBL scheme (Hong et al., 2006) is also largely attributable to just a few parameters. Smith and Ancell (2019) used a similar methodology to examine sensitivity of wind ramp event predictions to initial conditions and parameters of the MYNN scheme. Importantly, these studies show that, even within a single region, the details of the sensitivity show diurnal (Yang et al., 2017 and seasonal variations  and can differ even between events of nominally similar types (Smith & Ancell, 2019). Here, we aim at understanding sensitivities of LES modeling for the same region.
In this work, we examine sensitivities to parameters of a widely-used 1.5-order closure for subgrid-scale (SGS) turbulent fluxes of momentum and scalars in simulations that use WRF's grid nesting capability to span horizontal grid resolutions from 1.35 km to 50 m. This closure solves a prognostic equation for turbulent (alternatively, "turbulence") kinetic energy (TKE) in order to provide a velocity scale for estimating an eddy diffusivity. In Section 2, we review this SGS closure, identify the parameters, and explain their prior uncertainty ranges. In Section 4 we describe the construction of a perturbed parameter ensemble (PPE) of simulations of two case studies with similarities in the large-scale flow scenario, but differences in the structure of fine-scale turbulence. Results of these simulations are qualitatively presented in Section 5. To quantitatively assess the relative significance of the SGS closure parameters for predictions of wind speed, direction, and turbulence levels, we construct random forest (RF) models of the PPE and calculate feature importance (FI) (Antoniadis et al., 2021;Genuer et al., 2010). These findings are presented in Section 6. Finally, we discuss our conclusions in Section 7.

Parameter Identification for a 1.5-Order SGS Turbulence Model
LESs typically employ a closure to represent turbulent fluxes associated with scales too fine to be resolved on the computational mesh. A number of such closures have been developed, but one of the most commonly used formulations for LES applications of WRF (as well as within many other LES models for atmospheric boundary layers) is based on the closure described by Deardorff (1980) (see, e.g., Gibbs and Fedorovich (2016) and references therein). This closure type is often referred to as a 1.5-order scheme because it solves a single prognostic equation for the SGS TKE = 1 2 ′ ′ rather than a full set of second-moment equations. SGS momentum (scalar) fluxes are then computed using an eddy viscosity (eddy diffusivity) approach. Here we do not undertake to thoroughly review such turbulence modeling approaches but rather aim to highlight the roles of specific parameters 10.1029/2022JD037150 3 of 21 within these closures and how we determine their range of values to test. In our discussion, we largely follow the notation used in documentation for WRF except as noted.

Description of the Model
The SGS TKE e provides a velocity scale which is combined with a length scale l and a coefficient c k to compute the eddy viscosity K m according to (1) Assuming homogeneous, isotropic turbulence and a sharp spectral cut-off filter, a theoretical value for c k can be obtained as where α is the Kolmogorov coefficient (Lilly, 1967). Taking α = 1.5 as in Lilly (1967) yields c k = 0.094, however other sources assume somewhat different values of α with correspondingly different values of c k . For example, Schumann (1991) uses α = 1.6. The limitations of such estimates are important to recognize, especially when we try to configure SGS closures for simulating non-homogeneous flows with a poorly defined, implicit, grid-based filter, and we note that WRF's default specification of this parameter, c k = 0.15, has a larger value than the theoretical estimates.
The length scale l in Equation 1 is an important aspect of the closure that helps it to remain applicable under different states of stratification. When the local grid-scale stratification is neutral or unstable, l = Δ, where Δ is the filter scale set by the grid resolution, that is, Δ = (Δ Δ Δ ) 1∕3 . Under local stable stratification, where the scales of turbulence are suppressed, l is limited by where N is the Brunt-Väisälä frequency and c n is a model coefficient. This closure can be understood by considering c n e 1/2 as an approximation to the standard deviation of the SGS vertical velocity fluctuations. Different values for c n can be obtained by assuming different levels of suppression of vertical velocity fluctuations relative to horizontal velocity fluctuations. WRF's default, which is also the value used by Deardorff (1980), is c n = 0.76. Nevertheless, smaller values, which indicate stronger suppression of vertical variations, are also plausible. For example, c n = 0.5 is suggested by Nieuwstadt (1990) and used by Gibbs and Fedorovich (2016). A reasonable upper bound is given by c n = 0.82, which results from the assumption that the velocity fluctuations are isotropic.
SGS scalar fluxes can be computed using K m and a closure for the turbulent Prandtl number Pr t . Deardorff (1980) proposes so that Pr t increases from 1/3 under neutral or unstable stratification to 1 under very stable stratification. While these limits are widely accepted, the details of how Pr t varies with stratification are far less clear (Gibbs & Fedorovich, 2016). To capture this uncertainty, we introduce an additional parameter into Equation 4 that varies the rate at which Pr t changes with decreasing l/Δ. Our modified expression is and the effect of the introduced parameter n Pr is illustrated in Figure 1a. Note we are not suggesting Equation 5 be used as a permanent modification to the SGS scheme, but rather employed here as a means to identify possible sensitivity to the turbulent Prandtl number closures that would otherwise be difficult to isolate.
Production (or destruction) of SGS TKE due to shear and buoyancy effects can be modeled self-consistently using the eddy viscosity/eddy diffusivity-based SGS flux closures. However, additional assumptions are needed to account for the dissipation of TKE associated with molecular viscosity. This quantity, which is denoted by D ϵ , is parameterized as where c ϵ is a model coefficient. At the first vertical grid level above the surface, an enhanced value c ϵ = 3.9 is used; at all other levels c ϵ is modified in response to local stability. We express this stability dependence in the form which differs slightly from WRF's default expression that includes a dependence on c k . Here c ϵN is the value of c ϵ when turbulent length scales are not limited by stability. As for c k , a theoretical estimate can be made by assuming homogeneous isotropic turbulence and a sharp spectral filter, giving Moeng and Wyngaard (1988) discuss a multiplicative correction factor of 3 1/4 that can be applied to such estimates to better account for the filter shape (spherical rather than cubical in wave number space).

In Equation 7
, c ϵS is the minimum coefficient obtained under very stable conditions. Following de Roode et al. (2017), we can determine a critical Richardson number Ri C at which e vanishes. In our notation, Ri C depends on the closure parameters as which we can rearranged to yield Figure 1b illustrates the values of c ϵS obtained with some different combinations of c k , c n , and Ri C .
Finally, although it is not a parameter of the SGS turbulence closure per se, we are interested in checking how uncertainty in the surface flux parameterization might impact the LES solutions. For this purpose, we apply a multiplicative factor z f to the baseline roughness length used within the MYNN surface layer scheme using the stability functions of Dyer and Hicks (1970).

Determination of Parameter Ranges
As discussed by Yang et al. (2017), some subjectivity is involved in determining the ranges of parameter values to be tested in a sensitivity analysis. The prior ranges should be sufficiently inclusive of plausible values without arbitrarily enhancing the importance of a parameter with over-relaxed bounds. The parameter ranges and WRF's default values (defined either explicitly or implicitly in the code) are summarized in Table 1. These ranges are inputs to a Latin hypercube sampling scheme (Helton & Davis, 2003;Stein, 1987).
A lower limit to the eddy viscosity coefficient c k can be developed from the theoretical estimate (Equation 2) by varying the Kolmogorov coefficient α by 10 percent around a nominal value of 1.5. This gives a minimum value of c k = 0.08. The upper limit is set by doubling the default value used by WRF to give c k = 0.3.
The theoretical estimate expressed as Equation 8 is similarly used to obtain a range between 0.8 and 1.1 for the neutral flow TKE dissipation coefficient, with application of the filter shape correction factor of Moeng and Wyngaard (1988) increasing the upper bound of values to 1.5.
The parameter c n is limited between 0.5 and 0.82 as explained in the discussion following Equation 3, and the critical SGS gradient Richardson number Ri C is allowed to vary between 0.1 and 1. Observational analyses (e.g., Grachev et al., 2013) commonly indicate values of about 0.2-0.25, which is consistent with WRF's assumptions.
The turbulent Prandtl number exponential factor n Pr is allowed to span between 0.01 and 100 to obtain limiting behaviors of Pr t between 1/3 and 1 (see Figure 1a) and the roughness length factor z f ranges between 1 and 2 following Yang et al. (2017).
Using these bounds, we generate 64 combinations of perturbed parameters using Latin hypercube sampling (Helton & Davis, 2003;Stein, 1987). Note we actually compute sample values of log(n Pr ) and log(z f ). Each of these 64 parameter combinations is used to generate one ensemble member of a PPE of nested WRF/WRF-LES simulations for each of two case studies, yielding a total of 128 such simulations. Additional details of the case studies and simulation setups are provided in the next section.

Case Selection and Simulation Configurations
We investigate two case studies, each spanning multiple hours, on July 22, 2016 and August 21, 2016. These cases were selected from periods observed during the Second Wind Forecast Improvement Project (WFIP2) campaign, which was carried out in the Columbia Basin region of Washington and Oregon Wilczak et al., 2019). Both of these days were identified in the WFIP2 event log (Wilczak et al., 2019) as being associated with marine push events. Marine pushes or marine intrusions (Januzzi, 1993;Mass et al., 1986)  The simulations are performed in WRF version 4.1.2 (with minor modifications to implement the perturbed parameters in the code) using three levels of nested domains, which are configured based on the findings of Rai et al. (2019). The domain configuration is illustrated in Figure 2. The outermost domain has a horizontal grid spacing of 1.35 km, slightly in excess of the boundary layer height, and simulation of the flow on this domain is performed in a mesoscale modeling mode using the MYNN level 2.5 PBL and surface scheme (Nakanishi & Niino, 2006, 2009. Initial and boundary conditions are specified from Global Forecast System reanalysis. This domain is treated identically among all ensemble members. For both case studies, the mesoscale domain is initialized at 12 UTC and the solution is integrated for 6 hr before initiating solutions on two nested domains with horizontal grid spacing of 150 and 50 m. Both of these nests are treated as LES and the same set of perturbed parameter values is used on each. All three domains are advanced for at least four additional hours, with a 2 hr period selected for detailed analysis. Specifically, we examine 20:20-22:20 UTC for July 22 and 20:00-22:00 UTC for August 21. These time intervals were chosen to obtain periods when the simulated wind speeds were fairly uniform in time while also allowing spin up of the nested domains. Observations were collected during WFIP2 using a variety of instruments at a number of different sites covering a broad swath of the Columbia Basin (Wilczak et al., 2019). However, the limited extent of the LES domains means that only a subset of these can be encompassed. Therefore, the simulation domains are positioned so that all three include the Physics Site-12 (PS-12) tower that had sonic anemometers positioned at 50 and 80 m vertical levels.
The Physics Site was a heavily instrumented area that was a key aspect of the WFIP2 observational strategy. Further details are provided in Wilczak et al. (2019). All domains also included the Wasco, OR airport site a few kilometers away. As described in Section 6, output from this second location is also used in the parameter sensitivity analysis. We note that the goal of our study is not to calibrate LES model parameters relative to observations; rather we are focused on determining which parameters among those we test are most influential and should be retained in future analyses. In the context of our study, comparisons with observed values are primarily used as a check that a relevant range of parameter values have been sampled in the PPE, as this range is not well established for many of the model parameters a priori.
Additionally, a common vertical grid with 129 levels is used for all nests. The levels are chosen to provide a gradual increase from 10 to 25 m vertical spacing over the lowest 1 km of the domain. Sensitivity to vertical resolution is not evaluated within our study, largely due to computational constraints, although in general simulated boundary layer winds and turbulence can be sensitive to the choice of vertical grid spacing. WRF's fifth (third) order scheme is used for horizontal (vertical) advection of momentum and scalars. The mesoscale domain time step is 3 s, with time step ratios of 10 and 5 applied to the successive nests. To help account for these differences in temporal resolution, all simulation data is either interpolated or downsampled to 1 Hz frequency (as are the observational data) prior to analysis.
In addition to the description of the key model settings provided here, sample WRF namelist input files for the simulations can be obtained from the Atmosphere to Electrons Data Archive and Portal (https://a2e.energy.gov/ about/dap).

Simulation Results
In this section, we present some key results of the PPE and examine parameter sensitivity on a qualitative basis, preparatory to creating machine learning models of the data to facilitate additional sensitivity analysis.  coherent structures in the flow. These structures can also be inferred from observations by examining streamwise and cross-stream turbulence timescales. Particular parameter values tend to enhance those structures so that shifts in their position produce a strong signal in the wind speed time series at a fixed location. Such structures are not as significant for the July case, indicating the importance of day-to-day variability even for cases associated with nominally similar meteorological phenomena. This change in the coherent structures in the flow is generally consistent with the higher wind speeds and lower surface heat fluxes for the August case relative to the July case, although understanding of the transitional behavior between roll and cellular organization of the convective boundary layer remains imperfect (e.g., Salesky et al., 2017) and is further complicated by terrain effects (Rai et al., 2017).
Surface sensible heat flux H at the PS-12 location ( Figure 4) is in excess of 300 W m −2 throughout the LES simulations. The relative range of U 10min at a height of 80 m among the ensemble members is three to six times greater than the relative range of H. This is suggestive that the spread in wind speed is attributable to factors in addition to variations in surface heat fluxes among the ensemble members (note that surface latent heat fluxes vary more weakly than sensible heat fluxes). Later results will show that wind speed predictions are not very sensitive to the surface roughness factor, which also supports this conclusion.
While predictions of winds and turbulence at hub-height have clear importance for wind energy applications, it is also important that models be able to capture the overall boundary layer structure well, as a prerequisite for simulating interactions within and between wind farms and for assessing the effects of wind farms on their environment. More generally, prediction of boundary layer height is an important component of modeling how the boundary layer interacts with the overlying layer of the atmosphere, entraining air with different characteristics of momentum and scalar concentrations. Figure 5 shows the variation of boundary layer height h over the 2-hr windows used for detailed analysis of the LES. This quantity is diagnosed by computing bulk Richardson numbers from 10-minute averaged profile data and using a critical value Ri b,c = 0, similar to the method for determining boundary layer height used in WRF's Yonsei University PBL scheme (Hong et al., 2006). For the July case, h is in general agreement among all three nest levels, although it is consistently lower on the coarser resolution LES domain compared to the higher resolution one. However, h predicted by the LES ensembles (at both resolutions) is markedly shallower than h obtained on the mesoscale nest for the August case, which is actually quite similar in value to h predicted on the mesoscale domain for July. This result again highlights the differences in the finer-scale structures of the flow on the two case days, which can only be fully revealed through higher resolution simulations.
The results presented in Figures 3-5 indicate the significant spread among ensemble members for a selection of several important quantities. Now we turn to relating the differences among the ensemble members to differences For example, Figure 6 relates the range in U 10min (averaged over the 2-hr windows defined above) to the six perturbed parameters, considering both the influence of the specific case study and grid resolution. Consistent with Figure 3, results of the July case show a low bias relative to the observations. For approximately c k > 0.2, there appears to be a decreasing trend in wind speed with increasing eddy viscosity coefficient for the 150 m resolution results. Otherwise, no clear relationships with other parameters can be detected by visual inspection alone. In contrast, there is a marked trend of decreasing U 10min with increasing c k at both resolutions and across the full range of c k values for the August case. As for the July case, relationships with the other parameters are not readily discernible.
The strength of turbulent fluctuations in the horizontal winds is an important quantity to be gleaned from LES, and Figure 7 considers the effect of parameter values on such predictions (here, specifically, the resolved TKE associated with timescales less than 10 min which is denoted as E <10min ). Only for the eddy viscosity parameter is a trend in values apparent, with E <10min declining for high c k until at some value, which depends on the resolution and case study, the flow is essentially non-turbulent. Note, however, that at smaller c k values (again, with this range of c k values being case and resolution dependent) the sensitivity of E <10min is quite weak. For both cases and resolutions, agreement with observed E <10min can be obtained using c k at or below approximately 0.1, which is close to the estimate of Lilly (1967) and equal to the value recommended by Deardorff (1980) and used in many LES codes. WRF's default c k = 0.15 seems to be an acceptable choice for the July case but too large for the August case.
While E <10min indicates the intensity of turbulence, it does not characterize its structure. For this purpose, we compute an integral timescale of the wind speed fluctuations. Although the magnitude of E <10min computed from observations was similar for both cases (black lines in Figure 7), the results in Figure 8 show the differences in how these fluctuations are organized: an integral timescale of just less than 90 s for the July case compared to nearly 400 s for the August case. The simulation results show a large range of predicted values. Repeating a by now familiar pattern, the overall sensitivity to c k appears much larger than to any other parameters, albeit with the details of the sensitivity to c k depending on whether c k falls into the range of low, intermediate, or high values for the specific case and resolution (as for E <10min ).
Although it might be expected that the tripling of horizontal grid spacing moving from the coarser, 150 m LES domain to the finer, 50 m LES domain would significantly improve the ability of the simulations to match observed characteristics of wind speed fluctuations, this does not seem to borne out in practice for these case studies. Depending on the choice of parameter values, model predictions at either resolution can approximate    the observed values. However, the value of increased horizontal resolution is seen in predictions of wind shear (Figure 9), evaluated between 50 and 80 m vertical levels and averaged over the 2 hr periods defined above. Only at the higher resolution are some ensemble members able to match the observed values, whereas shear is overpredicted at the coarser resolution. At 50 m horizontal resolution, best results are obtained with an eddy viscosity coefficient of around 0.1. Shear increases roughly linearly with c k until values plateau for approximately c k > 0.2. Sensitivity to other parameters appears quite limited.
Because the strong dependence of the model's predictions on c k might impede detection of more subtle relationships with other parameters, and because simulation predictions were clearly degraded at large values of c k , we created a subset of half the original ensemble members that were deemed to show better performance. The simulation quality was evaluated on the basis of subjectively chosen, but quantitative, criteria described below. Rather than directly selecting specific ensemble members for inclusion in the halved sub-set, we ranked the results of the finest grid spacing (50 m) simulations on the basis of the sum of the relative errors in their predictions of the three response variables shown in Figures 7-9, considering both the July and August cases, and selected the ensemble members with ranks 1 through 32. Although no explicit criterion on the value of c k was applied, this approach led to the selection of ensemble members with c k < 0.2, roughly halving the original range in c k values (see Table 1). The sampling range of the remaining parameters is preserved, as shown in Figure 10.
This subsetting differs from the splitting of the data set that is inherent in the construction of the RF models that are discussed in the next section, as each RF model considers a single response variable. In contrast, the subsetting of the ensemble that we make considers a combination of response variables that characterize different aspects of the simulated winds.

Sensitivity Analysis Using RF
Machine learning tools can be used to create meta-models of how the LES results respond to input parameters (referred to as features in this context). Various model types can be used, offering different strengths and weaknesses. Here we focus on sensitivity analysis using RF regression (Breiman, 2001) as the meta-modeling approach. An advantage of the RF approach is its suitability for identifying parameters which affect the model output nonlinearly and its ability to handle interactions between variables. In addition, as an ensemble machine learning approach, RF can address the issue of over-fitting. The application of RF models for variable selection and sensitivity analysis tasks is discussed by Genuer et al. (2010) and Antoniadis et al. (2021). We use the RF implementation made available through the scikit-learn library (Pedregosa et al., 2011).
Briefly, our methodology begins with the construction of a set of data with vectors of explanatory variables, or features, and vectors of responses. Here, the features are defined to include the values of the closure parameters (c k , c n , c ϵN , Ri C , n Pr , z f ) as well as discrete categorical features: the day/date ( The method proceeds by growing regression trees using a subset of the data referred to as the training data. The other data is reserved as testing data to allow the accuracy of the RF model to be assessed. Both the training and testing accuracy are expected to be satisfactory (e.g., greater than 90%) before we finalize a model. Once the RF model has been constructed, it can be used to assess the importance of any feature. RF FI is an impurity-based measure of importance that looks at the relative depth at which different features are used as decision nodes in the trees composing the RF. FI might be biased as it can overestimate the importance of continuous or numeric features relative to discrete features with only a few possible values (Strobl et al., 2007). The RF FI technique can be complemented by an alternative, more general (i.e., nonspecific to RF models) procedure called permutation importance (PI) ranking, where feature values are randomly permuted and the corresponding change in the model score is computed. The greater the decrease in the model's score, the more important the permuted parameter is deemed to be. We present FI and PI results and find them to be largely consistent. Additionally, we tested the alternative but related meta-modeling approach called gradient tree boosting (Friedman, 2001), as implemented via the XGBoost software library (Chen & Guestrin, 2016), as a verification of the RF analyses. Although these results are not presented here, they support the main conclusions of the RF analyses. To verify that differences in FI between the full and halved ensembles were not simply due to changing the number of samples, we examined FI scores of c k obtained from 40 randomly selected sub-ensembles of 32 parameter combinations each. For all response variables examined here but one, the FI of c k for the full ensemble is within the interquartile range of FI for the random sub-ensembles. The exception is the turbulence timescale, for which the full ensemble's c k FI is within the range of FI for the random sub-ensembles. However, the FI of c k for our selected sub-ensemble lies beyond or at the edge of the range of variability of FI scores of the random sub-ensembles for all response variables except the mean wind veer, and it is outside the interquartile range for all response variables.
Sensitivities in prediction of U 10min are examined in terms of the first and second moments of the wind speed distribution. Results are shown in Figure 11, where FI results (summing to unity) are shown by blue and orange bars, respectively, for the full and halved (i.e., the parameter combinations shown in Figure 10) ensembles and feature rankings using PI are indicated by the numerical annotations of the bars. Note that the testing accuracy of the RF model for wind speed variance of the halved ensemble was lower (77%) than for other response variables.
Predictions of the mean of U 10min is (unsurprisingly) most sensitive to the case date. For the full ensemble, the importance of c k is similar to that of the site or height. The FI of c k is diminished in the sub-ensemble, but the eddy viscosity coefficient remains the most important of the parametric features. PI rankings are consistent with the FI values. Turning to the variance of U 10min shown in the lower panel of Figure 11, c k has the greatest FI ranking among all features in the full ensemble (followed by the categorical features) and remains important in the sub-ensemble. PI ranks c k slightly lower among the most important features (consistent with the known biases of FI), but nonetheless clearly confirms that sensitivity to c k is much greater than for any other parametric features. This is also consistent with Figure 6. The key aspect to note is that c k shows high importance, even in the sub-ensemble with reduced range of c k , that is comparable to or greater than the importance of the categorical features. Sensitivity to the remaining parameters is low, confirming the earlier qualitative analysis (i.e., Figures 7  and 8). Figure 10. Parameter values of the halved ensemble. Marker color indicates the rank of that parameter combination in the error metric described in the text: dark blue corresponds to low rank (lower error) while green through yellow shades correspond to increasing rank (and increasing error).
Sensitivities in the prediction of vertical derivatives in wind speed (shear) and direction (veer) are presented in Figure 13. The "height" feature for shear and veer has a different interpretation than for the other response variables, as shear and veer are computed as the differences in wind speed and direction between a pair of heights.
We use three such pairings: 50 and 80 m (allowing comparison to the data of the sonic anemometers at the PS-12 site), 50 and 120 m, and 120 and 190 m. The latter two height pairings are selected to represent the extremes of shear and veer across a turbine blade during its rotation, given that modern, large land-based wind turbines have hub heights around 120 m and blade lengths around 70 m. Not surprisingly, the height feature is uniformly  identified as the most important feature for prediction of shear for both full and halved ensembles using either FI or PI. Veer, however, is quite strongly influenced by the site, which indicates an importance of local topographic features. Continuing the established pattern, c k dominates the importance of the parametric features for both shear and veer.
Ranking of variables in terms of FI and PI scores is informative regarding which variables should be considered influential, but is not conclusive. In other words, features that show lower relative importance may still be important in an absolute sense. There is also the possibility that an importance score is not a reliable indicator of the feature's real influence (e.g., due to insufficient sampling of values in the simulation ensemble). To address these potential concerns, two additional analyses were performed using the full ensemble data.
The goal of the first analysis is to determine a threshold for PI scores to indicate a greater-than-random influence on the response variables. The testing procedure consists of replacing an input feature with a dummy feature with randomly assigned values (here, draws from a normal distribution) and computing the PI of this dummy feature as a reference. As Table 2 summarizes, PI values of the "real" input features date, site, resolution, and c k exceeded the dummy variable's PI for all the response variables discussed here. Additionally, the height input feature has greater PI than the dummy variable for mean and variance of the 10-min wind speed and for the wind shear. The Prandtl number exponential factor n Pr has PI greater than the dummy for variance of the 10-min wind speed, sub-10-min TKE, and turbulence integral timescale.
The second analysis is aimed at gauging which FI scores show statistical significance, such that comparison of values among the input variables to ascribe relative importance is soundly based. This analysis is motivated by the fact (as detailed, e.g., by Wang et al., 2016) that variable importance measures can suffer from instability due to the intrinsic randomness in the construction of an RF model, even once variations due to data selection and RF model parameter choices have been set aside. Here, we generate 100 sample data sets for each input feature and response variable combination, in which the values of one input feature have been randomly permuted while leaving the other input features as in the original data set. An RF model is trained for each of the 100 sample data sets and FI scores are computed. We then compare the 95th percentile FI of the permuted feature to the FI of that feature in the original data set (the "original" FI). If an input feature's original FI is greater than the 95th percentile FI from the randomized samples, the input feature's FI score is deemed to be significant for that response variable. We note the 95th percentile threshold is not theoretically based, but rather reflects a heuristic choice. If an input feature's original FI is less than the threshold, this does not necessarily imply that the feature is unimportant. Rather, it indicates that the FI score should be interpreted with caution and that additional data (such as the PI scores discussed above) should be used to supplement the assessment of that feature's importance. The results of this analysis are summarized in Table 3, and are largely consistent with those shown in Table 2. Date, site, height, resolution, and c k importance scores show significance for most of the response variables. In contrast to the analysis of PI scores, FI scores for c ϵN and Ri c show significance for the sub-10 min TKE and turbulence integral time. For the latter quantity, the FI of z f also shows significance.
We interpret these results as confirming that c k should be prioritized among the model parametric input features for sensitivity testing, but that other parametric features should be retained for consideration whenever it is feasible to do so. This is especially true for applications that depend on higher order turbulence statistics or when considering atmospheric conditions that differ from those considered here, especially different atmospheric stability conditions. In particular, the influence of the turbulent Prandtl number specification cannot be totally dismissed in our results. Furthermore, unless only specific dates, measurement sites, model resolutions, or vertical levels need to be considered for an application, these factors should also be prioritized in simulation sensitivity analyses.
A filled square (■) indicates that the input feature's permutation importance is greater than that of the dummy variable; an unfilled square (□) indicates that the permutation importance of the dummy variable is greater.

Note.
A filled square (■) indicates that the input feature's feature importance in the original data set exceeds the 95th percentile of FI from the permuted samples; an unfilled square (□) indicates the opposite, and suggests that input feature's FI score for the given response variable should be interpreted with caution.

Conclusions
We created a collection of 128 mesoscale-to-microscale coupled model runs using the grid nesting capability of the WRF framework. The simulation ensembles are comprised of 64 combinations of perturbed parameters related to the 1.5-order SGS turbulence closure originated by Deardorff (1980) that are used to simulate two case studies from the WFIP2 campaign. Due to the large-scale weather pattern in the Columbia Basin region on these two dates associated with marine intrusion events, both case studies exhibit strong westerly winds in excess of 10 m s −1 and strong surface fluxes. However, spatial organization of turbulence and other detailed features differed between the 2 days in both observations and simulations. From the simulation data, we computed a number of quantities relevant to wind energy associated with wind speed, direction, turbulence characteristics, and vertical variations of the wind profile.
Qualitative analysis of the results as well as quantitative assessments of FI using methods based on RF meta-modeling were performed. Both these approaches highlighted that, among all the tested parameters, the eddy viscosity parameter c k stands out in importance. Even when the original 64 parameter combinations were subset to 32 (roughly having the range of c k ), c k remains more influential than any other parameter,and it is often of comparable or greater importance than categorical features such as horizontal grid spacing of the nested domain or height level at which the quantity is computed. The only feature that consistently exceeds c k in importance is the case study date. This finding reflects the important role of c k within the turbulence modeling approach, where the magnitude of the eddy viscosity affects both the dissipation of resolved-scale turbulent fluctuations and the closure of shear and buoyant production of SGS TKE. Our findings also indicate WRF's default value of this parameter, 0.15, is larger than ideal for these cases and a value around 0.1 (as originally used by Deardorff (1980) and commonly recommended for other LES models) yields better results.
In addition to ranking input features according to their PI and FI scores obtained from RF meta-modeling, further tests were performed to verify the significance of these importance scores relative to randomized inputs. The combined implications of the PI and FI analyses confirm the significance of c k 's importance scores for all response variables considered here, and also indicate that a few or all of the categorical features have significant importance scores for each of the response variables. The FI and PI of some other SGS closure parameters (in particular, the parameter related to the turbulent Prandtl number) show significance for second-order flow statistics, such as the wind speed variance, but do not meet our significance criteria for mean flow statistics.
Lest it be concluded that the utility of LES is degraded by this sensitivity to the eddy viscosity coefficient; we note that our results indicate that the range of c k values which is likely to yield satisfactory results is actually rather narrow and is consistent with recommendations in the literature for idealized LES, even though these are highly non-idealized simulations. This suggests that a useful suite of sensitivity tests can be conducted with a limited number of ensemble members. Furthermore, by assessing the sensitivity of our simulations using RF FI and PI metrics, we are able to provide context to the significance of parametric sensitivities with information on the importance of categorical factors such as the case study date, specific measurement site, and horizontal grid spacing (resolution). In particular, the former two factors show high importance for most of the response variables we evaluate, which argues there is a high value to accounting for mesoscale variability and details of terrain by performing realistic, mesoscale-to-microscale coupled simulations.
Finally, further work is needed to examine how these parameter sensitivities and FI rankings change under a wider array of case studies, and especially considering conditions of stable stratification such as might be encountered nocturnally over land or offshore. We also look toward combined analyses of mesoscale and LES sensitivities to more comprehensively understand the uncertainties of mesoscale-to-microscale coupled modeling, and especially those sensitivities that are most important for wind energy applications.

Data Availability Statement
Observational and simulation data used in the preparation of this article are available from the Atmosphere to Electrons Data Archive and Portal (https://a2e.energy.gov/about/dap). The sampled values of the perturbed parameters and example namelist files for the WRF simulations are also provided as supplementary information to the simulation data sets. We thank the DAP team for their assistance in archiving the simulation data. Martin) for the their effort in cataloging events. The WFIP2 study was also made possible by a large number of individuals involved in the deployment of the measurement systems. We also thank our colleagues on the Mesoscale-Microscale Coupling project for many helpful interactions in the course of this study, as well as the anonymous reviewers of this work whose comments led us to make valuable improvements.