Stratification of the vertical spread‐skill relation by radiosonde drift in a convective‐scale ensemble

Ensemble forecasting systems provide useful insight into the uncertainty in the prediction of the atmosphere. However, most analysis considers ensembles in latitude, longitude, and time. Here, the vertical aspects of the spread‐skill relation are considered in a convective‐scale ensemble via comparisons with radiosonde ascents. The specific focus is on the impact of stratifying the spread‐skill relation by radiosonde drift. The drift acts as a proxy for the mobility of the atmosphere. The overall spread‐skill relation shows the temperature has a better relation than the dewpoint. However, the total variance comparisons between model and observations indicates that the dewpoint is underspread throughout the atmosphere, whilst the temperature is overspread through the lower atmosphere and underspread aloft. This suggests that the model bias is influencing the spread‐skill relation. Stratifying these results by the radiosonde drift indicates that the spread‐skill relation, and model bias, for both temperature and dewpoint degrades with increased mobility. For the most mobile situations, the ensemble is underspread throughout the atmosphere. These results have implications for ensemble design in terms of the role and influence of the driving ensemble in regional systems as more mobile situations will have a stronger dependence on the lateral boundary conditions. Longer term it may also imply that different strategies are required depending on the mobility of the synoptic conditions. Therefore, it argues for more consideration of “on‐demand” ensemble forecasting systems to allow a fairer representation of the uncertainty in different situations.


| INTRODUCTION
Ensemble forecasting systems aim to represent the uncertainty in the atmosphere; for precipitation, they are particularly useful when resolution increases (e.g., Clark et al., 2009).Ensemble quality is traditionally examined through properties related to its spread.One property, often considered, is the spread-skill relationship.This relationship has two aspects: (i) an un-biased well-spread ensemble has similar skill to its spread, and (ii) the model total variance matches the climatological observed variance (e.g., Johnson & Bowler, 2009).The relation, thus, ensures a well-spread ensemble will capture the variability in the atmosphere with a good spread-skill relation and a long-term variance that matches observations.The spread-skill relation has been considered several times using global ensembles (e.g., Hopson, 2014;Scherrer et al., 2004;Whitaker & Loughe, 1998) and, with adaptation, can also be considered for limited-area, high resolution, ensembles.Examples at the convective-scale include adaptations to consider the spatial spread-skill relation.The relation often shows that convective-scale ensembles are under-spread (e.g., Chen et al., 2018;Dey et al., 2014).
Ensemble spread is typically examined for specific levels or variables (e.g., surface, 500 hPa) but is now increasingly considered in the vertical (e.g., Flack, 2017;Melhauser & Zhang, 2012;Schwartz et al., 2014).Examination of the vertical aspects are particularly useful as the vertical state of the atmosphere is critical for forecasting certain types of weather phenomenon (e.g., convection and precipitation type; Bourgouin, 2000).The vertical state of the atmosphere is regularly considered through observed profiles (e.g., radiosonde ascents) which can be compared against model profiles to examine spread-skill relations, understand physical processes, and ascertain skill (e.g., Bain et al., 2022;Hanley & Lean, 2021;Schwartz et al., 2014;Woodhams et al., 2022).
The work presented here is a development of analysis presented in Schwartz et al. (2014) and Bain et al. (2022).Schwartz et al. (2014) considered the spread-skill relation using radiosonde ascents as the basis for assessing their 15 km ensemble.They found that their Ensemble Kalman Filter driven ensemble was, on average, well-spread.However, the dewpoint values were underspread throughout the troposphere, and wind and temperature were overspread in the mid-upper troposphere.On, the other hand, Bain et al. (2022) considered a small selection of cases from the UK Testbed Winter 2020.The differences in the spread-skill relationship between convective and hectometric-scale models was examined.They indicated similar results between the two different simulations and postulated that this could be a function of the small domain size of the hectometric ensemble due to error refresh rates.They also noted that sharp inversions were not captured well in the ensemble, as in Hanley and Lean (2021).
To advance these studies the vertical spread is explored in a month-long trial for a convective-scale, 12-member, ensemble.More specifically the question considered is what is the impact of stratifying the spreadskill relation by radiosonde drift?The answer to this question will help indicate the quality of the ensemble in different situations.There could also be implications for the design of "on-demand" forecasting systems in a different context to that suggested by Flack et al. (2018) and references therein.
The rest of this article is set out as follows.The observations, models, and comparison techniques are presented in Section 2; the spread-skill relation is considered in Section 3; conclusions are drawn in Section 4.

| Radiosondes
Radiosondes are regularly launched from seven sites across the United Kingdom (Figure 1).The reporting time for the radiosondes are 1200 and 0000 UTC.In practice, the launches occur between 11-12 UTC and 23-00 UTC.The primary data collected are the temperature, dewpoint, windspeed, and Global Positioning System (GPS) location of the radiosonde.Further details on the UK radiosonde launching system and specifications are found in Ingleby and Edwards (2015).In this study, model comparisons are made with all available radiosondes launched during the month between 2300 Z January 31, 2022 and 1200 Z February 28, 2022, that is, 366 profiles.This period has been chosen due to the variability of the weather over the United Kingdom, which included three named extra-tropical storms and some transient ridges.

| Model
The Met Office Unified Model (UM) at version 13.0 has been used to create a 12-member ensemble with fixed 4.4 km grid spacing for the domain shown in Figure 1.The domain is larger, and resolution lower than the operational convective-scale ensemble: Met Office Global and Regional Ensemble Prediction System-UK (MOGREPS-UK).These differences were required to ensure that the radiosonde drift is fully captured within the domain for all ascents considered.The UM is run with the physics defined by the midlatitude component of the second Regional and Atmosphere Land (RAL2M) configuration (Bush et al., 2023).The ensemble derives its initial and boundary conditions from the Met Office Global and Regional Ensemble Prediction System-Global (MOGREPS-G) and the model physics perturbations are provided by the random parameter scheme (McCabe et al., 2016).The model is run with 90 vertical levels capped at 40 km.The forecasts are initialized daily at 0300 UTC from 27 January to 27 February for a duration of 120 h.Model variables are output on pressure levels every 25 hPa to ensure fair comparisons across the cases.
To compare against the radiosonde data the dewpoint is required.However, the dewpoint is not directly output on pressure levels.Therefore, it is calculated from the temperature and relative humidity (RH).The method used to calculate the dewpoint follows Bolton (1980).
In terms of vapor pressure, the RH is defined as: for e, the vapor pressure, and e s , the saturation vapor pressure.The vapor pressure can be defined as: for T, the temperature (in C).Calculating e based on the dry-bulb temperature results in e s .After calculation of e, Equation (2) can be rearranged for T. In this instance T will be equivalent to the dewpoint temperature (T d ): The manual calculation of the dewpoint could lead to small differences between the simulations and observations, as it is based on empirical formulae.However, these should be minor compared to other errors.Equation ( 3) is most accurate when applied to temperatures between À35 and 35 C (Bolton, 1980).Therefore, dewpoints outside of this range are not included in the comparison.This results (on average) in the exclusion of dewpoints above 400 hPa.

| Comparison techniques
The first aspect of the spread-skill relation is often defined through the root mean square difference and error, respectively (e.g., Hopson, 2014).However, as in Bain et al. (2022), here the skill is defined as the "ensemble membersobservations", and the spread is defined as the differences between ensemble members for each unique pairing.When differences between the spread and skill are considered the mean absolute difference and mean absolute error are considered.This aspect is considered for the average of these metrics at each forecast time.
The second aspect of the spread-skill relation should be considered through climatological variances (e.g., Johnson & Bowler, 2009).A full convective-scale climatology for the equivalent domain is unavailable.Therefore, the second aspect of the spread-skill relation is acknowledged by comparing the total variance of the F I G U R E 1 Orography for the full domain used for the ensemble simulations and the locations of the radiosonde launch locations in the British Isles: Lerwick (L), Albemarle (A), Castor Bay (CB), Watnall (W), Valentia (V), Herstmonceux (H), and Camborne (C).
ensemble and observations over the period considered.The total variance is that calculated by aggregating over the entire period.Stratification by radiosonde drift is not considered for this aspect of the relation due to the reduced sample size.
Following Laroche and Sarrazin (2013) the radiosonde drift is considered by following the balloon trajectory in the model.The grid point closest to the GPS location of the balloon at each pressure level is used.Accounting for radiosonde drift is important due to the distribution of maximum distance covered by the ascents in this period (Figure 2).The maximum distance covered is 1.28 of longitude.This distance is equivalent to 32 grid points.Therefore, using static columns would not lead to a fair comparison between the model and observations.Furthermore, due to the smooth nature of the field, neighborhoods would have limited impacts (e.g., Mittermaier, 2014).However, it is acknowledged that there still remains research into what a "representative" temperature in a neighborhood means (e.g., Roberts et al., 2023) and so could be an important future avenue of investigation.
For the stratification of the spread-skill relation the radiosonde drift is divided into terciles based upon the great circle distance travelled by the radiosonde (Figure 2c).The lowest tercile represents stagnant conditions (<0.33 ); the upper tercile represents mobile conditions (>0.47 ).

| RESULTS
The spread-skill relation is first considered for the entire period (Section 3.1) to understand the context and behavior of the ensemble.Then, the spread-skill relation is stratified by radiosonde drift (Section 3.2).

| The spread-skill relation
Figure 3 shows the first aspect of the spread-skill relation.There are biases in the vertical profile of both temperature and dewpoint when considering the temporally averaged relation overall lead and start times (Figure 3a,b).This will impact the interpretation of the spread-skill relation.The member-member differences in both fields are close to zero.There is some structure throughout the vertical in the means with greater variability located towards the tropopause in the temperature, and around 700 hPa in the dewpoint.There is also reduced variability below 800 hPa.The standard deviations about the means are larger in the error compared to the spread suggesting the model is underspread.Comparing the distributions of both errors and spread (Figure 3c,d) shows that both distributions are Gaussian.However, there is greater frequency of spread focused around zero, compared to the error.Thus, the standard deviations, and distributions, provide additional information about the quality of the ensemble that accounts for the model bias.Considering the evolution in time averaged across all cases (Figure 3e,f) both fields tend to be underspread, with dewpoint being more underspread than temperature (which at times is close to being well-spread).As with the average over all times, the temperature is most underspread near the tropopause.The upper troposphere has improvements in the spread-skill relation at most lead times; the lower troposphere shows more variation but tends to be underspread at the earlier lead times (Figure 3e).On the other hand, other than the lowest 100 hPa, the dewpoint, and as such humidity, is underspread at all times and levels of the troposphere.
The underspread nature of the ensemble at the tropopause consistent with both Hanley and Lean (2021) and Bain et al. (2022).This is likely related to the model's tendency to under-represent inversions.The consistency of this factor in the ensemble helps confirm this as a model bias.
Figure 4 illustrates the second long-term aspect of the spread-skill relation and provides additional insight into the quality of the ensemble.The temperature (Figure 4a) shows a tendency to be overspread in the lower atmosphere and the depth at which the ensemble is overspread grows with lead time.Towards the start of the forecast the ensemble has similar total and observed variance at 700 hPa; by the end of the forecast this is now at approximately 450 hPa.This range is static over the first 81 h, but there is a rapid growth from T + 93 h onwards.In contrast, the dewpoint (Figure 4b) remains underspread above 925 hPa at all lead times apart from in the middle atmosphere where the variances tend to equal each other.
The tendency for the ensemble to be overspread in the boundary layer implies that the boundary layer perturbations are potentially too large.Within RAL2M there are two schemes that produce perturbations in the boundary layer: a stochastic boundary layer perturbation scheme (discussed in Bush et al., 2020) and the random parameter scheme (McCabe et al., 2016).It is plausible that a more physically-based stochastic scheme could be beneficial in producing the required level of spread, such as the Clark et al. (2021) scheme, which has smaller magnitude perturbations (and similar impacts on the forecast).The overdispersive boundary layer implies that further consideration needs to be given to perturbations aloft.
The patterns in spread characteristics from the total variance ratios match well with the comparisons of the standard deviations of the mean of the error and spread in Figure 3 (for example, there is a smaller standard deviation in the mean spread at the tropopause compared to that in the mean error).This provides additional evidence that the first aspect of the spread-skill relation is dominated by the bias in the model.However, the additional elements presented here aid in its interpretation.Given the variation in weather across the month it is likely that the different synoptic conditions, and their mobility, could influence the quality of the ensemble.Thus, the spread-skill relation is next considered for different mobilities.

| Stratification by radiosonde drift
Ensemble spread is known to vary in different situations (e.g., Flack et al., 2018).Therefore, it is plausible that the results presented above could differ with changing conditions.A simple consideration of the variation under different synoptic mobilities is considered via the use of radiosonde drift.Figure 5 shows the first aspect of the spread-skill relation for the three terciles of radiosonde drift (defined in Figure 2); it shows two factors related to the stratification: (i) the bias increases with increased mobility, and (ii) the model becomes more underspread with increased mobility.
These factors around the model bias and spread increasing with drift may be related.For example, the increase in the bias acts to complicate the interpretation of the spread-skill relation, as there is not an unbiased model.Despite this, considering the standard deviation range reduces the impact of the bias and indicates that the ensemble is underdispersive with increased mobility.
The spread-skill relation for both temperature and dewpoint (and thus humidity) for short drifts shows very close agreement between the spread and skill.The structure of the errors and spread are near identical (Figure 5a,c).This makes physical sense given the idea of error refresh rates (i.e., the time it takes the errors to leave the domain; appendix in Bain et al., 2022).In more stagnant conditions the perturbations will be able to grow upscale and influence the forecast within the domain as they will not be removed from the domain as quickly.
F I G U R E 3 The spread-skill relation for February 2022 for (a, c, e) temperature and (b, d, f) dewpoint.The spread-skill relation is shown as (a, b) a temporal average, (c, d) a relative histogram of the differences and errors combined, and (e, f) evolving with forecast lead time as vertical profiles in the atmosphere.The vertical profiles are considered between 950 and 200 hPa to ensure that there is no interpolation to pressures lower than the observed minimum surface pressure.In panels (a-d) the red lines represent the error and the blue lines the spread; in a and b the solid line is the mean, and the dashed lines are +/À1 standard deviation around the mean.
However, in mobile situations the error refresh rate will be faster and therefore, the perturbations may not be able to have as much of an influence on the spread of the ensemble inside the domain.This can be illustrated by a simple calculation of the error refresh rates from the speed-distance-time relation.For the upper limit of each drift tercile considered the error refresh rates are 93, 66, and 31 h, respectively, which corroborates the previous comments.These values assume that the balloon bursts 1 h from launch and that the errors travel in a straight line across the longest dimension of the domain, representing a somewhat idealized scenario.Whilst these will F I G U R E 5 As Figure 3 but split by different radiosonde drift distances, (a-d) represent short radiosonde drifts: a maximum distance in either latitude or longitude of up to 0.33 ; (e-h) represent medium radiosonde drifts: a maximum distance in either latitude or longitude of 0.33-0.47 ; and (i-l) represents long radiosonde drifts: a maximum distance in either latitude or longitude exceeding 0.47 .
F I G U R E 4 Ratios between the total variance of the ensemble to the observed variance for (a) temperature and (b) dewpoint.The vertical profiles are considered between 950 and 200 hPa to ensure that there is no interpolation to pressures lower than the observed minimum surface pressure.
not always be the error refresh rates for the domain, it serves as an illustrative example of why there could be a reduction in the spread compared to the skill (and potentially an increased bias) with increased balloon drift.
The dewpoint and temperature relations show similar results.The main difference is that the dewpoint error increases with height.This could be related to two factors: (i) a reduction of the number of available points due to the calculation of dewpoint, and (ii) an increase in the humidity bias with height.
The impact of the refresh rate implies a connection with the lateral boundary condition perturbations as a faster refresh rate will, likely, be more dominated by the boundary conditions.This could imply that an equivalent investigation in the driving ensemble (MOGREPS-G), or with a different driving model, could be beneficial to help understand the impact of the driving model as it is a fundamental part of the regional modeling system.It also raises questions as to the importance of domain size for ensembles-clearly larger domains could show benefit in a more mobile situation compared to a stagnant situation.On a shorter-term basis, this framework could be used for testing upgrades to aid in the decision-making process in operational systems.However, future visions could include feasibility studies on use of an "ondemand" ensemble with varying domain sizes.
It is also important to note that the error refresh rate will not explain all of these results.For example, factors such as surface contrast or wind direction could also be important in explaining the ensemble quality.
All aspects of the spread-skill relation, including those discussed in Section 3.1, raise questions as to whether regime-dependent perturbations are required.Thus the question of improving the quality of the ensemble will likely come down to optimal design of the perturbations, rather than their size.For example, effective perturbations to the tropopause could be influential in producing sensible spread aloft in the atmosphere.However, according to error growth theories (i.e., Selz & Craig, 2015), this should occur via the upscale growth process.Therefore, the mechanisms of error growth need to be considered in the design of ensembles, and as such is an important area of research that needs to continue.

| SUMMARY
Ensembles are vital tools for representing and understanding the variability of the atmosphere.Their value is predominantly examined through the spread-skill relation.This can be split into two components representing (i) the differences between the error and spread, and (ii) differences in the total variance of model and observations.However, this relation is rarely considered in the vertical.Using a convective-scale ensemble and radiosonde ascents the vertical spread-skill relation is assessed for forecasts during February 2022.The spreadskill relation is considered further by stratifying the relation by the distance travelled by the radiosonde during its ascent.The key findings are as follows: • The temperature is predominantly underspread at the tropopause.The latter point implies that there is reduced spread in more mobile situations.This is supported through calculating the error refresh rate which illustrates that there is limited chance of upscale growth of perturbations having an impact within the domain during the most mobile situations.In a practical context, this indicates that the ensemble will likely represent a fairer assessment of potential hazards, associated with phenomena sensitive to vertical profiles, in more stagnant conditions compared to those in more mobile situations.
The results presented here tend to agree with biases noted by Ingleby and Edwards (2015), Hanley and Lean (2021) and Bain et al. (2022) around the representation of inversions.However, boundary layer inversions were not detected within this study.The reason behind this could be due to a lack of sharp boundary layer inversions over the period considered and would require investigations over different seasons to investigate this further.In relation to the work presented in Schwartz et al. (2014), there were some differences as there were no overspread characteristics in the upper troposphere in the temperature field.However, there was agreement that the dewpoint was underspread throughout the atmosphere, which suggests that we need a better understanding of humidity variation throughout the atmosphere.It is worth noting that the Schwartz et al. (2014) ensemble had the benefit of data assimilation, resolution differences, and considered a different time of year.These factors could help explain some of the differences presented here in relation to that study.
The quality of ensembles should be considered through both aspects of the spread-skill relation (as in Johnson & Bowler, 2009), but further attention needs to be paid to this relationship, and the growth of perturbations (and errors), in the vertical.Furthermore, this study highlights the limitations of making conclusions based on the average results of different weather types and highlights the need for regime or spectrum-based analysis (e.g., Flack et al., 2018).This has implications for the testing process when upgrading convective-scale ensembles.Further investigations could include understanding the overspread characteristics within the boundary layer and the role of different driving models.Longer-term implications should consider the design of "ondemand" forecasting systems with areas to consider including adaptive perturbation techniques and adaptive domain sizes.Continued investigation of these adaptive approaches is required to improve the benefits of "on-demand" forecasting systems, while balanced with the added complexity, due to their potential to help improve forecasting and understanding of uncertainty in the atmosphere.

F
I G U R E 2 Relative frequency histograms for the distance travelled by the radiosonde for each launch location in the United Kingdom.The histograms indicate the distances in (a) longitude, (b) latitude, and (c) the great circle distance.The solid magenta line in (c) represents the limit of the first tercile (0.33 ), and the dashed magenta line represents the limit of the second tercile (0.47 ).The terciles are used for the stratification of the spread-skill relation.In each plot the solid line represents the Camborne ascent, the dark dashed line the Herstmonceux ascent, the dark dotted line the Watnall ascent, the dark dash-dotted line the Castor Bay ascent, the pale dashed line the Albemarle ascent, the pale dotted line the Lerwick ascent, and the pale dash-dotted line the Valentia ascent.Figure 1 shows the location of each launch site.F I G U R E 3 Legend on next page.

•
The second component of the spread-skill relation indicates that temperature is overspread in the lower atmosphere; the height at which the model is overspread with respect to observational variance grows with lead time.•The dewpoint is underspread throughout the atmosphere at all lead times.• As radiosonde drift increases the spread-skill relation deteriorates: the largest drifts show the least dispersive ensemble.