A scheme for verifying the spatial structure of extremes in numerical weather prediction: Exemplified for precipitation

This paper describes a novel scheme for assessing the ability of a numerical weather prediction (NWP) model to forecast the horizontal spatial structure of local extremes, for example, when comparing accumulated precipitation to a corresponding observed field. Both local maxima and minima are considered. The quality is measured by a score function between 0 and 1, which could potentially be adjusted for specific applications of NWP. The effect of neighbourhood size is included in order to consider the impact of resolvable scales. The scheme complements existing spatial verification schemes with added value and has a potential to be used both in operational verification and as a tool for revealing model deficiencies linked to individual forecast cases. The properties of the scheme are illustrated from various idealized cases. Verification results are presented for a real convective precipitation case and for a frontal precipitation case over several days. The scheme is discussed in the context of operational conditions, using synthetic fields of forecasts and analyses based on statistics of an operational NWP model at the Danish Meteorological Institute (DMI) and related observed data from a 5‐year period. Different future options are outlined, including the potential application of the scheme for ensemble prediction systems.


| INTRODUCTION
From the outset of numerical weather prediction (NWP) modelling, there has been a need to verify the quality of the model predictions. Many aspects of model performance have been studied in the past and have led to a large variety of verifications schemes. This is reflected in detailed documents from international meetings, for example, from the World Weather Research Programme (WWRP) Joint Working Group on Forecast Verification Research (http://www. cawcr.gov.au/projects/verification/).
There is an increasing focus on using NWP models to warn the public about localized precipitation events, for example, cloud bursts. It has become feasible to operate high-resolution cloud-resolving NWP models providing detailed precipitation forecasts of extreme events. However, these operational high-resolution systems are not perfect, and value of the increased spatial details need to be assessed by appropriate verification methods. The challenges related with NWP high-resolution model systems are summarized by, for example, Yano et al. (2018).
Providing optimal information and clear communication on precipitation to NWP users is demanding because of the large possible variations of precipitation across regions or municipalities of a country. Each user of rain forecasts would like to know if heavy precipitation or dry weather occurs at a given location. Unexpected large precipitation amounts such as cloudbursts in large cities may lead to significant property damage and related economic losses, as a result of flooding. Knowledge about expected dry conditions in a given area is also very important, especially if such area can be clearly distinguished from neighbouring areas with high probabilities of rain. Activities depending on dry conditions comprise numerous outdoor activities, for example, construction work, sport events, concerts, and so forth. As a consequence, ideal information from NWP to users would comprise a detailed geographical display of predicted rain and dry areas, and there is a need to verify this geographical distribution (structure) of areas with high and low precipitation amounts, in particular the highest and lowest values (dry spots). This requires a special type of verification, which is not easy to provide by traditional verification schemes. This paper describes a novel verification scheme named SLX (Structure of Local EXtremes), dedicated to this particular aspect of verification considering both maxima and minima of predicted and analysed precipitation.
Precipitation is challenging to verify because of its large variability in space and time as documented, for example, by Eggert et al. (2015) in their investigation of temporal and spatial scaling impacts on extreme precipitation over Germany. Fiener and Austerwald (2009) documented a variability of observed precipitation down to sub-km scales. Lovejoy and Mandelbrot (1985) pointed to fractal properties of rain. This highlights the need to compare model predictions with observations at different spatial and temporal scales. In recent years, highresolution precipitation analyses have become available for NWP model verification. They are based on networks of weather radars and rain gauge measurements. An increasing number of precipitation observations are becoming available as crowd-sourced data for NWP (e.g., Hintz et al., 2019). While useful, these data require careful quality control to contribute to improve the quality of precipitation analysis.
The increasing horizontal resolution of operational NWP models down to cloud-resolving scale has led to another verification issue, the so-called 'double penalty' problem, for example, Bougeault (2003) among others. First, the model is penalized for not predicting an observed high value at the observed location, and then is further penalized for predicting a high value at a slightly incorrect location. Therefore, a new generation of verification schemes has been developed to address the double penalty problem. Ebert (2008Ebert ( , 2009) summarizes the concepts of spatial verification, and Gilleland et al. (2010) provide an instructive summary of four dominating types of spatial verification: neighbourhood, scale separation, features-based and field deformation, respectively. The neighbourhood methods include information in a neighbourhood around the grid point, which may be observation or forecast point, to assess the quality. This acts as a kind of filter. An example of this type of scheme, which is widely used in limited area NWP, is the Fractions Skill Score (FSS) (Roberts & Lean, 2008). FSS compares binary fields (0 and 1) derived by examining if some threshold is exceeded or not in forecast field and analysis field. Related fractions of 1 in neighbourhoods are compared as input to the score computation over the entire limited area.
Scale separation methods assess quality by a process of decomposing the fields into different scales. An example of this type is Harris et al. (2001) who examine multi-scale properties of a precipitation forecast. Features-based methods identify specific structures, objects or characteristics linked to the observed and forecasted fields and compose a score from that. A popular scheme of this type is the SAL scheme for limited area NWP (Wernli & Paulat, 2008). SAL stands for 'Structure, Amplitude and Location'.
Recently, the spatial verification methods have been extended to include distance measures. This category measures the spatial distance between certain meteorological events, forecasted and observed (Dorninger et al., 2018). In addition, a comprehensive set of data has been defined for international collaboration on spatial verification methods (Gilleland et al., 2020).
Details of the methodology related with the new scheme (denoted SLX) are described in Section 2. The data used as a basis for verifying the scheme in Danish Meteorological Institute (DMI) are summarized in Section 3. An analysis of idealized examples is included in Section 4 while Section 5 is looking at the scheme in the context of real-world examples. Section 6 is a discussion section, and conclusions are provided in Section 7.

| DEFINITION OF THE NEW SCORE
The intention of the novel verification scheme SLX verification (Structure of Local Extremes) is to design a score, with four sub-components measuring how well diagnosed local extremes (minimum or maximum) of a forecast-or analysis field are reproduced in an absolute sense by the other field. Two scores identify, respectively, how forecasts match analysed maxima and minima in the model domain, and two other scores identify, respectively, how analysed values match identified forecasted maxima and minima. All components of the score assign values in the interval [0, 1], with 0 being poorest performance and 1 perfect agreement.
This means that local maxima and minima are identified in the forecast-and analysis fields. When scores related with these values are computed on a daily basis, they do not represent model performance for extremes in a traditional sense since weeks or months may be required to experience precipitation events above a certain threshold. However, post-processing of accumulated daily results over sufficiently long periods will eliminate this issue since results associated with thresholds representing extremes may be extracted for further examination.
An early version of a scheme addressing the occurrence of local extremes has been operational in DMI for several years (Sass & Yang, 2012). The forecasted precipitation field is compared against three highest and three lowest observed 12-h accumulated precipitation measured at rain gauges in Denmark. The SLX scheme represents a substantial extension of the scheme used in DMI so far, since it is based on complete horizontal fields of analysed and forecasted precipitation to be compared. It has been a priority to design the SLX method such that an output of the scores is produced daily for any input fields of analysis and forecast, including zero-valued fields. Figure 1 illustrates the SLX method. The scheme is based on the availability of a forecast field ɸ(i,j) and an analysis field Ψ (i,j) of accumulated precipitation over some period being typically 1 or more hours, and i, j represents the points in the two-dimensional field of the model domain to be verified. The two fields are assumed to have the same horizontal resolution. Figure 1 illustrates the geometrical principles of the scheme. Neighbourhoods are defined around local extreme points identified for both forecast and analysis. For simplicity, squared neighbourhoods are defined, for example, similar to the neighbourhood treatment defined by Roberts and Lean (2008). The grid point neighbourhood width L used here is measured in number of grid points. The points up to a distance of L from the identified extreme point, measured in each of the two horizontal directions, are included in the neighbourhood. L = 0 means that comparisons are on the level of single grid points. The interval = [ÀL, +L] defines the number of grid points of the neighbourhood in each of the two directions. The total number of grid points in a squared neighbourhood is then (2L + 1) 2 .
In Figure 1, L max is the largest neighbourhood width of the chosen set-up. It is equal to the width of a boundary zone that defines a transition between internal points of the verification domain and external points. If there is only one verification domain of a limited area model, the external points will belong to the host model. The boundary zone is represented by the area between the two large black squares of the figure. The algorithm is designed such that the total domain may be divided into a given number of sub-domains defined by a number of grid points in the two horizontal directions. In this case, some boundary zones are between sub-domains and do not represent a transition to an external model. For large model areas of, say 1000 or more grid points in both directions, it is reasonable to partition the model into sub-domains in order to distinguish regional differences

Ψmin(L,K 4)
Ψmax(L,K 3) obmax(K 1) mb_max ɸmax(L,K 1) * × * * × × F I G U R E 1 Set-up for computing structure of local extremes. Local extremes ob max (K 1 ) and ob min (K 2 ) are identified, corresponding to observed maxima and minima of the forecast field (only one point of each type in the figure). These values are compared, respectively, with forecasted maxima ɸ max (L,K 1 ) and minima ɸ min (L,K 2 ) valid for corresponding neighbourhood(s) around the observed extremes. Similarly, fc max (K 3 ) and fc min (K 4 ) correspond to forecasted local maxima and minima, respectively, and Ψ max (L,K 3 ), Ψ min (L,K 4 ) are the corresponding observed extremes valid for the associated neighbourhoods. Coloured squares illustrate the local neighbourhoods of width L used in the computation of the individual SLX scores. The largest value used in a chosen set-up is L max . A boundary zone of width L max is included to secure a consistent computation for internal points (see Section 2) of the computations in a final output. Alternatively, one may choose to compute averages over sub-domains as a final output. Testing different values of L (0 ≤ L ≤ L max ) is normally part of using SLX for a given weather situation or an idealized case. It is recommended to compute the score based on internal points only, in order to be able to include the same grid points in the computations and complete neighbourhoods of the points considered. This means that the identified local extremes are internal points.
Selecting extreme points of the analysis and forecast fields requires a decision on a tolerance limit δ (kg/m 2 ), which reflects the level of uncertainty of the analysis. Choosing a tolerance significantly larger than zero for non-zero precipitation extremes implies increased likelihood that several points will be selected as extremes. A value of δ close to zero, for example, at 0.1 kg/m 2 , could be chosen as default.
Zero-valued dry areas will often exist as part of both observation-and forecast fields. In this situation, multiple points of zero value will be automatically selected as minima for the verification process. All the selected points will be treated with the same neighbourhood width L and contribute with equal weight to the score, as described in (4)-(7) below. The implications of choosing non-zero tolerance values are discussed later, in the context of idealized examples, and in Section 6. Figure 1 displays the names of local extreme points: ob max (K 1 ), ob min (K 2 ), fc max (K 3 ) and fc min (K 4 ) respectively. K 1 , K 2 , K 3 and K 4 are indexes of the extremes, 1 ≤ K 1 ≤ M 1 , 1 ≤ K 2 ≤ M 2 , 1 ≤ K 3 ≤ M 3 , 1 ≤ K 4 ≤ M 4 , with the number of extremes of each type being, respectively, M 1 , M 2 , M 3 and M 4 . ob max (K 1 ) identifies an observed local maximum point. The value ob max (K 1 ) is compared with ɸ max (L, K 1 ), which is the maximum of the forecast field in the neighbourhood of the central point representing ob max (K 1 ) (Equation 1a) i K1 , j K1 are coordinates of the related forecasted values in central point. ob min (K 2 ) represents an observed local minimum point. When computing the score, the value ob min (K 2 ) is compared with ɸ min (L, K 2 ), which is the minimum of the forecast field in the neighbourhood of the point representing ob min (K 2 ) (Equation 1b) i K2 , j K2 are coordinates of the related forecasted values in central point. fc max (K 3 ) represents a forecasted local maximum point. When computing the local score, the value fc max (K 3 ) is compared with Ψ max (L, K 3 ), which is the maximum of the analysis field in the neighbourhood of the point representing fc max (K 3 ) (Equation 1c) i K3 , j K3 are coordinates of the related forecasted values in central point. fc min (K 4 ) represents a forecasted local minimum point. When computing the local score, the value fc min (K 4 ) is compared with Ψ min (L, K 4 ), which is the minimum of the analysed field in the neighbourhood of the point representing fc min (K 4 ). (Equation 1d) i K4 , j K4 are coordinates of the related forecasted values in central point. Regarding, as an example, the score computation related with an observed maximum ob max (K 1 ) to be compared with the maximum of the forecast field ɸ max (L,K 1 ) in the neighbourhood: If these values are in very close agreement, the contribution of the point to the related total score named SLX ob_max is close to 1 (maximum contribution). If the forecast field is very different from ob max (K 1 ) in the neighbourhood, the contribution to the score will be smaller and possibly zero, if the difference is sufficiently large. Correspondingly, the fit of a forecasted extreme maximum point to the extreme of the analysis points in the neighbourhood leads to a contribution to the score component SLX fc_max . Finally, the scores related with minima are named SLX ob_min and SLX fc_min respectively: The former determines how well the observed (analysed) minimum point agrees with the minimum of the forecast field in the neighbourhood, and the latter measures the degree of match of a forecasted minimum value to the analysis minimum in the neighbourhood of the forecasted minimum. SLX ob_max , SLX fc_max , SLX ob_min and SLX fc_min are defined in (4)-(7) utilizing values of the score defined below.
The assessment of the local value of the score is defined by a score function S (Figure 2), and formulas (2a)-(2c), (3a)-(3b), detailed below, are used for both the local maxima and minima.
'ob' means observed extreme, 'ɸ' means forecasted value. k (kg/m 2 ) represents a value below which a forecast should be assigned a perfect score of 1. It should be equal to or closely related to observation uncertainty at small precipitation amounts. Currently, k = 0.1 (kg/m 2 ): A is a dimensionless constant representing how fast the score is decreasing towards zero when forecast is larger than observation.
If ob > k: If ɸ > ob : If ob ≤ k: It is seen that the present choice of score function S is piecewise linear and very simple. The most characteristic feature is that S is asymmetric due to the application of A = 4 in (2c) and (3b). In DMI, we find this asymmetry useful because of the risk of under-forecasting warning conditions for different user groups who depend on critical thresholds. Over-forecasting will trigger a warning and thus avoid or reduce economic losses. However, too much over-forecasting is not useful and may cause significant and unnecessary expenses to the users. Overprediction relative to observation by a factor of 5 is set as a limit for useful forecasts. The score function should be thought of as a construct between model developers and NWP users, for example, forecasters. Other specific user groups may define the S function somewhat differently depending on their practical application. The challenge of defining score function for precipitation is discussed further in Section 6. S decreases linearly from 1 to zero as the fraction (ϕk)/(A Â k) increases from zero to 1 (Equation 3b). Using k = 0.1 kg/m 2 , S decreases to zero at 0.5 kg/m 2 . A gradual decrease towards zero for ɸ > 0.1 kg/m 2 is used in order to avoid a large sensitivity to small changes of forecast precipitation close to k (kg/m 2 ). With these definitions, (3a) and (3b) define some asymmetry of the score function near zero. For simplicity, the same coupling as expressed by (3a) and (3b) has been adopted in (2b) and (2c). This implies also small asymmetry in score function ( Figure 2) near the observation 'ob', since from (2b) S = 1 if obk ≤ ɸ ≤ ob while S < 1 for ɸ > ob. However, this asymmetry close to the observed value for ob > k may be adjusted for by imposing S = 1 in the interval [ob À k, ob + k] and an associated small adjustment of S above ob + k by computing S = Max{1 À (ɸ À ob À k)/(A Â ob À k), 0}. This will retain the formulation of a linear reduction of S to zero at 5*ob. The impact of this type of asymmetry may be considered small when using k = 0.1 kg/m 2 .
The steps of the verification process for the score component SLX ob_max may be summarized as follows, assuming that a decision has been made on a possible partitioning of the verification in sub-domains, as part of the verification set-up: I. Choose the set of different neighbourhood widths to be used in the experiments. II. Determine the value (and positions in grid) of the observed maxima ob max (K 1 ). III. Compute the forecasted maxima ϕ max (L,K 1 ) in the neighbourhood(s) of the extreme point(s). IV. Insert the values ob max (K 1 ) as value = OB and ϕ max (L,K 1 ) as a forecast to the score function S of Figure 2. The values and distance between the two determine the value of the score S ob_max (K 1 ). V. The procedure includes consideration of multiple extreme points, 1 ≤ K 1 ≤ M 1 (Equation 4), and the average of all computations is computed as the final value of the score (SLX ob_max ) linked to the chosen neighbourhood width. VI. Repeat (III)-(V) for all chosen neighbourhood widths.
The values of individual scores, taking into account multiple occurrences of extremes, are given by (4)-(7), which describe the averages for each type of score F I G U R E 2 The SLX score function used in this study. The score is defined between zero (poorest quality) and 1 (perfect forecast). The function is piecewise linear in the intervals [0, OB-k], [OB-k, OB], and [OB, 5*OB] S ob_max (K 1 ) is obtained by inserting for 'ob' and 'ɸ' the respective values, ob max (K 1 ) and ɸ max (L, K 1 ), in Equations (2a)-(2c), (3a) and (3b).
The total combined SLX score is defined as an average of the four individual components according to (8).
The scheme is tested using idealized set-ups and forecasts from the HARMONIE-AROME cloud-resolving model system (Bengtsson et al., 2017) as used operationally at DMI. The verification is performed at a horizontal grid point resolution down to 750 m.
A quantitative precipitation analysis system has been developed at DMI, combining in situ observations with a detailed analysis of data from meteorological radars. The radar products, at a horizontal grid size down to 500 m and accumulation periods varying from 10 min to 24 h, have operational status and are being used routinely by forecasters for monitoring of weather and by NWP model developers for model validation and verification.
The theoretical techniques used to merge rain gaugeand radar data in near real time are based on Kriging methods; (i) Kriging with external drift (KED), and (ii) Kriging with radar-based error corrections (KRE), which are described in the open literature (Goudenhoofdt & Delobbe, 2009;Sinclair & Pegram, 2005;Wackernagel, 2003). Further, both parametric and non-parametric variograms, describing the spatial statistical variation of precipitation, have been used (Schiemann et al., 2011;Velasco-Forero et al., 2009). The final product evaluated is the variance weighted mean of these four products (KED and KRE with both parametric and nonparametric variograms).

| IDEALIZED EXAMPLES
The properties of the SLX scheme are first illustrated using idealized examples. The response of the scheme is examined for different precipitation structures displaced between analysis and forecast, for example, coherent precipitation structures and single grid points representing extremes. Also, an idealized precipitation band is studied. The forecasted precipitation band is parallel to the analysed one, but displaced by a number of grid points. Finally, the response of the scheme is studied with regard to field perturbations in forecast and analysis. The domain size used in the computations is 200 Â 200 grid points.

| Impact of neighbourhood and scales
We first demonstrate how neighbourhoods and horizontal distribution of extremes determine the score components. The influence of increasing neighbourhood size is studied in the context of situations with coherent groups of extremes or single points separated at some distance. These extreme points may be thought of as mesoscale precipitation areas that are not precisely forecasted with regard to size and location. Single points of maxima separated in space may represent maxima related with (local) showers. A small default tolerance value of δ ⁓ 0 kg/m 2 for defining extremes favours selection of single points as maxima. Hence, coherent areas of maxima would normally require larger values of δ. Regarding minima, the focus is on zero-valued coherent areas, which may have any size without increasing δ. In Figure 3, the white circular areas represent observed and forecasted areas of zero precipitation, respectively. The yellowish areas represent observed and forecasted maxima, respectively, that are larger than the minimum k (kg/m 2 ) defined in the score function (formulas 2a-2c and 3a-3b). The grey areas represent intermediate precipitation amounts also larger than k (kg/m 2 ) occurring in both forecast and analysis. Note that the intermediate forecasted precipitation is assumed to overlap the observed areas of maxima and minima. Correspondingly, the observed intermediate values of precipitation overlap the high and low extremes of forecast values. This simplified set-up illustrates how the scores for zero-valued areas, forecasted or analysed, are influenced by overlapping areas of non-zero values, analysed or forecasted. Considering all points equal means that the individual contributions of zero-valued points to the score should be considered with equal weight.
Regarding minima, SLX fc_min < 1, as long as neighbourhood width L is smaller than d fc-min . Similarly, SLX ob_min < 1, as long as L is smaller than d ob-min . At these threshold neighbourhood widths, the scores become 1. For both SLX fc-min and SLX ob-min , no single point using smaller L than d min in a neighbourhood comparison contributes with a value of 1. It is possible to get scores equal to zero for L < d min if the intermediate precipitation areas are dominated by values significantly above k (kg/m 2 ), considering the score function used. The critical neighbourhood associated with d fc-min is shown in Figure 3 as a dotted square. Similarly, the solid blue frame corresponds to d ob-min .
This highlights the relevance of the scores for minima when zero-valued areas are embedded in an area dominated by non-zero precipitation values. Also, the importance of neighbourhood width is clear: The critical values, d fc-min and d ob-min respectively, determine scales when the scores achieve a value of 1.
Regarding the scores of maxima, SLX ob_max and SLX fc_max , the values will again be determined by a weighted average of the scores involving the accepted maximum points. A critical neighbourhood width for comparing maxima is then d max . If fc max exceeds ob max by more than a factor of 5, both SLX fc_max and SLX ob_max will be zero for comparisons involving sufficiently large neighbourhoods that will include values from both areas of maxima. In all other situations, a non-zero score will be obtained since it is assumed that the values to compare with are above k (kg/m 2 ), but values may get close to zero, for example, for large observed extremes ob max , if fc max is small.
As a summary, the set-up of Figure 3 demonstrates that the combined SLX score may become zero for L < d min for separated dry areas in combination with areas of maxima involving sufficiently different extremes. The width of the neighbourhood sometimes plays a critical role when trying to quantify at which scales the minima and maxima are predicted with reasonably high scores.

| Precipitation band displaced
Precipitation along meteorological fronts is a common feature in the atmosphere and is highly relevant in the context of verification. We may consider the idealized situation where forecasted accumulated precipitation versus analysed precipitation occurs in frontal structures separated by a number of grid points (D). This is illustrated in Figure 4. The idealized precipitation bands (observed and forecasted) are both 10 grid points across, with a gradual increase of precipitation in steps of 5 (kg/m 2 ) and a maximum of 15 (kg/m 2 ) accumulation in the middle. Each step is thus 2 grid points wide. The observed and forecasted precipitation are zero outside the precipitation bands. The main characteristics of this set-up, with respect to resulting values of scores, are the number of grid points separating the precipitation bands, the neighbourhood width chosen and the fraction of grid points in the precipitation bands compared with total number of points in the area verified. Figure 5 shows the SLX score for two displacements between the precipitation bands equal to 15 and 25 grid points, respectively. The common decreasing scores below 0.50 at decreasing neighbourhood widths below 6 points occur for the following reason: Some of the forecasted minima (= 0) chosen during the verification process occur under the non-zero precipitation values of the analysed precipitation band. Computing SLX fc_min related with these points requires a certain neighbourhood width >0 to contain analysed point values with zero precipitation, needed to F I G U R E 3 Neighbourhood and scales: ob min = fc min = 0 in white areas. ob max and fc max represent areas with maxima observed and forecasted, respectively. d min is the distance in grid points between areas with minima. d max is the distance between areas with maxima. Grey areas outside the minima and maxima represent intermediate precipitation amounts. d ob-min represents smallest necessary neighbourhood width to make SLX ob_min = 1, d fc-min represents the smallest necessary neighbourhood width to make SLX fc_min = 1. For details, see text achieve a positive contribution to SLX fc_min. A similar argument may be used regarding zero analysed precipitation points under the non-zero forecasted precipitation band. Increasing the areal fraction of the precipitation bands compared with total domain size would decrease the score further due to this argument.
When computing the scores of SLX ob_max and SLX fc_max involving points of 15 kg/m 2 , the distance between the precipitation bands determines the scores, which become equal to 1 when the neighbourhood width reaches the distance between the precipitation bands, as indicated in Figure 5. The fact that forecasted and observed minima agree over large parts of the domain implies that the combined SLX score is close to 0.50 for neighbourhood sizes less than the separation distance between the precipitation bands. All scores will be 1 if the two precipitation bands are not separated. Then all values will agree on the grid scale. Figure 6 illustrates situations when the range of forecast values gets increasingly separated from the range of analysed values. In this situation, the definition of the score function implies that SLX will decrease.

| Field perturbations
Figures 7 and 8 illustrate this in a specific set-up of perturbed forecasts related to an analysis field. Figure 7a, b illustrate observed and forecasted fields, respectively, with perturbed 10 Â 10 squares of magnitude ±c = 5 displaced differently between analysis and forecast. The 'background' values of observed-and forecasted fields are b and d, respectively. The results of SLX obtained with the forecast background values increasing from 5 to 40 are shown in Figure 8. When b = d = 5, the extreme areas of analysis and forecast are just displaced and the score will increase for sufficiently large neighbourhoods (dashed blue line). However, as expected, in these experimental conditions, the results show that the score will decrease at high forecast background values towards a low value. This is consistent with Figure 6.
Next, the impact of perturbing the background field b with the same amount c is shown, regarding both analysis and forecast (Figure 9a,b). Figure 10 shows the results applicable to different background fields up to 40. It is seen that improved scores are obtained as the general F I G U R E 5 Results of SLX valid for the precipitation bands of Figure 4. NTOL is neighbourhood width. An impact of the size and displacement of the precipitation band is seen (for details, see text) F I G U R E 6 Impact of bias. Qualitative illustration on the properties of SLX for increasing bias. Separation between observed and forecasted ranges leads to a decreasing value of SLX F I G U R E 4 Schematic picture of experimental conditions for displaced frontal precipitation bands. The observed and forecasted precipitation bands are displaced by D grid points. Each varies in intensity between 5 and 15 kg/m 2 background level is increased while the perturbation ±c is fixed. This result is a consequence of the score function since the ratio between forecast and analysis gets closer to 1. In view of the limited distance between the squares with perturbed values, it is seen that increased size of neighbourhood leads to higher scores.

| REAL-WORLD EXAMPLES
The purpose of the SLX score is the verification of operational NWP, with output on a daily basis. To demonstrate what can be expected in an operational context, results are presented from real-world weather situations. The first is a summer convection event where forecasts from different origin times are compared at the same valid time. At first sight, these different forecasts are quite accurate, but the SLX verification is able to distinguish quality of the individual forecasts. For the second case, some forecasts in December 2020 are examined, as part of a pre-operational set-up. The period studied includes consecutive days linked to changing weather conditions, as a high pressure ridge affecting Denmark is replaced by frontal rain. Finally, in order to simulate the characteristics of some longer time statistics, forecast-and observation data for a 5-year period, valid for summer conditions, are used as input to simulate properties of the SLX scheme under different assumptions, for example, forecasts simulating operational conditions and forecasts based on estimated 'climatological' precipitation frequencies of five precipitation categories. Also forecasts with bias are examined, and forecasts based on a uniform and random value applied to the entire grid.

| A case of summer convection
The case studied is from 27 August 2019 with convection developing during the day over Denmark in a warm air mass being convectively unstable. The analysis system and operational forecast model are used at a grid size of 2.5 km (see Section 3). The results of the forecasts are shown in Figure 11a-d. These forecasts of accumulated precipitation are all valid for 3 h, from 9 UTC to 12 UTC 27 August. The initial time of the individual forecasts are shown in the figures representing forecast lengths of 3, 6, F I G U R E 7 Background field of observed field and forecasted field are b and d, respectively. Blocks of 10 Â 10 grid points are present as perturbations to the background field, with values b À c, b + c respectively, for the observed field, and d À c, d + c for the forecast, c = 5 (kg/m 2 ). The position of the perturbed blocks differs between analysis and forecast F I G U R E 8 Results of experiments in the framework of Figure 7a,b. As the background value d increases in subsequent forecasts while perturbation c = 5 remains constant, the combined SLX score decreases towards SLX = 0. The background value of analysis is not changed, and 'SLX_bf' in the legend indicates background value of the forecast. NTOL is neighbourhood width 9 and 12 h respectively. The corresponding analysis is shown in Figure 11e. The black crosses indicate the location of the maximum accumulation values of the domain excluding the dark area outside the frames. The individual SLX components are shown in Figure 12a-d. The combined score is shown in Figure 12e (see figure text). The best scores (SLX and local maximum components) are obtained for the 6-h forecast. This agrees well with a visual inspection of Figure 11a-d showing that this forecast has a rather accurate position and magnitude of the precipitation maximum. Lower scores for the 3-h forecast are sometimes seen in operational forecasts and may be linked to model spin-up. This concept is mentioned by Ulmer and Balss (2016). Significantly lower scores are seen for the earliest forecast started at 00 UTC 27 August, especially when considering forecasting of extremes on the grid scale. A quite general feature is that scores become higher at a neighbourhood size of 5-10 grid points. This interval represents the smallest scales that the model can resolve. Another feature is that scores tend to be higher when considering scores related with the minima as compared with the maxima. This could be interpreted as larger areas with minima, for example, with zero-valued precipitation, occurring in both forecast and analysis.

| Moving weather systems
Changing weather related to moving weather systems across Europe play an important role in daily verification context. The SLX scores are illustrated for weather conditions occurring over Denmark from 24 December to 28 December 2020. On 24 December, a low pressure system and its associated precipitation affect the southernmost Denmark. This system moves towards the east and is followed by rising pressure and a flow of dry air from the north, with only few scattered areas of precipitation. 25 December is dominated by dry weather, but precipitation from a frontal system reaches Denmark on the following day. This is linked to a deep cyclone (central pressure < 960 hPa) developing north of Scotland. Precipitation covering most of Denmark develops during the afternoon of 26 December and continues over 27-28 December. Figure 13a illustrates precipitation over Denmark, occurring early in the almost dry period. The 3-h accumulation is shown for 04-07 UTC on 24 December. It represents precipitation on 24 December early in the day. Later, only small areas of scattered precipitation occur. The different precipitation characteristics for this changing weather pattern are reflected in the SLX scores of Figure 14. The model area verified is covering Denmark in a latitude band of 54.5 N-57.8 N, and a longitude band of 7.5 EÀ12.7 E. The operational model HARMONIE-AROME, see Section 3, is tested in a preoperational verification domain using a grid size of 750 m. For simplicity, the results of Figure 14 are for one neighbourhood size only (L = 6 points). The verification is split into nine equally sized sub-domains with separate SLX computations that are averaged. Figure 14a represents forecast verification of the mainly dry period while 14B applies to the period with frontal type of precipitation.
F I G U R E 1 1 Forecasted accumulated precipitation (kg/m 2 ) valid from 9 UTC to 12 UTC 27 on August 2019 over the light squared areas of dimensions 325 km Â 325 km. Figure (a), (b), (c) and (d) apply to forecasts starting at 00 UTC, 03 UTC, 06 UTC and 09 UTC, respectively. Figure (e) shows the corresponding analysed field of precipitation from 9 to 12 UTC on 27 August 2019. The black crosses indicate the maxima All operational runs are verified up to a forecast range of 48 h. They are executed every 3 h. The verification is based on 3-h accumulated precipitation. Figure 14a shows the averaged verification of all individual forecasts executed on 24 December. The latest forecast up to 48 h ends on 26 December 21 UTC. Similarly, Figure 14b shows the averaged verification of all 48-h forecasts executed on 26 December.
It turns out that the dry conditions are well forecasted. As a consequence, the scores related with minima are very close to 1.0 in Figure 14a. However, long forecasts starting late on 24 December do not predict F I G U R E 1 2 SLX scores for the four different forecast ranges 03 h, 06 h, 09 h, 12 h valid at 12 UTC on 27 August 2019. The scores SLX-obmax and SLX-fcmax related with observed and forecasted maxima are shown in a and b, respectively. The scores SLX-obmin and SLX-fcmin related with minima are shown in c and d, respectively. The combined score (SLX) is shown in e. The neighbourhood width is represented by NTOL well the observed precipitation increase on 26 December. This is clearly reflected in the scores of maxima, in particular the score of forecast maxima reaching values below 0.50.
For the wet period, all scores are much more alike (Figure 14b). This may be due to the dominating stratiform precipitation, which is more uniformly distributed horizontally than convective precipitation. Therefore, the scores related with minima are similar to the scores of maxima. The typical levels of SLX are smaller than the corresponding ones of the earlier dryer period. This may be explained by the difficulties in forecasting horizontal variations of non-zero precipitation. Towards the end of the second period, in the evening of 28 December, the scores tend to increase. This is related to more dry conditions starting to appear (not shown).

| Simulating operational conditions
It is a goal to carry out detailed tests of the SLX scheme using NWP data for Denmark. However, there is not yet a multi-year high-quality observed precipitation analysis data set available. To simulate how the SLX scheme would operate in an operational context, a simple method has been designed and tested. Archived NWP data and rain gauge data are used for this. The NWP model is the HARMONIE-AROME model (Bengtsson et al., 2017) operated at a grid size of 2.5 km. The data used to generate synthetic tests are from a summer month (June) covering 5 years from 2014 to 2018. The simulated verification results will only represent average performance of the model over the 5-year period since the model version has changed somewhat over the 5-year period. The forecasts and observations consist of data partitioned into five precipitation intervals (categories) and are provided as a contingency table, see Table 1. These data are valid for 12-h periods. The observations are based on 45 Danish quality controlled stations. The forecast data apply to 6-18 h forecasts. The data of Table 1 comprising 5 Â 5 = 25 precipitation classes represent 6545 data values. The class of dry or almost dry conditions (≤0.2 kg/m 2 ) dominates (4514 values).
We assume that the observation data can be used to approximate observed precipitation in a grid (2.5 Â 2.5 km) to be compared with the NWP model data produced and stored at this resolution. A feature of R programming is used for selecting an arbitrary number between 1 and the total number of 6545. The selected number is converted into a specific precipitation class out of the 25, utilizing that each class is associated with an interval directly proportional to the relative frequency out of the 6545 values. Having chosen one of the 25 classes, specific values of forecast and observation are selected. A precipitation interval linked to the precipitation class is divided into 100 equidistant sub-intervals of equal probability. A random number selected between 1 and 100 is then used to choose a specific sub-interval and its corresponding precipitation value. Separate random numbers are selected for observation and forecast to fix the precise values used in the precipitation class. The forecasted and observed precipitation fields of the model domain are simulated based on this. The upper bound of the highest class is set to 40 (kg/m 2 ). This value is estimated from regional extreme precipitation analysis over Denmark (Madsen et al., 2017), indicating that 12-h F I G U R E 1 3 Examples of radar-based analyses of precipitation, illustrating precipitation over Denmark during 24-26 December 2020. Two accumulation periods are shown: Accumulation from 04 to 07 UTC on 24 December 2020 (a), representing precipitation early on 24 December, and accumulation from 16 to 19 UTC on 26 December 2020 (b). The latter represents frontal type of precipitation over Denmark starting on 26 December accumulations of 40 (kg/m 2 ) on a regional scale occur on a time scale of $5 years.
However, it is not realistic to execute a procedure on the grid scale for all grid points independently since it will be almost impossible to generate quasi-constant large-scale precipitation accumulations, which occur in reality for domain sizes considered. Therefore, an approximation is made by selecting a scale to compose the total field, choosing among squares of different sizes by random selection, for example, 5 Â 5 squares of grid points or 10 Â 10 squares, and so forth up to a maximum size representing the entire domain. Constant values of forecast and analysis respectively are used inside each square. In this way, a spectrum of possible sizes is considered, from the smallest resolvable scale (approximately 5 Â 5 grid points) up to a scale representing the entire domain. As a result, the relative occurrence of a precipitation class as well as different horizontal scales is taken into account. Figure 15 presents the results generated with this approach. In this case, a small model domain of 100 Â 100 grid points has been chosen, with horizontal grid boxes of 5 Â 5, 10 Â 10, 20 Â 20, 50 Â 50 or 100 Â 100, selected as random options. A total of 720 grids of model-and observed states respectively have been generated to do an experiment. This would, for example, correspond to verification of each hour over a month (24 Â 30 = 720) in an imagined operational situation. Figure 15 shows a comparison between SLX scores of different simulations as a function of neighbourhood width. Simulated operational forecast data are represented by 'fc-stat', based on the 25 classes of Table 1. This is compared with forecast verification 'fc-clim' based on 'climate', which is not a constant value. It is based on the observed 'climatic' frequency of the individual precipitation categories. A modified contingency table has been constructed for this, using a forecast probability of each category based on the observed relative frequency of that category derived from the 5-year period. Also, the impact of adding a constant value (0.5 kg/m 2 ) to a standard forecast has been studied ('fc-bias-min'). Finally, a uniform random value forecast is verified ('fc-unif-random'). The field is generated by assuming equally likely choices in the possible interval of [0, 40] (kg/m 2 ).
As expected, Figure 15 shows that the forecast model clearly produces superior scores compared with other alternatives. 'fc-clim' produces a reasonably high forecast score. It benefits from a high statistical weight of the smallest quasi-dry precipitation class. The combined SLX score studied here is significantly affected by adding a constant bias of 0.5 kg/m 2 to all forecasts. The current choice of score function penalizes biases when dry conditions are observed. The constant-bias score is much lower than the 'climatic' forecast. Since all observed dry conditions will be over-forecasted by at least 0.5 kg/m 2 , these contribute with zero values to the total score. This result is reasonable if one accepts the definition of the score function for dry observed conditions, penalizing significantly small precipitation amounts. It emphasizes that a full understanding of the SLX verification requires that attention is paid to individual components related with minima and maxima. However, the averaged score provides a general overview of the quality. This subject is discussed further in Section 6. As expected, 'fc-unif-random' produces a very poor score. It is worthwhile noting that this type of forecast is associated with a large bias.

| DISCUSSION AND FUTURE PROSPECTS
In this section, several aspects of the SLX verification are addressed: the definition of an adequate score function, the impact of defining tolerance for selecting extreme points, use of the SLX scheme operationally, and finally some future development prospects.

| Score function
Verifying precipitation using a score function is challenging for several reasons. First, the fact that precipitation is a quantity with a lower bound = 0, while no limit applies upward, creates a special situation that is not favourable for defining symmetry of the score function. For another parameter such as temperature, it would be a F I G U R E 1 5 SLX statistical simulations related with operational conditions of June 2014-2018. 'fc-stat' represents SLX of operational data at different neighbourhood widths (NTOL). 'fc-clim' corresponds to a run with forecast based on observed statistics for each category during the period. 'fc-bias-min' corresponds to the score obtained if a bias of 0.5 kg/m 2 is added to all forecasts of the original simulations. 'fc-unif-random' corresponds to a random forecast [0, 40 kg/m 2 ] applied uniformly over the grid straightforward possibility to choose a symmetric score function centred around the observation. The score function used in the present study is very simple, with only three linear parts. A natural goal for an improved formulation could be to make a more advanced function allowing flexibility, with the possibility of several parameter settings in accordance with demands from users who might want a dependency on particular thresholds. Symmetry around observation in some interval would be a natural choice for non-dry conditions.

| Tolerance for selecting extreme points
Choosing tolerance δ = 0 (kg/m 2 ) for defining an analysed extreme means that the uncertainty is not taken into account. However, this uncertainty δ ob (kg/m 2 ) might be included in some way. It implies that points deviating up to δ ob (kg/m 2 ) from the analysed extreme may be selected for the score computation.
One may try to consider typical observation uncertainty related with rain gauges. According to Yang et al. (1998), such uncertainty varies substantially, for example, depending on meteorological conditions, and may amount to at least 20%. In the case of dry weather, the observation uncertainty is in practice zero. The currently used analysis should be considered an advanced combination of radar data and data from rain gauges, with quality control. Hence, the resulting uncertainty of the analysis field is difficult to estimate and is beyond the scope of the present paper.
Regarding forecasts, a concept of non-zero tolerance δ fc (kg/m 2 ) implies that more extreme forecast points tend to be selected to form coherent groups. For maxima, this means that points in an interval of δ fc lower than the extreme maximum value will be selected as points to be verified. For minima, larger values with deviations up to a magnitude of δ fc will be used. It will impact on the scores of SLX fc_max and SXL fc_min . This might be desirable, but it implies that a strict definition of an extreme value is relaxed.
Specifying both δ ob (kg/m 2 ) and δ fc (kg/m 2 ) as small fractions, for example, 5%-10% of the extreme value observed and forecasted, respectively, might be an attractive way of allowing non-zero tolerance of extremes. It tends to keep sensitivity of score computations at a low level since small fractions were involved. On the contrary, a high sensitivity related with minima could result if a constant and relatively high tolerance of δ were chosen. For maxima, for example, 10% of a large value increases the likelihood that coherent groups of points would be realized.

| SLX in an operational context
Prediction of accurate precipitation extremes is frequently requested by, for example, duty forecasters in an operational context. The SLX scores contribute in this respect with complementary information not available from, for example, FSS and SAL verification schemes mentioned in the introduction.
Forecasters normally represent the last step in a forecast chain. Sometimes it is verified if the forecaster improves NWP forecasts in this last step, when measuring quality by a verification score. It is then relevant to ask if the SLX score(s) are easy to improve by a forecaster, for example, by 'hedging', which implies some sort of degradation of the original forecast. Considering the combined score, it will be difficult to hedge for the following reasons: First, regarding minima, a forecast bias tends to reduce the scores, as shown in Section 5.3, investigating the impact of bias in the case of small precipitation amounts. Therefore, forecasters need to avoid biases of small precipitation amounts as much as possible.
Second, regarding maxima, it is relevant to mention another feature of hedging: According to Murphy (1993Murphy ( , 1978, hedged forecast fields are smooth, and they lack sharpness. Hence, a horizontal smoothing of the predicted precipitation field is an act of hedging. This would be attractive to forecasters in order to avoid large errors near the observed maxima, for example, measured in an absolute sense by mean absolute error or SLX. However, a horizontal smoothing of high forecast values might lead to incorrect non-zero forecasts related with minima, for example, overlapping the areas with observed minima equal to zero. As a consequence, the associated score SLX ob_min would be reduced. In general, achieving high scores related with maxima is challenging, as this requires predicting both location and amount with sufficient accuracy. Introducing a significant positive bias tends to provide a lower score, as shown in the field perturbation experiment of Section 4.
When operating SLX on a daily basis, one may produce not only output of the score components but also the associated high and low values of the extremes. This provides an opportunity for computing verification results as a function of specific thresholds, for example, observed or forecasted maxima above high thresholds. In this way, SLX may be used for studying model behaviour for high extreme precipitation values selectively, as a post-processing step. Other options for post-processing are relevant, for example, to compute SLX and its separate components as distributions, showing the number of score values occurring in separate equidistant intervals between 0 and 1. When accumulating results from daily runs, statistics may be carried out for longer periods, for example, months or years.
For large model domains, division into sub-domains is recommended. This could imply that large limited area NWP models of dimensions, say 1000 or 2000 grid points in each direction, would be divided into, for example, $100 sub-domains. Output from sub-domains could be averaged or stored separately. The latter provides an opportunity to diagnose problematic model output in specific sub-domains.
The examples of Sections 4 and 5 illustrate the important role of neighbourhood width for computations inside a given model domain or sub-domain, showing at which scales various local extremes are captured. It is recommended to compute results for a number of neighbourhood widths from grid scale to a maximum width determined by a prioritized scale for predicting extremes, for example, 50 km.
In operational verification, it is often relevant to investigate how different models perform, for example, as a function of model resolution. From the users' point of view, the skill of a forecast should be considered on an absolute spatial scale, for example, squares of 50 Â 50 km. Then, the number of grid points for this scale will depend on the model resolution. The highresolution model will then have more grid points in such squares and should be able to benefit from that in a comparison between the two models. Alternatively, the coarse mesh model could be interpolated to the fine mesh grid. It is assumed that the analysed precipitation is to be applied at a resolution of the high-resolution model. Otherwise possible details predicted at high resolution would not be rewarded.
The SLX concept can, in principle, be applied to a global model system. The maxima and minima could then be computed for any selection of neighbourhood sizes without considering the influence of lateral boundaries. A challenge would be to produce a sufficiently accurate precipitation analysis.
The current implementation is coded in R (https:// www.R-project.org/). For modest domain sizes with dimensions of few hundred grid points in each direction, the execution time on, for example, a single CPU, GNU 4.8.4 compiler, Intel processor COREi5 1.70 GHz is within few minutes. Code optimizations may be useful in the case of large model domains that are not tested in this investigation. In the case of many extreme points, the current code will run faster if the computations are based only on a subset of extreme points. However, such subset should be sufficiently large, in order not to affect the result significantly.

| Future development prospects
The potential of the scheme to be used in the context of ensemble prediction systems of limited area models, for example, the study by Frogner et al. (2019), is obvious. One may, for example, compute SLX in the context of percentiles as an alternative to the median. In the future, the convection resolving model and the precipitation analysis may be run at sub-km grid sizes, for example, down to 500 m. The general strategy of the international collaboration developing this model system (https:// hirlam.org) is to develop ensemble systems down to subkm grid size with frequent updates and short forecast lengths, for example,. shorter than 24 h. This would mean using SLX at short scales in space and time, with precipitation accumulation periods down to 1 h or less in the future.
If model developers are to compare experiments for the same atmospheric case, it will be important to agree on common settings for the comparisons. First of all the score function ( Figure 2) should be the same. In addition, the size and location of model domain and accumulation periods should be agreed on.
The general concept of verifying local maxima and minima implies that the scheme could also be adapted to other variables than precipitation. For each parameter to be considered, an appropriate score function needs to be defined, and an appropriate analysis needs to be available for comparison with the forecast.

| CONCLUSIONS
The evolution of high-resolution numerical weather prediction (NWP) in recent decades has led to much more detailed forecasts in space and time. However, there is a need to verify these more detailed forecasts, both in order to improve models further and in order to communicate forecast skill to the NWP users. Precipitation is a challenging parameter to verify due to its very fine scale structures. A new generation of high-resolution field verification of precipitation has been developed to meet the requirements. This verification is dependent on a detailed precipitation analysis. Based on experiences with the precipitation analysis developed in Danish Meteorological Institute (DMI), it is concluded that quality control of observational data is a very important part of the system in order to secure quality. This paper documents an example of a novel field verification scheme: The Structure of Local EXtremes (SLX) scheme. It includes verification of both local maxima and minima of observed and forecasted precipitation and has properties of both features-based and neighbourhood-based verification schemes. The properties of the scheme have been illustrated for both idealized precipitation cases, and for real-world forecast examples. The results indicate that the scheme is potentially very useful for investigating model weaknesses in individual cases and more generally for routine verification.
The scheme provides flexibility since the verification output can be produced for separate sub-domains. Moreover, linked with the four separate scores of the SLX scheme, a variety of diagnostic output can be presented for different aspects of forecast quality. Statistics for high extreme thresholds may be generated as part of a postprocessing step. The scheme will produce results for any analysis-and forecast field including zero-valued fields. The fact that the scheme verifies local extremes fits well with an increasing focus on using ensemble prediction systems in the NWP community. The score function can be adapted to special needs of the user community. SLX complements well other spatial verification schemes such as FSS and SAL.