Evaluation of three mainstream numerical weather prediction models with observations from meteorological mast IJmuiden at the North Sea

Numerical weather prediction models play an important role in the field of wind energy, for example, in power forecasting, resource assessment, wind farm (wake) simulations, and load assessment. Continuous evaluation of their performance is crucial for successful operations and further understanding of meteorology for wind energy purposes. However, extensive offshore observations are rarely available. In this paper, we use unique met mast and Lidar observations up to 315 m from met mast ‘‘IJmuiden,’’ located in the North Sea 85 km off the Dutch coast, to evaluate the representation of wind and other relevant variables in three mainstream meteorological models: ECMWF-IFS, HARMONIE-AROME, and WRF-ARW, for a wide range of weather conditions. Overall performance for hub-height wind speed is found to be comparable between the models, with a systematic wind speed bias <0.5 m/s and random wind speed errors (centered RMSE) <2 m/s. However, the model performance differs considerably between cases, with better performance for strong wind regimes and well-mixed wind and potential temperature profiles. Conditions characterized by moderate wind speeds combined with stable stratification, which typically produce substantial wind shear and power fluctuations, lead to the largest misrepresentations in all models.

For a successful downscaling procedure and other applications, it is first and foremost important that all models are thoroughly validated. Since model performance varies with location (orography and land use), atmospheric conditions, and quantity of interest, and the models themselves are continuously improved, evaluation is an ongoing activity. For example, a downscaling product for resource assessment from 2012 was based on a 45-km grid, 18 whereas the New European Wind Atlas 13 will be based on a 3-km grid. The growing interest in offshore wind farm development also calls for model evaluation that focuses on offshore conditions, which have received considerably less attention than onshore conditions in the past due to a lack of observations.
In this study, we use a unique dataset that was obtained over a 4-year period between 2012 and 2015 to evaluate the performance of three mainstream numerical weather prediction models. ''Met Mast IJmuiden'' (MMIJ) was installed in the North Sea between the Netherlands and the United Kingdom ( Figure 1). The exceptional distance to shore (85 km), length (4 years) and observation height (90 m mast, remote sensing up to 315 m) make the dataset well suited for an evaluation tailored to wind energy purposes at the Southern North Sea. The models included in this study are as follows: 1. The integrated forecasting system (IFS), the global operational weather model of the European Centre for Medium-range Weather Forecasts (ECMWF).
2. HARMONIE-AROME, a short-range regional weather forecasting model developed and used operationally by a consortium of European weather institutes.

The Weather Research and Forecasting (WRF) model, a community model maintained by the US University Corporation and National
Center for Atmospheric Research (UCAR/NCAR).
The results presented in this paper address the following research question: ''What is the typical model performance concerning (the vertical structure of) wind, temperature, and humidity and to what extent does this performance depend on the model itself and on atmospheric conditions?'' A wide variety of cases is analyzed, each characterized by a unique combination of wind speed, direction and large-scale atmospheric stability. This characterization provides an interesting and novel perspective on the results.
More details on the models and observations are given in sections 2.1 and 2.2. Subsequently, we explain the case-selection procedure and the statistics used (sections 2.3 and 2.4). The results are presented and discussed in Section 3. The paper concludes with a summary and conclusions in Section 4.

Description of the models
We use three mainstream numerical weather prediction models (IFS, HARMONIE, and WRF) that are all widely used for wind energy purposes but have a different focus.
The IFS is a global model that provides forecasts of up to 15 22 which gradually deactivates as grid spacing is refined. This scheme was found to perform well for resolutions up to 3 km. 23 Further details of the WRF configuration can be found in Appendix A.
It is clear that the model setup for each model is different. The WRF simulations were performed in-house, with a setup that is commonly used for case studies. The IFS and HARMONIE data were obtained from the operational archives of ECMWF and KNMI (the Dutch national weather service), respectively, and as such represent the operational model setup at the time of the MMIJ observations. Consequently, we evaluate the performance of each model in its typical setup, as if it were to be used straight away. However, to bridge the gap between the horizontal grid spacing, we performed an additional set of WRF simulations at a resolution (and vertical grid spacing) comparable with the IFS. These simulations will be referred to as ''WRF-coarse.'' Additional sensitivity runs with WRF-coarse were performed concerning the strength of the turbulent mixing in stable conditions, the impact of prognostic sea-surface temperature (SST), and the impact of momentum entrainment at the boundary-layer top.

Observations and model data
The IJmuiden dataset is described in detail in Werkhoven and Verhoef 24 and is previously analyzed in Kalverla et al 25  Additionally, pressure, temperature, and humidity were observed at 21 m (platform) and 90 m (tower top). All data are freely available from http://www.meteomastijmuiden.nl/data/. For more details, we refer to Kalverla et al. 25 In a preliminary study, we found that the model performance in terms of wind speed and direction at this location is very similar to other locations at the North Sea (also indicated in Figure 1). However, the observation height in combination with the distance to shore and the high quality of the MMIJ data is quite unique, and therefore, we focus on MMIJ in the presentation of the results.
The observations are available as 10-minute averages, whereas the model data from IFS and HARMONIE were available as hourly instantaneous model fields. WRF data for the MMIJ location was available at each time step, but for our consistency in the comparison, we used hourly instantaneous fields as well. The model data were aligned with the corresponding observational data (ie, the 10-min average closest to the full hour). This approach was validated using the WRF dataset. By using the high-frequency output and trying various combinations of averaging and interpolation, it was found that this approach leads to the smallest ''artificial'' model error statistics. Note that the instantaneous model values can be interpreted as spatial analogs of the corresponding time-averaged observations, although the exact analogy is dependent on the mean wind speed (eg, at a mean wind speed of 10 m/s, a grid cell of 3 km would represent the average wind over an interval of 300 s).
Model data from the nearest neighboring grid cells were linearly interpolated between the original model levels to match the MMIJ observation levels. All models were initialized at 12 UTC the day before the selected case date and forecast time series from 12 to 36 hours ahead were used for validation. This is a relevant time scale for power forecasts (eg, for power trading) and moreover, for this relatively short forecast horizon the effect of nonlinearities in the model equations is still small. As such, varying model performance will mainly be due to the (in)ability of the models to accurately capture the various atmospheric conditions.

Case selection: the uvs · t 2 method
The WRF simulations are run in house on a rather large domain, which limits the number of cases to be analyzed. Since we expect that the model performance will be different for different weather situations, we use a clustering algorithm to objectively select a wide variety of weather types. From previous experience, 25 we know that wind speed, direction, and atmospheric stability are useful parameters to distinguish between *https://www.ecmwf.int/en/forecasts/documentation-and-support/changes-ecmwf-model/ifs-documentation various wind regimes. However, these parameters can vary substantially in time or space within one case. In an intermittent turbulence regime, for example, there is no single characteristic Obukhov length to characterize the stability of the flow. This is because the Obukhov length is a very sensitive, internal parameter, based on local fluxes or gradients that are the result of (rather than the precondition for) the manifestation of boundary-layer processes.
To avoid this issue, the case selection is based on larger scale, external variables instead. The idea is that the observed conditions (eg, intermittent turbulence) are the manifestation of the external forcing: an airmass driven over a water surface by an ambient pressure field. It follows that for similar characteristics of the airmass, sea state, and pressure field, we expect similar flow regimes. This rationale is common in meteorological (downscaling) studies. 28 However, existing classifications are not optimized for the current application. First of all, the correlation with local observations is not always clear. For example, the Großwetterlagen classification 29 is not suitable because there can be many wind directions over the North Sea within one weather type. Moreover, existing classifications do not usually account for stability, while we know that stratification can make a large difference for the manifestation of the wind field. Therefore, we introduce a new clustering algorithm that better suits our needs.
We find that a suitable balance between large-scale weather type and the local flow is achieved if wind speed and direction are characterized in terms of the zonal and meridional geostrophic wind components u and v. Following Jones et al, 30 we use the south-north pressure difference over 10 • centered over the area of interest to calculate the west-east (u) wind component and vice versa for v. To account for large-scale density stratification s, we use the difference between the potential temperature at the 850-hPa level and the SST. Here, the 850-hPa potential temperature comprises information about the air mass, while the SST characterizes the sea state. The difference between the two was found to be a more robust clustering parameter than each of them separately (based on the criteria presented in section 3.1).
Daily averages of large-scale u, v, and s are derived from ERA-interim 31 mean sea-level pressure, SST, and 850-hPa potential temperature fields in the 2012 to 2015 MMIJ observation period. To improve the characterization of the airmass, the history of the flow is also relevant. 28 Therefore, we additionally include the previous day to obtain a six-parameter dataset (hence the subscript in uvs · t 2 ). We use a hierarchical clustering algorithm that iteratively (1) splits the largest cluster (judged by average cluster radius) along a plane perpendicular to its principle component, (2) computes new cluster centers, and (3) reassigns all points to the closest cluster center. 32 Once the algorithm has identified 30 clusters, we select the cases that are closest to the cluster centers. The number 30 is similar to other weather classification such as the Lamb Weather Types 30 and Großwetterlagen 29 and is aimed at selecting a wide range of characteristic weather conditions rather than exactly representing the climatology. Nevertheless, we perform several checks to verify the representativeness of our cases, which will be further discussed in sections 3.1. For HARMONIE, model data for six of these cases were unavailable. Therefore, the second closest (to the cluster center) dates from these clusters were used instead (see Appendix B).

Statistical analysis: error diagrams
The difference between model predictions p i and observations o i is defined as e i = p i − o i for each record i. A visualization in the form of a histogram or boxplot ( Figure 2A) immediately provides insight into the nature of the errors. If the errors follow a normal distribution (which is a reasonable first approximation for our data), the complete set can be characterized by a mean error or bias, , and a standard deviation of the error, . It can be shown that these two fundamental properties are directly related to a third commonly used error statistic, the root mean square error (RMSE) as 33 We can use this property to concisely visualize model performance for many cases in a single diagram by plotting systematic errors and random errors on the x-and y-axes, respectively ( Figure 2B). Interestingly, the distance to the origin in such diagrams represents the root mean square error, by virtue of Pythogoras' theorem. For convenience, we will refer to this type of figure as error diagrams in the remainder of this paper.

RESULTS AND DISCUSSION
In this section, the results are presented and discussed, including cross-comparison with existing studies. The results will be presented in the following order. Section 3.1 deals with the case selection. The overall results of the model validation are presented in section 3.2. Special attention is paid to the role of resolution in section 3.3 and to the representation of vertical gradients in section 3.4. Finally, the results are related to large-scale characteristics in section 3.5.

Case selection
We start with an inspection of the selected cases. A summary of each case, comprising case date; large-scale wind speed, direction, and stability; and number of cases in the corresponding cluster, is available in Appendix B. Figure 3 shows all 30 cases on a v versus u diagram. We find that the strongest winds are southwesterly, which corresponds to the typical climatology of the North Sea. Observed static stability at MMIJ varies within each case, but the relative occurrence of stable stratification is consistently larger for cases with a stronger large-scale density stratification The latter involves a random initialization step allowing us to estimate to some degree the uncertainty in the estimate. The figure demonstrates that this uncertainty is relatively small for most standard deviations but can be relatively large with respect to the bias since the bias is often very small. For wind speed and potential temperature, the uncertainty has more or less converged at 30 cases, while for mixing ratio, our partitioning algorithm deviates from the k-means uncertainty bands. With respect to the representation of vertical gradients, the uncertainty is larger, and especially, the standard deviation converges slowly.
Summarizing, we have sampled a variety of cases that allows us to investigate model performance under a wide range of forcing conditions.
Care should be taken to generalize these results quantitatively, because the uncertainty due to the limited amount of cases cannot always be neglected. However, the quick convergence of the error statistics is promising and suggests that this method could be extended to downscaling climatologies as well. Perhaps our choice of large-scale variables could be combined with the subset selection strategy of Rife et al, 34 who demonstrate that it is possible to obtain a fully representative subset of a climatology using 180 days. Figure 5 shows the error distribution of wind speed, wind direction, potential temperature, and mixing ratio at different altitudes for each model.

Overall performance
Means and standard deviations of these distributions are listed in Table 1. These results are weighed for the cluster size (see Appendix B). Note that the boxplots are quite symmetric, which supports the assumption that the errors approximately follow a normal distribution.    In all models, the wind direction is veered (turned clockwise) with respect to the observations by up to 10 • on average (Table 1). Performance is better near the surface than at 315 m. At 27 m, HARMONIE and WRF represent the wind direction better than the IFS, but at the other altitudes, performance is similar. The misrepresentation of surface winds has previously been attributed to excessive mixing under stable conditions, and the recent modifications to the IFS are expected to improve this deficiency. 19 The ECMWF website reports a decrease in the systematic wind direction error near the surface over land since then (<5 • after 60 hr). Also, in our results, there seems to be an improvement, but we cannot conclude that this is due to the modifications, because the frequency and intensity of stable stratification are generally lower for the cases after November 2013.
The IFS shows a warm bias of roughly 1 K and also a large spread ( ≈ 1 K). Oppositely, WRF and HARMONIE underestimate the temperature

Role of model resolution
Since we include model results from both the relatively coarse grid of the IFS and the refined grids of HARMONIE and WRF, the questions arise whether differences in results can be attributed to differences in model resolution and whether a higher resolution contributes to an improved forecast. Theoretically, we would expect that the higher resolution models are able to incorporate local features that are not resolved by the IFS. Also, a coarser resolution would effectively smooth the time series at a given location, as it can only represent the grid-cell average. Indeed, Table 1 shows that the IFS has the lowest standard deviation of the error for most variables, which is consistent with a smoother signal.
However, there are many other differences between the three models, so in order to isolate the effect of increased resolution, we performed additional WRF simulations on a resolution of 16 km, comparable with the IFS. Some of these additional results are included in the right subplots of Figure 5 and in Table 1. It turns out that the change in resolution has only a small impact on these results. For most variables, the spread of the errors is larger for the coarser resolution runs, instead of smaller.
Thus, the difference in model resolution does not seem to be responsible for differences between model results in this study and increased resolution does not necessarily lead to better results.

Boundary-layer structure
In the previous sections, we evaluated the model performance at individual altitudes. For wind energy purposes it is especially important that the models adequately reproduce the vertical boundary-layer structure as well. Misrepresentation of wind shear or density stratification can have adverse effect on turbine loads 3 and power yield. 8 Besides, the gradients of wind and temperature are intimately linked through the process of turbulent mixing. Table 1 also lists statistics of vertical gradients, but in this case, scatter plots are helpful to further understand the model performance ( Figure 6).
We first discuss the temperature stratification, which was calculated as the difference in virtual potential temperature between the top of the mast (90 m) and the platform (21 m). While neutral conditions represent most of the data, most scatter occurs for stable conditions. In the IFS results, we observe two different types of behavior: in one branch, the strong stable stratification is well represented, whereas in the other branch, the stratification is underestimated ( Figure 6A).  19 In our results, we do not see a clear difference in the representation of vertical gradients for the cases before and after November 2013. However, the frequency and strength of stable stratification were considerably lower in the cases after this update, so we cannot conclude whether the representation of (very) stable conditions improved or not based on our results.
With respect to the other modification, the addition of vertical levels in June 2013, there is one case with very stable stratification after this date (but before November 2013), which is reasonably well represented. Obviously, this is not enough to be conclusive, but it supports the hypothesis that the diverging model behavior, which we noticed in the scatter plot of temperature stratification, is related to modifications to the model.

A case against the average
Up to this point, we have focused on overall error distributions and obtained an impression of average model performance. However, if we focus on individual cases, we find substantial variability in model performance. To gain insight in the variability between individual cases, we will inspect the error diagrams of 115 m (≈ hub height) wind and 90-m potential temperature and humidity (Figure 7). Recall that the absolute distance to the origin represents the overall RMSE.
For hub-height wind speed ( Figure 7A), most cases cluster in an area with a mean bias between −1.5 and +1 m/s and of about 1.5 m/s. Again, we find a tendency to underestimate wind speeds on average. Exceptionally poor performance is found for cases 5 (WRF); 6 (IFS); 2, 9, and 21 (HARMONIE); 24 (WRF and HARMONIE); and 18 (all models). These are almost exclusively spring and summer cases, in which the sea water is relatively cold and stable conditions occur frequently. For wind direction ( Figure 7C), two third of the cases are clustered within 15 • on both axes, without obvious differences between models. Cases 5 and 18, which are among the worst represented cases, are both characterized by weak-wind regimes, for which wind direction is usually less well defined. Nevertheless, the positive direction bias (modeled wind direction is veered w.r.t. the observations) is clearly present for nearly all individual cases.
For temperature ( Figure 7B), we find a remarkable difference in performance between the three models. The cold bias of about 0.5 K in WRF and HARMONIE is clearly evident for all cases, and very well defined. In contrast, the IFS bias is very ''case-sensitive'': It varies between −1 and +3 K. Cases 5 and 18, which we already highlighted for poorly representing wind speed, are also outliers in the temperature plot. These cases share a characteristic large-scale stable stratification ( 850 − SST > 10 K). Errors in humidity are quite concentrated ( Figure 7D), but remarkably, all cases that have a relatively large variability also have a negative bias. This points to an underlying process that creates variability in moisture and the misrepresentation of which is associated with too dry boundary layers. Examples of such processes are frontal passages, precipitation, or dry-air entrainment at the boundary-layer top.
A similar figure for the vertical difference in virtual potential temperature ( Figure 8A) indicates that all models typically perform very well (see also Table 1) but a few cases are actually responsible for almost all of the overall error. Cases 2, 4, 5, 6, and 18 perform particularly poorly. Exactly these cases also underestimate the wind speed gradient ( Figure 8B) and have a large positive bias and spread in the wind direction gradient (ie, too little backing or too much veering with height; Figure 8D). These cases are characterized by (strong) stable stratification, or weak winds, or both.
The role of stability becomes especially clear from Figure 9, where we sorted the boxplots by the characteristic large-scale stability for that case. The representation of vertical boundary-layer structure is especially poor if the large-scale stability is strong, ie, when relatively warm air mass is advected over a cold sea surface. This is indisputable for the vertical differences shown but also apparent for 90-m potential temperature and 115-m wind direction (not shown). For 115-m wind speed and 90-m mixing ratio, the difference in model performance between stable and unstable conditions is less pronounced. Similar results are found for the other two models. It is worth noting that within each case, we find a both stable and unstable conditions locally (based on the temperature gradient near the surface). However, stable conditions are much more frequently observed for cases characterized by a strong large-scale stratification.

SUMMARY AND CONCLUSIONS
We have presented an evaluation of three mainstream meteorological models against unique observations up to 315 m at the North Sea. The IFS is the operational global weather model of the ECMWF, HARMONIE is the operational limited-area model used by a number of European weather agencies, and WRF-ARW is a community research model. These models are used in wind energy systems for forecasting, resource assessment, and increasingly to provide input for microscale simulations of wind-turbine interactions. The observations from MMIJ provide a unique opportunity to validate these models at a far (85 km) offshore location up to 315 m.
Our results show that there is substantial variability in model performance. Both WRF and the IFS consistently underestimate the wind speed by about 0.5 m/s. The spread of the errors, expressed as standard deviation of the error distribution, is below 2 m/s. HARMONIE has a smaller wind speed bias than the other models, which may be due to the additional data assimilation used in this model. The spread in the errors is smallest for the IFS, and we have shown that this is not due to the coarser grid.
Wind direction is found to be biased by up to 10 • , ie, the model winds are slightly veered (turned clockwise) with respect to the observations.
One might argue that a wind direction mismatch of <10 • is not particularly harmful. However, as Gaumond et al 45  Additionally, we find that the model results are very ''case-sensitive.'' All models struggle with the representation of stable conditions, leading to unrealistic boundary-layer structures. This is especially important because both power production (through enhanced wake losses) and turbine loads (through enhanced wind shear) are affected. Recently, Baas et al, 46 in line with Kalverla et al, 25 reported that over sea, strong wind shear can be maintained even under moderate to strong wind conditions if the density stratification is strong enough.
Our case-selection strategy, based on the clustering of relevant large-scale variables, was not only able to select a wide range of circulation patterns but also provided a reasonable representation of the underlying climatology. However, with the limited number of 30 cases included in this study, one should be careful to generalize the overall model scores presented in this study. This is especially true because the results in section 3.5 illustrate that the overall model performance as listed in Table 1 cannot always be projected on individual cases.
Finally, our results indicate that the large-scale stability, expressed as the difference between the potential temperature at 850 hPa and the SST, can be used to estimate the accuracy of model results. Since the large-scale stratification is easily determined, and also more robust and generic than the local stability, it could serve as guidance during the interpretation of model results.