WeatherBench 2: A benchmark for the next generation of data-driven global weather models

,


Introduction
Global, medium-range (1-14 day) numerical weather prediction (NWP) is a key component of modern weather forecasting (Bauer et al., 2015).It has huge economic and societal impact as many significant weather events occur on this time scale, such as heat waves, tropical-and extra-tropical cyclones, droughts or heavy precipitation leading to flooding.In addition to providing valuable forecasts, global NWP models serve a range of additional purposes: they provide boundary conditions for regional, high-resolution models; they are used to create (re-)analyses; and they serve researchers as tools to better understand the atmosphere.Current NWP models are based on discretizations of the governing equations describing fluid flow and thermodynamics (Kalnay, 2002).As of 2023, most global models have a horizontal grid spacing of less than 25 km.The European Center for 1 with state-of-the-art, deterministic scores for longer forecast lead times.Similarly, the FuXi model (Chen et al., 2023c) achieves deterministic scores similar to the IFS ensemble mean up to 15 days.SwinRDM (Chen et al., 2023b) combines a recurrent network with a diffusion model to provide granular predictions at 0.25 • resolution.ClimaX (Nguyen et al., 2023a) and Stormer (Nguyen et al., 2023b) are examples of models achieving competitive skill at lower resolution.NeuralGCM (Kochkov et al., 2023) represents the first hybrid ML-physics model that achieves state-of-the-art scores, including an ensemble version.
In light of the rapid advances of the field, the need for an updated benchmark that allows for easy comparison between different approaches becomes apparent.In comparison to WB1, WB2 supports much higher-resolution data and evaluation and adds additional metrics.This paper will lay out the design principles for WB2, followed by a detailed description of the evaluation metrics and datasets, as well as results on the headline scores.We will also discuss several issues with WB2 and potential future directions for this benchmark dataset.

Design decisions for WeatherBench 2
A general challenge of designing benchmarks for weather prediction is that weather is a very highdimensional and multi-faceted problem (Dueben et al., 2022).Every use case has slightly different requirements and quality measures.Take as an example model development at ECMWF, one of the world's leading operational NWP centers.ECMWF tracks a large number of metrics to evaluate their model performance.The headline scores (https://www.ecmwf.int/en/forecasts/quality-our-forecasts) and scorecard (https://sites.ecmwf.int/ifs/scorecards/)serve as concise summaries but are only the tip of the iceberg.In addition to quantitative scores, feedback is also be collected from in-house experts and end users, each of them interested in specific aspects of the forecast system.Recognizing the limitations of particular scores is even more important to keep in mind when evaluating ML-based approaches, which often violate standard assumptions in traditional NWP.All this is to say that no single metric or set of metrics will ever be able to fully describe what a "good" forecast is.WB2 does not attempt to do so.Similarly to ECMWF, WB2 defines a set of headline scores (described in Section 5) and evaluation tools that aim to capture key aspects of medium-range weather forecasting but are not meant to be exhaustive.Therefore, WB2 should not be seen as a traditional benchmark challenge with a single leaderboard but rather as a tool to compare different approaches on different aspects.
In its details, the WB2 evaluation protocol sticks closely to the forecast verification used by the WMO (, WMO) and operational weather centers like ECMWF.WB2 does not try to reinvent these standard verification protocols but rather aims to make them accessible to the community interested in AI models.This is done by providing open-source ground truth data and evaluation code, as well as by providing a single place to compare different traditional and AI approaches.
While WB2 defines the targets to be evaluated, it leaves the remaining modeling setup open (e.g., which inputs or model resolution to use).In this sense, WB2 is a benchmark for entire forecast systems, i.e., the combination of choice of input data, training setup and model architecture, rather than focusing on the models specifically.While this makes it harder to compare different model architectures, we believe that designing the non-model part of the forecast system components is equally as important and different approaches should be encouraged to accelerate progress in the field, as long as overfitting is avoided.However, it is important to recognize that different design choices can lead to advantages and disadvantages.One important choice is which input dataset to use for starting ("initializing") forecasts.Currently, most models use re-analysis datasets, in particular ERA5, which would not be available in an operational setup and provide an potential advantage over operational initial conditions (more details in Section 6.2).
Another priority of WB2 is to emphasize the importance of probabilistic prediction.Weather forecasting is an inherently uncertain endeavor because of chaotic error growth (Lorenz, 1963;Zhang et al., 2007), implying that even with perfect models and near perfect initial conditions there is a range of possible outcomes.In operational NWP, this fact led to the development of ensemble prediction systems in the 1990s (Palmer et al., 1993;Toth and Kalnay, 1993).In an ensemble, a number of forecasts (typically 10-100) are run with perturbed initial conditions and sometimes model physics, providing different potential realizations of future weather.One of the most important advantages of ensemble forecasts is that they can provide reliable information for decision making.For example, they can be used to estimate the probability of extreme events that a single, deterministic forecast might miss.Tropical cyclone track forecasts (often displayed as plumes) are a prime example of this.The data-driven models discussed above, with the exception of a perturbed ensemble experiment in the Pangu-Weather paper, all only provide deterministic forecasts.Similarly to the evolution of dynamical forecast systems, data-driven forecasts will need to become probabilistic to provide users with the most actionable information.This could be possible as a post-processing step on top of deterministic models, for example, through ensemble dressing or lagged forecasts.However, the success of NWP ensembles hints at the advantages of designing probabilistic forecast systems from the ground up, especially for capturing spatio-temporal correlations.For this reason, WB2 will include operational probabilistic verification metrics and baselines from the start.In data-driven modeling, there are several ways to create probabilistic forecasts, from predicting parameterized marginal distributions directly (Andrychowicz et al., 2023) to producing generative roll-outs.Some aspects of how to evaluate these forecasts are discussed in Section 6.3.
Finally, WB2 will provide a dynamic, open-source framework that can evolve with the needs of the ML-weather community.All data and code needed for evaluation are publicly available and can be extended to accommodate more detailed evaluation in the future, driven either by us or community contributors.

Data, baselines and data-driven models
Most of the datasets below are available in on Google Cloud Storage in Zarr format.For the latest details on these datasets, please visit https://weatherbench2.readthedocs.io/en/latest/data-guide.html.Table 1 provides a summary of key facts for each of the datasets/models, including the training resources.

ERA5
The ERA5 dataset (Hersbach et al., 2020) is used as the ground truth dataset for WB2 and as the training dataset for many of the data-driven approaches described above.ERA5 is a reanalysis dataset based on a 0.25 • (roughly 30 km) version of ECWMF's HRES model operational in 2016 (cycle 42r1) and ECMWF's 4D-Var data assimilation which uses a wide range of direct and remote sensing observations.ERA5 uses 12 hour assimilation windows from 21-09 and 09-21 UTC, during which a "prior" forecast initialized from the previous assimilation window is combined with observations to produce an analysis, a "best guess" of the Earth system state during this window.ERA5 data is available at hourly resolution from 1940 to present at the Copernicus Climate Data Store (https://cds.climate.copernicus.eu/).A subset of the ERA5 dataset is available on the cloud in the cloud-optimized Zarr format.
Note that using ERA5 for evaluation and training has some caveats.First, while attempting to be close to observations, ERA5 is a model simulation which is closer to the truth for some variables than for others.Especially for precipitation, ERA5 sometimes shows large differences to rain gauge measurements Lavers et al. (2022).We still include precipitation evaluation based on ERA5 here but advise caution when interpreting those results (for more details, see discussion in Section 6.1).Second, ERA5 uses a longer assimilation window compared to operational forecasts (see below).This helps to better constrain the guess of the atmospheric state but would delay the initialization of forecasts in a real time setting.At 00 UTC, for example, ERA5's assimilation window goes 9 hours "into the future".Operationally, the data assimilation window only extend 3 hours ahead from the forecast initialization time.One could use 06/18UTC initialization to level the playing field a little bit more.This issue has been extensively discussed in the supplement of (Lam et al., 2023, Fig. 10) who found that forecasts initialized from ERA5 at 00/12 UTC indeed perform better than those initialized at 06/18UTC, when the ERA5 analysis also only has a 3 hour look-ahead.The difference in RMSE for GraphCast was up to 5%.However, the 06/18UTC operational analysis are also produced using fewer DA cycles compared to the 00/12UTC analysis.For WB2, we chose to evaluate forecasts initialized at 00/12UTC.The main reasons for this are that some key baselines (IFS ENS and the ERA5 forecasts) were only available for 00/12 UTC initializations and that this has been the standard for most other ML-based evaluations so far, which allows us to directly include some of them here.

ERA5 forecasts
For research purposes, ECMWF ran a set of 10 day hindcasts initialized from ERA5 states at 00/12UTC with the same IFS model version used to create ERA5.These forecast are available in the MARS archive (MARS parameters: class=ea,stream=oper,expver=11,type=fc).Here we downloaded the variables required to compute the headline scores for the year 2020 (with the exception of total precipitation which was not available).Note that data until 5 days lead time is available at 6h intervals, and 12h intervals from 5 to 10 days lead time.
The ERA5 forecasts provide a like-for-like baseline for an AI model initialized from and evaluated against ERA5.They benefit from the same, longer assimilation window compared to the operational initial conditions and are run at 0.25 • resolution-similar to many modern AI methods.Because of the lower resolution and older model relative to the operational IFS HRES in 2020, one would expect the operational model to be more skillful by itself.

Climatology
The climatology is used for computing certain skill scores, in particular the anomaly correlation coefficient (ACC) and the stable equitable error in probability space (SEEPS), and as a baseline forecast.Here we follow Jung and Leutbecher (2008) and compute the climatology c as a function of the day of year (doy) and the time of day (tod ) by taking the mean of ERA5 data from 1990 to 2019 (inclusive) for each grid point.A sliding window of 61 days is used around each doy-tod combination with weights linearly decaying to zero from the center.This removes sample noise and makes the climatology smoother in time, at the expense of reducing the seasonal amplitude.
A probabilistic version of the climatology is created by taking each of the 30 years from 1990 to 2019 as an ensemble member, this time without smoothing.
Note that a 30 year climatology will include some climate drift, especially for temperature.Here, we do not apply any measure to correct for this.

IFS HRES
Our main baseline comes from operational forecasts created with ECMWF's IFS model, which, measured by standard verification metrics, are typically regarded as the best global, medium-range weather forecasts.Since 2016, the IFS in its HRES configuration has been run at 0.1 • (roughly 9 km) horizontal resolution.The operational model is updated regularly, approximately once to twice a year, which means that the exact model configuration might change during the evaluation period.Usually, updates are associated with slight improvements in most evaluation metrics, though not all.However, changes in the IFS are typically gradual.A comprehensive model description can be found at https://www.ecmwf.int/en/publications/ifs-documentation.A schedule of model upgrades can be found at https://confluence.ecmwf.int/display/FCST/Changes+to+the+forecasting+system.Initial conditions are created every 6 hours using an ensemble 4D-Var system using information from the previous assimilation cycle's forecast as well as observations in a +/-3 hour window.After accounting for the time to perform data assimilation and forward simulation, forecasts have a latency of 5.75 to 7 hours from the time at which they are initialized (https://confluence.ecmwf.int/display/DAC/Dissemination+schedule).Forecasts started at 00 and 12 UTC are run up to a lead time of 10 days.06 and 18 UTC initializations are run for 3.75 days.
Running a 15 day TCO1279 (∼ 9km) simulation in the setup used for the operational ensemble (using the new high-resolution setup) takes around 52 minutes with I/O and 46 minutes without I/O on 64 128-core (AMD EPYC Rome) nodes.

IFS HRES Initial Conditions
For evaluating IFS forecasts, we use operational analyses as ground truth rather than ERA5.This is because evaluating IFS forecasts against ERA5 would result in a non-zero error at time step t = 0 and therefore would put the IFS at an unfair disadvantage compared to data-driven models trained on and evaluated against ERA5.The difference is most pronounced during the early lead times and becomes smaller at longer lead times.See Fig. S2 for a comparison.Note that for some variables like 2m temperature differences persist even for longer lead times.This is likely due to a bias between the HRES analysis and ERA5.
Here, as in Lam et al. (2023) we use the initial conditions, i.e. the IFS HRES forecasts at t=0 as our "analysis".Note that this dataset is slightly different from the official analysis product on the ECMWF archive.This is because for the official analysis product an additional surface data assimilation step is taken.For ECMWF's internal evaluation, the official analysis product is used.However, the qualitative differences in the evaluation scores resulting from this discrepancy should be small except for the first time steps after initializations and are mostly limited to surface temperature.For very short lead times, results should not be over-interpreted anyway.Note that for precipitation accumulations we use ERA5 as a precipitation ground truth for all models.

IFS ENS
ECMWF also runs an ensemble version (ENS) of IFS, at 0.2 • resolution until to the 2023 upgrade (including the 2020 evaluation period used here), now at 0.1 • resolution.The ensemble consists of a control run and 50 perturbed members.The initial conditions of the perturbed members are created by running an ensemble of data assimilations (EDA) in which observation errors, model errors and boundary condition errors are represented by perturbations (https://confluence.ecmwf.int/display/FUG/Section+5.1.1+Ensemble+of+Data+Assimilations+-+EDA).The difference of each of the EDA analyses to the mean is then used as a perturbation to the HRES initial conditions.In addition, to more accurately represent forecast uncertainty, singular vector perturbations are added (https://confluence.ecmwf.int/display/FUG/Section+5.1.2+Singular+Vectors+-+SV) to the initial conditions.Model uncertainties during the forecasts are represented by stochastically perturbed parameterization tendencies (Buizza et al., 1999, SPPT), in which spatially and temporally correlated perturbations are added to the model physics tendencies.Ensemble forecasts are initialized at 00 and 12 UTC and run out to 15 days.

IFS ENS Mean
We also include the IFS ENS mean as a baseline, which we computed by simply averaging over the 50 members.The ensemble mean does not represent a realistic forecast but often performs very well on deterministic metrics.

Keisler (2022) Graph Neural Network
Keisler (2022) used a graph neural network architecture (Pfaff et al., 2021) with an encoder that maps the original 1 • latitude-longitude grid to an icosahedron grid, on which several rounds of message-passing computations are performed, before decoding back into latitude-longitude space.The model takes as input the atmospheric states at t = 0 and t = −6h and predicts the state at t = 6h.To forecast longer time horizons, the model's outputs are fed back as inputs autoregressively.The state consists of 6 three-dimensional variables at 13 pressure levels.ERA5 data is used for training with 1991, 2004 and 2017 used for validation, 2012, 2016

Pangu-Weather
Pangu-Weather (Bi et al., 2023) is a data-driven weather model based on a transformer architecture.It predicts the state of the atmosphere at t = t + ∆t based on the current state.The state is described by 5 upper-air variables and 4 surface variables on a 0.25 • horizontal grid (same as ERA5) with 13 vertical levels for the upper-air variables.The model is trained using ERA5 data from 1979 to 2017 (incl.)with 2019 for validation and 2018, 2020 and 2021 for testing.Here we evaluate forecasts for 2020.Four different model versions are trained for different prediction time steps ∆t = {1h, 3h, 6h, 24h}.To create forecasts for an arbitrary lead time model predictions are chained together autoregressively from the four different lead time models, using the fewest number of steps.For example, to create a 31h forecast, a 24h forecast is followed by a 6h and then a 1h forecast.The maximum lead time for the data used here is 7 days.Inference code for Pangu-Weather can be found at https://github.com/198808xc/Pangu-Weather.

Pangu-Weather (operational)
The scorecard also contains a version of Pangu-Weather initialized with the operational IFS HRES initial conditions (see above).

GraphCast
GraphCast (Lam et al., 2023) is similar in structure to Keisler (2022) but uses a higher resolution input with 6 upper-level variables on a 0.25 • horizontal grid with 37 vertical levels, and additionally 5 surface variables.Further, the processor operates on a multi-mesh, i.e. a nested structure of different resolution meshes.The model is also trained autoregressively up to a time horizon of 12 time steps (3 days).Here, we evaluate a version of GraphCast that was trained on ERA5 data from 1979 to 2019 (incl.).See Suppl.5.1 of Lam et al. (2023) for details.Code for GraphCast can be found at https://github.com/deepmind/graphcast.

GraphCast (operational)
The scorecard also contains a version of GraphCast fine-tuned and initialized with the operational IFS HRES initial conditions (see above).For mode detail, refer to https://github.com/deepmind/graphcast.

FuXi
FuXi (Chen et al., 2023c) is an autoregressive cascaded ML weather forecast system based on a transformer architecture with specific models trained for short (0-5 days), medium (5-10 days) and long-range (10-15 days) prediction.The model is trained on ERA5 data at 0.25 • horizontal resolution with 13 vertical levels.

SphericalCNN
Spherical Convolutional Neural Networks (Esteves et al., 2023) generalize Convolutional Neural Networks (CNNs) to functions on the sphere, by using spherical convolutions as the main linear operation.The weather model produces output at a resolution 1.4 • zonally and 0.7 • meridionally, with 7 vertical levels.Code can be found at https://github.com/google-research/spherical-cnn.

Evaluation protocol and metrics
The WB2 evaluation protocol sticks closely to the forecast verification used by the WMO (, WMO) and operational weather centers like ECMWF.Table 4 provides a list of the notation used.

Evaluation time period and initialization times
In its initial version WB2 uses the year 2020.2020 was chosen a) because it provides a compromise between recency and leaving several more recent years for independent testing if modeling groups desire to do so; and b) because data was available for many of the AI baseline models.One year is also a robust enough sample size for most of the metrics defined below.It should be noted, however, that for metrics focused on extreme events, e.g.hurricanes or heat waves, larger sample sizes are required.As WB2 is updated, evaluation can be changed to a more recent or longer period.
It is also worth discussing that many of the AI models were trained with data only up to 2018.Analysis for the GraphCast model shows that training with more recent data is advantageous (Lam et al., 2023).Further, the IFS model is continuously improved, suggesting that more recent forecasts are better.However, it is difficult comparing absolute values, since there are natural fluctuation year-by-year.In Fig. S1 we show scores for 2018 (also available on the website).The relative differences in scores between 2018 and 2020 are small, implying that, at least for the metrics shown here, the results are robust.
Evaluation is done on all 00 and 12 UTC initialization times for 2020, i.e. from January 1 2020 00UTC to December 31 12 UTC.This means that some forecasts will extend into 2021.We compared this to evaluating only forecast lead times that are valid in 2020 but the difference in the scores was negligible.The evaluation time step can be freely chosen.We use 6 hours as our highest resolution.

Evaluation resolution and area
Before the computation of the metrics, all forecasts and ground truths are first-order conservatively regridded to 1.5 • resolution.A regridding tool is available on the WB2 GitHub page.1.5 • is also used as the standard resolution for evaluation by the WMO (, WMO) and at ECMWF.Regridding to a common lower resolution allows all models to be evaluated without penalizing coarser-resolution models.Absolute metric values can differ significantly between resolutions but the relative differences between different models are consistent across different evaluation resolution (see Fig. S9 for a comparison of IFS HRES scores at different resolutions and consult the Weather-Bench 2 website to see all models evaluated at different resolutions).However, this should not give the false impression that higher-resolution models are not more accurate as relevant small scale details can be resolved.
All results shown in this paper are computed over all global grid points.Additionally, scores computed many other regions are shown on the website: https://sites.research.google/weatherbench.

Below-ground grid points
For some pressure levels (e.g.850hPa), grid points in high-altitude regions will be "below" ground; i.e., the surface pressure is actually smaller than the pressure level.ERA5 and IFS output will still provide interpolated values at these locations, even though they do not correspond to real physical variables.Here, we exclude below ground grid points from the computation of all metrics.To do so, we compute the climatological fraction of above-ground grid points (defined by geopotential > geopotential at surface) for each pressure level using the same doy-tod dependent method described in Section 3.2.We then use this fraction as an additional weight when computing spatial averages.We chose to implement the below-ground mask based on climatological values rather than the actual forecast or ground truth values because ML models, when explicitly trained against masked scores conditioned on the forecast or ground truth, could lead to biased predictions.

Deterministic metrics
All metrics are computed using an area-weighting over grid points.This is because on an equiangular latitude-longitude grid, grid cells at the poles have a much smaller area compared to grid cells at the equator.Weighing all cells equally would result in an inordinate bias towards the polar regions.The latitude weights w(i) are computed as: where θ u i and θ l i indicate upper and lower latitude bounds, respectively, for the grid cell with latitude index i.All vertical (pressure) levels are treated separately.For readability, no level index is included in the equations below.

Root mean squared error (RMSE)
The RMSE is defined for each variable and level as (2) Note that this agrees with the definition of the RMSE used by the WMO and ECMWF.In WB1, the time mean was taken outside the square root.We compared both versions and found differences to be small (less than 2% absolute difference on average).
The wind vector (WV) RMSE is computed as where u f , u o , v f and v o are the u and v components of wind in the forecast and observations, respectively.

Anomaly correlation coefficient (ACC)
The ACC is computed as the Pearson correlation coefficient of the anomalies with respect to the climatology c: where c t,l,i,j and c t,i,j indicate the climatology corresponding to the appropriate tod -doy combination.The ACC then is defined as (5) The range of ACC values goes from 1, indicating perfect correlation, to -1, indicating perfect anti-correlation.A climatological forecast has an ACC value of zero.ECMWF states that when "ACC value falls below 0.6 it is considered that the positioning of synoptic scale features ceases to have value for forecasting purposes" (https://confluence.ecmwf.int/display/FUG/Anomaly+Correlation+Coefficient).

Bias
The mean error, or simply bias, is computed for each location i, j as Bias l,i,j = 1 In addition, we compute the globally averaged root mean squared bias (RMSB) as

Stable Equitable Error in Probability Space -SEEPS
Traditional deterministic scores such as RMSE and ACC are not good choices for evaluating precipitation forecasts.This is because precipitation has a very skewed distribution and high spatiotemporal intermittency or unpredictability.Under such conditions, traditional scores heavily favor unrealistically smooth forecasts.This is the case for all variables but is especially dramatic for skewed variables.For this reason, ECMWF and WMO decided to use the SEEPS score (Rodwell et al., 2010) for their routine deterministic precipitation evaluation.The SEEPS score is based on a three-class categorization into "dry", "light" and "heavy" precipitation.The score is designed to discourage "hedging" (i.e.smooth forecasts) and be stable to parameter choices.For details behind the score, please consult Rodwell et al. (2010).Here, we describe how the score is computed and how the computation differs slightly from that in the original paper.We use a dry threshold of 0.25 mm/day (0.1 mm/day for 6 hourly accumulation; shown on the website).The remaining precipitation values are classified into light and heavy, so that climatologically there are twice as many light compared to heavy precipitation days.We compute this by calculating the 2/3rd quantile of non-dry days for each day-of-year according to the same procedure for computing a smooth climatology described in Section 3.2.This differs from most other SEEPS computations which uses monthly climatologies.Because the daily climatology is smooth though, this does not affect the results much.Forecast/observation pairs are then categorized into the three classes and a 3x3 contingency table is created for each forecast lead time.The contingency table is then multiplied by the scoring matrix S based on the yearly average climatological occurrence of dry days p 1 for each geographical location: where columns represent observed probabilities and rows represent forecast probabilities.
Very wet and very dry regions are excluded.Here we use 0.1 < p 1 < 0.85 as suggested by Rodwell et al. (2010).Finally an area-weighted average is taken over all locations.
In the figures, we show (1 -SEEPS), the more common skill score version that is positively oriented.

Continuous ranked probability score (CRPS)
Given scalar ground truth Y , and i.i.d.predictions X, X ′ , CRPS is defined as and Raftery, 2007).The skill term, E|X − Y | penalizes poor predictions, while − 1 2 E|X − X ′ | encourages spread.CRPS is minimized just when X is drawn from the same distribution as Y .To see this, consider two i.i.d.ground truth samples, Y, Y ′ , then subtract 1 2 E|Y − Y ′ | from CRPS to arrive at the divergence relation Thus, scalar CRPS is equal, up to a constant independent of the prediction, to the squared L2 difference of their cumulative distribution functions.In the case of a deterministic prediction, the CRPS reduces to the MAE.WeatherBench considers multi-dimensional predictions, f t,l , conditional on the initial ground truth o t−l .For these, CRPS is defined by averaging over time/latitude/longitude components.We also take advantage of M ≥ 2 predictions {f (1) , . . ., f (M ) }. Setting we define This is the time average of unbiased (conditional) CRPS (Zamo and Naveau, 2018).With large enough T , this is an accurate estimate of CRPS conditional on o t−l .It is therefore minimized by any prediction f t,l with the same component distributions as ground truth at time t, conditional on o t−l .
The second term ( 8) is efficiently computed in O(M log M ) time using O(M ) memory with a sort rather than a double summation.CRPS for deterministic predictions (where M = 1) is also supported, and in this case CRPS reduces to (weighted) mean absolute error.

Spread-skill ratio
The spread-skill ratio R is defined as the ratio between the ensemble spread and the RMSE of the ensemble mean f t,l,i,j = 1 M M m f t,l,i,j,m .
with var m being the variance in the ensemble dimension.
A well-calibrated ensemble forecast should have a spread-skill ratio of 1 (Fortin et al., 2014).Smaller values indicate an underdispersive forecast, while larger values indicate an overdispersive forecast.
Note that the spread-skill ratio is only a first-order test for calibration.To further diagnose ensemble calibration, rank histograms are a suitable choice (Wilks, 2006, Ch. 7.7.2).We plan to include those in WB2 soon.

Energy Spectra
Zonal spectral energy along lines of constant latitude is computed as a function of wavenumber (unitless), frequency (m −1 ) and wavelength (m).With f l discrete values along a zonal circle of constant latitude, with circumference C, the DFT F k is computed as The energy spectrum is then set to The factor of 2 appears for wavenumbers k > 0 since they account for both negative and positive frequency content.This choice of normalization ensures that Parseval's relation for energy holds: Supposing f l are sampled values of continuous function f (ℓ), for 0 < ℓ < C (m), then (C/L) is the spacing of longitudinal samples, whence To arrive at a final spectrum we average the zonal spectra for 30 • < |lat| < 60 • .

Headline scores
The headline scores are listed in Table 3.They reflect the most commonly evaluated variables in medium-range forecasting.The upper-level variables are chosen to capture the large-scale evolution of the atmosphere.Z500 and T850 are particularly good tracers of extra-tropical dynamics.Q700 provides a proxy for moisture transport and, indirectly, clouds.The surface variables are closely aligned with weather impact through 2 m temperature (T2M), 10 m wind speed (WS10) and 24 h precipitation accumulation (TP24hr).The mean sea level pressure (MSLP) is another measure of larger scale dynamics and is a good proxy for the strength of tropical and extra-tropical cyclones.

Results
The description of the results in this paper will focus on the general characteristics of the evaluation metrics and the baselines, with a smaller focus on comparing the data-driven models themselves.This is because the paper is a snapshot of the state-of-the-art of data-driven modeling at the time of writing and will almost certainly be outdated soon.For this reason, and because visualization of all models on a single graph makes it unreadable, we only show detailed scores for a smaller selection of models.Those models (GraphCast, Pangu-Weather and NeuralGCM) were chosen because of their current relevance in the community.For an up-to-date view of the data-driven state-of-the-art, including detailed scores of all models, visit the accompanying website: https: //sites.research.google/weatherbench.

Headline scores
Figs. 1 and 2 show scorecards of the deterministic headline scores relative to the IFS HRES baseline.Fig. 3 shows a scorecard of upper-level headline scores.Figs. 4, 5 and 6 show the deterministic and probabilistic scores as a function of lead time, respectively.Scores for the other metrics, variables and resolutions can be found on the WeatherBench 2 website.On RMSE, IFS ENS (mean) performs better than IFS HRES for longer lead times.For smaller scale variables this transition happens earlier.For example, for larger scale variables like pressure and temperature the IFS ENS (mean) has lower errors up to around 2 days, compared to up to around 12h for wind and humidity.The state-of-the-art, deterministic, data-driven methods (Pangu-Weather, GraphCast, FuXi and NeuralGCM) also score well compared to IFS HRES, with scores similar to those shown in the respective papers.In comparison to IFS ENS (mean), these deterministic models have lower errors up to 3-6 days, after which the ensemble mean has the lowest error.NeuralGCM ENS (mean) roughly matches the IFS ENS (mean) scores on RMSE.
It is also noticeable, especially in the plots relative to IFS HRES, that the scores for the datadriven methods have a strong 6h zig-zag pattern.This is an artifact of training and evaluating with ERA5, which has a 12h assimilation window.Forecasts initialized at 00/12UTC are initialized towards the beginning of the 09-21/21-09UTC ERA5 assimilation windows.This means that for the first 6h forecast, e.g., from 00 to 06 UTC, all the AI models have to learn is to emulate the IFS model version used in ERA5.For the next 6h window, e.g., from 06 to 12UTC, the models also have to learn the assimilation step-a harder task.If forecasts initialized at 06/18UTC were evaluated, the zig-zag pattern would be reversed.Note that in Lam et al. (2023), this pattern is   Figure 4: Global RMSE (SEEPS for TP24h) for headline variables for the year 2020.Note that for TP24h, IFS HRES and IFS ENS (mean) are evaluated against ERA5, since no precipitation accumulations are available for the analysis.Not all models/datasets have all variables available.not visible because evaluation is done on a 12h interval, thereby only sampling every second point shown here.Here we chose to show the 6h evaluation for completeness (except for NeuralGCM where we only had 12-hourly data).
Note that the larger error of IFS ENS (mean) at time t = 0 for 2m temperature stems from a difference in resolution between the ensemble (0.2 • ) and the analysis (0.1 • ).Results for the newly upgraded higher resolution ensemble (not included in WB2 yet) do not show this large initial error.
The precipitation verification using the SEEPS score shows that IFS HRES is the most skillful model with the "blurrier" models, such as IFS ENS (mean), being less skillful.This shows the advantage of using a categorical score compared to an average score such as RMSE.On RMSE, the order of the models is reversed (not shown).
For the probabilistic models, IFS ENS and NeuralGCM ENS show very similar scores on upperlevel variables.Their skill approaches that of the probabilistic climatology towards the end of the two week forecast horizon.This is in line with the expected ∼15 day limit of predictability of synoptic weather (Selz, 2019).For some variables, while the IFS ENS and NeuralGCM ENS skill curves flatten off, there is still a gap to the climatological skill.This could have several reasons: first, our methodology for computing the probabilistic climatology might not be a perfect representation of the true climatology; second; the forecast might still have some skill there.Especially for largescale variables such as pressure and temperature, there is evidence of skill in the sub-seasonal range; and third, especially for temperature variables, climate change has a non-negligible effect over the 30 years used to compute the climatology.
An analysis of the spread-skill relationship (Fig. 7) shows that except for precipitation, which is underdispersive, the IFS ensemble forecast is very well calibrated, especially for longer lead times.The initial hours show some spin-up with more spread in the first hours, followed by a 2 day dip in the spread-skill ratio.For geopotential and pressure, this trend is reversed.NeuralGCM ENS also shows good calibration for longer lead times but instead appears to be underdispersive at early lead times, perhaps due to the lack of initial condition spread.Here it is important to not over-interpret global spread-skill ratios.More fine-grained analysis of tropical vs extra-tropical dispersion show more differentiated behavior.These results are included on the website.
Fig. 8 also shows a time series of 6 day RMSE over Europe for all of 2020 for each of the models.Here, several forecast bust events are visible, where forecast skill drops significantly below normal (Rodwell et al., 2013).Interestingly, some bust cases are shared between the physical and AI models (e.g.around March 19) while for others AI models continue to perform well, despite IFS HRES and ENS performing badly (e.g.August 11).In some other cases, AI models spuriously perform badly.Generally, this implies that AI models show similar error characteristics to physical models.

Bias
Fig. 9 shows the RMS bias (RMSB) for the headline variables.Additionally, bias maps for 2m temperature, 24h precipitation, 10m wind speed and 700hPa specific humidity can be found in the supplement.It is important to note that one year is a small sample to compute bias statistics.In the bias maps in the supplement it is evident that dominant weather patterns in 2020 influence the bias.For example, 2020 saw a persistent heat wave in Siberia.All models tend to have a cold bias there since they fail to fully predict such a persistent pattern.Therefore, not too much should be interpreted into regional biases without considering the context.
What is noticeable is that while ML methods tend to have a lower average bias compared to physical methods, especially at early lead times, for wind speed GraphCast (and Keisler (2022); not shown) show a large increase in average bias with lead time.The bias maps (Figs.S6 and Figure 5: Global RMSE/SEEPS % difference compared to IFS HRES for headline variables for the year 2020.Negative values indicated lower RMSE.Note that for TP24h, IFS HRES and IFS ENS (mean) are evaluated against ERA5, since no precipitation accumulations are available for the analysis, and the metric is SEEPS (in this case not 1-SEEPS to have a consistent orientation with the other relative plots).Not all models/datasets have all variables available.Figure 9: Global RMSB for the headline variables for the year 2020.For Z500 and MSLP the climatological bias is comparatively large and is not shown within the figure limits S7) show that this is due to a consistent bias towards smaller wind speed values.This bias is not present when looking at U or V separately (see supplement of Lam et al. (2023)).What this indicates is that ML models trained to minimize each wind component separately, and doing a good job at doing so, can struggle to represent correlations between those components.Interestingly, Pangu-Weather does not have this bias.Neither does the hybrid NeuralGCM model.

Spectra
Fig. 10 shows the power spectra for four variables and four lead times.The ERA5 spectrum can be taken as a reference to compare between lead times as it is constant with lead time.IFS HRES generally has more energy on small scales compared to ERA5, likely due to its higher resolution, but is also quasi-constant with lead time which is to be expected for a physical model.The IFS ENS (mean) shows an increasing drop-off in power with lead time, starting with the smaller wavelengths and then also for the larger wavelengths for longer lead times.Here it is important to re-iterate that the ensemble mean does not represent a realistic realization of weather but rather the mean of the forecast distribution.In other words, the mean will become increasingly smooth with lead time, eventually converging to the climatological mean.The progression of the drop-off from smaller towards longer wavelengths is consistent with the model of upscale error growth first proposed by Lorenz (1969).For variables that vary more on small scales, such as humidity and precipitation, this effect is larger.
For Z500, the AI models tend to have more energy on small scales, more so for longer lead times.This increase in small scale energy for Z500 is in contrast to the behavior for the other variables where the AI models show a significant drop in small scale variability from 6h to 3d.This is a result of the aforementioned smoothing.After 3 days most AI models tend to have a constant spectrum.This is in line with the 24-72 h optimization windows.In other words, these models tend to optimize their blurring for those time ranges.The ensemble mean continues to become blurrier resulting in lower RMSE values for longer lead times.The spectra of NeuralGCM show less variance on small scales due to the model's lower resolution but the spectra are roughly consistent with lead time.
The spectra clearly show some of the blurring exhibited by some AI models.It is important to note though that having a spectrum that matches observations well is a necessary but not sufficient conditions for "realism".

Case studies
While case studies can only provide anecdotal evidence of model performance, they offer a more "human" and holistic view that skill scores cannot provide.Here we shows one such case study.Another case study of Hurricane Laura can be seen in Fig. S10 On the WB2 website, we eventually plan to extend this catalog of case studies.
Fig. 11 shows model forecasts and the corresponding ERA5 ground truth for storm Alex which brought damaging wind and rain across Europe (https://climate.copernicus.eu/esotc/2020/storm-alex).What is noticeable is that all models, physical and AI-based, predict the evolution of the storm with impressive accuracy and agreement up to four days lead time.October 2 saw strong winds over the North-West of France as well as parts of northern Spain, which was predicted well by all methods.On October 4, a cut-off trough developed cetered over the English channel.IFS HRES and even the ENS mean reasonably predict this whereas the AI methods favor a continuous trough stretching from Iceland to England.
Fig. S10 shows the evolution of Hurricane Laura, the most damaging hurricane of the 2020 season  (https://www.nhc.noaa.gov/data/tcr/AL132020_Laura.pdf).IFS HRES shows impressive skill in predicting the evolution and track of the hurricane up to 5 days in advance with the timing and location of landfall on August 8 almost perfectly forecasted.The IFS ENS mean, somewhat predictably, fails to predict the strength and location of the hurricane because it averages out several hurricane tracks.Pangu-Weather and GraphCast have a solid track forecast with reasonable cyclone structure but fail to predict the intensity in pressure and wind speed seen in the actual hurricane of the IFS forecast.GraphCast predicts landfall West of the actual location.NeuralGCM, for this case, shows a remarkable agreement with ERA5, on par with IFS HRES.We plan to add more case studies to the website in the future.While case studies should not be over-interpreted, they are evidence that while sometimes lacking in detail and fidelity, AI models can produce skillful extreme weather forecasts for events outside of their training dataset.

ERA5 as ground truth
Here, we use ERA5 as our ground truth dataset.However, as already discussed in Section 3.1, ERA5 is a model simulation that is kept close to observations, rather than direct observations.The quality of ERA5 depends on the variable in question.Large-scale dynamical variables like geopotential and upper-levels temperatures and wind tend to be well represented in reanalysis datasets.For surface variables, the sparsity of observations and difficulty of representing smallscale physics in the underlying model can cause larger discrepancies with observations.This is especially true for precipitation, which is not directly assimilated into ERA5 (e.g., through radar observations) and often show large differences to rain gauges or radar precipitation estimates (e.g., (Lavers et al., 2022), (Andrychowicz et al., 2023)).The precipitation evaluation using ERA5 shown here should really be seen as a placeholder for more accurate precipitation data.
Operational weather services like ECMWF verify their forecasts with direct observations, e.g., from weather stations, in addition to using assimilated ground truths.This is something we are looking to add to WeatherBench in the future.However, station observations come with their own issues: first, they are unevenly distributed with some regions being especially sparse in observations; second, station data comes in varying quality and requires careful quality control; and third, station data suffers from representativeness issues; that is, the station might not be representative of the model grid box it is compared to.
For these reasons, ERA5 still represents a good option for evaluation for most variables because of its temporal and spatial coverage but it should be kept in mind that it does not tell the full story.Notably, some regional ML and post-processing models already look at using data directly for training and evaluation (Andrychowicz et al., 2023;Demaeyer et al., 2023).

ERA5 initialization versus operational forecasting
Most data-driven methods so far have been trained and evaluated using ERA5 data as initial conditions.As already discussed in Section 3.1, ERA5 data would not be available in real time to initialize live, operational forecasts.For a true apples-to-apples comparison with operational physical models, ML forecasts would need to be initialized using operationally available initial conditions.The first such experiment has been done by Ben-Bouallegue et al. (2023a), who took the Pangu-Weather model trained on ERA5 and initialized it with operational IFS analyses.In addition, we include operational versions of Pangu-Weather and GraphCast in the WeatherBench scorecards.Across the board, the scores are very similar between the ERA5-initialized and operational versions, whether they have been fine-tuned (in the case of GraphCast) or not (in the case of Pangu-Weather).This suggests that the distribution shift between ERA5 and the operational IFS analysis is manageable for these AI models.Regardless, going forward it will be important to distinguish between ERA5-initialized and (quasi-)operational models as e.g.done in Table 1.For an apples-to-apples comparison, we encourage researchers to run and evaluate their models with operational analyses.

Forecast realism, probabilistic forecasts and extremes
One feature of current AI methods is that they produce unrealistically smooth forecasts, a trend that often becomes more pronounced with lead time.This is a direct consequence of these models minimizing a deterministic mean error (such as MSE).Because the evolution of the atmosphere is chaotic and forecast uncertainty grows with time, this will lead models to "hedge their bets" by predicting the mean of the potential forecast distribution.In a non-linear, high-dimensional system such as the atmosphere, the mean of many states is not a realistic state itself.The result are blurry forecasts that perform well on some skill metrics (like RMSE or ACC) but do not represent a possible weather state.This is evident in several analyses presented here: spectra show excessive blurring of AI models for longer lead times; the SEEPS score shows how blurry models fail to predict the right precipitation category; wind speed biases are evidence that current AI models have difficulties learning correlations between variables; and the case studies show that the intensity of local features such as wind speed, precipitation and cyclone intensity aren't represented.For some applications, having good forecasts of average weather is sufficient but for others AI models are not yet appropriate.This is naturally related to probabilistic prediction, i.e., predicting the full range of potential weather outcomes, which is so important for predicting the probability of extreme weather.Current, "traditional" forecast systems use ensembles for this purpose.AI methods could follow a similar approach by producing an ensemble of generative roll-outs, or they could directly predict probabilistic outcomes of interest, such as the probability distribution of precipitation as done in the MetNet family of models (Andrychowicz et al., 2023) or many post-processing models (e.g.Gneiting et al. (2005)).For many applications, such as predicting weather at a particular location, the latter approach might be more straightforward and sufficient, while for other applications, for example cyclone track forecasting or when humans interpret model output, temporal and spatial structure is important.Working closely with end users will be key in determining the most appropriate probabilistic representation for each application.
The probabilistic evaluation metrics proposed in WB2 (CRPS and spread/skill) are very much just the tip of the iceberg.They are univariate statistics, i.e., they ignore spatial and temporal correlations, and they do not specifically focus on extreme weather.Just like with the deterministic metrics (RMSE or ACC), better CRPS values do not automatically mean a more useful forecast.Weather forecasting is a very high-dimensional problem, which means that there won't be a single metric to determine forecast quality.Rather, evaluation will have to be guided by applications.

Post-processing
In this paper, the focus has been on data-driven forecast models.However, AI can also and has for a long time been used to post-process dynamical forecasts (Rasp and Lerch, 2018;Finn, 2021;Gneiting and Raftery, 2007;Grönquist et al., 2021).WB2 can equally well serve as a benchmark for post-processing models, on top of dynamical forecasts such as those produced by ECMWF.
In fact, comparing "purely" data-driven forecasting models with dynamical models with state-ofthe-art post-processing should be an insightful exercise going forward.Note that benchmarks for post-processing have been proposed by Ashkboos et al. (2022) including re-forecasts for training post-processing models and an extreme weighted CRPS score definition.Similarly, Demaeyer et al. (2023) present a benchmark for station-based post-processing.
Significant improvements over raw dynamical model output can be expected using post-processing.Previous studies (Rasp and Lerch, 2018;Ben-Bouallegue et al., 2023b;Finn, 2021) suggest that probabilistic forecasts can be improved by up to 20% in terms of CRPS, depending on the variable in question.Since data-driven methods, at least partly, already perform a post-processing implicitly, a fair comparison would be against post-processed dynamical models.Another interesting question is how much AI models can benefit from additional post-processing.Hopefully, these comparisons will be added to WB2 soon.

Conclusion
WeatherBench 2.0 is an updated benchmark for data-driven, global weather forecasting.It is motivated by the rapid advances in used AI methods since the publication of the original WeatherBench benchmark.WB2 is designed to stay close to the operational forecast evaluation used by many weather centers and to provide a robust framework for evaluating new methods against operational baselines.By providing evaluation code and data, we hope to speed up machine learning workflows and ensure the reproducibility of results.
WB2 is also designed to be a dynamic framework that will be updated with new metrics and models as this area of research evolves.Several possible extensions have already been discussed in this paper, for example including station observations and evaluating extremes.
and 2020 for testing and the remaining years from 1979 to 2020 for training.During training the model is trained to minimize the cumulative error of up to 12 time steps (3 days).

Figure 1 :
Figure 1: Deterministic headline scorecards for upper-level variables.Values show absolute RMSE.Colors denote % difference to the IFS HRES baseline.

Figure 2 :
Figure 2: Deterministic headline scorecards for surface variables.Values show absolute RMSE, with the exception of precipitation which shows SEEPS (evaluated against ERA5 in all cases).Colors denote % difference to the IFS HRES baseline.

Figure 3 :
Figure 3: Probabilistic headline scorecards for upper-level variables.Values show absolute CRPS.Colors denote % difference to the IFS ENS baseline.

Figure 6 :
Figure 6: Global CRPS for headline variables for the year 2020.

Figure 10 :
Figure 10: Power spectra of 500hPa geopotential, 700hPa specific humidity, 850hPa u-wind and 6h precipitation accumulation for different lead times.

Figure S2 :
Figure S2: Comparison of evaluating IFS HRES and ENS (mean) with analysis and ERA5 ground truth.

Figure S4 :
Figure S4: Global mean 850hPa temperature bias for 3, 5 and 10 day lead times.Below ground areas are grayed out.

Figure S6 :
Figure S6: Global mean 10m wind speed bias for 3, 5 and 10 day lead times.

Figure S7 :
Figure S7: Global mean 850hPa wind speed bias for 3, 5 and 10 day lead times.Below ground areas are grayed out.

Figure S8 :
Figure S8: Global mean 700hPa specific humidity bias for 3, 5 and 10 day lead times.Below ground areas are grayed out.

Figure S9 :
Figure S9: Comparison of RMSE scores for IFS HRES evaluated at different resolutions

Table 1 :
Table of datasets and models; ∆x refers to the horizontal resolution and "levels" to the number of vertical model levels used in the input and output.(*) See details in the text for IFS inference time.GPU = Graphics Processing Unit.TPU = Tensor Processing Unit.
combine a differential dynamical core with learned physics.NeuralGCM has been trained in a high-resolution deterministic

Table 2 :
Notation used in the evaluation metrics.version at 0.7 • and a lower-resolution ensemble version at 1.4 • .Both model versions use 32 vertical levels.

Table 3 :
List of headline scores.WV is short for Wind Vector; WS is short for Wind Speed.