A transformer neural network for predicting near-surface temperature

A new method based on the Transformer model is proposed for post-processing of numerical weather prediction (NWP) forecasts of 2 m air temperature. The Transformer is a machine learning (ML) model based on self-attention, which extracts information about which inputs are most important for the prediction. It is trained using time series input from NWP variables and crowd-sourced 2 m air temperature observations from more than 1000 private weather stations (PWSs). The performance of the new post-processing model is evaluated using both observational data from PWSs and completely independent observations from the Danish Meteorological Institute (DMI) network of surface synoptic observations (SYNOP) stations. The performance of the Trans-former model is compared against the raw NWP forecast, as well as against two benchmark post-processing models; a linear regression (LR) model and a neural network (NN). The results evaluated using PWS observations show an improvement in the 2 m temperature forecasts with respect to both bias and standard deviation (STD) for all three post-processing models, with the Trans-former model showing the largest improvement. The raw NWP forecast, LR, NN and Transformer model have a bias and STD of 0.34 and 1.96 (cid:1) C, 0.03 and 1.63 (cid:1) C, 0.10 and 1.53 (cid:1) C and 0.02 and 1.13 (cid:1) C, respectively. The corresponding results using DMI SYNOP stations also show improved forecasts, where the Transformer model performs better than both the raw NWP forecast and the two benchmark models. However, a dependence on distance to the coast and cold temperatures is observed.


| INTRODUCTION
Numerical weather prediction (NWP) models generally have systematic errors in their forecasts, for example, for near-surface variables such as temperature, wind speed and relative humidity.These errors are partly due to errors in the initial conditions and imperfect descriptions of dynamical and physical processes, including unresolved sub-grid processes (Mass et al., 2002;Nicolis et al., 2009;Tribbia & Baumhefner, 2004).Furthermore, NWP forecasts are gridded, which means that the forecast represents an average over a grid cell (Pielke, 2013).Although high-resolution NWP models have spatial resolutions of a few kilometres they still have local systematic errors due to unresolved topography and local obstacles (Hart et al., 2005), and also due to errors in the numerical solution of the governing equations.In order to obtain weather forecasts that better match the locally observed weather post-processing methods can be used to correct these errors.
The simplest post-processing methods are bias corrections, either through a fixed offset or through the use of a moving average.The moving average has been thoroughly investigated for post-processing of near-surface variables (McCollor & Stull, 2008;Stensrud & Yussouf, 2005;Sweeney & Lynch, 2011).The Kalman filter (Kalman, 1960) has also been extensively used for post-processing NWP forecasts (Alerskans & Kaas, 2021;Delle Monache et al., 2011;Galanis et al., 2006;Homleid, 1995;Roeger et al., 2003).One of the first post-processing methods used within atmospheric sciences is model output statistics (MOS), where NWP forecasts and observations are related through a set of linear regression equations (Glahn & Lowry, 1972).MOS has been widely used for postprocessing NWP forecasts (Cheng & Steenburgh, 2007;Hart et al., 2004;Jacks et al., 1990;Wilks & Hamill, 2007) and it is still used operationally by several national meteorological services.
These methods are all linear and can therefore only represent linear relationships.The atmosphere, however, is a highly non-linear system (Lorenz, 1963).Non-linear post-processing methods based on machine learning (ML) techniques have become increasingly popular and have shown good results when used to post-process NWP forecasts.A variety of ML models have been used to postprocess short-range wind speed forecasts (Barbounis et al., 2006;Schicker et al., 2017;Veldkamp et al., 2021) and solar radiation forecasts (Lauret et al., 2014;Lima et al., 2016).ML-based models have also been used to post-process near-surface temperature forecasts.Eccel et al. (2007) compared several linear and non-linear postprocessing methods, including two methods based on ML techniques; a neural network (NN) and Random Forest (RF), for post-processing of minimum temperatures.Overall, the best-performing method was RF, however, in some cases, the performance of the linear methods was comparable to that of the non-linear methods.Casaioli et al. (2003) and Marzban (2003) both obtained good results for post-processing of near-surface temperature forecasts using ML techniques.
Many operational meteorological forecasting centres are today also running ensemble forecasts, which provide probabilistic forecasts and enable a more accurate and meaningful framework, especially for longer forecast lead times (Buizza & Leutbecher, 2015;Houtekamer et al., 1996;Leutbecher & Palmer, 2008;Molteni et al., 1996;Toth & Kalnay, 1993).There exists a wide range of techniques for post-processing of probabilistic forecasts (Gneiting et al., 2007;Vannitsem et al., 2018Vannitsem et al., , 2021)).Standard statistical methods include ensemble model output statistics (EMOS), in which the forecast is corrected through the use of multiple linear regression, similar to the traditional MOS method for deterministic forecasts (Gneiting et al., 2005).Bayesian model averaging is another frequently used post-processing method, in which a forecast probability density function is obtained through a weighted average of the probability density functions based on the individual ensemble members (Hoeting et al., 1999;Raftery et al., 2005;Wilson et al., 2007).Other well-known post-processing methods include logistic regression (Hamill et al., 2004) and quantile regression (Bremnes, 2004).Recently, post-processing methods based on ML techniques have also been applied to probabilistic forecasts.Quantile regression forests have been used for post-processing of probabilistic forecasts of both temperature and wind speed and were found to perform better than traditional ensemble post-processing methods (Taillardat et al., 2016).Rasp and Lerch (2018) developed an NN for post-processing of ensemble near-surface temperature forecasts, in which both forecast data and station-specific information, such as station altitude, were used as input to the model.The NN was compared to standard post-processing methods and was found to perform the best.Bremnes (2020) used quantile regression in combination with an NN to postprocess probabilistic wind speed forecasts, obtaining improved forecasts compared to other post-processing methods.ML-based methods have also been used for postprocessing probabilistic precipitation forecasts with good results (Scheuerer et al., 2020).
The objective of this study is to evaluate the performance of a new ML model for post-processing of deterministic 2 m air temperature (T 2m ) forecasts.The model used in this study is based on the Transformer model, which was originally developed for sequence modelling and tested on machine translation tasks (Vaswani et al., 2017).The Transformer model has also been widely used for other natural language processing problems (Devlin et al., 2018;Liu et al., 2018;Radford et al., 2018) and models based on the Transformer have also been used for time series forecasting (Cai et al., 2020;Li et al., 2019;Wu et al., 2020).Furthermore, methods based on the Transformer model have recently been used within the atmospheric sciences for data-driven weather forecasting (Bilgin et al., 2021;Chattopadhyay et al., 2020) and post-processing of global ensemble forecasts (Finn, 2021).
Here we use forecasts from the Global Forecast System (GFS) NWP model and observational data from a network of private weather stations (PWSs) in order to produce site-specific T 2m forecasts.Data from April 2019-December 2020 and observations from more than 1000 PWSs, located mainly in Denmark, are used in order to train the post-processing model.The model is then used to generate local 48-h T 2m forecasts for individual PWSs.The Transformer model is also evaluated on observational data from a network of surface synoptic observations (SYNOP) stations in order to evaluate the performance of the model on completely independent observations from another data source.The performance of the Transformer model is compared against the raw GFS forecast, as well as against two standard benchmark post-processing methods; a linear regression (LR) model and a feed-forward, fully connected NN.
The remainder of the paper is structured such that Section 2 presents the observational and forecast data used, followed by a description of the data pre-processing.The Transformer post-processing model is described in Section 3 and details of the model input variables are given in Section 4. Section 5 introduces the two benchmark post-processing models.Results are shown in Section 6, and Section 7 includes a discussion of the results and suggestions for future work.Finally, the conclusions are presented in Section 8.

| Observational data
Crowd-sourced observations are observations obtained from non-conventional sources, such as PWSs, vehicles and smartphones (Howe, 2006;Muller et al., 2015).The use of crowd-sourced observations in atmospheric sciences, NWP included, is becoming increasingly widespread.In recent years, several study areas of interest have emerged, such as the use of pressure observations from smartphones (Hintz et al., 2019;Mass & Madaus, 2014;McNicholas & Mass, 2018), the quantification of the urban heat island effect (Meier et al., 2017;Steeneveld et al., 2011), the use of crowd-sourced wind observations (Droste et al., 2020;Hintz et al., 2020), the use of vehicle observations for assimilation in NWP models (Siems-Anderson et al., 2020), and the use of PWSs for post-processing of NWP forecasts (Nipen et al., 2020).Here, we are using crowd-sourced observations from PWSs in order to post-process near-surface temperature forecasts.
Observations from non-traditional sources do not usually comply with the regulations defined by the World Meteorological Organization (WMO) for meteorological observations (WMO, 2018).The rules for station placement used by WMO have been developed to make sure that the observations are representative of the surrounding area.This means that the stations should not be placed too close to obstacles, such as buildings and trees, on balconies or in direct sunlight.For PWSs, these standards are often not met.However, for the purpose of post-processing near-surface temperature forecasts in order to obtain more local forecasts, the representativity of the observations to a larger surrounding area is of less interest.In this case, the interest is that of the locally observed temperature, under the premise that it is measured reasonably correct with good ventilation, minimized short and long wave radiative effects and instrumentally dry conditions.
The main strength of crowd-sourced observations is their high spatial resolution; observations from nontraditional sources vastly outnumber observations from official networks.Several networks of PWSs exist, such as the Weather Observations Website (WOW, 2021), Weather Underground (Weather Underground, 2021), Netatmo (Netatmo SAS, 2021), andFieldSense (2021).In Denmark, both the Netatmo and FieldSense networks greatly outnumber the official network of weather stations operated by the Danish Meteorological Institute (DMI).Other well-known PWS networks, such as WOW and Wunderground, however, are not as widely used in Denmark, which is the area of interest in this study.On the other hand, the main disadvantage of crowd-sourced data from PWSs is that the sensors usually are low-cost and not as accurate and well-calibrated as the measurement instruments used in official weather station networks (Bell et al., 2013(Bell et al., , 2015)).
In this study, observational data from FieldSense are used.The FieldSense network consists of approximately 1500 stations as of 26 July 2021 and is densest in more rural areas and farmlands.For the purpose of developing a post-processing method for near-surface temperature forecasts, T 2m observations are used.The number of stations included in this study is 1093, whereas the number of locations is larger, approximately 2000.A location is here only defined by its latitude and longitude coordinates, which means that the same station might be considered twice if it has been moved during the period of interest.The temporal resolution of the observational network is 10 min.However, the full temporal resolution of the observations has not been utilized here since only the observation closest in time to the output time of the NWP model is used.
In order to remove obviously erroneous observations quality control (QC) is needed.The tests applied in this study consist of a range check, a rate of change check and a persistence check (Zahumenskỳ, 2004).Furthermore, a manual inspection of the data was performed for each individual location.

| NWP data
The NWP data used in this study are forecast data from the GFS from the National Center for Environmental Prediction (NCEP).Operational data can be accessed from the NOAA Operational Model Archive and Distribution System (NOMADS; https://nomads.ncep.noaa.gov/) and historical data can be accessed via the National Center for Atmospheric Research (NCAR) Research Data Archive (https://rda.ucar.edu/).The product used in this study is the GFS dataset with a resolution of 0.25 (NCEP, 2015).
GFS is a global non-hydrostatic NWP model with a finite-volume dynamical core.The horizontal spatial resolution is approximately 13 km and in the vertical, the model is divided into 64 layers, the highest extending up to about 0.2 hPa (NOAA, 2021).A new forecast is issued four times a day at 00, 06, 12 and 18 UTC.The first few days of the forecast output are hourly, whereafter it is three-hourly.However, for historical data, output is only saved for every 3 h, also for the shorter-range forecasts.
Forty-eight-hour forecasts were extracted for the period April 2019 to December 2020.Since historical data were used, the temporal resolution is 3-hourly.Extracted data include, among others, forecasts of 2 m air temperature and relative humidity, 10 m winds and radiation fluxes.Table 1 shows the full list of variables considered.The nearest neighbour interpolation was used to estimate the forecast data at the station locations.

| Data pre-processing
To ensure a good performance for NN-based ML models it is typically necessary to pre-process the data (Marzban, 2003).In the case of missing observational data, different approaches can be used.One approach is to replace the missing data with the sample average for the variable in question and another method is to fill the data gaps using interpolation.The most simple case is just to exclude the missing observations altogether.Here, a combination of data exclusion and temporal interpolation is performed.When only very few data points are missing, the data are temporarily interpolated.If longer data gaps exist, these data are excluded.Since the Transformer model uses time series input (see Section 3) it is necessary to split the dataset for each location into subsets when large data gaps exist.Therefore, the total number of subsets used in this study is approximately 5400.The length of each time series differs, from a single month up to more than a year, depending on how many and how long the data gaps in the original location datasets were.
Another important part of the pre-processing step is data normalization or standardization (Kotsiantis et al., 2006).This is a transformation of the data in order to get the data on the same scale.Different methods can be used.Here, we have chosen to standardize the data by subtracting the mean of each variable from each data point and dividing by the standard deviation of the variable in question.However, for some variables, a cyclic encoding is instead performed.Some input variables, such as wind direction, have a finite range and there is often a discontinuity near the edges of the range.Such discontinuities have a negative effect on the training of, for example, NNs (Vamplew & Adams, 1998).In order to overcome this problem, a cyclic encoding of the variable in question needs to be performed.The standard way to encode such a discontinuous variable as wind direction (ϕ wd ) is to transform it into two variables, v 1 and v 2 , by taking the sine and cosine of the data The combination of the sine and cosine variables contains the same information as the original discontinuous variable, though both the sine and cosine variables are continuous.A cyclic encoding is therefore performed for the discontinuous input variables.
The dataset, consisting of forecast and observational data, is split into three subsets in order to ensure that each step is performed on independent data.The training dataset is used to estimate model parameters, the validation dataset is used for model optimization and the test dataset is used for an independent evaluation of the performance of the models.The train, validation and test datasets consist of 2507, 1416 and 1416 location subsets, respectively.Figure 1 shows the spatial distribution of locations for the three datasets.As mentioned previously, the datasets for each individual station were split into subsets depending on the number of data gaps in the individual datasets.Therefore, each of these subsets in the train, validation and test datasets contains data from a single PWS, however, with differing temporal lengths.This also means that data from one station can be included in all three subsets, however, they then contain data from different time periods.All three datasets include data from the whole period, April 2019-December 2020.

| THE TRANSFORMER MODEL
Sequence-to-sequence (seq2seq) models are models that transform a given sequence of elements into another sequence.Seq2seq models have become state-of-the-art approaches for natural language processing problems (Cho et al., 2014;Kalchbrenner & Blunsom, 2013;Sutskever et al., 2014), and are also widely used for time series forecasting (Cinar et al., 2017).The dominant types of seq2seq models include recurrent neural networks (RNN; Schmidt, 2019), such as the Long Short-Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997), convolutional neural networks (CNN;O'Shea & Nash, 2015), and the more recent Transformer model (Vaswani et al., 2017).The post-processing model used here is based on the Transformer model, with the implementation based on the model developed by Radford et al. (2018).In this section, a description of the Transformer model used in this study is presented.For a more complete description of the general Transformer model, see Vaswani et al. (2017).
To put it simply, the Transformer model compares the elements in an input sequence, learns how they influence each other and uses this to predict the next element in the sequence.A flow diagram of the Transformer model used in this study is shown in Figure 2. First, three different linear projections are applied to the input data.This means that the input data is projected to three different spaces using different parameters, where the new projections are called queries, keys and values.As the purpose is to calculate how the input data in the sequence influence each other it is necessary that the queries, keys and values are initialized differently, otherwise, the largest contribution would always come from comparing one input element with itself.The queries, keys and values can be thought of in terms of how a retrieval is done in a database.In a database retrieval, we have tabular data, where each value corresponds to a key.When a query is made to the database, it is compared to a set of keys.The best-matching key is then selected and the output is the associated value.In the case of the Transformer model, a form of weighted or proportional retrieval, based on a set of weights describing how similar the set of keys are to the query (i.e., how similar the elements in the input sequence are to each other), is performed.
Next is the so-called self-attention mechanism, which is the basis of the Transformer model.Attention, first proposed by Bahdanau et al. (2014), is the mechanism of finding dependencies between input and output sequences (general attention) or between elements of the input sequences itself (self-attention).In the attention block, a comparison is performed between the queries and keys in order to calculate how the elements in the input sequence influence each other, that is, self-attention is applied.The comparison is performed using a scaled dot product, from which a similarity measure can be obtained.By applying this similarity measure to the values, the attention to each of the input elements based on the other input elements is obtained.In other words, the self-attention mechanism decides at each prediction step, which other parts of the input sequence are important for the prediction.The attention calculation can be performed in parallel with different sets of parameters or so-called heads, which is why it is called multi-head attention.This means that the different query-key-value sets in each attention head can learn different relationships.
The attention mechanism has no concept of the order of the input sequence, as it only weights the elements in the input sequence without any notion of their order.Therefore, a positional encoding of the input data is used to give the model information about the relative or absolute position of the elements in order for the model to be able to use the order of the input sequence.
The multi-head attention is used together with a fully connected, feed-forward NN.The role of the feed-forward NN is to add further non-linearity and to increase the capacity of the model.Residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) are applied in order to stabilize the training process.Finally, a linear layer is applied in order to obtain the output of the model.
The configuration of the Transformer model can be described by its different hyperparameters.In the Transformer model developed here, we have used two parallel attention blocks.Furthermore, the number of layers used, where one layer consists of a multi-head attention block followed by a feed-forward network, is one.An initial learning rate of 1e À3 is used with a standard decaying learning rate schedule.A dropout of 0.1 is used for regularizing the model.Furthermore, early stopping is also applied in order to prevent the model from overfitting.
The input data to the Transformer model consist of time series of both observational and forecast data.The Transformer model is given a time series of length 5 days with a fixed number of input features and this is then used to predict the 2 m temperature for the next 48 h.Since a time series is used, not all lead times from all forecast cycles are used.Only the newest available forecasts are used to construct the time series input.This means that only lead times +3 h and +6 h from the forecasts from the previous 5 days are used.The length of the input time series was investigated.It was found that lengths smaller than 5 days did not contain enough information to obtain good results for the whole prediction period.Lengths longer than 5 days, on the other hand, did not improve the predictions.Therefore, a length of 5 days for the input time series was chosen.
The forecast data in the input series have been shifted one step in time, whereas the observational data have not been shifted.As an example, the model is given a time series input with a length of 5 days with both observational and forecast data.The observational data are from 12 UTC 3 January 2021 to 09 UTC 8 January 2021.The forecast data features, on the other hand, have been shifted in time so that valid times for the forecast data are from 15 UTC 3 January 2021 to 12 UTC 8 January 2021.This is illustrated in the "input period" part of Figure 3.The temporal shift is performed in order to always use the forecast data for the time step we are predicting.The shift is only performed for the forecast data since we do not have the observation for the time step we are predicting (it is what we want to predict).
The time series data is used to predict the observed 2 m air temperature one-time step ahead.The temporal resolution is the same as the output frequency of the NWP model, namely 3 h.Taking the previous example, this means that the first prediction is valid at 12 UTC on 8 January 2021.For the next prediction step, the last predicted observed T 2m and the next forecast lead time in the current forecast cycle are included, whereas the first values in the time series input are excluded.This is done in order to always have the same length of the input time series data (5 days) for all prediction steps.The model is therefore auto-regressive, which means that it uses the previously generated element in the output sequence as input for the next prediction step.This part is illustrated in the "prediction period" part of Figure 3.

| FEATURE SELECTION
Table 1 shows all variables considered as input features to the Transformer model.A feature importance analysis was performed using gradient backpropagation, similar to the analysis performed in Rußwurm and Körner (2020).The aim of the feature importance analysis is to reduce the number of input features and only select the most important ones in order to reduce the complexity of the model and decrease the risk of overfitting (Goodfellow et al., 2016).
The basis for the gradient backpropagation feature analysis method is that ML models are differential functions that approximate a mapping from an input to an output.During the training process, gradients of an error function (also called cost or loss function) are backpropagated in order to update the model weights.The main idea behind the gradient backpropagation feature importance is that the gradients can be backpropagated through the entire model all the way back to individual elements of the input.This means that the gradients can be backpropagated back to specific times and features in the input time series.Therefore, these gradients give an indication of how each element in the input data affects the prediction.Vanishing gradients indicate that a change in the input does not affect the prediction.Positive or negative gradients, on the other hand, indicate that a change in the corresponding input element will have an impact on the prediction.
A version of the Transformer model was trained using all input features listed in Table 1, except day of year and hour of the day.Neither of these auxiliary variables was included since early results indicated that the use of these two variables led to model overfitting due to the relatively small, in terms of temporal extension, training dataset.This means that forecast errors which depend on either the hour of the analysis or on the valid hour of the forecast are not considered.The gradient backpropagation analysis for this model was evaluated using the training dataset.Figure 4 shows the feature importance as given by the magnitude of the gradients, averaged over all predictions and stations, as a function of the index in the input time series.It should be noted that a logarithmic scale has been used for the ordinate.
The last, that is, the most recent, element in the input time series is the most important by far.Hereafter, high importance is seen for time series elements from previous days that have the same valid time as the valid time of the prediction.As an example, if the prediction is valid for 09 UTC 8 January 2021 then the indices in the input time series that correspond to 09 UTC for the previous days, that is, January 3-7, are also more important (i.e., indices 1, 9, 17, 25 and 33).Furthermore, the F I G U R E 3 Illustration showing the steps in the transformer model for post-processing forecasts.For simplicity, only the observed (blue) and forecast (red) T 2m are shown.The "input period" shows the input time series used for the first prediction step (t 0 ).Here, the shift in the forecast data is evident, showing that we are using the +3 h forecast together with the current observation.In the "prediction period" the input data is used to forecast the observed T 2m one step at a time.Since the transformer model is an auto-regressive model, the latest prediction is used for the next prediction step.This means that a sliding window is used in order to always use a time series with a length of 5 days, such that the oldest observation and forecast are discarded in the next prediction step and the new prediction and the next forecast value are included instead.
importance decreases with increasing temporal distance to the prediction, for example, the elements from 09 UTC 6 January 2021 are less important than the corresponding elements from 09 UTC 7 January 2021.
The most important feature is the observed 2 m temperature, which is expected since this is what the model is to predict.The mean sea level pressure is the second most important feature, followed by net short wave radiation flux, NWP 2 m temperature and 850 hPa geopotential height.Some features have next to no impact on the prediction, such as vertical velocity at 700 hPa, total cloud cover and total precipitation.Based on the estimated feature importance several different feature combinations were tested in order to select the optimal input features.The final input features selected are marked in Table 1 in italics.

| BENCHMARKING
To evaluate the Transformer model, its performance is compared against the raw GFS forecast, as well as two standard benchmark post-processing models; a linear regression model and a feed-forward, fully connected and non-temporal NN.The LR is a simple standard method for post-processing and a comparison against it evaluates the benefits of using more complex, non-linear methods, whereas the comparison against the NN evaluates the benefits of the additional complexity added to the Transformer model, such as the use of self-attention to learn the temporal dynamics of the data.

| Linear regression
A linear regression (LR) model is used to linearly relate the predictand Y (also called the dependent variable, that is, the output variable, which is the observed 2 m temperature) to the predictors X (also called regressors, i.e., the input variables) where b Y are the predictions from the regression model and B are regression coefficients which are obtained by minimizing the error for the predictions of Y given X (Wilks, 2011).Here, the least-squares method is used where the superscripts T indicates the transpose and À1 the inverse.The LR model is used to predict the observed 2 m temperature using the same input variables as the Transformer model, except that it does not use the observed 2 m temperature.This is because the LR is not an autoregressive model and therefore only the current observation is available.Instead, forecast lead time is included as an input variable in order to add a temporal dependence.

| Neural network
The other benchmark post-processing model considered is a feed-forward, fully-connected NN.Here, a brief overview of the NN used in this study is given.For a more detailed and comprehensive description of NNs, see, for example, LeCun et al. (2015), Nielsen (2015) and Goodfellow et al. (2016).
An NN predicts a target value y given some input data x by mapping a function unto x, such that f(x) = y.It consists of a sequence of layers, where each layer consists of one or more so-called neurons.An example of an NN with three layers is shown in Figure 5.The left-most and first layer is the so-called input layer, through which the NN receives its input data, and the right-most and last layer is the output layer, through which the final prediction of the NN is given.Any layer between the input and output layer is called a hidden layer.The connections between the neurons in the different layers are associated with a weight, which represents the importance given to the corresponding input data and how large of an impact it has on the final output (Bishop & Nasrabadi, 2006).In addition, an activation function is used for each connection to add non-linearity to the model and to allow for variable importance and a switch-like behaviour in how the model responds to input variables.
The neurons can be considered information-processing units as they receive input, process it and produce an output.Consider a neuron k in the nth hidden layer, which receives m inputs, , where each input has a weight associated with it; km .The inputs are linearly combined, weighted by their respective weight (Bishop & Nasrabadi, 2006) k is called the neuron's bias.The output of the neuron is then given by (Bishop & Nasrabadi, 2006) The superscript (n) refers to the nth hidden layer and φ(Á) is the activation function.
Here we have used a simple feed-forward, fullyconnected NN.It consists of an input layer with 15 input features, which are the same as for the LR, one hidden layer with 60 neurons and an output layer with one neuron, as we only have one output; the predicted T 2m .The activation function used is the rectified linear unit (ReLU) function.
It is important to take measures against model overfitting, in which case the model learns to predict only the training data well and performs poorly on unseen data, that is, the generalization ability of the model becomes poor (Goodfellow et al., 2016).To ensure that the NN is not overfitting both early stopping (Bishop & Nasrabadi, 2006) and dropout (Goodfellow et al., 2016) are used.

| Verification metrics
The post-processing models are used to post-process GFS T 2m forecasts.The aim is to produce forecasts that better match the locally observed weather and it is therefore assumed that the observations represent the truth.This means that the aim is to forecast the observed temperature, disregarding possible observational biases.Therefore, the performance of the post-processing models, and the raw GFS forecast, is evaluated using the observed temperatures as the truth.The following verification metrics are used to assess performance (Wilks, 2011) where y fcst,i and y obs,i denote the ith forecast and observational data, respectively, and N is the number of forecastobservation pairs.Additional verification performed is based on the following metrics Root mean square error RMSE ð Þ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Example of a neural network with an input layer consisting of three input neurons, one hidden layer with five neurons and an output layer consisting of one neuron.
Correlation r ð Þ ¼ where y fcst and y obs denote the average and σ fcst and σ obs denote the STD of the forecast and observations, respectively.Results based on these metrics have been added as Supporting information.

| RESULTS
The post-processing models are applied to the data in the test dataset in order to post-process GFS T 2m forecasts.Figure 6 shows the lead time summary statistics with results averaged over all stations in the test dataset for overall bias, with the average calculated based on the individual station biases in absolute values (mean(jbias station j)), and STD of the forecast minus observed T 2m as a function of lead time.The raw GFS forecast has an average absolute bias and STD in the range of 0.6-0.7 C and 1.6-1.8C, respectively.Both the LR and NN improve the forecasts and show similar performance to each other with biases around 0.4 C and STDs in the range of 1.3-1.6C, with the NN performing slightly better than the LR.The Transformer model performs the best and reduces the bias to 0.1-0.2C, with the largest reduction seen for the shorter lead times.Similarly, the average STD is reduced to 0.8-1.3C when using the Transformer model.Again, the largest improvement is seen for the shorter lead times.The corresponding results for the MAE, RMSE and correlation verification metrics are shown in Figures S1-S3, which show similar results to those presented in Figure 6.
A zigzag type of pattern can be seen in Figure 6, most notably for the bias of the raw GFS forecast.This zigzag pattern arises due to a difference in diurnal amplitudes between the forecast and observations.Since the verification statistics are presented in a lead time format, the difference in diurnal amplitude between forecast and observations for the different forecast cycles will produce this kind of zigzag pattern.
In order to assess the performance of the models with respect to spatial dependencies, the geographical distributions of bias and STD are shown in Figures 7 and 8.The performance was calculated seasonally since the locations contain data from different time periods and different seasons might have different error statistics.Each circle indicates the location and the colour represents the overall seasonal bias or STD, calculated over all lead times.The raw GFS forecast has mainly a warm bias for northern hemisphere winter (December-February) and autumn (September-November) months, whereas a cold bias is seen for most stations for northern hemisphere spring (March-May) and summer (June-August) months.Overall, the magnitude of the raw GFS forecast bias is largest for autumn and winter months and smaller for spring and summer months.No clear spatial pattern can be seen.
The LR and NN show very similar performances to each other.For winter months, they generally reduce the bias compared to the raw forecast, however, for stations where the GFS has a relatively small warm bias the LR and NN tend to reduce the bias too much, such that they instead have a cold bias for these stations.For both spring and summer months, neither the LR nor the NN reduces the bias significantly and they, like the raw GFS forecast, have a cold bias for most stations.For autumn months, both models reduce the warm bias of the raw GFS forecast and most stations produce forecasts with a smaller bias.However, similar to the performance for winter months, for stations where the raw GFS forecast only has a slight warm bias, they instead have cold biases.Generally, the Transformer model reduces the bias of the raw GFS  forecast for all seasons, with largest improvements seen for autumn and winter months.It also performs significantly better than the other two post-processing methods.Furthermore, the seasonal difference in bias for the Transformer model is much smaller compared to that of the raw GFS forecast.The raw GFS forecast has higher STD for spring and summer months, while lower for autumn and winter months.A geographical pattern in the STD of the raw GFS forecast can be seen for spring and summer months.Higher STD is seen for stations located closer to western and southern coasts.This is most evident during summer months for locations in eastern Denmark (see Figure 8).The LR and NN models clearly reduce the STD for spring and summer months, mainly for stations with a relatively higher STD.For winter and autumn months, however, only a slight improvement can be seen for a few stations.Overall, the Transformer model reduces the STD of the raw GFS forecast for all seasons for a vast majority of stations.The largest improvement with respect to STD is seen in spring and summer months, where the highest STDs were seen for the raw forecast.
The corresponding results for the spatial distribution of MAE, RMSE and correlation are shown in Figures S4-S6.Here, similar results are shown; all three post-processing models improve upon the raw GFS forecast, with the Transformer model showing the largest improvement.
Figure 9 shows a Taylor diagram, which demonstrates how well the forecast and observation patterns match each other based on their correlation, centred root mean square error (CRMSE) and STD (Taylor, 2001).Correlation and STD are calculated for all models and the observations.The CRMSE of each of the models is calculated using the observations as the truth and is given by the distance from the point marked by a yellow square (i.e., the observations).The STD and correlation are plotted using polar coordinates, where the STD is given by the radial distance from the origin.This means that a modelled pattern that agrees well with the observed pattern will have a high correlation and a low CRMSE, as well as an STD close to the observations (i.e., close to the solid yellow curve).It should be noted that the Taylor diagram only provides information about centred pattern errors since the means of the models have been subtracted.
The raw GFS forecast has a smaller STD than the observations, indicating that it has too small variability compared to the observations.All three post-processing models lie closer to the observations than the raw GFS forecast, with the NN and Transformer model having STDs closest to the observations.This indicates that the variability in the forecasts produced by these models is more similar to the actually observed variability in the observations.Furthermore, all three models have a higher correlation and smaller CRMSE than the raw GFS forecast.Out of the three postprocessing models, the Transformer model shows the best performance with respect to correlation and CRMSE, followed by the NN.

| Validation of observational data from SYNOP stations
All three datasets-train, validation and test-contain forecast data and observations from the same period; April 2019-December 2020.However, none of them contains observational data from the same location for the same time period, as described in Section 2.3.Note that due to the spatial density of stations and the relatively coarse resolution of the forecast data stations that are located relatively close to each other use the same forecast data.If these stations are distributed among the three datasets, the analysis of the post-processing models performed on the test subset might not be performed on completely independent forecast data.To evaluate the models' performance on completely independent forecast data, GFS forecast data from the period January-December 2018 have been used.Furthermore, the main disadvantage of crowd-sourced observations from PWSs is, as mentioned, that the sensors usually are not as wellcalibrated as the instruments used for official weather station networks (Bell et al., 2013(Bell et al., , 2015)).A thorough bias check for each individual station in the PWS network has not been performed, and the actual bias for each individual station is therefore unknown.Furthermore, the bias may vary among stations, which means that a bias check for an individual station will not tell us anything about the bias of another station.In addition, it is not possible to compare the PWS observational data with forecasts from a high-resolution NWP model, for example, the DMI-HARMONIE NWP model (Bengtsson et al., 2017;Yang et al., 2017), in order to determine the individual station bias.This is partly because the NWP model has its own model bias and partly because of representativity errors; the NWP model and the observations do not necessarily represent the same area, even more so for PWSs, which do not have to follow the WMO location standards (WMO, 2018).Therefore, to test if the post-processing models have learned to correct the actual NWP model error and not just possible instrument errors of the PWSs, observational data from the DMI SYNOP network is used to evaluate the performance of the models.It should be noted that these stations might also have representativity errors and instrument biases, although these are assumed to be smaller than for PWSs.Therefore, observational data from 50 DMI SYNOP stations and GFS forecast data from 2018 are used to assess the performance of the three post-processing models.
The spatial distributions of seasonal bias and STD are shown in Figures 10 and 11.The corresponding results for the spatial distribution of MAE, RMSE and correlation are shown in Figures S7-S9.Again, it can be seen F I G U R E 9 Taylor diagram showing results for the raw GFS forecast (red), the Transformer model (blue), the linear regression model (LR, black) and the neural network (NN, magenta).The standard deviation and correlation are plotted using polar coordinates, where the standard deviation is given by the radial distance from the origin (dotted circular lines) and the correlation is given by the azimuthal position (dashed straight lines).The CRMSE is given by the distance between the model and the PWS observations (sold circular lines), which is marked by a yellow square.CRMSE, centred root mean square error; GFS, global forecast system; LR, linear regression; NN, neural network; PWSs, private weather stations.that, in general, the raw GFS forecast has a warm bias for winter and autumn months and a cold bias for spring and summer months.For winter months, the Transformer model reduces the bias for stations in which the raw GFS forecast exhibits a relatively large bias.For stations for which the raw GFS forecast only has a very small bias, the Transformer model tends to perform slightly worse, with a slightly larger warm bias compared to the raw forecast.A similar pattern is seen for the LR and NN models.
For both spring and autumn months, the Transformer model exhibits mainly a small warm bias for most stations.For these months, the Transformer model tends to perform better than the raw GFS forecast and it reduces the bias for most stations.The LR, on the other hand, has a cold bias for most stations during spring months, whereas the NN has a warm bias for most stations.The improvement compared to the raw GFS forecast, however, is only very slight.For autumn months, on the other hand, the LR and NN reduce the bias for most stations.For summer months, the LR and NN only perform slightly better than the raw GFS forecast for a few stations, whereas it performs slightly worse for others.The Transformer model, on the other hand, reduces the bias compared to the raw GFS forecast for a majority of stations for summer months.No clear spatial pattern in the distribution of bias can be seen for any of the models for any of the seasons.
Generally, the raw GFS forecast has higher STD for spring and summer months and lower STD for winter and autumn months.No spatial pattern can be seen in the geographical distribution of STD for the raw GFS forecast.The same seasonal pattern is seen for all three post-processing models, with higher STDs for spring and summer months and lower STDs for winter and autumn months.For winter months, only a slight difference in performance with respect to STD between the raw GFS forecast and the Transformer model is evident in Figure 11.For most stations, the performance is comparable, whereas for some stations the Transformer model performs better.For some coastal stations, however, the raw GFS forecast has a much lower STD than the Transformer model.The LR performs similar to the Transformer model for winter months, whereas the NN performs slightly worse, also compared to the raw GFS forecast.Similar results are seen for autumn months; the Transformer model and LR show a slight improvement for a few stations compared to the raw GFS forecast, whereas the NN performs slightly worse.For both spring and summer months, the Transformer model generally exhibits lower STDs and performs better than the raw GFS forecast.However, the same dependence on distance to the coast is observed for the Transformer model for these months.Similar results are seen for the other two post-processing models.
Figure 11 showed a spatial dependence of the STD of the post-processing models, especially the Transformer model.More specifically, a worse performance for coastal stations could be seen.Most of the PWSs are located on or in the vicinity of fields and are therefore not located close to the coast and cannot be regarded as coastal stations.A possible explanation for the worse performance for coastal stations is therefore that the Transformer, LR and NN models have not seen any data from coastal stations during the training period.Due to the relatively coarse resolution of the GFS model, the nearest grid cell for some of the coastal stations might have a too marine influence.Furthermore, the wind direction can have a large impact on the difference between the forecast and observed T 2m .For example, on-shore winds would affect the forecast differently compared to off-shore winds.The wind direction is used as an input feature for the models.However, in order to include the land/ocean effect for coastal stations, it would be necessary to construct a variable which includes information about if the wind direction is on-or off-shore.Another option is to add an input feature such as distance to the coast or a topographyrelated feature.However, in either case, data from coastal stations need to be included in the training dataset in order to improve the performance of coastal stations.The coastal DMI stations were included in the geographical analysis in order to investigate any spatial dependencies in the performance of post-processing models.In subsequent analyses, the coastal DMI stations have been excluded in order to investigate the performance of the models on data that are more representative of what they were trained on.
Table 3 shows the overall performance of the models as validated using observations from DMI SYNOP stations, with results shown using both all stations and noncoastal stations only.The LR exhibits the largest bias for both all stations and non-coastal stations only; À0.21 and À0.24 C, respectively.The raw GFS forecast and the Transformer models have similar biases, whereas the NN model has biases close to zero.The NN, on the other hand, has the largest STD when evaluated using all stations, 1.78 C, even higher than the raw GFS forecast, which otherwise is the model with the highest STD.The Transformer model has the smallest STD, 1.49 C, followed by the LR, 1.62 C. For non-coastal stations, all three post-processing models reduce the STD compared to the raw GFS forecast, with the Transformer model performing the best, followed by the LR.Similar results are seen for the MAE, RMSE and correlation verification metrics.
Figure 12 shows the seasonal lead time summary statistics with results averaged over all non-coastal DMI stations for overall bias, with the average calculated based on the individual station biases in absolute values (mean (jbias station j)), and STD of the forecast minus observed T 2m as a function of lead time.The corresponding lead time statistics based on MAE, RMSE and correlation are shown in Figures S10-S12, where similar results as those presented here can be seen.
The raw GFS forecast has the largest bias for autumn months and the highest STD is seen for spring and summer months.The Transformer model reduces the bias for all four seasons, with the largest improvements seen for autumn months and the smallest improvements seen for winter months.The LR and NN also reduce the bias, especially for autumn months, with smaller improvements seen for winter and spring months, whereas no clear improvement is seen for summer months.The performance of the post-processing models with respect to STD is seasonally dependent.For spring and summer months, the Transformer model reduces the STD and performs better than the GFS forecast for all lead times, with the largest improvements seen for the shorter lead times.For autumn and winter months, the Transformer model only outperforms the raw GFS forecast for the very short lead times, whereas for the longer lead times their performances are comparable.For the NN and LR, on the other hand, the largest improvement in STD is seen in spring months.For the other 3 months, only a very slight improvement compared to the raw GFS forecast is seen for the LR.The NN, on the other hand, performs slightly worse than the raw GFS forecast, especially for winter months.
A Taylor diagram is shown in Figure 13.Both the raw GFS forecast and the Transformer models have a smaller STD compared to the observations, which indicates that they underpredict the observed variability.However, the Transformer model is closer to the observations.The LR and NN, on the other hand, have STDs more similar to that of the observations.This means that the variability of the forecasts produced by these two models model is It was shown that the Transformer model outperforms the raw GFS forecast, as well as the other two postprocessing models, when applied to the test dataset, with respect to both bias and STD, as well as MAE, RMSE and correlation.The better performance compared to the two benchmark post-processing models verifies the benefits of using a more complex, non-linear post-processing model, as well as the benefits of using self-attention to learn and include the temporal dynamics of the data.In addition, the Transformer model performs well on data from a period it has never seen before, as well as on an independent observational data source.Also, here it is better than both the LR and NN models.This shows that the Transformer model has learned to correct actual NWP model errors and not just possible instrument biases.
A larger improvement for the Transformer model is seen for the shorter lead times.The reason for this is most likely that the Transformer model is a sequential model, meaning that it predicts the observed T 2m one- F I G U R E 1 3 Taylor diagram showing results for the raw GFS forecast (red), the Transformer model (blue), the linear regression model (LR, black) and the neural network (NN, magenta).The standard deviation and correlation are plotted using polar coordinates, where the standard deviation is given by the radial distance from the origin (dotted circular lines) and the correlation is given by the azimuthal position (dashed straight lines).The CRMSE is given by the distance between the model and the observation (solid circular lines), which is marked by a yellow square.The results are based on observations from non-coastal DMI stations.CRMSE, centred root mean square error; DMI, Danish Meteorological Institute; GFS, global forecast system; LR, linear regression; NN, neural network.
time step ahead.Furthermore, since the Transformer model is an auto-regressive model, the predicted temperature from the previous prediction step is included in the input time series for the next prediction step as the latest observed T 2m .As seen in Figure 4, and as mentioned in Section 4, the observed T 2m for the last, that is, the most recent, entry in the input time series has the largest impact on the prediction.Any error in the predicted temperature will therefore carry over to the next prediction and the error will therefore increase with increasing lead time.Furthermore, during training, the model is given a time series of data consisting of observational and forecast data for a 5 day period in order to predict only one time step ahead.Since the model input is a time series, only one value for each time is included.This means that not all lead times from all forecast cycles are used in the construction of the input time series.The input time series is constructed using only the newest available forecast, meaning that only lead times +3 h and +6 h are used.Therefore, the model has only been trained on lead times +3 h and +6 h and has not seen any forecast data for longer lead times.When the Transformer model is used for prediction, an input time series of forecast and observational data for a 5 day period is used.For the first prediction step, this time series consists of forecast data with lead times +3 h and +6 h only, exactly as in the training scenario.However, when the first two time steps have been predicted, that is, when predictions for +3 h and +6 h have been made, forecast data for lead times +9 h and up to +48 h are used as input in the time series in order to predict T 2m up to 48 h ahead.NWP models usually have lead time-dependent errors and due to the chaotic nature of the atmosphere, small errors in the initial conditions and model shortcomings lead to nonlinear error growth, which results in an increase in the model errors with increasing lead time (Lorenz, 1963).This might therefore introduce additional errors for the longer lead times, which might affect the performance of the Transformer model, and it likely contributes to the smaller improvement seen for the longer lead times.The validation of the independent SYNOP data showed a good performance for all post-processing models, especially for the Transformer model, for all four seasons with respect to bias.The performance with respect to STD, on the other hand, is best for spring and summer months, where the Transformer model outperforms both the raw GFS forecast and the benchmark post-processing models.However, for autumn and winter months an improvement in STD is only seen for the shorter lead times, whereas for the longer lead times the performance of the Transformer model and the raw GFS forecast is comparable.Both the LR and NN also have similar STDs to the raw GFS forecast, with no clear improvement seen.The corresponding analysis for validation on the PWS dataset (not shown) shows the same tendency.To determine if the performance of the Transformer model for autumn and winter months is due to a temperature dependence, the bias and STD of the T 2m error as a function of lead time and observed T 2m were calculated (not shown).For winter months, the results for the STD showed a clear dependence on temperature with worse performance for cold temperatures (<3 C) and longer lead times.The corresponding results for autumn months showed a worse performance for longer lead times in general, but especially for colder (<5 C) and warmer (>15 C) temperatures.For bias, the Transformer model performs worse for the cold temperatures (<À5 C) for winter months and for temperatures below freezing for the longer lead times for autumn months.
To further investigate any temperature dependence of the Transformer model additional validation analyses were performed where the Transformer model was used to predict T 2m using GFS and DMI observational data from the period January-April 2021.A worse performance, with respect to both bias and STD, was seen for the Transformer model for February.February 2021 was very cold in Denmark, with several days where the temperature did not rise above freezing and with freezing temperatures down to À20 C (DMI, 2021b).Neither 2019 nor 2020 were years during which such freezing temperatures observed in Denmark (DMI, 2021a).This means that Transformer model has not seen temperatures this cold before, which is most likely why it has difficulties in predicting the T 2m for February 2021.In cold winter situations, such as those described here, extremely low local temperatures may appear, associated with very stable and shallow boundary layers.Models often have difficulties to correctly reflect the weather situation in these circumstances, that is, a clear sky, calm to low wind conditions and lack of turbulence mixing.The fact that the cloud cover is given low weight as a predictor in the Transformer model is most likely due to few learning cases of this type.
The bias and STD of the T 2m error as a function of lead time and observed T 2m were calculated for the 2021 period as well (not shown).This showed that the Transformer model performed better with respect to bias than the raw GFS forecast for all lead times and temperatures, except for the very cold temperatures.For STD, on the other hand, a worse performance was seen mainly for temperatures in the range of 1-6 C. For cold temperatures, a smaller STD was observed for the Transformer model compared to the raw GFS forecast.However, a small STD does not necessarily indicate better performance, especially if the bias is large.
Hence, it is clear that the Transformer model has a dependence on observed T 2m , with worse performance for colder temperatures.
In order to allow the Transformer model to predict values that it has not seen before and thereby alleviate the temperature dependence, a change of normalization was tested.Instead of using the standardization method with mean and STD values from the training dataset the min/max normalization was used.The minimum and maximum values were taken from a plausible range of values for the variable in question.This will allow the model to extrapolate what it has learned to ranges it has never seen before.However, the results were not improved with the change of normalization method.Most likely, other relationships and variables are more important for these very cold temperatures.Furthermore, it is well-known that ML models have a hard time predicting extreme values (Ribeiro & Moniz, 2020).If the training dataset does not include enough examples for a certain range, for example, cold T 2m , the ML model will have difficulties predicting these cold temperatures.In both the Transformer model and the NN, the model optimization and training are performed based on the minimization of a loss function.The loss function gives the average performance of the model across the whole range of training examples, which means that the most abundant cases will have the largest impact on the model performance.Conversely, the rare cases will therefore have an almost negligible impact on the loss function, which means that the performance of the model for these cases will be poor (Ribeiro & Moniz, 2020).Therefore, both the Transformer model and the NN will have a hard time predicting temperatures in the extreme ends, for which not many training examples exist, as demonstrated by the poor performance for very cold T 2m .To perform better for very cold temperatures the models need to be trained on data which include these types of temperatures and weather situations, and perhaps make use of additional input features.
The post-processing models were only trained on slightly more than 1.5 years worth of data.This means that they have probably not seen data with enough variation in it, that is, there are most likely several weather situations they have not encountered and they have therefore not learned how to predict these.The Field-Sense network of PWSs, however, is relatively new, and a longer historical archive of observations is not available.An interesting idea could instead be to include data from the DMI stations, and also SYNOP stations from other countries, in order to train the Transformer model on more diverse data which spans a greater interval of temperatures and weather situations.Since the Transformer model has shown good results using observations from the DMI stations to predict T 2m it stands to reason that including these data in the training of the model could be beneficial.
Only the features listed in Table 1 were tested.Other features could be interesting to include as well.As discussed previously, a constructed variable which holds information about if the wind direction is on-or off-shore could be added.In addition, distance to the coast could also be used.Furthermore, a static stability parameter could be potentially important.Stability parameters hold information about vertical mixing and are important for describing the stability of the planetary boundary layer (Holton, 2004).Several studies have shown that NWP models in general tend to overestimate the T 2m , that is, they tend to have a warm bias, during strong static conditions during the winter (Atlaskin & Vihma, 2012;Haiden et al., 2018;Køltzow et al., 2019;Sandu et al., 2013).In order to improve the performance of the Transformer model, especially the very cold temperatures, adding a stability parameter as a feature could be beneficial.Other variables worth considering include station-specific variables, such as for example, altitude.Rasp and Lerch (2018) showed that for predicting near-surface temperatures in Germany using an NN, the inclusion of station-specific information was necessary to obtain a good performance.The terrain in Denmark, however, is very flat and inclusion of an altitude input feature is not expected to significantly improve the results.For more mountainous regions, on the other hand, such an input feature is expected to be necessary, as here stronger spatial variations in the NWP biases are expected.Rasp and Lerch (2018) also found that the inclusion of station latitude and longitude improved the performance of the post-processing model.It should be investigated whether the inclusion of such input features could further improve the performance of the Transformer model.
The performance of the Transformer model was evaluated against two standard post-processing methods; a linear regression (LR) model and an NN.Overall, the Transformer model performed better than both benchmark models.There exist other, more complex postprocessing models as well.The non-linear auto-regressive with exogenous inputs (NARX) NN, which is similar to an RNN, have been successfully applied to hydrological time series forecasting (Chang et al., 2014;Wunsch et al., 2021).Another ML model that has shown good performance for time series prediction is the LSTM (Kreuzer et al., 2020).The performance of the Transformer model should therefore be further investigated by comparing it against more complex models, such as the NARX and LSTM.
In this study, we have focused on post-processing of deterministic forecasts of near-surface temperature.Probabilistic forecasts offer a more complete framework and can also be used to provide deterministic forecasts.In light of this, the application of the Transformer model to ensemble forecasts should be investigated.Recently, Finn (2021) used a self-attentive ensemble Transformer for member-by-member post-processing of gridded 2 m temperature forecasts, effectively calibrating the ensemble spread.The Transformer model presented in the current study could similarly be extended towards postprocessing of probabilistic T 2m forecasts.

| CONCLUSIONS
The performance of a new post-processing method, based on the Transformer ML model, for producing site-specific T 2m forecasts was evaluated.The Transformer model was trained on forecasts from GFS and observational data from a network of PWSs for the period April 2019-December 2020.In addition to comparing the performance of the Transformer model to the raw GFS forecast, two benchmark post-processing models-a linear regression (LR) model and a feed-forward, fully connected NN-were used as well.The results were evaluated using a test dataset from the same period as the training dataset, with observational data from the same network of PWSs.These results showed an improvement with respect to both bias and STD compared to the raw GFS forecast.In addition, the Transformer model performed better than both benchmark post-processing methods.This verifies the benefits of using a more complex, non-linear postprocessing model, as well as the benefits of using selfattention to learn and include the temporal dynamics of the data.
Further validation on data from a completely independent period, as well as observations from the DMI network of SYNOP stations, a completely independent observational data source that the Transformer model has never seen before, was performed.The seasonal and geographical distribution of bias and STD showed that all three post-processing models perform best for spring and summer months, where the raw GFS forecast has larger errors.The Transformer model showed the largest improvement of the three models.Furthermore, a dependence on distance to the coast for the STD of the Transformer model was revealed.In order to improve the performance for coastal stations, it might be beneficial to include a constructed input feature, which holds information about if the wind direction is on-or off-shore or adds information about distance to the coast.Furthermore, it is necessary to include data from coastal stations in the training dataset in order to improve the performance of these stations.
It was shown that the Transformer model performs better than the raw GFS forecast, as well as the two benchmark post-processing models, with respect to bias for all seasons.However, for winter months only a small improvement compared to the raw GFS forecast was seen for the longer lead times.For STD, the Transformer model outperforms the raw GFS forecast, as well as the LR and NN, for spring and summer months.For autumn and winter months, on the other hand, the Transformer model, and the two benchmark models, only perform better than the raw GFS forecast for the shorter lead times.It was found that the Transformer model has a dependence on temperature, such that a worse performance is seen for colder temperatures, which could explain the worse performance for winter and autumn months.To remove this dependence it is suggested to investigate the importance of adding an input feature which holds information about the stability of the planetary boundary layer.Furthermore, additional training data for the very cold T 2m range is necessary in order to improve the performance of the Transformer model for these cases.
Overall, the Transformer post-processing model shows promising results, despite only being trained on less than 2 years worth of data.In order to include more diverse data with a wide variety of weather situations, an interesting idea could be to include SYNOP data in the training of the Transformer model.
Location of stations in the (a) training, (b) validation and (c) test datasets.F I G U R E 2 Flow diagram of the transformer model structure used in this study.

F
I G U R E 6 Mean summary lead time statistics with results averaged over all stations for seasonal bias (solid) with the average over all stations calculated based on the bias for each individual station in absolute values (mean(jbias station j)) and standard deviation (dashed) of forecast minus observed 2 m temperature.Results are shown for the raw GFS forecast (red), the Transformer model (blue), the linear regression model (LR) and the neural network (NN) as a function of lead time.The results are based on observations from the PWSs in the test dataset.GFS, global forecast system; PWSs, private weather stations.U R E 7 Spatial distribution of seasonal bias of forecast minus observed 2 m temperatures.The first row (from the top) shows results for winter months (December-February), the second row shows results for spring months (March-May), the third row shows results for summer months (June-August) and the last row shows results for autumn months (September-November).Each column shows results for one model, with the first columns showing results for GFS, the second for the Transformer model, the third for the linear regression model (LR) and the last for the neural network (NN).The results are based on observations from the PWSs in the test dataset.GFS, global forecast system; PWSs, private weather stations.
Spatial distribution of seasonal standard deviation of forecast minus observed 2 m temperatures.The first row (from the top) shows results for winter months (December-February), the second row shows results for spring months (March-May), the third row shows results for summer months (June-August) and the last row shows results for autumn months (September-November).Each column shows results for one model, with the first columns showing results for GFS, the second for the Transformer model, the third for the linear regression model (LR) and the last for the neural network (NN).The results are based on observations from the PWSs in the test dataset.GFS, global forecast system; PWSs, private weather stations.
Spatial distribution of seasonal bias of forecast minus observed 2 m temperatures.The first row (from the top) shows results for winter months (December-February), the second row shows results for spring months (March-May), the third row shows results for summer months (June-August) and the last row shows results for autumn months (September-November).Each column shows results for one model, with the first columns showing results for GFS, the second for the Transformer model, the third for the linear regression model (LR) and the last for the neural network (NN).The results are based on observations from DMI stations.DMI, Danish Meteorological Institute.
Spatial distribution of seasonal standard deviation of forecast minus observed 2 m temperatures.The first row (from the top) shows results for winter months (December-February), the second row shows results for spring months (March-May), the third row shows results for summer months (June-August) and the last row shows results for autumn months (September-November).Each column shows results for one model, with the first columns showing results for GFS, the second for the Transformer model, the third for the linear regression model (LR) and the last for the neural network (NN).The results are based on observations from DMI stations.DMI, Danish Meteorological Institute.
Mean summary lead time statistics with results averaged over all stations for seasonal bias (solid) with the average over all stations calculated based on the bias for each individual station in absolute values (mean(jbias station j)) and standard deviation (dashed) of forecast minus observed 2 m temperature.Results are shown for the raw GFS forecast (red), the Transformer model (blue), the linear regression model (LR, black) and the neural network (NN, magenta) as a function of lead time for (a) December-February, (b) March-May, (c) June-August and (d) September-November.The results are based on observations from non-coastal DMI stations.DMI, Danish Meteorological Institute; GFS, global forecast system; LR, linear regression; NN, neural network.
List of input variables considered.
Table2shows the overall performance of the models.The raw GFS forecast has the largest bias with 0.34 C, followed by the NN model with a bias of 0.10 C, whereas the LR and the Transformer model have biases of 0.03 and 0.02 C, respectively.The STD verification metric shows similar results, where the GFS forecast has the highest STD of 1.96 C, followed by the LR and NN with STDs of 1.63 and 1.53 C, respectively.The Transformer model performs the best, with the lowest overall STD of 1.13 C. The results for the other three verification metrics-MAE, RMSE and correlation-all show similar results, where the raw GFS forecast has the worst score and the Transformer model the best.
Overall performance of the models with respect to bias, MAE, STD, RMSE and correlation.
Note:The statistics is based on data from the test dataset, using all lead times and stations.Abbreviations: GFS, global forecast system; LR, linear regression; MAE, mean absolute error; NN, neural network; RMSE, root mean square error; STD, standard deviation.
T A B L E 3 Overall performance of the models with respect to bias, MAE, STD, RMSE and correlation.