A Comparative Study of Convolutional Neural Network Models for Wind Field Downscaling

We analyze the applicability of convolutional neural network (CNN) architectures for downscaling of short-range forecasts of near-surface winds on extended spatial domains. Short-range wind field forecasts (at the 100 m level) from ECMWF ERA5 reanalysis initial conditions at 31 km horizontal resolution are downscaled to mimic HRES (deterministic) short-range forecasts at 9 km resolution. We evaluate the downscaling quality of four exemplary model architectures and compare these against a multi-linear regression model. We conduct a qualitative and quantitative comparison of model predictions and examine whether the predictive skill of CNNs can be enhanced by incorporating additional atmospheric variables, such as geopotential height and forecast surface roughness, or static high-resolution fields, like land-sea mask and topography. We further propose DeepRU, a novel U-Net-based CNN architecture, which is able to infer situation-dependent wind structures that cannot be reconstructed by other models. Inferring a target 9 km resolution wind field from the low-resolution input fields over the Alpine area takes less than 10 milliseconds on our GPU target architecture, which compares favorably to an overhead in simulation time of minutes or hours between low- and high-resolution forecast simulations.


| INTRODUCTION AND CONTRIBUTION
Accurate prediction of near-surface wind fields is a topic of central interest in various fields of science and industry. Severe memory and performance costs of numerical weather simulations, however, limit the availability of fine-scale (high-resolution) predictions, especially when forecast data is required for extended spatial domains. While running global reanalyses and forecasts with a spatial resolution of around 30 km is computationally affordable (e.g., Hersbach et al., 2020), these models are unable to accurately reproduce wind climatology in regions with complex orography, such as mountain ranges. Since wind speed and direction are determined by localized interactions between air flow and surface topography, with sometimes the added complication of thermal forcing, accurate numerical simulation requires information on significantly finer length scales, particularly in regions that are topographically complex. For instance, (sub-grid-scale) topographic features such as steep slopes, valleys, mountain ridges or cliffs may induce wind shear, turbulence, acceleration and deceleration patterns that cannot be resolved by global models that lack information on these factors. Moreover, meteorologically relevant factors such as the vertical stability, snow cover, or the presence of nearby lakes, river beds, or sea can strongly influence local wind conditions (e.g., McQueen et al., 1995;Holtslag et al., 2013). In these regions, finer-resolution regional numerical models, with grid spacings of order kms or less need to be applied in order to obtain reliable low-level winds (e.g., Salvador et al., 1999;Mass et al., 2002). One approach to circumvent costly high-resolution simulations over extended spatial scales is known as downscaling, i.e.
inferring information on physical quantities at local scale from readily available low-resolution simulation data using suitable refinement processes. Downscaling is a long-standing topic of interest in many scientific disciplines, and in particular in meteorological research there exists a large variety of methods to downscale physical parameters. Such methods can be broadly classified into dynamical and empirical-statistical approaches (e.g., Hewitson and Crane, 1996;Rummukainen, 1997;Wilby and Wigley, 1997).
In dynamical downscaling (e.g., Rummukainen, 2010;Radić and Clarke, 2011;Xue et al., 2014;Kotlarski et al., 2014;Räisänen et al., 2004), high-resolution numerical models are used over limited sub-domains of the area of interest, and numerical model outputs on coarser scales provide boundary conditions for the simulations on the finer scale. While the restricted size of the model domain leads to a significant reduction of computational costs compared to global-domain simulations, dynamical downscaling still remains computationally demanding and time-consuming.
Statistical downscaling, on the other hand, aims to avoid the simulation at the finer scales, by using a coarse scale simulation (referred to as predictor data) to infer predictions at fine scale (referred to as predictand data). Correlations between the quantities at fine and coarse scales are learned by training statistical models on a set of known predictor-predictand data pairs.
Over time, a large number of empirical-statistical down-scaling approaches have been developed, which apply statistical regression methods for downscaling purposes, like (generalized) multi-linear regression methods (e.g., Chandler, 2005), or quantile mapping approaches (e.g., Wood et al., 2004). With recent developments in data-driven machine learning and computer science, however, more powerful modeling techniques have become available, which may have the potential to outperform previous methods in terms of both accuracy and efficiency. Only a few studies have examined the use of non-linear regression methods or more recent non-classical machine learning techniques (e.g., Eccel et al., 2007;Gaitan et al., 2014;Vandal et al., 2019) . Specifically, the extent to which non-linear machine-learning approaches can provide additional value over classical methods is a question that has not been answered conclusively, as yet.
Deep-learning methods are among the most prominent examples for state-of-the-art machine-learning techniques (e.g., LeCun et al., 2015;Goodfellow et al., 2016). In particular, convolutional neural networks (CNNs) have found manifold application in complex image processing and understanding tasks (e.g., Guo et al., 2016;. One of these is single-image superresolution, i.e. the generation of high-resolution images from low-resolution images (e.g., , which, formally, can be thought of as a very similar task to downscaling of climate variables. CNNs rely on expressing regression models that operate on an extended spatial domain as a set of localized linear models (localized filter kernels), which are applied repeatedly at varying spatial positions across the domain through convolution operations. The restriction of the model parameterization to local filter kernels effectively limits the number of trainable parameters, and thus reduces the tendency of the model to overfit spurious patterns in the data, while increasing model efficiency.
While also applicable to irregular graph-based data structures (Kipf and Welling, 2016), e.g., data defined on irregular grids, CNNs work most effectively with regular-gridded data in multi-dimensional array representations, facilitating an efficient parallel computation of optimization tasks on GPU-based compute hardware. Computational efficiency through parallelization is one of the major selling points of CNNs and should be considered as an important aspect during model design and data preparation.
Furthermore, more complex mappings can be learned by stacking multiple layers of convolution operations (increasing the depth of the models) and applying these successively, to generate more abstract feature representations. Similar to standard artificial neural networks (ANNs), applying non-linear activation functions between successive convolution layers can enable the model to learn non-linear mappings. Beyond purely sequential feature processing, more elaborate model design patterns, like skip connections between pairs of convolution layers (Srivastava et al., 2015), residual learning (He et al., 2016, e.g.,), or changes in the spatial resolution of internal feature representations (e.g., Ronneberger et al., 2015) can be leveraged to improve model performance.
CNNs are, thus, particularly well-suited for learning tasks involving spatially distributed data, which are often encountered in meteorology. Though CNN-based model architectures are increasingly adopted also in earth-system sciences (e.g., Shen, 2018;Reichstein et al., 2019;Vannitsem et al., 2020), their usage for downscaling applications has been rarely discussed (e.g., Vandal et al., 2018;Baño-Medina et al., 2019). In particular, earlier studies focused on simple CNN architectures, which do not make use of recent model design patterns, and thus do not exploit the full potential of state-of-the-art CNN architectures.

| Contribution
In this work, we perform a study of fully-convolutional neural network architectures for statistical downscaling of near-surface wind vector fields. The results are compared to those obtained by a multi-linear regression model, both w.r.t. quality and performance. We train models to predict the most likely outcome of a high-resolution simulation of near-surface winds 100 m above ground, based on low-resolution short-range wind-field forecasts as primary predictors. The data are defined on irregular octahedral and triangular reduced Gaussian grids with 9 km and 31 km horizontal resolution, respectively. To enable efficient processing of the data with CNNs and to avoid destroying local detail via interpolation, the data are mapped to regular grids through suitable padding. We view this work as an initial 'proof of concept' step, to pave the way to using finer resolutions, for both predictor and predictand. If the predictand scale could reach 1 or 2 km we would envisage a much greater range of practical applications emerging.
We compare the capabilities of different existing models, which reflect varying degrees of model complexity and elaboration.
Starting with a multi-linear regression model and a light-weight linear convolutional model, we continue the comparison with non-linear convolutional models of increasing complexity. By incorporating beneficial design patterns identified beforehand, in combination with adaptions in architectural design and training methodology, we propose DeepRU -a U-Net-based CNN model that improves the reconstruction quality of existing architectures.
For all models, we analyze whether incorporating additional climate variables and high-resolution topography like surface altitude and land-sea mask improves the network's inference capabilities. We further train the models on sub-regions of the domain, to avoid learning relationships between low-and high-resolution winds purely based on geographic location, i.e., to avoid overfitting to a particular domain. The reconstruction quality of all downscaling models is compared to the high-resolution simulations of real-world weather situations for a topographically complex region in central and southern Europe for the period between March 2016 and September 2019 ( Figure 1). Our key finding is that thought-out architecture design and appropriate model tuning enable network-based downscaling methods to efficiently generate high-resolution wind fields in which local and global scale structures are reproduced with high fidelity.
To further analyze the usability of network-based downscaling, the relationships between model complexity, network performance, and computational requirements such as memory consumption and prediction time are evaluated. We show how the model depth as well as the used design patterns, i.e. residual connections across successive convolution layers and U-shaped encoder-decoder architectures, are leveraged to balance between model complexity and prediction quality.

| Empirical-Statistical Downscaling
In describing downscaling options available at the time, (Wilby and Wigley, 1997) distinguish between regression methods, weather typing approaches and stochastic weather generators. Regression-based methods build upon the construction of parametric models, which are trained in an optimization procedure to establish a transfer function between low-resolution predictor variables and high-resolution predictands. Weather typing approaches, in contrast, rely on finding a suitable match between a set of predictor values and predictor value sets contained in the training data, in order to select out the most appropriate weather pattern analogue (e.g., Zorita and von Storch, 1999). Stochastic weather generators provide a probabilistic approach and are trained to replicate spatio-temporal sample statistics, as implied by the training data (e.g., Wilks, 2010Wilks, , 2012. A comprehensive review and comparison of empirical-statistical models for downscaling climate variables has been conducted by (Maraun et al., 2015),  and , who showed that many of the approaches perform generally well, but leave space for improvement. For instance, realistic replication of spatial variability in the high-resolution predictand variables remains a major challenge for many of the models .
Specifically addressing the problem of wind-field downscaling and forecasting, (Pryor, 2005) and (Michelangeli et al., 2009) proposed distribution-based approaches for wind-field inference, and (Huang et al., 2015) proposed a physical-statistical hybrid method for downscaling.
The question of what methods provide additional value over classical approaches, has only been addressed by a number of smaller model-comparison studies -with varying results. While (Eccel et al., 2007;Mao and Monahan, 2018) and (Vandal et al., 2019) found hardly any or no advantage, in applying non-classical machine-learning methods, (Gaitan et al., 2014) show non-classical methods outperforming classical ones, with artificial neural networks being a particular method example. More recently, (Buzzi et al., 2019) used neural networks for nowcasting wind in the Swiss alps, and achieved very skillful models.
These apparently contradictory findings raise the question of when, and under what conditions, can deep learning methods be profitably employed for downscaling.
Within meteorology, only a small number of studies have dealt with using CNNs for downscaling applications. For example, (Vandal et al., 2018) proposed "DeepSD", a simple convolutional neural network for downscaling precipitation over extended spatial domains, and more recently, (Baño-Medina et al., 2019), studied the performance of a set of convolutional neural networks for downscaling temperature and precipitation over Europe. (Pan et al., 2019) proposed a similar architecture, again with a focus on precipitation.
While the influence of model complexity has been examined by (Baño-Medina et al., 2019) in terms of model depth, i.e. the number of convolution layers, the models in use did not exploit recent design patterns, like skip or residual connections (e.g., Srivastava et al., 2015; or the fully-convolutional U-Net like architecture (Ronneberger et al., 2015), which enable network models to achieve state-of-the-art results in computer vision tasks.

| Single-Image Super-Resolution
Computer vision, being the origin of a large number of technological developments in machine learning, provides a problem setting which is closely related to downscaling in meteorology and climatology -single-image super-resolution. There, the goal is to identify mappings which allow for increasing the resolution of single low-resolution input images, while maintaining visual quality and avoiding pixel artifacts and blurriness. Within this context, the use of deep learning has led to remarkable improvements compared to standard statistical models (e.g., . Especially CNNs were found to be particularly successful (e.g., Dong et al., 2014Sajjadi et al., 2017).

| TRAINING DATA
For model training and evaluation, we use short-range weather forecast data, which include near-surface wind field simulations at different scales. The data is taken from the ecmwf! (ecmwf!) mars! (mars!) archive (Maass, 2019) and covers a spatial domain in central and southern Europe.

| Domain Description
The training domain is restricted to 40 • − 50 • N and 0 • − 20 • E (see Figure 2 (a)), and is comprised of sub-regions with varying orographic properties. Specifically, the domain contains high mountains of the Alps, some smaller mountain ranges in central Europe, flat areas in France, parts of the Mediterranean Sea, and southwest-facing coastal regions of the Adriatic, to confront the employed models with challenging scenarios where winds are highly influenced by the topography. Especially in the Dinaric Alps, situated in the eastern part of the domain, topographically forced gap flows are known to be an important phenomenon (e.g., Lee et al., 2005;Belušić et al., 2013). Significant differences between the low-and high-resolution numerical simulation results are most commonly observed in and around mountain ranges and coast lines , leading to the question of whether downscaling techniques can learn these differences and accurately predict the high-resolution fields from the low-resolution versions.
Map of the surface topography in Europe representing the data domain. (b) Low-resolution (N320) and high-resolution octahedral Gaussian simulation grid (O1280) used by ERA5 and HRES respectively. Over our domain the high-resolution grid comprises about 3 times more grid points in longitude and about 4 times more in latitude.

| Low-and High-Resolution Simulations
As "low-resolution" input to our models, we use data derived from the ERA5 reanalysis product suite (Hersbach et al., 2020).
ERA5 is the fifth in the series of ECMWF global reanalyses, and provides estimates of the 3-dimensional global atmospheric state (climate) over time, based on a four-dimensional variational (4d-Var) data assimilation, of past observations, into a recent version of the operational ECMWF numerical forecast model. Output is provided on a regular reduced Gaussian grid with a horizontal resolution of 31 km (0.28125 • ). In this study we use hourly forecast fields, from data times of 06:00 and 18:00 UTC, at time steps of T + 1, 2, . . . 12 h. We use these short range forecasts instead of the true reanalysis fields to avoid systematic small jumps in low-level winds seen in the latter at 09:00 and 21:00 UTC (documented in Hersbach et al., 2020).
The higher-resolution target dataset was provided by operational short-range forecasts from ECMWF's HRES (High RESolution) model, also at hourly intervals, initialized twice per day. HRES is a component of the ECMWF Integrated Forecast System (IFS) that can provide relatively accurate forecast products into the medium ranges (≥ 72 h ahead) (ECMWF, 2017).
HRES is the highest available resolution model at ECMWF (∼9 km) and, as with reanalyses, incorporates observations and information about the Earth-system as a prior for simulation runs. The output is provided on an octahedral reduced Gaussian grid (O1280). Forecast timesteps used were T + 7, 8, 9, . . . 18 h from the 00:00 UTC and 12:00 UTC runs. These were chosen as a compromise between being long enough to reduce any contamination from model spin-up, and short enough to retain forecast accuracy. The different spatial resolutions of ERA5 and HRES are illustrated in Figure 2  Products for HRES on the O1280 grid were first introduced operationally in March 2016, and so are only available from that point onwards. Therefore, we restrict our analysis to time periods between March 2016 to October 2019.

| Predictor and Predictand Variables
Both the low-resolution predictors and the high-resolution predictands provide two wind variables, which contain spatio-temporal information on the horizontal wind components 100 m above ground. The wind variables are denoted by U (meridional wind) and V (zonal wind). At the same locations (i.e. grid points), land surface elevation (altitude, ALT) and a binary land-sea mask (LSM) are available in low-and high-resolution variants. These are used as static predictors.
From the low-resolution dataset, supplementary predictor variables are obtained and used as dynamical, i.e., time-varying, predictors. The additional variables were manually selected according to the following considerations: • Boundary layer height (BLH) is a model diagnostic that describes the vertical extent of the lowest layer of the atmosphere within which interactions take place between the Earth's surface and the atmosphere (Stull, 2017). Its value typically ranges between about 0.3 and 3 km and it is essentially a metric for low level stability, with larger values implying deeper layers of instability-driven mixing. Earlier studies (e.g., Holtslag et al., 2013) found that boundary-layer effects can have a significant impact on model performance in numerical temperature and wind predictions. Therefore, BLH may encode information that affects the matching between the low and high-resolution variants. Also, BLH can provide the model with information about diurnal cycles. For these various reasons there was clear potential for this standard model output variable to be a useful predictor.
• Forecast surface roughness (FSR) denotes the surface roughness as represented in the forecast, and thereby provides information on the frictional retardation of the near-surface airflow. Contributory factors are vegetation types and land cover like soil or snow. The only dynamic component in the ECMWF modelling architecture is snow cover; other aspects are fixed year-round. We expected a small but direct impact from the snow cover.
• Geopotential height at 500 hPa (Z500) designates the elevation of the 500 hPa pressure level above mean sea level, and typically has values around 5500 m. At this height, the pressure gradients and Coriolis force are typically in balance and winds are roughly parallel to Z500-isolines (see e.g. geostrophic winds in Wallace and Hobbs, 2006). Fields of Z500 very commonly serve as a proxy for forecasters of the general atmospheric flow structure and indeed synoptic pattern. So on the one hand one might expect a link with near-surface winds, but on the other the level is so far from the surface that it is unlikely to be a good predictor of local winds. This variable was partly included as a test of the veracity of our results. Even though on physical grounds we did not, overall, expect strong predictive skill from this variable, our results indicate an apparent influence on the inferred fields.

| Data Padding
The training data obtained from MARS is defined on irregular grids where the number of grid nodes per latitude decreases with increasing latitude. As CNNs require the input data as multi-dimensional data arrays, the data needs to be resampled on a regular grid structure. Since resampling using interpolation can smooth out and even remove relevant structures, the initial data is copied into rectangular 2D grids and padded appropriately. Therefore, the maximum number of longitudes for the latitude nearest to the equator is computed, and new points are padded to the remaining latitudes for each grid (cf. Figure 3). This approach preserves the spatial adjacency of grid nodes for a large proportion of the nodes, which is important to facilitate proper learning of spatial correlations. The true distance between grid nodes in world space is however ignored in the training process. The padded points are marked in a binary mask, which is passed to the objective function during network training to distinguish between valid and padded values in the loss computation.
Padding is chosen based on the fact that convolutional neural networks do not only take into account neighborhood relations but also relative changes of neighboring values. Zero-padding, which may cause steep gradients between neighboring values, is 0 0 0 1 1 2 2 4 4 4 5 5 6 6 7 3 F I G U R E 3 Example of padding and masking used to resample the initial (low-resolution) data from an irregular Gaussian grid to a Cartesian grid. Blue cells indicate the data points of the gridded wind field. The interior of the data domain is shown in light-blue, boundary points are drawn in dark-blue, their values are represented by numbers. A regular grid is achieved by padding new data points to the grid (light-red cells) while replicating the corresponding boundary values. thus deemed unsuitable, and replaced by replication padding using the values of the boundary grid points of the valid domain.
The initial low-and high-resolution data with respectively 1918 and 20416 grid points on irregular grids are mapped to regular grids of size 36 × 60 and 144 × 180 in latitude and longitude directions. This results in an increase in the number of grid points by a factor of 4 × 3 between low-resolution and high-resolution grids, which reflects the actual difference in resolution between ERA5 and HRES simulations (see Figure 2).

| Data Scaling
Before training, the padded data are standardized by subtracting sample mean and dividing by sample standard deviation.
Standardization has proven useful in machine learning for improving the stability and convergence time of non-linear optimization methods (e.g., Ioffe and Szegedy, 2015;Srivastava et al., 2014). For time-dependent predictors, sample mean and standard deviation were computed node-wise from the snapshot statistics of the respective training datasets. Node-wise scaling is preferred over global domain scaling as spatial inhomogeneities are reduced, which we found to improve the downscaling results in our experiments. For static predictors, mean and standard deviation were computed from domain statistics. For sample standard deviations, we considered the unbiased ensemble estimate. Validation data is transformed accordingly before processing.
Standardization is performed also for the predictand variables. We found this useful due to strong differences in average wind speeds between coast or sea sites and mountain ranges. Further details are discussed in Sect. 5.

| NETWORK ARCHITECTURES
All of the models we use and compare in this work are constructed as parametric mappings of the form wherein y represents the array of high-resolution predictands, x denotes the array of predictor variables, and β summarizes the model-specific parameters to be optimized during training.
We use in particular CNNs, which repeatedly apply convolution kernels of fixed size to gridded input data at varying spatial positions to capture different types of features. For the downscaling CNNs in our study, we consider input predictor arrays of shape c ( s lon corresponds to running the models on limited sub-domains, which we use for data augmentation, as discussed in Sect. 5.
Predictands y are assumed to be of shape c Y × 4s lat × 3s lon , with c Y indicating the number of predictand variables.
While c Y = 2 is fixed for all our models, corresponding to high-resolution wind components U and V, c (LR) X and c (HR) X vary depending on the predictors supplied to the models, as detailed in Sect. 6.2. In particular, some of the models are provided with low-resolution predictors exclusively, whereas other model configurations are informed additionally with high-resolution topography predictors.
The (rectangular) filter kernels are parametrized per convolution layer as arrays of shape c in × c out × k lat × k lon , with c in and c out denoting the numbers of input and output features of the layer, and k lat ans k lon the spatial extent of the kernel filters in latitude and longitude. Due to the size of the kernel, the number of elements in convolution output arrays differs from that of the input arrays. To compensate for this, suitable replication padding between successive convolution layers is employed to maintain the spatial shape of feature arrays constant throughout the series of convolutions.
In the following, the details of the different model architectures used in our evaluation are described. A schematic summary of all models is provided in Figure 4.

| Linear Convolutional Network Model: LinearCNN
For our comparison, we implemented a linear convolutional model, which we call LinearCNN. In contrast to usual practice, we intentionally omit non-linear activation functions from the architecture to force the model to rely on local linear relationships between predictors and predictands.
Formally, the architecture is designed to mimic the effect of a linear model, which takes a section of the low-resolution predictor array of 5 × 5 pixels and outputs an estimate for the 4 × 3 pixel patch of the high-resolution wind field array that corresponds to the center pixel of the low-resolution section. Since, however, the data padding may cause distortions between neighboring pixels in latitude-longitude coordinates, the neighborhood correspondence between low-resolution and high-resolution array grids may be imperfect. To account for this, we find it beneficial to average the final estimate over multiple sample estimates which are obtained from a small surrounding of the central low-resolution pixel. LinearCNN, therefore, is designed to produce a high-resolution estimate of size 12 × 9, corresponding to the area, which is covered by the interior 3 × 3 pixels of the low-resolution section. In this way, the averaging is realized automatically when applying the model to larger domains in convolution mode.
The model architecture consists of two branches for processing low-resolution and high-resolution inputs separately. The low-resolution branch is comprised of a single convolution layer and a successive transpose convolution (e.g., Dumoulin and Visin, 2016) to increase the resolution of the input features. Transpose convolutions can be understood as linear operations, which learn to parameterize the gradient of a standard convolution. Whereas standard convolutions possess the ability to reduce the spatial resolution of feature arrays by skipping pixels between successive applications of the kernel, which is known as striding (e.g., Dumoulin and Visin, 2016), transpose convolutions can achieve an increase in resolution by parameterizing the kernel of a strided convolution. For our experiments, we select a kernel size of (k lat , k lon ) = (5, 5) for the standard convolution, as well as kernel size (12, 9) and strides (4, 3) for the transpose convolution, reflecting the magnification ratio between low-resolution and high-resolution grids. Depending on the number of input variables c (LR) X , the first convolution layer transforms the input predictors into a set of 25c (LR) X latent features, which are then processed further by the transpose convolution. The number of features is selected to admit a full-rank linear mapping between low-resolution predictors and high-resolution predictand estimates, i.e., each of the single pixel estimates may depend on all of the covered predictor values, independent on other estimates.
On the high-resolution branch, the predictors, are fed into a single standard convolution layer with kernel size (9, 9). The outputs of this layer are directly added to those of the low-resolution transpose convolution. Empirically, we found that models with larger kernel sizes did not improve the performance.

| Simple Non-Linear CNN: DeepSD
DeepSD is a simple non-linear CNN architecture, which has been proposed by (Vandal et al., 2018)   interpolation of the low-resolution image data. Although (Vandal et al., 2018) propose to compose DeepSD of several instances of stacked SRCNNs for better predictions, we found that for the magnification ratio of 3x in longitude and 4x in latitude a single stage of SRCNN already attains results on a par with those achieved by other SRCNN instances.
In the implementation of DeepSD we follow the design proposed by (Dong et al., 2014) and (Vandal et al., 2018). The first layer uses a large kernel size of (9, 9) to transform the input predictor fields into an abstract feature space representation with 64 features, followed by a non-linear activation. The second layer applies a pixel-wise dimensionality reduction with a convolution of kernel size (1, 1) and 32 output features, and a second non-linear activation. The final layer applies a convolution with kernel size (5, 5) to transform the features to the target resolution. (Vandal et al., 2018) further proposed to inform the model with high-resolution orography to learn the influence of the topography on the inferred climate variables. Hence, we include the high-resolution static orography predictors during training of all our DeepSD models. To match low-resolution and high-resolution predictors, the low-resolution predictors are first interpolated to high-resolution using a bicubic interpolation, and then concatenated to the high-resolution predictors to create a combined input array. A schematic of the HR-input block is shown in Figure 5 proposed an hourglass-shaped network architecture, where the highest number of feature channels is used for the outermost layers, while the channel size of the inner layers are reduced. This design pattern is supposed to avoid costly computations while maintaining prediction quality.
In our experiments, we slightly adapt the architecture of FSRCNN and split the model into three parts: an input processing stage for primary feature extraction, a feature processing stage, and a super-resolution stage for successively increasing the resolution until the target resolution is reached.
The design of the input stage varies depending on the predictors in use. When employing low-resolution predictors exclusively, a single convolution layer of kernel size (5, 5) is used to transform the inputs into a set of 56 spatial feature fields, which coincides with the original design by . For model configurations that employ both low-resolution and high-resolution predictors, a combined feature representation in the low-resolution spatial domain is created by applying the input block as depicted in Figure 5. We apply two independent convolution chains to low-and high-resolution predictors separately, and restrict the number of feature channels for both chains to c (LR) = c (HR) = 28. While on the low-resolution branch one single convolution with kernel size (5, 5) is used for feature extraction, the high-resolution branch consists of a sequence of strided convolutions with kernel sizes as indicated in Figure 5. This reduces the resolution of the features successively to low-resolution scale. The resulting features are concatenated with the previously computed low-resolution features and supplied to the feature processing stage.
The feature-processing stage again reflects the original design choices by . In an hourglass-like architecture, a convolution with a (1, 1)-kernel is applied to reduce the number of feature from 56 channels to 12, which is then followed by a sequence of four convolution layers with kernel size (3, 3), 12 output feature channels, batch normalization and non-linear activation. The last convolution layer of the processing stage uses a (1, 1)-kernel to return to the 56 feature channels.
In the original FSRCNN, the resulting features are used as input for a single transpose convolution with a kernel size of (9, 9) for upsampling. In our experiments, however, we found that this very large transpose convolution can lead to slow training progress, and can even prevent training from convergence. Furthermore, (Odena et al., 2016) has shown that transpose convolutions can introduce checkerboard-like artifacts in the final prediction. To circumvent these problems, the extracted features are fed into a super-resolution block, as sketched in Figure 6, after the final batch normalization and non-linear activation layer of the feature extraction stage. Hence, we avoid transpose convolutions in our work and, instead, use bilinear upsampling first and apply conventional convolution afterwards to obtain an upsampled result (e.g., . In addition, we replace a single upsampling convolution with scaling factor (4, 3) by a sequence of 3 upsampling blocks with smaller scaling factors of (2, 1), (1, 3), and (2, 1) to obtain the final image in target resolution. The upsampling blocks are comprised of bilinear interpolation, convolution layers with kernel size (3, 3), (3, 5), or (3, 3), batch normalization, and a non-linear activation function.
Finally, upsampling is followed by an additional convolution layer with batch normalization and non-linear activation, and a single output convolution without any activation function. Being a non-linear model, all but the very last convolution layers in FSRCNN are followed by non-linear activations, which are realized as parametric rectified linear units (PReLU), as proposed in .
Note that in the original FSRCNN architecture, batch normalization has not been used. In our experiments, however, we found it beneficial to regularize the feature representations through batch normalization, since the increased depth of our FSRCNN variant may lead to instabilities in training due to e.g. internal covariate shifts (Ioffe and Szegedy, 2015). By applying batch normalization after each convolution, we could successfully stabilize the training process.

| Deep Non-Linear CNN: EnhanceNet
Prior work in Deep Learning (e.g., Timofte et al., 2017, and references therein) has shown that increasing network depth can help improve prediction quality, and led to network architectures which outperform shallow networks. However, deep networks can easily introduce instabilities in the optimization process, which is typically based on backpropagation of gradients. Specifically, training may become inefficient due to vanishing gradients (Glorot and Bengio, 2010), which originate from the accumulation of small parameter gradients in the chain-rule-based estimation of model parameter updates. The sequential algorithm for gradient estimation causes an exponential decay of parameter updates in early layers of the network, and prevents the parameters from changing significantly during training. While batch normalization may help to stabilize network training, vanishing gradients remain an intrinsic problem of deep neural network architectures.
An effective way to address this problem is the integration of so called short-cut connections. The purpose of these connections is to pass output feature of earlier layers directly to a later stage in the network, effectively skipping parameter dependencies of intermediate model parts and circumventing the accumulation of small gradients. Two prominent examples are skip connections used by (Srivastava et al., 2015) and (Ronneberger et al., 2015), as well as residual connections proposed by . With skip connections, the output of a previous layer is concatenated with the result of an intermediate layer.
An example is given in Figure 7, which is discussed in more detail in Sect. 4.5. Residual connections are similar to skip connections, but instead of being concatenated, the features before and after intermediate processing are added. This enables the model to learn mappings that are close to identity more directly.
As a deep CNN architecture with residual connections we selected EnhanceNet (Sajjadi et al., 2017), which has originally been proposed for image super-resolution. EnhanceNet is comprised of an input stage for raw feature extraction, followed by a stack of 20 convolution layers for feature processing, and a super-sampling stage (see Figure 6), similar to that of FSRCNN.
Residual learning is incorporated into the architecture in two variants. On the one hand, convolutions for feature processing are subdivided into 10 blocks of 2 layers each, where each block is wrapped by a residual connection. A schematic representation of one of these residual blocks is shown in Figure 6. On the other hand, bicubic interpolation is used to interpolate the low-resolution wind-field inputs to target resolution, yielding a baseline estimate for the high-resolution field, which is added to the model output.
For reasons of efficiency, the convolution layers of EnhanceNet use a kernel size of (3, 3). In our experiments, the number of feature channels is set to 64, which is equivalent to the parameters chosen in the original paper by (Sajjadi et al., 2017).
The non-linear activation functions for EnhanceNet are realized through rectified linear units. Similar to LinearCNN and FSRCNN, we consider network variants with varying settings of low-resolution dynamical predictors, as well as with and without high-resolution topography. Depending on the predictor configuration, either a single convolution layer with kernel size (3, 3) or the input block depicted in Figure 5 is used for primary feature extraction. Since the main focus of our study is on pixel-wise accuracy of the downscaling results, we refrain from using perceptual and adversarial losses that are typically used in super-resolution image tasks (Sajjadi et al., 2017) and instead use pixel-wise losses as discussed in Sect. 5.3.

| DeepRU
Network architectures for super-resolution image generation have been optimized for natural images, which possess properties that are different from those of meteorological simulation results. For instance, natural images typically depict coherent objects, like cars or animals, with well-defined shapes and boundaries. In contrast, meteorological data contains different meteorological variables which vary smoothly yet less coherently across the domain. Therefore, we expect that more skillful models can be obtained by tailoring model architectures explicitly to meteorological data.
For the present application, we argue that near-surface wind systems result from a complex interplay between large-scale weather situation, i.e., continental-scale pressure distribution, and boundary-layer processes at finer horizontal scales. The correct treatment of physical processes at varying scales, therefore, appears as an important aspect in downscaling wind fields on extended spatial domains. This motivates the use of a model architecture that is not restricted to a single resolution scale for feature extraction, but uses different resolution stages to understand the data on multiple scales.
To account for this, we propose to use a U-Net architecture (Ronneberger et al., 2015) with residual connections (He et al., 2016) for downscaling, which we call Deep Residual U-Net (DeepRU). The U-Net architecture enables an efficient extraction of multi-scale features by design. It consists of two symmetric branches, which are connected by skip connections for simplified information transfer: an encoding branch, on which the data is encoded into an abstract reduced feature representation, and a decoding branch, on which the feature representations are then decoded to reconstruct wind fields at fine-scale target resolution.
During the encoding stage, the number of grid points is successively reduced , at the same time increasing the number of feature channels per grid point. In this way, patterns of larger spatial extent can be extracted with small-size convolution kernels.
During the decoding stage, the features are super-sampled to a finer scale, while reducing the number of feature channels. The skip-connections enable a direct information flow between encoding and decoding stages at equal resolutions. By concatenating features from the encoding branch with corresponding features on the decoding branch before further processing, details in the data that could get lost during the compression process can be preserved and localized precisely. In recent work, the U-Net architecture has also been employed for super-resolution tasks (e.g., Lu and Chen, 2019).
DeepRU is a six-stage U-Net architecture with both residual and skip connections at every resolution stage (see Figure 7).
Similar to FSRCNN and EnhanceNet, we use the input blocks depicted in Figure 5 to transform the inputs to 64 primary low-resolution feature channels. We then super-sample the features using bilinear interpolation, to match the high-resolution grid of size 144 × 180. This option gave the most accurate downscaling results among a variety of alternatives we have tried. The high-resolution features are then fed into the adapted U-Net architecture. We use strided convolutions to downsample the features during encoding and bilinear interpolation with a successive convolution layer to increase the resolution again during decoding.
At each resolution stage, we apply batch normalization and leaky-ReLU activation before passing features to a residual block, as depicted in Figure 6. The residual blocks, originally proposed by , have been slightly modified for the downscaling task. We find that extending the original residual block by another convolution layer before the addition operation leads to an increase in flexibility of the residuals, which translates to a better overall model performance.
We implemented skip connections so that a new combined input can be formed by concatenating the features from the encoding stage to the corresponding super-sampled features in the decoding stage. The combined input is then processed by a single convolution layer with batch normalization and leaky-ReLU activation to further reduce the number of feature channels.
The reduced features are finally passed to an additional residual block. After the last residual block at the target resolution in the decoding stage, a convolution layer is added to output a set of features, which are added to a bicubic interpolant of the low-resolution winds, resulting in the final wind field prediction.

| Localized Multi-Linear Regression Model: LinearEnsemble
To enable a comparison of the CNN models with more classical approaches, we also consider a model that is based on standard multi-linear regression instead of successive convolutions. Due to simplicity and interpretability, multi-linear regression models are frequently used in downscaling and post-processing tasks (e.g., Eccel et al., 2007;Fowler et al., 2007;Gaitan et al., 2014).
For multi-linear regression models, Eq. (1) can be rewritten in simplified form as wherein W is a (c Y d (HR) × c X d (LR) )-shaped matrix of weight parameters capturing linear relationships between flattened predictor vectors x ∈ c X d (LR) and flattened predictand vectors y ∈ c Y d (HR) , and b ∈ c Y d (HR) is a vector of offset parameters.
Again, c X and c Y denote the number of predictor and predictand variables per grid node, and d ( For our comparison, we limit the number of trainable parameters to O k · d (HR) , for some user-defined constant k ≤ d (LR) .
An ensemble of multi-linear regression models is trained, where each model uses the k -nearest nodes from the low-resolution input to predict the wind components U and V at a single grid node of the high-resolution domain. This corresponds to an induced sparsity pattern on W , which allows at most k · c X · c Y · d (HR) entries of W to be non-zero.
In contrast to CNNs, we train only two different variants of the model ensemble. In a first step, we use only the low-resolution wind components U and V to inform the model, resulting in a channel number of c X = 2. In a second step, we also add the complementary low-resolution dynamic predictors BLH, FSR, and Z500, resulting in a total of c X = 5 predictor channels. Static predictors are not included in the training process, as the resulting contributions in Eq.
(2) would be indifferent among samples, and can thus be incorporated into the offset-vector b without loss of information. The k nearest low-resolution grid nodes are determined based on the standard L 1 distance (in latitude-longitude space) to the target node. We empirically determined that neighborhood sizes beyond k = 16 did not improve the results significantly in our application.

| TRAINING METHODOLOGY
The time range of about three years that is covered by our data is comparatively short, when set in relation to time scales commonly used to define "climatology". Moreover temporal correlations between successive samples limit the number of independent examples of weather situations across the domain. This raises the need for efficient data splitting using cross validation, and employing suitable methods to increase the number of training samples. In the following, we shed light on the training methodology and loss functions used in our experiments, and provide details on the optimization process.

| Cross Validation
For all models, including LinearEnsemble, we employ cross-validation with three cycles of model training and validation. In each cycle, we exclude a subset of the data from training. As the data exhibits both short-term temporal correlations on time scales of up to a few days and variations due to seasonality, we decided to pick full consecutive years of data for validation.
This minimizes information overlap between training and validation data due to systematic correlations at the beginning and

| Patch Training
To further increase diversity and variance of training samples, we perform CNN training on sub-patches of the full domain. This procedure limits the dimensionality of the model inputs, thus enforcing models to base their predictions on local information, and reducing the chance of over-fitting to statistical artefacts in the data. Specifically, fitting of potentially non-physical long-distance correlations is efficiently avoided.
From another perspective, patch training is advantageous due to an improved usage of static predictor information in comparison to full-domain training. Static predictors remain invariant when training on the full domain and can, thus, be ignored by the network or be leveraged to establish a network operation mode of local pattern matching, instead of regression. In such a mode, models might learn to associate the invariant topography with preselected local patterns, learned by heart, instead of using the provided dynamic information to regress on.
Confirming our expectations, we found that patch-trained models yield lower training and validation losses compared to models trained on the entire domain. Experiments show that intermediate patch sizes yield the best training results. For very small patch sizes, we observe a decrease in prediction quality, which may be attributed to a loss of context information due to insufficient data supply. These findings may also be related to the concept of the minimum skillful scale of the underlying low-resolution simulation (Benestad et al., 2008), i.e., the smallest spatial domain size, for which the low-resolution data provides a sufficient amount of information for the downscaling model to generate skillful predictions.
In our experiments, low-resolution data was processed in patches of size 24 × 36 and matched with the corresponding high-resolution patches of size 96 × 108. This was found to yield the most accurate full-grid predictions when applied to validation samples. The sub-patches for training were selected randomly for each predictor-predictand data pair and each training step, so that the induced randomness further decreases the chance of over-fitting to the training input. Note, however, that patching was applied exclusively during training of the models. For validation and evaluation of model performance, predictions were computed based on the full domain.

| Loss functions
For measuring error magnitude between predictions and high-resolution targets, we consider different deviation measures, which put weight on distinct aspects of reconstruction accuracy. For optimization purposes, we consider spatially averaged deviation scores, whereas for further evaluation, we consider both average and local deviations.
Given that ì t i and ì y i represent the target wind and prediction wind vectors at node i , with 1 ≤ i ≤ d (HR) indexing the nodes of the high-resolution grid, we consider, in first place, the mean square error (MSE) with Here, ì t i and ì y i denote the sets of predictand and prediction vectors throughout the domain, | · | indicate standard L 2 vector norm, and · D an average over the spatial domain. The main advantage of MSE is its invariance with respect to rotations of local vector directions, i.e. predictand-prediction pairs which differ only by node-wise rotations of wind directions, are assigned an identical deviation score.
However, a potential drawback of MSE is that local deviation scores scale quadratically with wind magnitude (the significance of this would ulitmately depend on the application). In particular, small-angle deviations in areas of large wind speeds may contribute largely to the overall deviation score, whereas some strong directional deviations, such as opposite wind directions in areas of low wind speed, are hardly taken into account. This problem becomes particularly serious in certain scenarios, where slow but strongly variable winds over mountainous areas are accompanied by increased wind speeds over the sea .
A solution to weaken the square dependence effect is to linearize MSE, resulting in the mean absolute error (MAE).
Unfortunately, even MAE does not fully overcome the scaling issue and inherits the problems of MSE. Considering angular deviations instead, for instance, as measured by cosine dissimilarity, does not provide an alternative, as well, since angle-based deviation measures do not provide the model with information on differences in wind speed magnitude. A potential alternative would be to use a weighted average of the above-mentioned deviation metrics. However, we refrained from using such metrics as this would require an optimization of additional ad-hoc hyper-parameters.
An effective solution is to use the standard MSE and reduce spatial inhomogeneity through node-wise standardization of the target predictands. The models then learn to mimic a reduced representation of the non-standard predictands, which can easily be converted back to true-scale through an easily invertible linear transformation. As stated in Sect. 3, sample mean and standard deviation are computed from the respective training data set. For validation and evaluation purposes, we convert back to real-scale target predictands and predictions.

| Implementation and Optimization
All models have been realized and evaluated in PyTorch (Paszke et al., 2019). Optimization is performed using the ADAM optimizer (Kingma et al., 2014) with an initial learning rate of 10 −3 , which is reduced by a factor of 0.1 whenever the validation loss in terms of MSE does not decay by more than a fraction of 10 −4 over a period of 5 training epochs. The process is continued until a minimum learning rate of 10 −6 is reached. To guarantee a proper convergence of the models, we train for 150 epochs in each of the three runs per cross-validation cycle, without early stopping. Saturation of training and validation losses was usually achieved after 50-60 epochs, and both training and validation losses showed only minor variations beyond. In particular, we did not observe tendencies of additional over-fitting once the models converged.

| Regularization
During training, we employ weight decay with a rate of 10 −4 (Kingma and Welling, 2013). Additionally, non-linear convolutional models use batch normalization (Ioffe and Szegedy, 2015) after each convolution operation, which we find to significantly accelerate training convergence. For DeepRU, we apply two-dimensional dropout regularization (Srivastava et al., 2014) with a dropout rate of 0.1 after each residual block; i.e. succeeding each residual block a fraction of 0.1 of the respective output feature channels is selected randomly and set to zero. Although earlier studies reported performance issues when using batch normalization and dropout regularization in common, (see e.g., Li et al., 2019), we did not encounter any such negative effects.

| EVALUATION
To compare the different model architectures with respect to downscaling performance, we consider sample-wise deviations between target predictands and model predictions, and investigate the extent to which the predictions depend on particular predictors. To shed light on the impotance of the choice of predictors, the CNN models are trained with four different predictor configurations, including low-resolution wind fields and orography only, providing supplementary high-resolution orography predictors or additional low-resolution dynamic predictors, or the full set of parameters. The predictor settings are detailed in Table 1 and indicated with letters (A) through (D).

TA B L E 1 Predictor configurations for model trainings with varying combinations of low-resolution (LR) and high-resolution (HR) predictors. c (LR)
X and c (HR) X denote the total number of low-resolution and high-resolution predictor fields, supplied to the models.
Exceptions from this strategy arise for DeepSD and LinearEnsemble. In the case of DeepSD, we refrain from suppressing the use of high-resolution static predictors, in order to stay close to the original implementation, which included high-resolution orography predictors by design. Therefore, for DeepSD, we only consider configurations (B) and (D). For LinearEnsemble we exclude static predictors in both low-resolution and high-resolution, as by design the model does not take advantage from static predictors (see Sect. 4.6); we therefore consider only configurations (A) and (C).

| Run-time Performance and Memory Requirements
A general overview of the model performances with respect to the number of trainable parameters, memory consumption, computational time for yearly or daily predictions are provided in LinearEnsemble is exceptional here, as memory limitations arise from the need for rapidly accessible storage of the training data, rather than from optimization computations. As the nearest neighbor positions vary irregularly with spatial position, data selection for LinearEnsemble cannot be realized through efficient array-slicing operations, as it is the case for CNNs.
Nearest-neighbor indexing has to be performed for all linear models separtely and was found to be too slow to be conducted at training time. As a result, data for the LinearEnsemble had to be preselected and stored with high redundancy during training.
For the full ensemble of 20416 linear models with 16 nearest neighbors, the 3 year dataset, including all low-resolution dynamic predictors, required the allocation of roughly 137 GiB of memory, which is not feasible to be stored in RAM on a local machine with typically less than 32 GiB available. Hence, the data was outsourced to a separate HDF5 file and streamed from the hard drive during training, which delivers, by a large margin, the highest training time among all trained models. The training times for the remaining models scaled with model complexity, with the highest being for the most complex model -i.e., DeepRU.
In contrast to the above, and for reasons of fair comparison, the computational time for model prediction (PR), is computed using a batch size of 220 for all networks; note that timings for loss computations and optimization are not included in the measurements. To compute the total time for model predictions, we make use of the Python's timer module to measure the plain time required by the model to perform downscaling on all input hours for one year, in our case 8760 hours. As timings are often distorted due to hardware communication and process management, we conducted 3 measurement runs for all models and averaged the results to obtain the final total prediction time. The time for single hour predictions is represented by the ratio between the total computational time and the total number of inputs. In our study, we experienced that the measured time increased with the model complexity, with highest computational costs for DeepRU.
Regarding the number of trainable parameters, the deeper non-linear solutions EnhanceNet and DeepRU exhibit a significantly higher amount of convolutional layers in comparison to the remaining models and, thus, require more memory to store the trained parameters. Consequently, the general memory consumption scales with the model complexity (see MEM column in Table 2). Despite the higher consumption of memory for non-linear models, in particular for DeepRU, we found that they achieved the overall best results in our experiments, which is further discussed in the following sections.

| Quantitative Analysis
The statistics of spatially averaged MSE on the validation data are illustrated in Figure 8, confirming that both model architecture  predictors, which proved to be useful for all the non-linear models, appears to have no effect on the performance of LinearCNN.
The model appears unsuited to extract useful correlations between low-resolution predictors and high-resolution wind fields. The reason for this is the restrictive parametrization scheme, which is unsuitable for capturing random offsets and distortions between low-and high-resolution field variables caused by the data padding procedure (see Sect. 3.4). As the same linear model is shared across the entire domain, LinearCNN is forced to yield a most likely estimate, which, however, is found to be inaccurate for most of the grid nodes, and poor regarding spatial detail.
In contrast, LinearEnsemble takes advantage of the local parameterization, and achieves considerably better results, comparable with or better than those of the non-linear models DeepSD and FSRCNN. The gain in performance, however, comes at the expense of a higher tendency of the model to overfit on the training data. Especially for model variants with a large number of predictors, either due to the use of additional dynamic predictors or larger environment size k , one observes severe over-fitting.
This is visible also in Figure 8, as the maximum reconstruction error of LinearEnsemble models with full predictor set (UV, Dyn, Oro(LR) and UV, Dyn, Oro(LR, HR)) exceeds the maximum error of even LinearCNN. L 2 -regularization did not improve generalization performance, but increased the reconstruction error on both training and test data. For the nonlinear models, in contrast, over-fitting could be minimized through weight decay during optimization -having a similar effect as L 2 -regularization -and dropout regularization.
In agreement with earlier studies by , FSRCNN achieves smaller downscaling errors than DeepSD. The quality of the downscaled wind fields, however, is slightly below that of the LinearEnsemble model for all predictor variants under consideration.
Nevertheless, prediction quality can be further improved by considering more complex models. EnhanceNet, which differs from FSRCNN by an increased number of convolution layers and the use of residual connections in combination with bicubic downscaling as additive baseline estimate, is the first model to surpass the performance of LinearEnsemble. Notably, EnhanceNet achieves slightly worse results than LinearEnsemble when omitting the high-resolution orography predictors, but catches up after adding the high-resolution predictors. The same is true for DeepRU, which achieves another reduction of MSE.
Comparing directly DeepRU and LinearEnsemble, we find that DeepRU not only reduces the MSE, but can also more effectively take advantage of additional predictors. Whereas LinearEnsemple responds with an increased tendency of overfitting, DeepRU achieves a reduction in deviation score when supplied with high-resolution static and low-resolution dynamic predictors.
Specifically, model configuration (D) of DeepRU is the most accurate model in our comparison with an average MSE around 2.7 (ms −1 ) 2 .
F I G U R E 9 Mean magnitude difference (top row) and mean cosine deviations (bottom row) between target high-resolution forecast and low-resolution forecast simulation (left), prediction of LinearEnsemble (middle) and DeepRU (right). The average is taken over all 3 validation years.

| Spatial distribution of prediction errors
To examine the spatial distribution of reconstruction errors, we consider additional angular and magnitude-specific deviation measures, which we average over the sample distribution instead of the spatial domain. Specifically, we consider cosine dissimilarity (CosDis) CosDis ì t i , ì y i = 1 2 1 − cos ì t i , ì y i X for angular deviations between target predictands and predictions. Systematic deviations in wind speed magnitude are measured in terms of the magnitude difference (MD) which provides a measure for how much the respective models over-or underestimate wind speed magnitudes. In both measures, · X indicates the sample average over the validation sets of the 3 cross-validation cycles, respectively. Figure 9 shows the spatial distribution of magnitude difference and cosine dissimilarity for low-resolution forecasts interpolated bilinearly to the high-resolution grid, as well as outputs of the best-performing DeepRU and LinearEnsemble models, relative to the high-resolution forecasts. Regarding the low-resolution simulation, velocities in specific regions near the coasts are not well captured and are mainly underestimated with magnitude shifts greater than 1.0 ms −1 . Angular deviations are more pronounced in mountainous areas. Typical values of cosine dissimilarity range between 0.25 and 0.30, which corresponds to average deviation angles of more than 40 • . In the northern part of the Mediterranean Sea, the magnitude difference plot for the low-resolution simulation suggests checker-board-like artifacts, which, however, are most likely due to a mismatch in spatial resolution and grid structure of low-resolution and high-resolution grids, as well as the use of bilinear interpolation for visualization purposes.
In contrast to the low-resolution simulation, LinearEnsemble tends to underestimate, on average, wind magnitudes at all local grid nodes. We expect that this is mainly caused by an underestimation of extreme winds through LinearEnsemble, which is a common problem of multilinear models that are optimized for minimizing MSE losses (e.g., Bishop, 2006). As expected, cosine deviations for LinearEnsemble are much lower than for the low-resolution simulations. However, in areas close to the mountains, LinearEnsemble fails to properly predict both extreme shifts in magnitude and direction, for example due to ridge lines.
DeepRU shows overall better performances with lowest cosine and magnitude differences. Prediction errors exhibit a spatially similar pattern to LinearEnsemble but with generally smaller amplitudes. Furthermore, DeepRU outperforms LinearEnsemble in capturing local variance in wind speed magnitude and directions. As a result, magnitude differences appear less uniform, with over-and underestimation in flat areas and near the boundaries, which are caused by imperfect information due to convolution padding. In the Mediterranean Sea magnitude errors show large-scale wave-like patterns, which especially north of Corsica end east of Sardinia resemble ringing artifacts due to Gibbs phenomenon (Gibbs, 1898). In turn this relates to the models' spectral representation of topography; issues arise in regions adjacent to where steep slopes meet flat land or sea. In fact the provided topographic height fields contain very similar patterns; sea altitudes look invalid.

| Analysis of Feature Importance
For the model configuration, which was trained on the full set of predictors (D), we also investigate the importance of particular predictors according to the method proposed by (Breimann, 2001). For this, we perturb the model inputs from the validation data set by randomly shuffling single predictors, and then measure the change in the prediction error that is caused by the perturbation.
Let X = {x 1 , . . . , x t , . . . , x T } be the (plain) validation dataset for the respective model run, with data samples x t = x (1) Figure 10 illustrates the sample statistics of ρ unperturbed data. A similar increase in prediction error in terms of absolute deviation score therefore yields a larger change ratio for more complex models. This implies that the change ratios ρ (p) t should be interpreted in a model-specific context. Assessing the relative importance of the remaining predictors, we find that least information is extracted from FSR, as perturbations in this predictor hardly affect any of the models. As FSR is provided on the same coarse grid resolution as the predictor winds, all the information it provides could already be encapsulated in the winds themselves, so that most models learn to ignore the redundant information. Interestingly, LinearEnsemble is the only model that fits correlations between FSR and high-resolution winds, which may be related to the overfitting problem of the model. Perturbations in BLH also have only a slight impact on prediction performance. This was quite a surprising result, given that this quantity varies considerably over time, and given that wind speeds at 100 m can be closely related, especially when BLH values are small. Z500 is leveraged mainly by the less complex models LinearCNN and DeepSD. Z500 provides information on large-scale weather patterns, and there is known relationship between its gradients and 500 hPa geostrophic winds, which seems to be recognized most prominently by DeepSD. Nevertheless, direct links between Z500 and 100-m-winds tend to be relatively weak, which explains its minor impact on the performance of other models.

| Analysis of Reconstructed Flow Patterns
The quantitative analysis provides high-level abstract information on overall downscaling performance of the models, yet it does not convey detailed information on the ability of the models to reproduce complex flow patterns that we see in the high-resolution simulation. To investigate this aspect in more detail, we select two example cases, which exhibit strong discrepancies between To visualize wind vector fields, we use Line Integral Convolution (LIC), introduced by (Cabral and Leedom, 1993). To generate a LIC visualization, a randomly sampled white-noise intensity image of user-defined resolution is convolved with a 1D smoothing kernel along streamlines in the vector field. Thus, while LIC generates high correlation between the intensities along the streamlines, different streamlines are emphasized by low-intensity correlation between them. In addition, color mapping is used to encode additional parameters, such as the local vector field magnitude. In contrast to alternative visualizations, such as vector glyphs or stream line plots, LIC provides a global and dense view of the vector field and can avoid occlusion artifacts due to improper glyph size or sparse sub-domains due to improper streamline seeding. A disadvantage of LIC is that there is ambiguity about which of two opposite directions are represented.  (Metcheck, 2020) absolute relative error (ARE), as well as local L 2 deviation, which combines both aspects. Results for the outputs of the low-resolution simulation and model predictions are depicted in Figure 12.
Based on the quantitative evaluation of all models in Sect. 6.2, it can be conjectured that both LinearEnsemble and DeepRU reconstruct meaningful downscaling results, with DeepRU leading to overall better prediction quality in scenarios of high inhomogeneities. As seen, e.g., in the cases A (Adriatic Sea) and B (Austrian Alps) in Figure 11, LinearEnsemble tends to not reconstruct the flow features whene there is a pronounced flow pattern mismatch between the low-resolution and high-resolution forecast simulations. DeepRU, in contrast, uses both local and global information about the orography, and presumably additional parameters, and is able to better replicate the HRES wind fields. Especially over the Adriatic Sea (A), the winds are mainly northwesterly, tangential to the coast, and higher magnitudes are more pronounced. LinearEnsemble relies solely on local information in the low-resolution fields and is not able to reconstruct the ground truth faithfully.
In areas of complex surface topography, such as near the Austrian Alps (B), variations in wind speed and direction are usually more pronounced, as wind fields are highly influenced by surface interactions. Here, both models learn a reasonable mapping and are able to handle these cases quite well. According to cosine dissimilarity (Figure 14 (a)), DeepRU performs slightly better than LinearEnsemble in terms of direction predictions. Also, DeepRU is able to better replicate extreme transitions in magnitude, occuring on small spatial scales, which results in smaller relative and L 2 errors (see Figure 14 (b) and (c)).
A scenario with generally stronger and rather laminar flow, which exhibits some large differences in wind speed magnitude, is given in C, where fine-scale mountains slow down winds in eastern France. Since fluctuations in wind direction are small in this area, both models exhibit small errors overall in wind direction. Nonetheless, LinearEnsemble is not really able to account for orography-mediated flow adjustments on small spatial scales, whilst DeepRU can more precisely predict deviations from laminar flow. This is also clearly demonstrated by the absolute relative errors in Figure 14 (b).
The second example is for March 19, 2017, 01:00 UTC. Figure 15 depicts LIC plots of the wind fields for the simulations and predictions similar to Figure 11. As illustrated in Figure 13 (b), the weather pattern over our domain is mainly dominated by an Alpine lee cyclone, situated between Corsica and northwest Italy. Comparing low-resolution and high-resolution forecast simulations, major parts of the flow are rather laminar with high wind speeds up to 18 ms −1 . Contrary to the low-resolution simulation, HRES exhibits sharper changes in magnitude over mountain ridges and mountain edges, and exhibits higher F I G U R E 1 4 Visualization of spatial deviations of the low-resolution simulation, LinearEnsemble, and DeepRU wind predictions compared with the output of the high-resolution simulation shown in Fig. 11. Here, the deviations are (a) cosine dissimilarity, (b) absolute relative error, and (c) L 2 norm. F I G U R E 1 5 Wind fields over Europe, as obtained from low-resolution and high-resolution short-range forecast simulations and model predictions for March 19, 2017, 1:00 UTC, similar to Fig. 11. Color-coding indicates the local wind velocity. distortions in wind directions over the sea. Two particular cases with differences between forecast simulations and model predicitions are highlighted in Figure 15 and are labelled A and B.
In case A, the outputs of both the low-resolution simulation and the LinearEnsemble suggest a rather circular vortex pattern with moderate wind speeds over the Ligurian Sea, between the French Riviera and Corsica. The high-resolution simulation, in contrast, displays a distorted, more elongated flow pattern. DeepRU here elongates the flow around the vortex towards northern Italy, and additionally enhances the southerly wind near the western coast of Corsica, which, in summary, better mirrors the predictions of HRES.
Case B emphasizes the wind field above northern Italy, where the flow is more inhomogeneous since regions of high wind speeds are interleaved with topographically triggered vortex structures.
Here, LinearEnsemble fails to predict as well as DeepRU the sharp magnitude changes seen in HRES along the mountain ridge of the Appenines and near to the three marked lakes.

| APPLICATIONS IN FORECASTING
As this was ostensibly a proof-of-concept study, it was not intended that the CNN architectures computed here would be used directly for forecasting. Indeed the spatial resolutions of our predictor and predictand datasets are not competitive relative to current operational configurations. In Europe for example, operations nowadays use global models with spatial resolution ∼10-20 km, and for shorter leads up to (e.g.) day 3 use limited area models (LAMs) with resolution ∼1-4 km. Nonetheless, our results are sufficiently promising to provide a blueprint for future operational systems that successfully serve the needs of automated and forecaster-based predictions. So how might this work?
We would envisage first stepping down in scale to use predictor and predictand resolutions ∼5-10 km, and ∼1-2 km respectively. Regarding the predictors, international modelling centres such as ECMWF will upgrade their global ensembles to this resolution range in the next few years. Regarding the predictand, this is needed only for training, and so need not be run operationally in real-time. So one could use, for example, a 1-or 2-year global reanalysis-style dataset, similar to that described in (Dueben et al., 2020), but created with repeated observation-based initialisations. This would deliver worldwide downscaling options, for any region the user selected. An alternative would be to use real-time LAM output for any region for which that was available.
Real-time CNN predictions realised via this route could be used in different ways. Where no LAM coverage exists, predictions could be delivered for short and medium range lead times. Where LAMs are available usage would focus on the medium range, and if the same LAM were used for training this could nicely provide continuity across the LAM-Global temporal boundary.
Another difficulty to address, at least in ECMWF output, is the apparently poor representation of 10-m-winds over mountains -the reason we used 100-m-winds in this study. This may improve in future, but if not the CNN approach is such that one could use 100-m-winds as predictor, and 10-m-winds as target, if the latter were better represented at 1-2 km resolution -which there is some evidence for, at least for LAMs (Figure 7 in Hewson, 2019).
There are numerous application areas that need better, locally-refined wind speed predictions. Renewable energy is clearly one. Others include local pollutant dispersal, coastal and open water shipping, rig operations, leisure activities such as sailing, aviation, the construction industry and warnings in general. Applications for which mean speed predictions are important across the full speed range, such as renewables, will potentially benefit most. For applications with a focus on extremes more investigation will be needed; the training period may not be sufficient. Predictions may be systematically too weak, or become unstable. For very hazardous but less rare gap-flow phenomena we can be more optimistic however. Here we expect the CNN predictions to deliver major benefits for users compared to raw model output.
Society requires not only predictions of mean wind speeds, but also forecasts of gusts, particularly extreme gusts, because of the dangers posed to life and infrastructure. Gusts have not been directly explored in this study. One might be able to convert mean speeds into reasonable gust forecasts using empirically defined gust-to-mean relationships (see e.g. Ashcroft, 1994), developed for different land surface types, although for cyclone-related gusts, which tend to be the major wind-related hazard in the vicinity of storm tracks (e.g. in northern and western Europe) caution is needed. Low-level stability, and destabilization mechanisms, as outlined in (Hewson and Neu, 2015), are of paramount importance for determining the strengths of phenomena such as the cold jet, warm jet and sting jet (see also Browning, 2004). In that context it is curious that the BLH parameter used in our study, which relates directly to stability, did not add much predictive value for the CNNs. Our use of a region that is relatively remote from storm tracks may explain this.
It is important to re-iterate that airflow, and thus winds, can be very scale dependant. On meter scales speeds around city buildings vary dramatically, whilst on a lake the behavior of a yacht can be influenced by clumps of bushes nearby. Indeed scale dependence is more acute than it is for other parameters, such as rainfall and temperature. Thus model resolution increases bring with them more and more application areas for forecasts, particularly for regions that are topographically and/or physically complex. In turn this brings sustainability, whereby the method outlined in this paper, and variants of it, can find utility for the foreseeable future as numerical weather prediction models continue to evolve.

| CONCLUSION
In this study, we have analyzed convolutional neural networks for downscaling of wind fields on extended spatial domains.
By going from a simple linear CNN to deeper and more elaborate non-linear models, we have investigated how the network complexity affects downscaling performance. We have further compared the performance of different CNNs to that of an ensemble of localized linear regression models.
Our study has shown that the prediction accuracy of the linear ensemble model is higher than what can be achieved with shallow non-linear CNN architectures. Especially for simplistic non-linear models with only a few convolution layers, it seems that the non-linearity even hinders performance. We attribute this to the distortion of the wind-field by the non-linear activations on its way through the network, preventing the model to benefit from simple mapping schemes, such as e.g. interpolation kernels.
The use of overly simplistic and shallow non-linear models may be one reason why earlier studies found no additional value in applying CNN-based machine-learning methods (e.g., Eccel et al., 2007).
Deeper and more complex network models, on the other hand, are able to discover skillful mappings by exploiting non-linear correlations for modelling the relationship between low-and high-resolution fields. Specifically, we found that all non-linear models in our study take advantage of additional high-resolution static predictor data, such as information on local orography. In comparison, the use of 3 pre-defined low-resolution dynamic predictors gave only minor improvements.
Among the non-linear CNNs, we identified EnhanceNet, previously-proposed for single-image super-resolution, as a deep CNN that was able to beat the baseline linear ensemble model. With DeepRU, we proposed a novel deep residual U-Net architecture, which outperformed both the linear model and EnhanceNet in terms of reconstruction accuracy. The major advantage of DeepRU lies in its ability to process features at different spatial scales. This is particularly useful for downscaling of wind fields, where local wind systems have to be consistent with large-scale flow patterns. Although we still observe some deviations between high-resolution model predictions and native high-resolution forecast simulations, we are confident that convolutional neural networks can provide promising downscaling results and add more value to downscaling than linear models at a reasonable computational cost.
We conclude that deep CNN approaches are particularly effective for downscaling with high magnification ratios on large spatial domains. In this setting, the use of classical models becomes computationally inefficient, and linear link functions between predictor variables and predictands become insufficient to account for non-trivial variability in the local flow, e.g., due to pronounced flow distortion around obstacles. We found that deep CNNs are better suited for replicating this variance, especially in mountainous areas or over the sea near to coasts, and expect that the same holds true also at finer spatial scales.
An interesting question for future research is how to consider a more accurate treatment of the domain geometry in CNN-based downscaling. While data padding was found to be well-suited for reshaping irregular grids on domains of up to a few thousand kilometers of horizontal extent, increasing domain size even further may lead to distortion artifacts due to disregard of the spherical geometry of the Earth's surface. The same is true for interpolation-based resampling methods, where the horizontal spacing of the sampling points varies with latitude, limiting data resolution close to the equator and enforcing data redundancy closer to the poles. The use of more appropriate convolutional model architectures, like spherical CNNs for unstructured grids (e.g., Jiang et al., 2019), or geometric deep-learning approaches (e.g., Bronstein et al., 2017) in general, may help to overcome such limitations, thus increasing physical plausibility and data efficiency of the models.
From the exciting perspective of real-time application, one would ideally want to step down in scale and apply the results of this proof-of-concept study in a finer resolution setting. We envisage that operational real-time forecast runs -single deterministic and/or ensemble -could be downscaled in real-time to 1-2 km, over any pre-selected domains, for customer applications. This could be activated on a central cloud-type platform, or locally by customers to meet their own needs. Given the small number of low-resolution predictors, data transfer requirements for the second option would be minimal, compared to say the task of transferring four dimensional (full-atmosphere) fields for many variables.
At such very high target resolutions, the correct treatment of ambiguity in the data becomes increasingly important, since the same coarse-scale flow pattern may correspond to multiple fine-scale realizations. Similar to stochastic weather generators, generative CNN models like variational auto-encoders (Kingma and Welling, 2013) or convolutional generative adversarial networks (e.g., Goodfellow et al., 2014;Radford et al., 2015) may provide promising approaches for building flexible models for ensemble-based probabilistic downscaling. Moreover, if the low-resolution feed were based on ensemble data itself, one could then generate a super-ensemble (i.e. ensemble of ensembles) to provide the final smooth-format probabilistic output for users.