WRF-based dynamical downscaling of ERA5 reanalysis data for High Mountain Asia: Towards a new version of the High Asia Refined analysis

The High Asia Refined analysis (HAR) is a regional atmospheric data set gen-erated by dynamical downscaling of the Final operational global analysis (FNL) using the Weather Research and Forecasting (WRF) model. It has been successfully and widely utilized. A new version (HAR v2) with longer temporal coverage and extended domains is currently under development. ERA5 reanalysis data is used as forcing data. This study aims to find the optimal set-up for the production of the HAR v2 to provide similar or even better accuracy as the HAR. First, we conducted a sensitivity study, in which different cumulus, microphysics, planetary boundary layer, and land surface model schemes were compared and validated against in situ observations. The technique for order preference by similarity to the ideal solution (TOPSIS) method was applied to identify the best schemes. Snow depth in ERA5 is overestimated in High Mountain Asia (HMA) and causes a cold bias in the WRF output. Therefore, we used Japanese 55-year Reanalysis (JRA-55) to correct snow depth initialized from ERA5 based on the linear scaling approach. After applying the best schemes identified by the TOPSIS method and correcting the initial snow depth, the model performance improves. Finally, we applied the improved set-up for the HAR v2 and computed a one-year run for 2011. Compared to the HAR, the HAR v2 has


| INTRODUCTION
High Mountain Asia (HMA) is a geographic region that includes the Tibetan Plateau (TP) and its surrounding mountain ranges, such as the Himalayas, the Karakoram, the Tian Shan, and so on. Climate-triggered natural hazards pose a threat to human lives in HMA, for example, big landslides regularly occurring in the Fergana basin along the foothills of Tian Shan (Roessner et al., 2005). Landslides are predetermined by static factors that can be derived from surface characteristics but are triggered by dynamic factors, which are mainly extreme and prolonged rainfall, as well as earthquakes (Dai and Lee 2002;Hong et al. 2007;Kirschbaum et al., 2012). Another example of climate-triggered natural hazards is the Glacier Lake Outburst Flood (GLOF). Glacier thinning and retreat in the Himalayas, caused by rising air temperature, have resulted in the formation of new glacier lakes and the enlargement of existing ones (ICIMOD, 2011). The sudden discharge of water from these lakes is known as GLOF and leads to extensive damage in downstream villages .
Availability of climate data with a high spatial and temporal resolution is crucial for a better understanding of climatic triggering mechanisms of these localized hazards. However, in HMA, in situ meteorological observations are sparsely and unevenly distributed (Hasson et al., 2016), for example, only a few stations are available in the western TP due to the harsh environment and complex terrain. Moreover, the existing stations are commonly situated at lower altitudes close to valley-based settlements or airports. Thus, our knowledge of the climate at high elevations, that is, where most of the hazards mentioned above occur, is still limited. Global reanalysis data can provide evenly distributed climate data, but they are still too coarse to resolve fundamental processes over complex terrains, such as orographicallyinduced precipitation, and therefore, they are not suitable for applications at regional to local scales (Leung et al., 2003;Lo et al., 2008;Feser et al., 2011). Here, regional climate models (RCMs) applying dynamical downscaling method have great potential to overcome this problem.
The High Asia Refined analysis (HAR, Maussion et al., 2011;2014) is a regional atmospheric data set generated by dynamical downscaling using the Weather Research and Forecasting (WRF) model version 3.3.1 (Skamarock and Klemp, 2008) as RCM. The Final operational global analysis (FNL) from the National Centres of Environmental Prediction (NCEP) was used as forcing data. The HAR covers the period from October 2000 to October 2014 and is available in 30 km (3-hourly interval) and 10 km (hourly interval) resolution. The HAR provides detailed and accurate gridded climate data for HMA region. It has been comprehensively analysed, especially in terms of precipitation and atmospheric water transport Curio et al., 2015;Pritchard et al., 2019;Li et al., 2020) and has been successfully applied in many research fields, such as glacier mass balance modelling , snow and energy balance modelling (Huintjes et al., 2015), and so forth.
However, the short temporal coverage of the HAR makes it unsuitable for long-term and climatological studies. Moreover, its 10 km domain does not cover the whole TP and Tian Shan, which further limits its application within these two regions. Therefore, a new version (HAR v2) with extended temporal coverage and a larger 10 km domain is developed. The state-of-the-art ERA5 reanalysis data set (Copernicus Climate Change Service (C3S), 2017) from the European Centre for Medium-Range Weather Forecast (ECMWF) is used as forcing data. We switch to ERA5 because it will eventually cover the period from 1950 to near real time (currently available from 1979 to near real time), which is much longer than FNL (available from 1999 to near real time).
The overall goal of this study is to find an optimal model set-up for the HAR v2 to provide similar or even better accuracy as the HAR. During the development of the HAR, the sensitivity of simulated precipitation to different physical parameterization schemes (PPSs) was already thoroughly tested (Maussion et al., 2011). However, changes in forcing data and domain configuration may have a significant impact on model output (Miguez-Macho et al., 2004;Leduc and Laprise, 2009;Kala et al., 2015;Huang and Gao, 2018). Thus, the PPSs used in the HAR might not be suitable for the HAR v2, and therefore, the first objective of the current study is to investigate the sensitivity of simulated total precipitation (Prcp) and air temperature at 2 m above ground (T2) to different cumulus (CU), microphysics (MP), planetary boundary layer (PBL) and land surface model (LSM) schemes. The technique for order preference by similarity to the ideal solution (TOPSIS) method is applied to determine the best PPSs. Snow depth in ERA5 over the TP is reported to be largely overestimated (Orsolini et al. 2019). Snow depth is an important quantity in the initial condition, which later on determines surface albedo, alters surface energy balance, and influences T2. We assume that the overestimated snow depth in ERA5 leads to an underestimation of T2, and by correcting the bias in snow depth, the cold bias might be reduced. The second objective is to validate this assumption and to examine the model's sensitivity to initial snow conditions. Snow depth from the Japanese 55-year Reanalysis (JRA-55) data set is used to correct snow depth initialized from ERA5. The final objective is to apply the best PPSs identified by the TOPSIS method and the snow correction approach as the final set-up for the HAR v2, and to compare the two versions of the HAR.
2 | METHODOLOGY 2.1 | WRF model set-up of the reference experiment (Ref) WRF version 4.0.3 (Skamarock et al., 2019) was employed as RCM for our sensitivity studies. We chose January and July 2011 as simulation periods to consider atmospheric conditions in summer and winter, which are under different influence by monsoon and mid-latitude westerlies. Initial and boundary conditions were derived from ERA5 reanalysis data set with 0.25 spatial resolution and hourly temporal resolution. The domain setup ( Figure 1) consisted of two-way nested domains with 30 km and 10 km grid spacing (hereinafter: d30km and d10km). Only the output from d10km was used in this study. One might argue that ERA5 already has a high resolution of $32 km, so the parent domain (d30km) might not be necessary. However, a one-day experiment, which directly downscaled ERA5 to 10 km resolution (see Supporting Information), shows that the large-scale circulation patterns are distorted in the direct downscaling approach, and the 500 hPa wind field from two-way nesting approach is closer to the forcing data ERA5 ( Figure S1). Thus, we kept the d30km as parent domain. In the vertical direction, 28 Eta-levels were used. Lake surface temperature was substituted by daily mean surface air temperature using the avg_tsfc.exe module in WRF. The forcing strategy was daily re-initialization adopted from the HAR. Each run started at 12:00 UTC and contained 36 hr, with the first 12 hr as spin-up time. This strategy avoids the model from deviating too far from the forcing data and provides computational flexibility since daily runs are totally independent of each other and can be computed in parallel and in any sequence. PPSs used in Ref are the same as in the HAR. The model set-up for Ref are summarized in Table 1.

| Design of sensitivity experiments and evaluation methods
At first, we conducted four sets of experiments to examine the performance of different CU, MP, PBL and LSM schemes ( Table 2, hereinafter, PPS experiments). In each experiment set, except for the reference scheme adopted from the HAR, we chose two additional schemes. The selected schemes fulfil at least one of the following criteria: (a) they are commonly used in the WRF community; (b) they have excellent performance according to previous studies and (c) they were not tested in Maussion et al. performance were calculated, which include mean bias (MB), mean absolute error (MAE), root mean square error (RMSE) and Spearman correlation coefficient (r s ) for Prcp and T2, respectively. They are defined as follows (Wilks, 2006): Where P i and O i are simulated monthly averages and observed monthly averages at each station, and P i − O i is the bias score at each station; N is the total number of stations; R pi and R oi are the ranks of simulated and observed daily averages over all stations; M is the total

Physical parameterization schemes
Longwave radiation RRTM scheme (Mlawer et al., 1997) Shortwave radiation Dudhia scheme (Dudhia, 1989) Cumulus (CU) Grell 3D scheme (Grell, 1993;Grell and Devenyi, 2002) Microphysics (MP) Thompson scheme (Thompson et al., 2008) Planetary boundary layer (PBL) Mellor-Yamada-Janjic scheme (Janjic, 1994) Land surface model (LSM) Unified Noah land surface model (Tewari et al., 2004) Surface layer Eta similarity scheme (Janjic, 1994) T A B L E 2 number of days. MB demonstrates the systematic deviation of the model from the observations. However, MB could sometimes be misleading due to the offset of positive and negative values. MAE shows the average magnitude of the error. Since RMSE gives more weight to errors with larger absolute values and is more sensitive to outliers, it provides information about the variability of the error distribution (Chai and Draxler, 2014;Willmott and Matsuura, 2005). The distribution-free Spearman rank correlation coefficient (r s ) measures the extent to which simulated and observed values tend to change together. In order to identify the best scheme for each experiment set, MB, MAE, RMSE and r s were utilized as criteria in the TOPSIS method (Section 2.3). In the next step, the identified best schemes were combined to conduct two runs ( Table 2): one without modifying initial snow depth (COMB) and one with bias correction of snow depth (COMB_S). We used JRA-55 to correct snow depth initialized from ERA5 (Section 2.4). Finally, the improved set-up was applied as the best set-up for the HAR v2 for a one-year run in 2011. The resulting one-year data set was then compared with observations and the HAR.

| Technique for order preference by similarity to the ideal solution (TOPSIS)
The TOPSIS method was applied to identify the best scheme among the sensitivity experiments. It is a multiple criteria decision-making method and aims at finding the optimal decision when the alternatives are numerous and conflicting. It was proposed by Hwang and Yoon (1981) and later applied in several studies, such as ranking general circulation models (Raju and Kumar, 2014;Jena et al., 2015;Li et al., 2019) and identifying the best PPSs in WRF (Sikder and Hossain, 2016;Stergiou et al., 2017). The basic concept of TOPSIS is to determine the best alternative that has the shortest and longest distance from the positive and negative ideal solution. We applied this method for each experiment set. The process was carried out following the description in Tzeng and Huang (2011): 1 Define the weighted normalized evaluation matrix containing m alternatives and n criteria: where a ij represents the criterion j for alternative i; w j is the weight for each criterion. In our case, there are three alternatives (Ref and other two schemes, i = 1, 2, 3) and 16 criteria (MB, MAE, RMSE and r s for Prcp and T2 in January and July, j = 1, 2, …16).
Determine the positive ideal solution (PIS j ) and the negative ideal solution (NIS j ) for each criterion: where J 1 and J 2 represent the benefit criteria (larger is better) and the cost criteria (smaller is better).
Calculate the Euclidean distances between each alternative to PISs (D + i ) and to NISs (D − i ): Define the closeness coefficient (CC i ) for each alternative as: Rank the alternatives by their CC in descending order. The alternative with the highest CC is chosen as the best scheme.

| Bias correction of snow depth
A systematic cold bias has been found over high elevated areas in WRF simulations (e.g., Gao et al., 2015;Karki et al., 2017;Bonekamp et al., 2018), including the HAR (Maussion, 2014;Pritchard et al., 2019). Several studies addressed this cold bias to snow-related processes (Tomasi et al., 2017;Meng et al., 2018). Orsolini et al. (2019) compared snow depth over the TP in five recent global reanalyses using in-situ and satellite remote sensing observations. They found that ERA5 largely overestimates snow depth, especially during winter. Compared to its predecessor ERA-interim, snow depth in ERA5 is still much higher, because ERA5 does not assimilate snow cover from Interactive Multi-Sensor Snow and Ice Mapping System (IMS) for areas above 1,500 m (Orsolini et al., 2019). A realistic representation of snow depth in the forcing data is crucial to accurately simulate snow cover, surface albedo, surface energy balance and T2. To overcome this issue, we could use snow depth from another data set to initialize WRF. But as mentioned before, we chose ERA5 as forcing data due to its long temporal coverage. If we introduce a third-party data set, the production of the HAR v2 would be dependent on it. To avoid this, we applied an alternative strategy that is already used in WRF, where land surface characteristic, such as snow-free albedo, leaf area index, and so forth, are given in the form of static data represented as 12-month climatology. Here, analogously, we used this concept of static data and introduced a gridded 12-month climatology of a scaling factor to correct snow-related variables in the initial conditions. JRA-55 (Kobayashi et al., 2015) developed by the Japan Meteorological Agency was utilized to calculate the scaling factors. JRA-55 has an excellent performance among reanalyses regarding snow depth (Orsolini et al., 2019) and also has a relatively longer temporal coverage. Monthly means of snow depth at the model resolution ($55 km) available from 1958 to 2013 were downloaded from NCAR Research Data Archive. We used a linear approach (Lenderink et al. 2007) to scale the initial snow depth by the ratio of long-term monthly means of JRA-55 and ERA5. First, snow depth of ERA5 was derived from snow water equivalent and snow density. The 12-month climatological snow depth of JRA-55 and ERA5 was calculated from 35-year monthly means . Between 1979 and 2013, the two data sets overlap. Second, the 12-month climatological snow depth was reprojected and linearly resampled to the grid of d30km and d10km, respectively. Then the gridded 12month climatology of the scaling factor was calculated for every domain as the ratio of reprojected climatological snow depth between JRA-55 and ERA5. The ratio of grid points where reprojected ERA5 snow depth was zero was set to one. Figure 2a depicts the 35-year mean annual cycle from ERA5 and JRA-55 averaged over the whole 35-year (1979-2013) mean annual cycle of snow depth from ERA5 and JRA-55 averaged over the whole area of d10km of the HAR v2. Map of scaling factor for d10km of HAR v2 in January (b) and July (c). The grid points where the value is equal to one are masked out area of d10km of the HAR v2. Figure 2b,c show the map of scaling factors applied for d10km in January and July, respectively. After running the real.exe program in WRF, files containing the initial conditions are generated. In the initial conditions, snow depth is represented by two variables: physical snow depth (SNOWH) and snow water equivalent (SNOW). We modified these two variables in the initial files by multiplying them with the scaling factor of the corresponding month.

| Observational data
To calculate the skill scores described in Section 2.2, in-situ observations of Prcp and T2 from Global Surface Summary of the Day (GSOD) provided by National Centres of Environmental Information (NCEI) were used. We selected stations with more than 80% of records within the simulation period. After filtering, 103 stations are available within d10km of the HAR v2 (all points in Figure 1b), and 57 stations are left within d10km of the HAR (red points in Figure 1b). We compared the station data with the nearest grid cell from the model. Because of the difference between station elevation and the elevation of grid cells in the model, we applied a constant lapse rate of −6.5 K·km −1 to correct simulated T2. No correction was applied for Prcp due to its high spatial and temporal variability. Due to the sparse and uneven distribution of GSOD stations in HMA, for example, hardly any station located in western TP and Pamir-Karakorum (Figure 1b), we also made use of a satellite-based gridded precipitation product from Tropical Rainfall Measuring Mission (TRMM). Here, Multi-Satellite Precipitation Analysis (TMPA, Huffman and Bolvin, 2015) Rainfall Estimate Product 3B42 Version 7 was applied to examine the spatial distribution of simulated Prcp in the HAR and the HAR v2. TMPA merges satellite measurement with gauge data. It has a spatial resolution of 0.25 and a quasi-global coverage between 50 N and 50 S.

| Sensitivity to PPSs
Statistical scores for all the PPS experiments and Ref are listed in Table 3. Besides, we use box-whisker plot to visualize the spatial distribution of bias scores (WRFobservation) over 103 GSOD stations. White triangles in each box in Figure 3 (Prcp) and Figure 4 (T2) correspond to the MB scores in Table 3.
For Prcp, Ref features a wet bias, with MB of 0.14 mm·day −1 and 0.70 mm·day −1 in January and July, respectively (  (Table S1). Sensitivities of Prcp to PPSs vary seasonally. The fluctuation of the boxes of all experiments implies that PPSs have a stronger influence on Prcp in July than in January (Figure 3). MB of Prcp ranges from 0.27 mm·day −1 to 1.15 mm·day −1 in July, and only from 0.10 mm day −1 to 0.18 mm·day −1 in January (Table 3). The main reason for this seasonal variability is that most GSOD stations are located in areas receiving more precipitation in summer than in winter Curio and Scherer, 2016). Wider boxes in July indicate a stronger spatial variation of bias score in summer (Figure 3). Compared to the other two ensemble CU schemes, CU1 (Kain-Fritsch-Cumulus Potential scheme) shows the best performance regarding MAE and MB scores. In contrary, RMSE of CU1 in July is higher than Ref, which implies that the station-wise bias in CU1 is more scattered. CU1 is a modified version of the Kain-Fritsch scheme with a better treatment of shallow cumuli (Berg et al., 2013). According to Qian et al. (2016), CU1 tends to suppress deep convections and consequently produces lower Prcp. All three MP schemes consider five hydrometeor species: cloud, rain, ice, snow and graupel. MP1 (Purdue Lin scheme) is a singlemoment scheme only predicting the mixing ratio for these hydrometeor species (Chen and Sun, 2002). Ref (Thompson scheme) uses a double-moment description for rain and ice (Thompson et al., 2008), while MP2 (Morrison two-moment scheme) uses a double-moment description for rain, ice, snow and graupel (Morrison et al., 2009). MP2 performs the best (Table 3), probably due to its double-moment prediction in all ice-phase particles (Orr et al., 2017). The nonlocal scheme PBL2 (YSU) has the best skill compared to the other two local schemes. PBL2 largely improves summer Prcp with MB of 0.27 mm·day −1 compared to MB of 0.70 mm·day −1 in Ref. As for LSM schemes, LSM1 uses the same scheme as Ref (Noah LSM) but with a mosaic approach (Li et al., 2013), which considers sub-grid variability of land use. However, it does not improve Prcp simulation. LSM2 is Noah LSM with multi-parameterization options (Noah-MP). It captures winter Prcp better but produces more Prcp in summer than Ref (Table 3).
The same analysis is performed for T2. The HAR has a better performance than Ref in winter (Table S1). All PPS experiments including Ref produce larger T2 bias in January than in July, with MB of T2 ranging from −0.06 K to 0.70 K in July and −0.39 K to −2.69 K in January (Table 3). The spatial variability of the bias score is also stronger in winter (Figure 4). Independently from PPSs applied, all experiments show an overall underestimation of T2 in winter. CU schemes hardly influence winter T2. CU1, which shows some improvement over Ref in simulating Prcp, produces larger summer warm bias than Ref. Both MP schemes show an improvement in summer, but MP1 produces larger winter cold bias. As for PBL schemes, PBL1 improves T2 simulation in winter but worsens it in summer. It is warmer than Ref in both seasons, which is in accordance with previous studies (Kleczek et al. 2014;Jänicke et al., 2017;Xu et al., 2019). Xu et al. (2019) suggests that this is because PBL1 considers turbulent exchange induced by steep terrain when calculating the turbulent diffusion coefficient. Besides, our results show that the nonlocal PBL2 scheme is colder than both local schemes, which is in contrast with some previous studies concluding that nonlocal schemes generally produce higher temperature than local schemes (Hu et al., 2010;Xie et al., 2012;Kleczek et al., 2014). This different behaviour might due to the unique PBL characteristics over HMA compared to the plain regions (Yang et al., 2004;Zou et al., 2005;Chen et al., 2013), which indicates that PPSs behave very differently in different regions, and the choice of PPSs always depends on the specific purpose. For LSM schemes, again, LSM1 does not show significant improvement. Noah-MP (LSM2) was reported to have better performance than Noah LSM (Ref) in previous studies Gao et al., 2017) and has been widely applied in numerical modelling studies in HMA (Collier and Immerzeel, 2015;Karki et al., 2017;Norris et al., 2017). However, our results show the opposite, that is, that Noah-MP is colder than Noah LSM. This might result from our relatively short spin-up time. All the studies mentioned above use a spin-up time longer than 2 weeks. Noah-MP needs longer spin-up time than Noah LSM to reach a climatological equilibrium state (Cai et al., 2014;Barlage et al., 2015;Gao et al., 2015).

| Combination run and sensitivity to initial snow depth
The results from Section 3.1 indicate that there is a tradeoff in model performance between Prcp and T2, as well as between January and July. No single PPS performs ideally in both seasons and for both quantities. For instance, PBL2 has a better skill in predicting Prcp, but it worsens the winter cold bias. PBL1 shows superior performance in terms of T2 during winter with the lowest MAE, MB and RMSE among all experiments, but it also features the highest warm bias in summer. Therefore, we used the TOPSIS method to find the optimal solution for each experiment set, giving the same weight to each criterion (MAE, MB, RMSE and r s ) for each quantity (Prcp and T2). Table 4 lists the closeness coefficient (CC) and TOPSIS ranking for each experiment set. We then used

T A B L E 3
Statistical scores for evaluation of PPS experiments over 103 GSOD stations within d10km of the HAR v2 (red and white points in Figure 1b) Table 5), which is mainly due to the larger winter cold bias induced by PBL2 (Table 3).
To validate the assumption that the large winter cold bias is partly induced by overestimation of snow depth in the initial conditions, we conducted a further simulation (COMB_S). The PPSs used in this experiment are the same as in COMB, but initial snow depth and snow water equivalent were corrected using the method described in Section 2.4.
After the correction, the prediction of winter T2 is largely improved with MB of −1.54 K in COMB and MB of 0.17 K in COMB_S (Table 5). The impact of snow correction on summer T2 is minor compared to winter. Comparing the station-wise bias scores (Figure 6c,e), the most affected stations are located in the TP and in Northern Pakistan, where snow depth differs the most between ERA5 and JRA-55 (Section 4.2). This snow correction approach leads to slightly higher Prcp in both months (Table 5), but COMB_S still performs better than Ref.

| Comparison of HAR v2 with HAR
COMB_S achieves a significantly better performance than Ref (Table 5). Therefore, the set-up of COMB_S was applied to the HAR v2. We switched to a newer version of WRF (V4.1) for the generation of the HAR v2 and firstly produced HAR v2 d10km data for the whole year of 2011. Figure S3 shows that the change of model version only has a minor impact on the output.
We compared daily Prcp from the HAR v2 with the HAR and GSOD data (Figure 7). Here, only 57 GSOD stations within the d10km of the HAR are used (red points in Figure 1b). The HAR shows an overall lower MB and is better in explaining the variance in observations than

Note:
The values that differ from ref of more than 10% are marked in bold and put between parentheses if they show a lower skill than Ref.
the HAR v2 (MB of 0.30 mm·day −1 and 0.36 mm·day −1 , r 2 of .61 and .57). The HAR v2 tends to produce more Prcp than the HAR in April and from September to December. However, this kind of comparison only demonstrates the ability of the two versions of the HAR to reproduce the overall Prcp amount. In addition, we utilized gridded Prcp products to examine their ability to reproduce the spatial distribution of Prcp. Figure 8 shows seasonal Prcp from the HAR, the HAR v2 and the TMPA in 2011. In winter, the HAR and the HAR v2 produce similar spatial patterns of Prcp with maximum Prcp over Pamir-Karakoram-western-Himalayas (PKwH) region. However, this maximum centre is not detected in the TMPA. In MAM, two Prcp maxima exist in both versions of the HAR: one in PKwH and one over the region near Brahmaputra River. Only the latter one is visible in the TMPA, but with lower Prcp amount. The HAR and the HAR v2 show the largest differences in JJA. The HAR v2 shares more spatial similarity with the TMPA, while the HAR produces higher Prcp amount in Indian, Pakistan and eastern TP. In SON, the HAR still shows higher Prcp amount in Indian. Same as in winter and spring, both versions of the HAR detect more Prcp in PKwH than the TMPA.
Daily T2 from the HAR and the HAR v2 are compared with each other and with GSOD ( Figure 9). Both data sets reproduce T2 seasonality well (r 2 = .99). The HAR v2 simulates T2 better with MB of −0.58 K

| Sources of uncertainties
One should always keep in mind that the accuracy of observational data and the method used for validation could lead to uncertainty of the result. In this study, we use freely available observations from GSOD as the ground truth to calculate statistical metrics. Station data from this data set underwent quality control before being published (NOAA, 2019). However, some GSOD stations seem to be outliers, for example, the station in southeastern TP featuring winter warm bias, while all the surrounding stations show cold bias (Figure 6a,c,e). With regards to Prcp, it is well known that rain gauges undercatch Prcp under strong wind conditions (Duchon and Essenberg, 2001;WMO, 2008). We used T2 and Prcp from the nearest grid point to compare them with observations. With a resolution of 10 km, the distance between the actual location of stations and the associated grid point could reach a few kilometres, which leads to a large difference in elevation over complex terrain. A constant lapse rate of −6.5 K·km −1 is applied to correct the simulated T2 values. For Prcp, no correction is applied. Therefore, the point-to-point comparison between modelled and observed Prcp always depicts discrepancies. Another source of uncertainty comes from the fact that we only conducted sensitivity experiments for the year 2011 due to the high computational costs. Due to the sparse distribution of GSOD stations, we also compared the spatial distribution of seasonal Prcp from two versions of the HRA with the TMPA. The results (Figure 8) indicate that the TMPA fails to reproduce Prcp over PKwH in winter, spring and autumn. This region is dominated by mid-latitude westerlies and experiences its maximum Prcp amount in winter (Archer and Fowler, 2004;Bookhagen and Burbank, 2010;Curio and Scherer, 2016;Mölg et al., 2018). The majority of Prcp in this region falls as snow . Satellitebased Prcp has a deficiency in detecting snowfall (Behrangi et al., 2014;Immerzeel et al., 2015). The Global Precipitation Measurement (GPM) mission, launched in February 2014, is a successor of TRMM and a next generation of global observations of precipitation from space. The onboard Dual-frequency Precipitation Radar is more sensitive than TRMM in detecting snowfall and light rainfall (NASA, 2019). GPM-based products are expected to be a better validation tool for modelled results after 2014.

| Snow correction approach
In this study, we propose a bias correction method for initial snow depth and snow water equivalent based on the concept of linear scaling approach. This approach has the advantage that only monthly climatological information is required. Figure 6 and Table 5 show that bias- Scatter plots of HAR v2 (a) and HAR (b) with statistical scores MB in mm day −1 and r 2 . Daily Prcp time series (c) corrected initial conditions improve winter T2 simulations.
To further explore the influence of snow depth on T2, monthly mean difference of T2, surface albedo and snow cover fraction between COMB and COMB_S in January and July are presented in Figure 10. In January, regions with the highest increase of T2 in COMB_S are Southeastern TP, the Himalayas, and the Tian Shan. These regions also feature lower albedo and lower snow cover fraction. The same relationship can be found in July. However, the increase of T2 is much weaker in July than in January, which is in accordance with the MB scores in Table 5.
In the Noah LSM, surface albedo (α) is calculated as: where α sn, free and α sn refer to snow-free albedo and snow albedo, respectively. f sn is snow cover fraction. α is proportional to f sn , which is defined as: In Equation 12, W is snow depth in water equivalent; W cr is a land-use-dependent threshold of snow depth, over which f sn is set to 1; a s is a distribution shape parameter. After snow correction, W decreases in most grid points in d10km (Figure 2), which leads to smaller f sn according to Equation 12. Smaller f sn results in lower surface albedo (Equation 11) and modifies the surface energy balance by reflecting less short-wave radiation, and thus, influences T2 in COMB_S. On the other hand, more absorption of shortwave radiation could lead to larger moist static energy (Meng et al., 2014), which could be a possible reason of the enhanced precipitation in COMB_S (Table 5). Note that snow-related changes affect T2 and Prcp through multiple complex processes. Further work is needed to quantify the impact of these processes on T2 and Prcp. The error sources of dynamical downscaling are twofold: deficiencies in model dynamics and physics, as well as inaccuracies in forcing data. After snow correction, the cold bias over TP is reduced but not eliminated. The distribution of T2 bias score in COMB_S (Figure 6e,f) share some similar pattern with the HAR (figure 2.4 in Maussion, 2014): cold bias over the TP but warm bias in Pakistan and Northwestern China. This implies that these patterns of biases are related to WRF itself, or the same errors exist in both ERA5 and FNL data sets. Cold bias over the TP is reported from many other WRF studies independent of the forcing data (ERA-interim: Gao et al., 2015;Karki et al., 2017;Bonekamp et al., 2018;Huang and Gao, 2018;FNL: Huang and Gao, 2018;Maussion, 2014;Zhou et al., 2018). Meng et al. (2018) pointed out that this is due to the overestimation of albedo over snow-covered areas in Noah LSM. After switching off albedo parameterization and replacing the albedo with MODIS time-varying albedo, their simulated T2 improved. According to Chen et al. (2014), the major weakness of LSM is snow processes, and Noah LSM needs a higher snow albedo to retain snow on ground since it only considers a single layer of snowpack.

| CONCLUSIONS
Validation of the reference experiment (Ref) indicates that, due to the changes in forcing data and domain configuration, the PPSs used in the HAR are no longer Scatter plots of HAR v2 (a) and HAR (b) with statistical scores MB in K and r 2 . Daily T2 time series (c) suitable for the HAR v2 since they produce large summer wet bias and winter cold bias. We examined model sensitivities to different PPSs and initial snow conditions to find an optimal set-up for the HAR v2. The results of PPS experiments reveal that no single PPS is optimally suited for both seasons (January and July) and both quantities (T2 and Prcp). Trade-offs always exist between quantities and between seasons. Kain-Fritsch cumulus potential scheme, Morrison 2moment scheme, Yonsei University scheme and Noah LSM are identified by the TOPSIS method as the best schemes. The combination run (COMB) inherits from these schemes and has a superior performance in Prcp but not in T2 when compared to Ref.
Overestimation of snow depth in ERA5 in most areas of d10km leads to higher snow cover fraction and higher surface albedo, which ultimately results in a cold bias. After correction of initial snow depth, cold bias is significantly reduced but not eliminated. This could imply a deficiency of Noah LSM. In future LSM development, improvement of representation of snow-related processes is needed.
After careful selection of PPSs and correction of initial snow depth, model performance is improved. The improved set-up was applied to produce 1 year of HAR v2 data for 2011. Compared to the old version, the HAR v2 generally produces slightly higher Prcp amounts, but the spatial distribution of seasonal Prcp matches better to observations. Both versions of the HAR show cold bias in spring, autumn and winter, but warm bias in summer. The HAR v2 has smaller magnitudes of these biases.
The HAR v2 is planned to have a temporal coverage of at least 30 years. Comprehensive validation against observations and comparison with the HAR are scheduled once data production is finished. The HAR v2 is developed within the framework of the "Climatic and Tectonic Natural Hazards in Central Asia (CaTeNA)" project to investigate the climatic triggering mechanism (e) (f) F I G U R E 1 0 Difference of T2 (a and b), surface albedo (c and d) and snow cover fraction (e and f) between COMB_s and COMB in of landslides in Central Asia. However, application of this data set will not be limited to this research field. With high resolution and extended temporal coverage, the HAR v2 can contribute to a better understanding of climate-related processes in the remote and data-sparse HMA region.

ACKNOWLEDGMENTS
This work was supported by the German Federal Ministry of Education and Research (BMBF) under the framework of the "Climatic and Tectonic Natural Hazards in Central Asia (CaTeNA)" project (Grant Number FKZ 03G0878G). The authors acknowledge support by the Open Access Publication Fund of TU Berlin. The authors thank the two anonymous reviewers for their constructive comments and suggestions, which helped the authors to improve the manuscript.