Evaluating the Evolution of ECMWF Precipitation Products Using Observational Data for Iran: From ERA40 to ERA5

European Center for Medium‐Range Weather Forecasts Reanalysis (ERA), one of the most widely used precipitation products, has evolved from ERA‐40 to ERA‐20CM, ERA‐20C, ERA‐Interim, and ERA5. Studies evaluating the performance of individual ERA products cannot adequately assess the evolution of the products. We compared the performance of all ERA precipitation products at daily, monthly, and annual data (1980–2018) using more than 2100 Iran precipitation gauges. Results indicated that ERA‐40 performed worst, followed by ERA‐20CM, which showed only minor improvements over ERA‐40. ERA‐20C considerably outperformed its predecessors, benefiting from the assimilation of observational data. Although several previous studies have reported full superiority of ERA5 over ERA‐Interim, our results revealed several shortcomings in ERA5 compared with the ERA‐Interim estimates. Both ERA‐Interim and ERA5 performed best overall, with ERA‐Interim showing better statistical and categorical skill scores, and ERA5 performing better in estimating extreme precipitations. These results suggest that the accuracy of ERA precipitation products has improved from ERA‐40 to ERA‐Interim, but not consistently from ERA‐Interim to ERA5. This study employed a grid‐grid comparison approach by first creating a gridded reference data set through the spatial aggregation of point source observations, however, the results from a point‐grid approach showed no change in the overall ranking of products (despite the slight changes in the error index values). These findings are useful for model development at a global scale and for hydrological applications in Iran.

Several studies have evaluated the performance of ERA precipitation products regionally and globally. Studies comparing continental precipitation have found ERA-Interim to be significantly better in capturing monthly precipitation variability than ERA-40 (Simmons et al., 2010). Evaluation of ERA-Interim for precipitation in the United States indicated that the data set had comparable performance to Global Precipitation Climatology Project (GPCP; Adler et al., 2003) for annual averages during the period 2000-2008 (Balsamo et al., 2010). A recent evaluation of ERA5 for North America (Tarek et al., 2020) indicated consistently lower precipitation bias in ERA5 than in ERA-Interim, and concluded that ERA5 can lead to systematically more accurate hydrological modeling. In a study in mainland China, Jiang et al. (2021) showed that ERA5 can successfully identify the spatial distribution and hotspots of precipitation, but underestimates extreme precipitation.
Although ERA products have been evaluated separately in different regions, there is still a need for specific comparison of performance for the different ERA products that have emerged over time. Previous studies evaluating individual ERA products are not comparable, mainly due to different evaluation approaches, spatiotemporal scale, and study periods. Many studies have also been conducted across different regions with varying uncertainty in ground truth observations, further impeding comprehensive comparison of ERA products. An exhaustive evaluation of several successive ERA precipitation products for countries with diverse climates and precipitation patterns (such as Iran) can be insightful, revealing critical information on the performance of successive ERA products and contributing to the development and improvement of future versions. While later ERA products are generally expected to show improvements in models, input data, and assimilation methods (Sun et al., 2018), this needs to be verified independently and across different regions.
In this study, we evaluated the performance of five successive ERA precipitation products, using data for Iran, and investigated the improvements in each version compared with its predecessors. Iran's diverse geography and hydro-climatological patterns, and access to independent observational data (i.e., precipitation data from gauges not used by global precipitation data sets) provided a unique opportunity to investigate the performance and evolution of the different ERA precipitation products. • Non-physical trends and variability may be present in the record due to changes in the observing system • Consistency of the representation of globally averaged temperatures in the mid-to upper stratosphere has not improved in ERA5 ERA5 strengths compared to ERA-Interim (related to precipitation): • Better global balance of precipitation and evaporation • Better precipitation over land in the deep tropics • Higher spatial and temporal resolutions Note. Models' general characteristics, improvements, strengths, and limitations related to precipitation estimations are also summarized. References used to summarize information in Table 1

10.1029/2022EA002352 4 of 24
A few studies have evaluated the performance of ERA-Interim or ERA5 precipitation products for Iran (Darand & Khandu, 2020;Fallah et al., 2020;Khoshchehreh et al., 2021;Shayeghi et al., 2020;Shobeiri et al., 2021;Taghizadeh et al., 2021). However, none of them has focused on a comprehensive comparison of different successive ERA precipitation products to demonstrate their evolution over time. These studies have used different study areas and reference observational data sets, preventing intercomparison of ERA products across Iran based on the results.
The aim of this study was to make a combined assessment of the different ERA precipitation products, using Iran as the study area, to answer the following research questions: (a) What uncertainties and error characteristics are associated with the different ERA precipitation products in estimating precipitation? (b) Do later ERA precipitation products represent an improvement on their predecessors? To this end, we used data from more than 2,100 gauges scattered throughout Iran to create the reference data set, against which outputs from the different ERA precipitation products were compared ( Table 1). The recorded data covered a 38-year period , and the evaluation was performed at daily, monthly, and annual timescales. To conduct a comprehensive performance evaluation, inspect the error characteristics, and quantify the progress and evolution of the ERA precipitation products, different types of statistical indices and error decomposition approaches were employed.

Study Area
Iran was selected at the study area because most of its existing rain-gauge observations are not publicly accessible and have thus not been used in previous assessments of global precipitation products. Iran is located between 25°-40°N and 42°-63°E and occupies an area of 1,648,000 km 2 (Figure 1). The topography of Iran is diverse, with two mountain ranges, the Alborz Chain, which runs from northwest to northeast, and the Zagros Chain, which runs from northwest to the shores of the Persian Gulf, and two large deserts (Lout and Kavir) in the center of the Iranian Plateau. The elevation varies from 25 m below mean sea level (MSL) in northern coastal regions by the Caspian Sea to 5,600 m above MSL in the Alborz Chain. Based on the Extended De Martonne classification (Rahimi et al., 2013), Iran has a wide range of climates, from perhumid in the northwest and along the Caspian Sea coast to semi-humid, Mediterranean, semi-arid, and extra-arid in central and eastern areas.
Precipitation in Iran is influenced by various synoptic systems arriving from the north, northwest, west, south, and southeast (Sabziparvar et al., 2015). The Mediterranean and North Atlantic cyclones, along with the cold continental air mass, affect northwest and northern Iran, causing considerable amounts of annual precipitation in the north and northwest (up to 2,000 and 500 mm/year, respectively) (Sabziparvar et al., 2015). The Alborz and Zagros Chains largely block the atmospheric and frontal systems arriving from the north and northwest of Iran, causing an arid climate in the central region with annual precipitation of less than 50 mm/year (Yazdanpanah et al., 2017). Summertime Indian monsoon systems influence southern and southeastern parts of Iran, causing strong winds and sudden rain storms that can result in annual precipitation of up to 200 mm/year in these regions (Sabziparvar et al., 2015).

Observed Precipitation Data Set
Precipitation in Iran is measured by two separate gauge networks, comprising: (a) the Iran Meteorological Organization (IRIMO) network (synoptic stations) and (b) the Iran Water Management Research Institute network (TAMAB stations) (Figure 1). Although observed precipitation at the synoptic stations is regularly reported to the WMO, their stations account for only a fraction of existing rain gauges in Iran. Most of Iran's rain-gauge observations are not freely accessible to the public and have thus not been used in assessments of global models and products. In this study, we used data from 479 synoptic and 1,646 TAMAB stations with records of daily precipitation for the selected study period . Before using the gauge data in assessments, we conducted quality control (QC) tests, which resulted in exclusion of data from six TAMAB precipitation gauges from the reference data set (see Section 2.3.1).

ERA Precipitation Products
The five different versions of ERA (ERA-40, ERA-20C, ERA-20CM, ERA-Interim, and ERA5) use models and data assimilation systems to reanalyze archived observations, creating global data sets describing the recent history of the atmosphere, land surface, and oceans (ECMWF, 2021). Table 1 provides a summary of the different versions of ERA products and their specifications, differences, and improvements with a main focus on precipitation related issues. A brief description of each data set is also given below.

ERA-40
ERA-40 is a 45-year second-generation reanalysis carried out by ECMWF in 2005 to produce the best possible set of analyses, given the changing observing system and the available computational resources. It began in September 1957, when the observing system had been enhanced, and ran until August 2002 (Uppala et al., 2005). The observations used in ERA-40 were accumulated from various sources with assimilated data provided by a succession of satellite-borne instruments from the 1970s onward, supplemented with increasing numbers of observations from aircraft, ocean buoys, and other surface platforms, but with a declining number of radiosonde ascents since the late 1980s. The computational cost of ECMWF's operational four-dimensional variational (4D-Var) data assimilation system was too large to be used for ERA-40. An updated form of the 3D-Var analysis, used operationally at ECMWF between January 1996 and November 1997 (Andersson et al., 1998), was thus adopted in this product (Uppala et al., 2005). ERA-40 estimated global gridded precipitation 1957-2002 with 6-hourly temporal resolution and ∼125 km spatial resolution (Table 1).

ERA-20CM
ERA-20CM is an ensemble of 10 atmospheric model integrations for the 20th Century (1899-2010) developed at ECMWF (Hersbach et al., 2015). The spatial resolution of ERA-20CM is ∼125 km and the temporal resolution is 3 hr (Table 1). Since no atmospheric observations were assimilated in ERA-20CM, this product cannot reproduce the data from actual synoptic gauges. However, the ERA-20CM ensemble product can provide a statistical estimate of the climate over the 20th Century and provides a good reference for the forced low-frequency variability of the atmosphere in the 20th century. Moreover, ERA-20CM is well suited for the projection of global warming and significant events onto other geophysical quantities not directly provided in the forcing data (Hersbach et al., 2015).

ERA-20C
The ECMWF's 20th Century reanalysis ERA-20C (1900ERA-20C ( -2010, an atmospheric general circulation model, uses the same configuration as the control member of the ERA-20CM ensemble. However, it is forced by observation-based analyses of sea surface temperature, sea ice cover, atmospheric composition changes, and solar forcing (Poli et al., 2016). The resulting climate trend estimations resemble those of ERA-20CM for the temperature and water cycle, but the assimilation of observations adds realism to synoptic timescales compared with ERA-20CM in regions that are covered by observations. The novel feature of ERA-20C compared with its predecessors was the availability of observation-based feedback information. The general quality of ERA-20C and its climate-related agreement with other products improves with the availability of observations (Poli et al., 2016).

ERA-Interim
The ERA-Interim project was conducted by ECMWF to prepare a new atmospheric reanalysis to replace ERA-40, which extended back to the early part of the 20th Century (Dee et al., 2011). ERA-Interim covers the period 1979-2019 and its gridded data include a large variety of 3-hourly surface parameters, describing the weather, ocean-wave and land-surface conditions, and 6-hourly upper-air parameters covering the troposphere and stratosphere (Dee et al., 2011). Vertical integrals of atmospheric flux, monthly averages for many parameters, and other derived fields have also been produced and published in the Copernicus portal (Berrisford et al., 2009). The spatial resolution of this data set is 79 km (Table 1).

ERA5
ERA5, the most recent ECMWF reanalysis product, provides a detailed record of the global atmosphere, land surface, and ocean waves from 1950 onward. Developed to replace ERA-Interim, ERA5 significantly enhanced the spatial resolution of ECMWF reanalysis to 31 km and provides hourly output and an uncertainty estimate from an ensemble of model runs (Hersbach et al., 2020). In addition, the representation of tropospheric processes appears to be significantly improved in ERA5, as it benefits from a decade of research and developments in modeling physical dynamics and in data assimilation techniques (Hennermann & Berrisford, 2018). Therefore, ERA5 can be expected to perform considerably better than ECMWF's previous products. Expected improvements include representation of tropical cyclones, global balance of precipitation and evaporation, precipitation over land in the deep tropics, soil moisture, and more consistent sea surface temperatures and sea ice (Hennermann & Berrisford, 2018).

Quality Control of the Observed Precipitation Data Set
Before evaluating the performance of the ERA precipitation products, we applied QC tests on the gauge observations and excluded inhomogeneous and suspicious gauges from the reference data set. The overall workflow of QC tests implemented in this study is shown in Figure S1 in Supporting Information S1. The QC tests were applied to the annual number of wet days with a threshold of 1 mm/day, rather than to precipitation rates following Wijngaard et al. (2003). After constructing the time series of annual wet days for all gauge stations, four statistical homogeneity tests including Standard Normal Homogeneity Test (SNHT; Alexandersson, 1986), Buishand range test (Buishand, 1982), Pettitt test (Pettit, 1979), and Von Neumann ratio test (Von Neumann, 1941) were performed to check departures of the time series from homogeneity (using the RStudio package by Pohlert, 2016).

of 24
The predefined null hypothesis of all four tests is that the annual number of wet days in a year is independent and identically distributed. Under the alternative hypothesis, presence of a stepwise shift (break) in the mean or non-random distribution of the time series is assumed (for more details on the definitions of these tests, see Supporting Information S1).
Based on the results of the tests and using the classification scheme developed by Schönwiese and Rapp (1997) and Wijngaard et al. (2003), gauges in Iran were classified into three categories (Useful, Doubtful, and Suspicious) The Doubtful and Suspicious gauges were re-checked against their closest Useful gauges, using the double mass curve test  and a final decision about exclusion of the gauges was then made (for more details on this test, see Supporting Information S1). Only six (of 1,640) TAMAB gauges did not pass the QC test and were excluded from the reference data set used in this study.

Spatio-Temporal Aggregation of the Reference and ERA Precipitation Data Set
Since the reference and ERA data sets had different temporal resolution, all sub-daily gauge (i.e., reference) data and gridded (i.e., ERA product) data were aggregated to daily time step. Monthly and annual time series were then built from daily values. During the temporal aggregations, monthly (annual) records with more than 5 days (2 months) of missing data in a month (year) were excluded (considered as NaNs).
We employed a slightly modified version of the inverse distance weighting (IDW) method to obtain gridded estimations of precipitation from gauge observations at the spatial resolution of each ERA data set (Table 1). Figure  S2 in Supporting Information S1 shows the boxplot of the number gauge stations in each grid cell of different ERA precipitation product. As it can be seen in Figure S2 in Supporting Information S1, ERA40, ERA20CM, and ERA20C products with coarser spatial resolutions have an average of around 15 gauges in each grid while this number for ERA-Interim and ERA5 are seven and two gauges per grid, respectively. Using the modified IDW method, three-dimensional distances (including the vertical distance of each gauge from the mean elevation of the grid) were calculated and used as the weights of all gauge stations falling into different ERA product grid cells (see Supporting Information S1 for more details). The gridded daily, monthly, and annual time series obtained for the reference observational data set at the spatial resolution of ERA precipitation products were subjected to statistical analyses and error evaluations.

Statistical Indices
Kling-Gupta Efficiency (KGE; optimal value is 1) (Gupta et al., 2009;Kling et al., 2012) was chosen as the statistical index in this study, as it summarizes and combines three measures of model error: correlation coefficient, bias, and variability ratio (Kling et al., 2012). The KGE values were calculated for precipitation estimated by the ERA products compared with the associated observed values in the reference data set. The revised version of KGE (Kling et al., 2012) was used, to ensure that the bias and variability ratios were not cross-correlated. KGE is calculated as where CC is Pearson product-moment correlation coefficient (optimal value is 1), is bias ratio, defined as the division of simulated and observed mean values (optimal value of 1), and is variability ratio, computed by dividing the coefficient of variation (CV) of simulated and observed values by the optimal value of 1 (Kling et al., 2012).
Three additional statistical measures, correlation coefficient (CC), root mean squared standard error (RMSE), and relative bias (RBias) were also calculated, using the following equations: GHAJARNIA ET AL.
10.1029/2022EA002352 8 of 24 where est and obs indicate the estimated and observed precipitation values from the ERA products and reference data sets, respectively. For each pixel, cov ( est , obs ) is the statistical covariance of observed and estimated precipitation, respectively, and obs represents the standard deviation of the data set. obs and est are observed and estimated precipitation at the time step t at each grid cell, with T indicating the total number of time steps in a specific grid cell.
All statistical indices were calculated for the daily and monthly time series. In addition, the observed precipitation data (and their associated ERA estimates) were categorized into four different classes: 0-200, 200-400, 400-600, and >600 mm/year, and KGE was calculated for each class to evaluate the performance of ERA products for regions with different climates and precipitation regimes (from dry to more humid regions).

Categorical Contingency Table Indices
Apart from evaluating the accuracy of models in estimating precipitation rate, it is important to verify their precision in detecting precipitation occurrence. This was done by creating a contingency table based on dichotomous estimations that return Yes if the precipitation has happened or No otherwise. The contingency table indices were then defined based on the number of Yes and No events in the reference and ERA products. The threshold specified to separate Yes and No events in this study was varied from 0 to 25 mm/day, in order to test the functionality of the ERA products in detection of precipitation at different precipitation rates. The contingency table indices applied were probability of detection (POD), false alarm ratio (FAR), bias, and Heidke skill score (HSS) where H, M, FA , and CN represent Hit, Miss, False Alarm, and Correct Negative conditions, respectively.
POD represents the success rate of the model in estimating the occurrence of precipitation correctly (optimal value is 1), whereas FAR ratio measures the fraction of estimated precipitation events that were non-rainy days in the observed data set (optimal value is 0). Bias also measures the ratio of the number of estimated precipitation events to the number of observed rainy days and indicates whether the model tends to under forecast (Bias < 1 ) or over forecast (Bias > 1 ) precipitation occurrence. HSS measures the fraction of correct precipitation estimates by the model after eliminating correctly estimated rainy days due to random chance (optimal value is 1). For more information on the definitions and details of the contingency table indices, see Murphy and Winkler (1987) or the website https://www.cawcr.gov.au/projects/verification/. The verification package in Rstudio, developed by NCAR (2015), was used to calculate the contingency table indices at different thresholds tested in this study.

Systematic and Random Error Decomposition
Decomposition of errors to systematic and random components can help determine the source of errors in estimation/prediction models and provides very useful information for evaluating performance and identifying areas for future model enhancement. Systematic errors are reproducible inaccuracies that are consistently in the same direction (higher or lower than observed data), while random errors vary around observed data in different directions. The mean squared difference (MSD) error index, which measures the difference between observed and modeled values, can be decomposed into a systematic component (MSD s ) and a random component (MSD r ) as (Willmott, 1981) where ̂e st is obtained by calculating the least square linear regression relationship as ̂e st = + obs , where and are the intercept and slope, respectively.
From the above equations, the ratio (MSDs∕MSD) represents the systematic error component, while (MSDr∕MSD) or (1 − MSDs∕MSD ) represents the random component of the total MSD value. Apart from the desirability of lower MSD values (total MSD, MSDs, and MSDr), a low ratio of systematic error to the random component is preferable, as it reflects a better model algorithm, in the present case capable of capturing the precipitation process (see Ghajarnia et al., 2018 for a graphical illustration of systematic and random error components).

Daily Evaluations
Comparison of scatter plots of daily precipitation estimates for the different ERA precipitation products against gauge observations ( Figure 2) and assessment of skill scores revealed improvements in successive versions of the ERA precipitation products at the daily scale. ERA-40 and ERA-20CM were found to be the worst-performing products among all ERA versions, as indicated by accumulation of points around the horizontal and vertical axes (Figures 2a and 2b), indicating poor performance in capturing precipitation occurrence and rate. ERA-20C estimates outperformed those of the earlier versions (Figure 2c), presumably due to its assimilation of observational data as explained earlier in Section 2.2.5 and Table 1. This performance improvement was reflected in higher CC (0.49, ∼310% increase compared with ERA-20CM) and lower RMSE (2.9 mm/day, ∼10% decrease compared with ERA-20CM) (Figure 2c). In addition, ERA-20C estimates were less concentrated around the horizontal and vertical axes and more directed toward the perfect agreement line. This positive improving trend was also seen for ERA-Interim and ERA5 (Figures 2d and 2e). However, although ERA5 was able to capture more extreme daily observations than ERA-Interim (enhanced distribution of points toward the upper right side of the perfect agreement line), its estimates were more scattered in the plot (especially around the horizontal and vertical axes), indicating more erroneous estimates (Figure 2e). ERA-Interim, the version before ERA5, outperformed all the other products at daily timescale, with the highest CC (0.59, compared with 0.55 for ERA5), lowest RMSE (3 mm/day, compared with 3.8 for ERA5), better RBias (8%, compared with 30% for ERA5), and a better distribution of points around the perfect agreement line (Figure 2d). ERA-5 significantly underestimated precipitation between ∼5 and ∼25 mm/day.
In Figure 2, the highest concentration of estimation-observation pairs in all panels was around the zero-zero value, as influenced by the high number of non-rainy days in the time series, especially across the most arid climate regions in central Iran. In addition, it should also be noted that the different spatial resolution of ERA precipitation products can play a role and influence the error statistics as well. As mentioned in Table 1, the spatial resolution of ERA precipitation products varies from 125 km in ERA40 to 31 km in ERA5 and as expected, the models with higher spatial resolutions must be able to provide a more realistic representation of the reality, compared to the coarser products. However, the increase of spatial resolution can also impact the error prorogation in numerical models leading to the increase of error values and limiting the usefulness of model outputs. In fact, the models with coarser spatial resolution have the capacity to smooth the precipitation field with possibility of improving statistics such as RMSE, correlation coefficient, or bias. Therefore, this issue of scale must be considered when interpreting the results and when comparing the performance of different ERA precipitation products.
Nevertheless, it is also worthy of attention that the issue of scale exists when comparing the first three versions of ERA precipitation products with ERA-Interim as well as between ERA-Interim and ERA5. As can be seen in Figure 2, higher resolution of ERA-Interim could considerably improve and enhance precipitation estimation when compared with previous coarser products. But this has not happened between ERA5 and ERA-Interim, although ERA5 benefits from much higher resolution compared to ERA-Interim. This means that although the same issue of scale and methodological limitations exist between all these models and their comparisons, ERA-Interim could successfully improve its predecessors while ERA5 was not able to succeed in this task at the same level as ERA-Interim. Figure 3 shows box plots of daily KGE and its components (CC, bias ratio, and variability ratio) for all grid cells, together with the spatial distribution across Iran. ERA-40 showed very poor performance, mainly due to high bias ratio (β) and low CC (Figure 3a). There was a significant improvement in all subsequent ERA precipitation products, as reflected in lower bias ratio and higher CC components (significantly better index values with shorter interquartile range, see Figures 3a and 3b). Median KGE for ERA-20CM, ERA-20C, ERA-interim, and ERA5 was 0.1, 0.34, 0.45, and 0.39, which was a 93%, 124%, 132%, and 127% improvement, respectively, compared with ERA-40. Interestingly, there was no significant improvement in ERA products after ERA-40. Again, and in contrast to general expectations, ERA-Interim had slightly higher median KGE (0.06, +16%), higher median CC (0.02, +4%), lower bias ratio (0.12, −10%), and lower variability ratio (0.12, −16%) compared with ERA5 ( Figure 3). This indicates a relatively small decrease in ERA5 performance in precipitation estimation compared with ERA-Interim at the daily scale.
Evaluation of the spatial pattern of KGE for different ERA precipitation products across Iran (Figure 3e) showed that, in addition to the higher spatial resolution, the accuracy of ERA precipitation estimates improved from ERA-40 to ERA5. Compared with ERA-40 and ERA-20CM, which exhibited lower spatial skill in terms of KGE, ERA-20C, ERA-Interim, and ERA5 showed improved performance for most parts of Iran. In particular, ERA5, ERA-Interim, and partly ERA-20C showed better performance for mountainous regions in western Iran (Zagros Chain) (Figure 3e). However, grid cells located in the northwest and coastal regions around the Caspian Sea were associated with poor representations of precipitation rates, especially by ERA5. Precipitation in central arid areas and the Iranian deserts was mainly misrepresented by all ERA precipitation products, or not included in the analysis due to lack of gauge stations.
To assess how the ERA precipitation products performed in estimating precipitation of different intensities, KGE and its related components (CC, β, γ) were calculated separately for the four precipitation intensity categories (0-5, 5-10, 10-20, and >20 mm/day) (Figure 4). The size of the sample in each category can be found in Table  S1 in Supporting Information S1. Based on the KGE values (Figure 4a), for the category 0-5 mm/day (representing more than 95% of the data, see Table S1 in Supporting Information S1), ERA-20C, ERA-Interim, ERA5, ERA-20CM, and ERA-40 provided daily estimates in descending order of accuracy (KGE = 0.2, 0.16, −0.5, −0.27, and −0.31, respectively). However, the order changed to ERA5, ERA-Interim, ERA-20C, ERA-20CM, and ERA-40 as precipitation intensity increased. For example, for precipitation >20 mm/day (around 1% of the sample size) KGE was −0.24, −0.35, −1.34, and <−3 for ERA5, ERA-Interim, ERA-20C, and ERA-20CM/ERA-40, respectively. These trends indicated that the two more recent ERA precipitation products, and in particular ERA5, performed better in capturing extreme precipitation events in Iran. This can be attributed to the higher spatial resolution of ERA-Interim (79 km) and ERA5 (35 km) compared with their predecessors (125 km), leading to better representation of local precipitation processes and intense rainfall. Similar patterns in CC, bias ratio, and variability ratio were found (Figures 4b-4d), confirming that ERA5 provided the best performance for the highest rainfall category (>20 mm/day).
The categorical metrics POD, FAR, and HSS were used to assess the ability of the ERA products in capturing the occurrence or non-occurrence (precipitation estimated zero by each product) of precipitation events ( Figure 5). The POD of the ERA products improved constantly from ERA-40 to ERA5 (0.32-0.69) (Figure 5a), indicating higher capability of ERA5 in detecting rainy days compared with its predecessors. This was a particularly important finding considering that ERA5 had the highest spatial resolution of all products.
Despite the improvement in POD for newer ERA precipitation products, the FAR of ERA5 (median 0.58) increased considerably compared with ERA-20C (median 0.47) and the FAR of ERA-Interim (median 0.48) increased slightly, but with larger interquartile range. The higher FAR in ERA5, indicating a higher probability of falsely reporting precipitation events, led to a lower median value of HSS (an overall measure summarizing all categorical metrics) for ERA5 (0.45) compared with ERA-Interim (0.49) (Figure 5b). This indicates that the overall ability of the ERA precipitation products in detecting precipitation events at daily scale in Iran improved from ERA-40 to ERA-Interim, but decreased in the most recent product (ERA5).
The spatial patterns in HSS (Figure 5b) also indicated low detection skill for ERA-40 and ERA-20CM in the study area. HSS improved for ERA-20C estimates and further for ERA-Interim estimates, mostly in the western mountainous region of Iran (Zagros Chain). ERA5 showed slightly weaker performance in terms of HSS, particularly in the northwest and northeast of the country. Figure S3 and S4 in Supporting Information S1 also show the maps of spatial distribution of POD and FAR across Iran, based on precipitation estimates of different ERA products.
Calculation of the categorical metrics using different precipitation thresholds for rainy/non-rainy days revealed critical information on the capacity of the different ERA products in detecting precipitation ( Figure 6). The FAR, POD, and HSS values for different thresholds in ERA-40 and ERA-20CM showed that these products had very little skill in detecting precipitation events in Iran, especially at higher intensities. More than 95% of the precipitation estimates by ERA-40 and ERA-20CM for rainfall events >10 mm/day were false alarms, and these products missed almost all >10 mm/day events (Figures 6a and 6b).
As expected, ERA-20C showed considerable improvement over ERA-20CM and the improvement continued in ERA-Interim and ERA5, with the latter able to capture ∼95% of precipitation events with threshold 0 mm/day (Figure 6b). However, this came at the cost of FAR higher than 0.7, which means more than 70% of ERA5 precipitation estimates for this rainfall category were false alarms. This can be attributed to the higher spatial resolution of ERA5 and its higher potential detection of local spatial precipitation patterns not detected by the reference data set (due to low number of gauges in those grid cells). In general, high values of both POD and FAR led to higher bias, as shown in Figure 6d for ERA5 at the threshold <1 mm/day. Overall, the performance of ERA5 in estimating precipitation in the range 0-1 mm/day was lower than that of ERA-Interim. In terms of HSS (Figure 6c), ERA5 outperformed ERA-Interim for intense precipitation (i.e., higher than 13 mm/ day). In addition, there was a much more gradual increase in FAR for ERA5 compared with ERA-Interim, indicating a lower probability of false alarms by ERA5 for higher precipitation estimates. These findings, together with the higher POD values for ERA5 at all precipitation thresholds compared with ERA-Interim (Figure 6b), reveal that ERA5 had higher capability in representing intense and extreme precipitation events, despite its overall weaker performance at lower precipitation rates. The enhanced performance at extreme precipitation categories can again be due to the higher spatial resolution of ERA5 compared with ERA-Interim, leading to its higher capability in capturing local precipitation processes. Similarly, the higher spatial resolution of ERA5 can possibly have impacted the growth and propagation of errors in the numerical algorithm of the ERA precipitation product and as such, increase the final error values. Figure 7 shows the spatial patterns in the systematic and random error components as percentages and as absolute values of MSD averaged over all grid cells across Iran. The lowest systematic error component for all ERA precipitation products was found for ERA-40 (low ratio of systematic error to the sum of systematic and random errors) (Figure 7a). However, it had very high random error values (MSDr) across all grid cells, reflecting Although ERA-20C improved on ERA-20CM's estimation algorithm and provided lower systematic and random errors, it still suffered from higher mean MSDr error component in precipitation estimates. ERA-Interim was the best-performing ERA precipitation product, giving the lowest relative systematic error for different parts of Iran ( Figure 7a) and lower mean MSDs than the MSDr component ( Figure 7b). Finally, although ERA5 gave lower relative systematic error value (compared with its random error component), its absolute MSDr and MSD values were higher, which can impact ERA5 precipitation estimates compared with ERA-Interim. These results indicate an overall improvement in ERA products from ERA-40 to ERA-Interim (with decreased systematic error), but slightly decreased performance in ERA5 in terms of both systematic and random error components. This may be caused by the higher resolution of ERA5, but it can also indicate that the ERA5 modeling approach still needs more improvement to better match the spatial resolution of its output.

Monthly Evaluations
To evaluate the performance of ERA precipitation products at monthly timescale, the statistical and categorical evaluation metrics were re-calculated for the monthly time series. Scatter plots of monthly estimates versus observations for all ERA precipitation products (Figure 8) showed considerably improved performance of all products at the monthly scale compared with the daily scale, as was expected. This was reflected in paired points in the scatter plots being better distributed around the perfect agreement line (Figures 8a-8e). All products were also more capable of estimating precipitation in wetter months. As found for the daily results (see Figure 2), ERA-40 and ERA-20CM were again the two products with the weakest performance metrics, while ERA-20C had the highest CC (0.78) and lowest RMSE (22.9 mm/month) values. Compared with ERA-Interim and ERA5, ERA-20C estimates were more scattered around lower precipitation rates, indicating underestimation (lower RBias). This can be due to the lower spatial resolution of ERA-20C compared with ERA-Interim and ERA5, resulting in dampening of estimated and aggregated precipitation data over larger grid cells.
Comparisons of the performance of ERA-Interim and ERA5 (Figures 8d and 8e) indicated that, despite both having the same CC value (0.76), ERA5 achieved better performance in capturing wet months due to its higher spatial resolution, as the points reached closer to the upper right corner of the scatter plot ( Figure 8e). However, ERA5 also had higher RMSE (32.5 mm/month) than ERA-Interim (26.7 mm/month).
Monthly estimates by the ERA precipitation products were also evaluated using KGE (for full results, see Figure  S5 in Supporting Information S1). Compared with the daily results, the monthly results showed slightly higher KGE median values, of −1.42, 0.3, 0.56, 0.65, and 0.61 for ERA-40, ERA-20CM, ERA-20C, ERA-Interim, and ERA5 respectively. They also showed higher monthly correlations, especially for ERA-40 and ERA-20CM (larger interquartile range toward higher CC in the box plots, see Figure S5a in Supporting Information S1), variability ratio closer to 1 (except ERA-20C), and similar bias ratio values to those of the daily results. The spatial patterns in gridded monthly KGE maps (see Figure S5e in Supporting Information S1) also showed higher KGE values, particularly for mountainous regions of western and south-western Iran. Overall, based on the monthly KGE calculations in Figure S5 in Supporting Information S1, ERA-Interim achieved the best performance of all products. However, the comparable ERA5 skill scores while having considerably higher spatial resolution should also be taken into consideration. To perform a more in-depth analysis of ERA products at the monthly scale, KGE and its components were also calculated for different mean annual precipitation categories (see Section 2.4.1) (Figure 9). These categories included grid cells from dry central regions of Iran to moderate and wet highlands in the northwest or humid and very wet coastal areas by the Caspian Sea in the north (see Table S2 in Supporting Information S1 for the size of the data subsets at each category). According to the KGE values (Figure 9a), ERA-Interim was the best performing ERA product in all categories, except over the wettest grid cells with the highest annual precipitation (≥600 mm/year) in which ERA5 outperformed ERA-Interim. Both ERA5 and ERA-Interim improved the estimates of their predecessors for all annual precipitation categories and at the monthly timescale, which is a sign of successful evolution in ERA precipitation estimation products. Figure 10 shows the pattern of mean monthly precipitation according to the reference data set and ERA precipitation products. ERA-40, with considerable overestimation, was not able to correctly capture the values or the seasonality of observed mean monthly precipitation (Figure 10a). The performance of ERA-20CM was considerably better than that of ERA-40, but there were still some inaccurate monthly patterns (e.g., in April, July, and December) (Figure 10b). ERA-20C, ERA-Interim, and ERA5 all correctly captured the trend in the mean monthly reference data set, but ERA-Interim provided the most accurate monthly estimates while ERA-20C and ERA5 contained relatively constant bias as under-and over-estimations, respectively.

Annual Evaluation
For each ERA product, annual precipitation across Iran was calculated by averaging the estimates over grid cells containing at least one observational gauge. A similar calculation was made for the reference data set, using the observational gauge data aggregated in the grid cells of the ERA products and applying the modified IDW method (see Section 2.3). Finally, the difference between these two time series was plotted ( Figure 11).
As shown in Figure 11, ERA-40 overestimated the reference data set at the annual scale. This issue was considerably improved in ERA-20CM and ERA-20C, although some erroneous annual fluctuations were observed, particularly in ERA-20CM. ERA-Interim and ERA5 again gave the lowest errors at the annual scale (as well as at daily and monthly scales, see previous sections), with ERA-Interim being closest overall to the reference data set. ERA5 annual precipitation estimates suffered from a relatively constant positive bias. Therefore, similarly to the daily and monthly results, and contrary to findings in previous studies in Iran (Fallah et al., 2020;Khoshchehreh et al., 2021;Taghizadeh et al., 2021), our evaluations at annual scale suggested that the evolution of ERA precipitation products was quite successful from ERA-40 to ERA-Interim, but ERA5 had less skill than ERA-Interim across different climates and geographical conditions in Iran. However, it is important to note that the spatial resolution of the estimates improved considerably from ERA-40 to ERA5, which can influence the uncertainties associated with the estimated and reference precipitation data sets created in this study.

The Issue of Scale
Global precipitation products are gridded data sets while ground truth observations are mainly point source which leads to a scale issue in performance evaluation studies. Therefore, we have two main approaches of (a) point-grid and (b) grid-grid comparisons for assessing their performance, each one having its advantages and disadvantages. In the first approach, gridded estimations from the product are directly compared with the point source observations at rain gauges located inside the grids while in the second approach, a gridded reference data set is initially produced by spatial interpolation of point source observations and then compared with the gridded precipitation estimation product at the same grid cells. In the point-grid approach, there is no manipulation on the observational data set, however, comparison of the average state of precipitation over a relatively large area (with sometimes great variations in space) obtained from the gridded product with the point source observational precipitation at the rain gauge location will introduce uncertainty in the final results. On the other hand, although the grid-grid approach overcomes this uncertainty, it introduces another type of uncertainty in the results through spatial aggregation of the point source observations while creating the reference data set.
In this study, we have employed the second approach as we had access to a sufficient number of stations with proper spatial distribution across Iran. However, in order to understand the impact of this methodological choice on the final results and on the ranking of different ECMWF precipitation products, we have also calculated the KGE index and its components by using the point-grid approach. Figure 12 shows the medians of KGE, correlation (r), bias (β), and variability ration (γ) indexes on daily timescale and across Iran for all products, calculated separately based on the point-grid and grid-grid approaches (same results for the monthly time step can be seen in Figure S6 in Supporting Information S1). The comparison of the point-grid with grid-grid results shows close and similar error index values for all products except for ERA-40 which contains the largest grid cell and lowest spatial resolution, and for the correlation coefficient of ERA5 that is slightly better than ERA-Interim in the point-grid approach (possibly a sign of partial improvement in local precipitation representations by ERA5). Nevertheless, despite the slight changes in the results, the overall pattern of progress in ECMWF precipitation products from ERA-40 to ERA5 and product rankings remain the same. This observation indicates that the choice of point-grid or grid-grid approaches and the aforementioned scale issue, can influence the statistical calculations and would marginally change the final results, but it does not alter the ranking of products. This is exactly in line with previous findings of Saemian et al., 2021 in the evaluation of 44 precipitation estimation products across Iran. The similar comparison between point-grid and the grid-grid approaches by them also revealed that although slight changes might arise in the error indexes based on the choice of either approach, the overall ranking of the data sets was not influenced and remained unchanged.

Conclusions
This study investigated the performance of the five ERA precipitation products developed to date (ERA-40, ERA-20CM, ERA-20C, ERA-Interim, and ERA5) against a reference observational data set for Iran. Long-term historical daily precipitation observations from a total of 2119 gauges (479 synoptic and 1640 TAMAB stations) were quality-checked and used to create the reference data set through spatial aggregation of point-source observations within ERA grid cells. Different statistical and categorical metrics, time series comparisons, and error decomposition methods were applied to investigate uncertainties and error characteristics of the different ERA precipitation products and assess the accuracy of more recent versions compared with their predecessors when applied to Iran. Uncertainties and error values quantified for different spatial regions showed poor performance of all ERA products for coastal areas along the Caspian Sea in the north and for mountainous regions in northwest Iran. The estimates were more accurate (especially ERA-Interim and ERA5 estimates) for western (Zagros Mountain chain) and southwestern and southern (Persian Gulf coast) parts of Iran. ERA-40 was the worst-performing product in all analyses, with considerably high positive bias. ERA-20CM provided marginal improvements over ERA-40 but still contained serious errors, including high relative systematic error for different parts of Iran. By assimilation of observational data into ERA-20CM, ERA-20C provided a considerable improvement in all error indices compared with its predecessors. The two most recent products, ERA-Interim and ERA5, introduced lower error compared with earlier products and showed similar performance based on different evaluation metrics.
In addition to having higher spatial resolution, ERA5 outperformed all other ERA products through more accurate estimation of intense precipitation events in wetter regions of Iran. ERA5 also showed good capability for detecting occurrence of precipitation events, with higher POD values than other ERA products. However, based on the KGE, CC, RMSE indices and lower systematic error, ERA-Interim outperformed ERA5 (and other products) overall. ERA-Interim and especially ERA5 had higher FAR values than their immediate predecessor (ERA20C), which indicates more false alarm errors in their estimates that can be a source of positive bias. ERA-40, ERA20-CM, and to some extent ERA-20C could not correctly simulate the long-term pattern of monthly precipitation changes over Iran, while ERA-Interim and ERA5 showed considerably improved performance, with ERA-Interim being the most accurate ERA product based on the monthly patterns and analyses.
Overall, the findings at different temporal scales, based on various evaluation indices, suggested that the ERA precipitation products improved considerably from ERA-40 to ERA-Interim, but showed slight performance decrease in ERA5 based on some indices. This finding is particularly important, as several previous studies in Iran have identified ERA5 as the best ERA precipitation product, with consistent improvements over ERA-Interim. Differences between the study areas and gauge stations used in the reference data sets might have affected our final results in comparison with previous studies. It should also be noted that despite the marginal decrease in some indices, ERA5 has significantly increased the spatial resolution of precipitation estimations, from 79 km in ERA-Interim and 125 km for other previous products, to 31 km in ERA5 and can thus better capture precipitation patterns in local regions and precipitation extremes. Although, this higher spatial resolution can enhance the spatial representation of precipitation phenomena with more accurate estimations of precipitation field, it can also lead to the propagation and thus increase of error values in the numerical models and in the final model outputs. However, the similar difference of spatial resolution between ERA-Interim (79 km) and its predecessors (125 km) and yet considerable improvement of all skill scores in ERA-Interim indicated a more successful evolution from ERA-40 to ERA-Interim, than from ERA-Interim to ERA5. This suggests that the current modeling capacity of ERA products matches better with spatial resolution of 79 km (ERA-Interim) rather than 31 km (ERA5), for modeling precipitation over Iran. As a possible direction for future studies, we suggest and encourage the comparison of ERA5 and ERA-Interim precipitation products in other regions of the world particularly with consideration of statistics as the Fraction Skill Score or considering variables as daily precipitation maximum over the study area that show the added value of having higher resolution models. Finally, it should be noted that we used a grid-grid approach for our evaluation in this study; however, a comparison between the grid-grid and point-grid approaches shows that results from both approaches are comparable so it could not change the overall results.

Conflict of Interest
The authors declare no conflicts of interest relevant to this study.