Assessment of NA‐CORDEX regional climate models, reanalysis and in situ gridded‐observational data sets against the U.S. Climate Reference Network

Climate models' capability of reproducing the present climate at both global and regional scales still needs improvements. The assessment of model performance critically depends on the data sets used as comparators/references. Reanalysis and gridded observational data sets have been frequently used for this purpose. However, none of these can be considered an accurate reference data set because of their associated uncertainties and full representativity. This paper, for the first time, uses in‐situ measurements from National Oceanic and Atmospheric Administration U.S. Climate Reference Network (USCRN) spanning the period 2006–2020 to assess daily temperature and precipitation from a suite of dynamically downscaled regional climate models (RCMs; driven by ERA‐Interim) involved in NA‐CORDEX. The assessment is also extended to the most recent and widely used Earth system reanalyses (ERA5, ERA‐Interim, MERRA2 and NARR) and a few in situ‐based gridded data sets (Daymet, PRISM, Livneh and CPC). Results show that biases for the different data sets are seasonally and subregionally dependent. On average, reanalysis and in situ‐based gridded data sets are warmer (with biases exceeding 0.3°C) than USCRN year‐round, while RCMs are colder (warmer) in winter (summer) with biases ranging from −0.5 (0.9)°C for RCMs at 0.44° to −0.2 (1.4)°C for CRCM5‐UQAM‐11. In situ‐based gridded data sets provide the best performance in most of the Contiguous United States (CONUS) regions compared to reanalyses and RCMs, but still have biases in regions such as the Western mountains and the Pacific Northwest. Furthermore, in most US subregions, reanalysis data sets do not outperform reanalysis‐driven RCMs. Likewise, for both reanalysis data sets and RCMs, temperature and precipitation biases vary considerably depending on the local orography, with larger temperature biases for coarser model resolutions and precipitation biases for reanalysis.


K E Y W O R D S
high-resolution, NA-CORDEX, precipitation, reference measurements, regional climate models, temperature, USCRN

| INTRODUCTION
Near-surface temperature and precipitation are the most frequently used variables in climate studies.Models and observations have been extensively studied over the past decades to estimate trends in these variables and to assess the impact of global warming on health, food security, ecosystems and water supply.Both are highly variable in space and time and the historical time series are often prone to inhomogeneities due to changes in instrumentation, calculation algorithms, station re-locations and other factors, which must be adjusted to enable the identification of climate signals (Essa et al., 2022;Hausfather et al., 2016;Madonna et al., 2022).Moreover, observations over certain regions are sparse resulting in large sampling uncertainties (Sy et al., 2021).Nevertheless, near-surface observations are still one of the main data sources for the evaluation of climate models.
Climate models are used to understand past, present, and future climate variability and change.However, one of the key challenges in their assessment is the spatial mismatch between observations and models (e.g., Zhang et al., 2011), the latter with resolutions varying from 10 to 50 km for regional climate models (RCMs) and from 50 to 300 km for the global climate models (GCMs) (Eyring et al., 2016;Taylor et al., 2012).The scale mismatch effects are more evident in climate models and Earth system reanalyses with a coarse horizontal resolution.Therefore, different interpolation techniques have been used in previous research to match near-surface observations to climate model scales (Comber & Zeng, 2019;Herrera et al., 2019;Militino et al., 2015).Despite the potential errors that can arise from data set interpolation procedures, these methods can still be applied with a certain level of confidence in regions where the spatial coverage of observations is dense.Widely used alternative comparators for assessing climate models are reanalysis and gridded observational data sets (e.g., Gibson et al., 2019;Srivastava et al., 2020Srivastava et al., , 2022).An increasing number of studies evaluate RCMs, accounting for observational uncertainties, by comparing their results with various gridded observational data sets globally (Harris et al., 2014(Harris et al., , 2020) ) and in different regions, including the United States (Gibson et al., 2019;Srivastava et al., 2020Srivastava et al., , 2022) ) and Europe (e.g., Bandhauer et al., 2022;Haylock et al., 2008;Herold et al., 2016;Herrera et al., 2016Herrera et al., , 2019;;Kotlarski et al., 2019).
Consequently, different conclusions are often drawn and the real model performance is altered by the contribution of observational uncertainties (Gibson et al., 2019;G omez-Navarro et al., 2012;Kotlarski et al., 2019;Prein & Gobiet, 2017;Srivastava et al., 2022).A debate on the uncertainties due to the usage of reanalysis and gridded surface observational data sets in the model evaluation has been ongoing for several years and was addressed in several papers from different regions, including the United States (e.g., Gervais et al., 2014;Gibson et al., 2019;Srivastava et al., 2020Srivastava et al., , 2022)), Canada (e.g., Diaconescu et al., 2018) and Europe (e.g., Flaounas et al., 2012;G omez-Navarro et al., 2012;Kotlarski et al., 2019;Prein & Gobiet, 2017).The comparison uncertainties are influenced by several factors, such as: (i) the use of interpolation methods/techniques mapping point-based site measurements to a regular grid (Contractor et al., 2015;Herrera et al., 2019;Hofstra et al., 2010), as well as the limitations of different interpolation techniques used for different topographic regions.Various methods, such as Inverse Distance Weighting, Kriging and Thiessen polygon, have been typically considered for both precipitation and temperature data (see Ribeiro et al., 2021;Shen et al., 2001;Teegavarapu et al., 2018).These methods exhibit different strengths and limitations, which can significantly impact the representation of the climate variable (Attorre et al., 2007;Avila et al., 2015;Contractor et al., 2015;Gervais et al., 2014); (ii) the representativeness in complex terrain, particularly when observations fail to properly capture the spatial dependence between the topography, land heterogeneities or coastal features and physical quantities.Furthermore, a key challenge in this respect is the limited coverage of station observations in numerous high-elevation regions.Moreover, remote sensing data may not be able to accurately capture meteorological fields in complex topography.For example, temperature measurements at high elevations may be biased due to the altitude-dependent decrease in air pressure.The related bias adjustment may be challenging in some cases (Pepin et al., 2022); and (iii) the way climate variables are adjusted on the orography, as well as the choice of the most appropriate spatial resolution for gridded observational data set (Sandu et al., 2019).The choice of the target resolution depends on the application and the spatial scale of the climate variables being studied, but it can also affect the accuracy of the interpolation results.For example, the usage of a coarser resolution may result in oversimplification of the precipitation patterns, while a finer resolution may produce unrealistic results due to interpolation errors.
High-quality observational records are key to solving many of the issues in assessing climate models.Reference data set (Thorne et al., 2017) and homogenized (i.e., biasadjusted) historical time series (Madonna et al., 2022) are required because non-climatic discontinuities can compromise the interpretation of decadal climate variability and change.In the last two decades, international measurement programmes, such as the Global Climate Observing System, designed 'reference networks' to monitor climate with the objective to fill an important gap in the global observing system (https://gcos.wmo.int/en/home).Reference networks can provide long-term, highquality climate data records, traceable to SI standards and quantify uncertainties (Thorne et al., 2017(Thorne et al., , 2018)).The U.S. Climate Reference Network (USCRN) is one of the brightest examples of reference network measuring near-surface air temperature and precipitation and measuring at the same time several quantities of influence (Buban et al., 2020;Diamond et al., 2013;Madonna et al., 2023).
Furthermore, within the framework of the European CORDEX (EURO-CORDEX; Jacob et al., 2020) initiative, a substantial and voluntary effort has been undertaken to advance the field of regional climate and Earth system science in Europe.Operating as part of the World Climate Research Programme (WCRP)-Coordinated Regional Downscaling Experiment (CORDEX), EURO-CORDEX shares common objectives aimed at enhancing the evaluation of climate models and improving climate projection frameworks.More recently, Diez-Sierra et al. (2022) have conducted a major comprehensive exploration of RCM evaluation within the broader context of global initiatives such as CORDEX-CORE and Copernicus Climate Change Service (C3S) (Buontempo et al., 2022), encompassing regions, including North America and Europe.These studies collectively contribute to a holistic understanding of RCM performance across diverse geographic areas and various spatial scales.
In this work, the ability of RCMs participating to the North American Coordinated Regional Climate Downscaling Experiment (NA-CORDEX; Mearns et al., 2017;Bukovsky & Mearns, 2020) over the Contiguous United States (CONUS) is evaluated, along with the performances of atmospheric reanalysis and in situ-based gridded surface observational data sets.To address these issues, differently from previous studies available in the literature (e.g., Gibson et al., 2019;Srivastava et al., 2020Srivastava et al., , 2022)), observations from the USCRN network are used as the reference data set for comparison.
This paper focuses on the three main scientific questions: i. How well do NA-CORDEX climate models, in situbased gridded observational and reanalysis data sets represent CONUS daily temperature and precipitation versus local reference in-situ observations?ii.Are recent gridded-observational data sets and reanalysis products reliable for climate model evaluation?iii.How does the improved resolution (from 0.44 to 0.22 /0.11 ) in NA-CORDEX RCMs bring value to in-situ reference observations?
The remainder of the paper is structured as follows.Section 2 describes the USCRN data, RCMs, in situ-based gridded data sets and atmospheric reanalysis data used.The statistical metrics and the subregional evaluations are presented in Section 3. Section 4 assesses the performance of individual and multi-model ensemble mean in simulating the CONUS local climate characteristics, along with a subregional assessment of RCM biases and reanalysis as well as of gridded surface observational data sets uncertainties.Finally, a discussion of the most relevant results together with the main conclusions and recommendations are provided in Section 5.

| U.S. Climate Reference Network
The U.S. Climate Reference Network (USCRN) (Diamond et al., 2013) is a systematic and sustained network of 139 stations deployed across the CONUS, Alaska and Hawaii.Stations are managed and maintained by the National Oceanic and Atmospheric Administration's (NOAA) National Centers for Environmental Information (NCEI).The primary goal of USCRN is to provide long-term homogeneous observations of temperature, precipitation, and soil moisture/soil temperature that can be used for current climate applications while also being coupled to past long-term observations for the detection and attribution of climate change (Diamond et al., 2013).USCRN stations use high-quality instruments to measure temperature, precipitation, wind speed, soil conditions, and other ancillary variables (https://www.ncdc.noaa.gov/crn/).For both temperature and precipitation, the concept of triple measurements redundancy (i.e., three collocated thermometers or rain gauges) is adopted in the data processing to improve the quality of their estimations.The quality of individual observations and the continuity of the records at each site are monitored on a routine basis.Specific information regarding USCRN instrumentation can also be found at www.ncdc.noaa.gov/crn/instrdoc.html.
USCRN has been also classified as a reference network in the frame of the European Union's Horizon 2020 Research Project GAIA-CLIM (Gap Analysis for Integrated Atmospheric ECV CLImate Monitoring) using a maturity matrix approach (Thorne et al., 2017).Among the 139 USCRN stations, only 129 stations with at least 12 valid years (i.e., without missing data) during the 2006-2020 period across the CONUS (as depicted by black circles in Figure 1a) were used in this paper.The USCRN stations cover the United States uniformly and measure at different altitudes (Figure 1b,c) and in different climate regimes (Figure 1a).They are placed as far away from vegetation as possible, with rural environments expected to be free of human activities and land-use/land cover change effects (Sy & Quesada, 2020).Observations are performed to fulfil the requirements of the World Meteorological Organization (WMO), as well as US requirements for the variables being observed (WMO, 2008).
In Figure 2, the USCRN climatological daily mean temperature and precipitation (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020) for annual, winter and summer seasons are shown.For temperature (Figure 2, left column), annual patterns reveal strong spatial gradients over the CONUS with maximum values exceeding 25.4 C observed over the south, desert and southwest Pacific areas, while the smallest values of 2.0 C are observed over the northern parts of the MtWest and central areas.Similar spatial gradients are also observed in winter and summer, while the coldest temperature value of −13.3 C is recorded in the northern part of the MtWest and Central areas, mainly during winter cold air incursions from the Arctic.By contrast, the highest temperature values exceeding 37.9 C are observed in the desert in summertime.Precipitation (Figure 2, right column) patterns are highly dependent on locations and seasons with the highest values over the southeast and Pacific Northwest.The high precipitation patterns found over southeastern CONUS mainly occur in summer and are likely generated by tropical systems (Mitchell et al., 2019) or other mesoscale atmospheric circulations (Barlow et al., 2019).In contrast, the high precipitation over the northwestern (with values exceeding 20 mm/day in Quinault and 10 mm/day in Darrington stations) occurs in wintertime and may be due to the synoptic-scale atmospheric fronts (e.g., Castro et al., 2012;Yu et al., 2022).

| Atmospheric reanalysis data sets
The list of data sets assessed in this study includes four widely used reanalysis and four gridded observational products.The atmospheric reanalysis products used in this paper include the global ERA-Interim (hereafter, ERAI) (Dee et al., 2011), ERA5 (Hersbach et al., 2020), MERRA-2 (Gelaro et al., 2017) and the North American Regional Reanalysis (NARR; Mesinger et al., 2006).ERA5    is the latest climate reanalysis produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), providing hourly data on a regular latitude-longitude grid at 0.25 × 0.25 resolution (Hersbach et al., 2020).It replaces the ERA-Interim reanalysis (used to force NA-CORDEX simulations; see next section) and is based on the Integrated Forecasting System (IFS) Cy41r2 which became operational in 2016.It thus benefits from a decade of developments in model physics, core dynamics and data assimilation compared to ERAI.In addition to a significantly enhanced horizontal resolution of 31 km, compared to 80 km for ERAI, it has hourly output throughout, and an uncertainty estimate from an ensemble.The general set-up of ERA5, as well as a basic evaluation of characteristics, is provided by (Hersbach et al., 2020) and data are publicly available through the Copernicus Climate Data Store (CDS; https://cds.climate.copernicus.eu).
Beyond the two ECMWF products, one of the most recent global atmospheric reanalysis products is MERRA-2 (Gelaro et al., 2017).MERRA-2 is the latest atmospheric global reanalysis of the modern satellite era produced by NASA's Global Modeling and Assimilation Office (GMAO).It also assimilates observation types not available to the earlier generation MERRA reanalysis (Rienecker et al., 2011) and includes updates to the Goddard Earth Observing System (GEOS) model and analysis scheme to provide a viable ongoing climate analysis beyond MERRA.Overall, MERRA-2 system has most of the same basic features as the MERRA system but includes several important updates (Gelaro et al., 2017).The regional reanalysis product used in this study is the high-resolution North American Regional Reanalysis (NARR; Mesinger et al., 2006).NARR consists of a long-term, consistent, high-resolution climate data set for the North American domain, as a major improvement upon the earlier regional reanalysis data sets in both resolution and accuracy.It notably differs from the other reanalysis products described above because it does assimilate rain gauge networks into its latent heating scheme (Bukovsky & Karoly, 2007).The direct assimilation from observed precipitation data sets makes it more of a 'hybrid' product compared to other reanalyses (Mesinger et al., 2006).

| In situ gridded observational data sets
Along with the reanalysis products used, four recent gridded-observational data sets used in this paper include the Climate Prediction Center (CPC) Unified CONUS data set (hereafter, CPC) (Higgins et al., 2000), Livneh (Livneh et al., 2013(Livneh et al., , 2015)), the Oregon State University Parameter-Elevation Regressions on Independent Slopes Model (PRISM) (Daly et al., 2008), and the Daily Surface Weather Data on a 1-km Grid for North America, Version 4 (Daymet-v4) (Thornton et al., 2021).CPC (Higgins et al., 2000) is from NOAA.It uses station data from the US unified rain gauge data set, composed of multiple sources (Higgins et al., 2000).Based on the inverse-distance weighting interpolation algorithms of Cressman (1959), CPC was developed with the aim to create regional analyses over the CONUS-Mexico and the South America domains (Higgins et al., 2000).The primary goal consists of developing a US Precipitation Quality Control (QC) system and analysis that improves the QC of rain gauge data used in precipitation analyses for the United States improving precipitation products and applications in support of climate monitoring, climate prediction, and applied research.Livneh (Livneh et al., 2013(Livneh et al., , 2015) ) is a station-based 1/16 (6 km) resolution gridded data.It was developed based on the meshing procedure of Maurer et al. (2002).With the effort to create regional analyses over Mexico, the United States, and southern Canada, it accounts for orographic effects using the elevation-scaling procedure for precipitation climatology from 1961 to 1990.PRISM data set (Daly et al., 2008) primarily uses station data from Cooperative Observer Program (COOP) stations and snowpack telemetry (SNOTEL), as well as several other smaller networks.The daily product also incorporates radar observations of 4 km resolution from the Advanced Hydro-Weather Prediction System over central and eastern CONUS.To correct the precipitationelevation dependence, a linear method, using weights at each grid point based on elevation and location characteristics, is used.Finally, Daymet-4 (Thornton et al., 2021) daily surface weather data on a 1-km grid for North America is based on COOP and SNOTEL station networks like PRISM.The precipitation-elevation dependence is also corrected using a weighted local linear regression.
Moreover, it is worth pointing out that except for the PRISM data set, which includes USCRN data as an input source for the months between May and September and only after 2017 (Buban et al., 2020), the in situ gridded observational data sets do not use USCRN data in their input data.Additionally, as part of this study focuses on the period 2006-2014 due to the unavailability of USCRN before 2006 and the availability of model simulations up to 2014/2015 (model dependent), the risk of a circularity issue in the comparison among USCRN and in-situ gridded data sets is negligible (Buban et al., 2020).The comparison between gridded-observational data sets also allows to discuss uncertainties due to the potential impact of interpolating data from sparse surface observations.Data set's main specification and related references for further details are summarized in Table 1 for reanalysis products and Table 2 for the in situ-based gridded observation data sets.

| NA-CORDEX model ensemble
This study examines the NA-CORDEX model ensemble (Mearns et al., 2017) composed of seven RCMs: CRCM5−OUR (Martynov et al., 2013;Šeparovi c et al., 2013), CRCM5−UQAM (Martynov et al., 2013;Šeparovi c et al., 2013), RCA4 (Samuelsson et al., 2011), RegCM4 (Giorgi et al., 2012), WRF (Skamarock & Klemp, 2008), CanRCM4 (Scinocca et al., 2016) and HIRHAM5 (Christensen et al., 2007).The NA-CORDEX data set was retrieved from https://www.earthsystemgrid.org/search/cordexsearch.htmldata archive.The experiment aims to add value to the existing body of RCMs by using multiple simulations with high spatial resolutions to facilitate RCM T A B L E 1 Characteristics of reanalysis products used in this study.intercomparison studies and ultimately serve the impact and adaptation communities (Giorgi et al., 2009).Further details about the model data sets, horizontal resolutions and related references are summarized in Table 3.As the main purpose of this work is model evaluation, historical experiment data only, not subject to any bias correction (labelled as 'Eval' driven by ERAI reanalysis), are considered.The RCMs are run at either 0.44 (50 km) or 0.22 (25 km), with a single higher-resolution run of the CRCM5−OUR model at 0.11 (12.5 km) to enable a direct evaluation of potential added-value from increased resolution (https://na-cordex.org/simulation-matrix.html).

| Subregional assessment
The comparison between RCMs, reanalysis, and in-situ-based gridded data sets refers to daily mean temperature and precipitation.It is carried out by using the nearest neighbour interpolation on a regular grid with respect to the locations of USCRN stations.Despite some limitations in matching station observations and grid point data, particularly for the coarsest gridded data sets, the nearest neighbour method has been found to have the advantage of preserving the physical properties of the model column, extreme values and temporal variability (e.g., Schwarz et al., 2017;Vautard et al., 2013).It is also widely used in the literature to assess the ability of RCMs to simulate local climate conditions (e.g., Buban et al., 2020;Diaconescu et al., 2018;Schwarz et al., 2017;Vautard et al., 2013).Furthermore, it should be noted that station observations remain the primary source of information for describing the historical climate of a certain region.Station observations also provide a more detailed view of local climate conditions particularly over complex terrain where satellite data may not be accurate and in regions where intense precipitation events occur.Simulated gridded products are usually interpreted as mean values over the grid (see Chen & Knutson, 2008;Gervais et al., 2014 for more discussion) and smaller than the values recorded at single measurement stations.The data sets are aggregated annually and seasonally for summer (June, July and August; JJA) and winter (December, January and February; DJF).Biases for reanalysis and in situ-based gridded data sets are also assessed over the extended period from 2006 to 2020.For a consistent inter-comparison, reanalysis and in situbased gridded products have been re-gridded to a single resolution of 0.44 CONUS land-only grid using a firstorder conservative procedure (Jones, 1999).Part of the analysis presented in this study is focused on different CONUS subregions, according to the classification of Bukovsky (2011) (Figure 1a).Among the 29 subregions created in the North American Regional Climate Change Assessment Program (NARCCAP) domain (Bukovsky & Mearns, 2020;Mearns et al., 2017), grouped regions (Figure 1a) have been selected using the climate classification by Ricketts et al. (1999), dividing the CONUS into eight climatic subregions: Desert, PacificSW, PacificNW, MtWest, Central, East, South and Great Lakes.This division also accounts for the different orography: while the Eastern United States has a relatively flat topography and low elevation (Figure 1b,c), the Western CONUS, especially the Mountain West region (MtWest), features a highly fractured relief rising gradually from sea level to 3500 m, with magnitude gradually decreasing towards the West coast (Figure 1b,c).

| Evaluation metrics
Regional averages are calculated by averaging only the points corresponding to USCRN stations in a given region.However, it is important to notice that representativeness uncertainty may affect the regional averages in regions such as the Pacific Southwest, Pacific Northwest, and the Great Lakes, where the data set may be relatively sparse to fully represent the entire region.However, as shown in literature (e.g., Madonna et al., 2023), resolution of RCMs and REA is sufficient to reduce the representativeness uncertainty.Following previous studies (e.g., Alexander et al., 2006;Buban et al., 2020;Diaconescu et al., 2018), this approach assumes homogeneous local conditions.Different metrics were chosen to evaluate different aspects of the data sets.The metrics were computed for both temperature and precipitation and for each data set, covering all seasons and subregions.The data sets were first evaluated in their ability to reproduce the observed temperature and precipitation climatology over the entire CONUS and for each subregion.Then, biases in the seasonal cycle of the rainfall and temperature distributions have been quantified.As a third step, the spatial-temporal variability throughout the subregions was assessed using Taylor diagrams (Taylor, 2001).Taylor diagrams summarize the main scores skills: correlation coefficient, standard deviation and normalized root mean square deviation and have been already employed in the ranking of RCMs and reanalysis products in many studies over the CONUS regions (e.g., Gibson et al., 2019;Srivastava et al., 2022).To estimate correlations, a modified Taylor diagram using the Kendall rank correlation test τ (Croux & Dehon, 2010) is used.The Kendall rank correlation test has been demonstrated to be less sensitive to errors and discrepancies in data compared to Pearson or Spearman tests (e.g., Diouf et al., 2022).Finally, orography-dependent temperature/ precipitation biases are examined using the Theil's Sen regression method (Siegel & Benson, 1982;Theil, 1992).Theil's Sen slope estimator is a resistant and nonparametric regression method based on the median of pairwise slopes and is also found to be significantly more accurate than the simple linear regression method for skewed and heteroskedastic data (Sy et al., 2021).

| Climatological biases
In Figure 3, the spatial patterns of the daily mean temperature bias estimated from the ensemble-mean of RCMs at 0.44 and 0.22 resolutions (hereinafter, RCM-44 and RCM-22, respectively), for RCM at 0.11 resolution (the one RCM at 0.11 , hereinafter CRCM5-UQAM-11), reanalysis (hereinafter, REA) and in situ-based gridded data sets (hereinafter, in-situ-based gridded), subsampled at USCRN station locations, are shown.On average, REA and in situ-based gridded data sets are generally warmer in both seasons with biases of 0.4 C in winter (and 0.9 C in summer) for REA and biases exceeding 0.3 C in summer for the in situ-based gridded data sets.In contrast, RCMs are generally colder (warmer) in wintertime (summertime) with biases ranging from −0.5 (0.9) C at RCM-44 to −0.2 (1.4) C at CRCM5-UQAM-11 (Figure 3).Overall, the spatial variability is well reproduced in both seasons with correlation values larger than 0.8 for almost all the data sets, except for a few RCMs.Regarding the skills (given by correlation, RMSE and standard deviation values), in situ-based gridded data sets provide the best performance compared to REA and RCMs, as summarized in Table 4. Further, all data sets show a cold (warm) bias in the western (eastern) parts of the United States in summer and wintertime, with values large values up to ±8.0 C at some stations over the MtWest mountains, PacificNW and PacificSW, while warm biases of the same magnitude are found in summer for stations in the central United States.A closer agreement with USCRN is found in wintertime in the east and south.Regarding the sensitivity to the increased RCM resolution, overall, some bias reduction is found by increasing the resolution.
In Figure 4, biases for precipitation are shown.On average, all REA are drier over the CONUS, with mean bias exceeding −0.7 mm/day in winter (and −0.3 mm/day in summer).The largest bias is found in winter, exceeding −4.0 mm/day at the PacificNW and PacificSW regions (especially in Quinault and Darrington stations).For the in situ-based gridded data sets, biases are generally negligible over the entire CONUS in both seasons except for Darrington and Bodega stations (2.0 mm/day) in the Paci-ficNW and PacificSW regions.For RCMs, biases are larger than other data sets in both seasons with a mean bias estimated around 0.01 mm/day (0.23 mm/day) for RCM-44, 0.2 mm/day (0.4 mm/day) for RCM-22 and 0.6 mm/day (0.34 mm/day) for CRCM5-UQAM-11 in winter (summer).Considering both seasons, REA and in situ-based gridded data sets show quite similar rainfall biases over the entire CONUS, while for RCMs (except for CRCM5-UQAM-11), opposite biases are simulated over F I G U R E 3 Spatial distribution of the daily mean temperature bias ( C), using USCRN as the reference, estimated from the ensemblemeans of reanalyses (REA), of models at 0.44 , 0.22 , and 0.11 resolutions (RCM-44, RCM-22 and CRCM5-UQAM-11 respectively) and of the in situ-based gridded datasets (in situ-based gridded) for both winter (DJF, left panel) and summer (JJA, right panel).The skills scores (i.e., the spatial mean bias, Kendall rank correlations and RMSE values) are provided at the top left of each panel.DJF, December, January and February; JJA, June, July and August; RCM, regional climate model; RMSE, root mean square error; USCRN, U.S. Climate Reference Network.[Colour figure can be viewed at wileyonlinelibrary.com] the southeast part during both seasons, that is, dry in winter and wet in summer.This poor performance can be related to the misrepresentation of the tropical cyclones and/or of the mesoscale systems (Hsu et al., 2019).Considering correlations, values up to r = 0.8 are obtained for all data sets.
A comparison of seasonal biases for the daily mean temperature within each subregion is shown in Figure 5. RCMs at all resolutions show significant cold biases of −3.0 C in winter over the deserts, MtWest mountains and the PacificNW regions attributable to the misrepresentation of surface albedo in wintertime (Bonan, 1998;Li et al., 2016).Also, REA and in situ-based gridded data sets show large cold biases (−2.5 C for REA and −3.0 C for in situ-based gridded data sets) over the Pacific Northwest.Nevertheless, over the MtWest mountains, where the topography is relatively complex, the REA and in situ-based gridded data sets show a better agreement with USCRN compared to RCMs.In summer, warm biases are generally simulated by all data sets in most of the CONUS subregions, except in the Pacific Northwest and Desert, where a large cold bias of −3.0 C persists for all data sets.
In analogy to temperature, the comparison of seasonal daily mean biases for precipitation is shown in Figure 6.The bias in winter is limited to −1.0 mm/day for all data sets except over the PacificNW, where the bias is larger for REA.Instead, in situ-based gridded data sets and REA show positive biases, while RCMs remain particularly dry over the Great Lakes, southern, eastern parts in summer.Similar conclusions as with temperature can be drawn about the RCM resolution increase.Considering the individual results from all data sets (Figure S1 for temperature and Figure S2 for precipitation), the spread is particularly large for RCMs at both 0.44 and 0.22 resolutions in all subregions, with values ranging from −2.5 C for WRF to 3.0 C for RCA4 in the Great Lakes during winter.For reanalysis and in situ-based gridded data sets, a larger spread is particularly found in the Pacific Northwest during both seasons.In terms of precipitation, there is a good agreement between the data sets except for reanalyses: the latter shows bias ranging from −1.0 mm/day for MERRA2 to 1.0 mm/day for ERAI in the southern region.
Figure 7 explores the annual cycle of precipitation calculated for USCRN stations and other data sets over all subregions.The annual cycle (solid black curve), computed from monthly average of station-based daily precipitation reflects the wet summer season observed over most subregions (East, Central, Great Lakes and South).On the other hand, over the PacificNW and PacificSW regions, the wet winter season, which is driven by fronts coming from the northwestern, is also well characterized by the USCRN.Figure 7 shows how the phase and amplitude of the observed annual cycle are properly reproduced by in situ-based gridded data sets compared to REA and RCMs.This is also clear from the RSME values estimated over the different subregions, except over the Pacific NW and the MtWest mountains, where in situbased gridded data sets are generally drier throughout the entire year (Figure 7e,g).This effect is likely due to the orographic mismatch.REA (black dashed line) properly reproduces the different phases of the annual cycle but underestimates the magnitude in most subregions with a large ensemble uncertainty, especially in wintertime.RCMs (e.g., RCM-22 and RCM-44) are again generally wetter than USCRN, although they can properly reproduce the seasonal cycle.CRCM5-UQAM-11 (solid T A B L E 4 An overview of the estimated spatial mean bias in daily temperature and precipitation, the Kendall rank correlation and the root mean square error (RMSE) as presented in Figures 3 and 4.These metrics are calculated based on the ensemble-mean of reanalysis (REA), models at resolutions of 0.44 (RCM-44), 0.22 (RCM-22) and 0.11 (CRCM5-UQAM-11), as well as from the in situ-based gridded data sets (in-situ-based).
F I G U R E 7 Seasonal monthly mean rainfall distribution (mm/day) over the different subregions obtained from the ensemble-mean of reanalysis (REA, black dashed-line), of models at 0.44 (RCM-44, dark-red dashed line), at 0.22 (RCM-22, dark-green dashed line) and at 0.11 (CRCM5-UQAM-11, solid orange line) resolutions and of the in situ-based gridded data sets (in-situ-based, dark-blue dashed line).Red (green) shaded area shows the range within ±1 sigma of RCM-44 (RCM-22) grid spacing models, while grey (blue) shaded area shows the reanalysis (in situ-based gridded data sets) ensemble uncertainty.RCM, regional climate model.[Colour figure can be viewed at wileyonlinelibrary.com] and S15).In terms of individual results from the reanalyses, NARR underestimates USCRN in all subregions and, therefore, the performance is poorer than other reanalysis products.The comparison of individual RCMs at all resolutions also shows a large spread.WRF largely overestimates USCRN values, especially in summer over MtWest mountains, east, and central areas (Figure S3e,i).
In contrast, there is a closer agreement with USCRN for the different in-situ-based gridded data sets.In summary, the observed annual cycle for in situ-based gridded data sets is closer to USCRN than both the REA and RCMs, while REA data sets do not perform better than RCMs in most subregions.In fact, the large variability in the simulated annual is particularly evident in the MtWest mountain region (Figure 7e) and can be due to the difference in topographic elevation in REA, RCMs and USCRN stations (see Section 4.2 above), the latter typically located in the valleys.
To rank the reliability of REA, in situ-based gridded data sets, and RCMs in reproducing patterns of temperature and precipitation against USCRN, Taylor diagrams are also used.Figure 8 shows an example of a diagram for one subregion per variable (for temperature, Desert, left panels and for precipitation, PacificSW, right panels).Results for other subregions are shown in Figures S6, S7, S8 and S9.Regarding temperature across the various subregions, REA and the in situ-based gridded data sets provide quite similar skills, with the best skill scores obtained in the central United States (with correlation values up to r = 0.9), while the worst skill scores are obtained along the west coast regions (i.e., PacificNW and PacificSW regions).RCMs have the best performance in the PacificSW in wintertime with correlation values up to 0.7, while the poorest performance (r = 0.5 for RCM-44 and RCM-22, and r = 0.3 for CRCM5-UQAM-11) is obtained over the desert (Figure 8, left panels).Nevertheless, the performance is relatively better in wintertime compared to summertime (Figure S6 vs. Figure S7).For precipitation, the skill scores are small for all the data sets (Figure 8 right panels for the PacificSW region, and Figures S8 and S9 for the other subregions) and the worst values (r = 0.2 associated with large variability) are found over the PacificSW in summertime, with the in situ-based gridded data sets performing slightly better in wintertime.Despite a good agreement with daily mean precipitation (Figure 6), biases are found in the local climate variability.Sensitivity for both REA and in situ-based gridded data sets in the extended period (i.e., 2006-2020) does not affect skills for both either temperature or precipitation (see Figures S10, S11, S12 and S13).
Based on the results from the individual data sets (represented by different symbols in Figures 8, S6, S7, S8 and S9), the reanalysis products display relatively similar performance for temperature in both seasons.However, ERA5 provides the best skills for precipitation in the PacificSW region (Figure 8, WRF demonstrates the best skills overall, with correlation values of up to r = 0.7 for temperature in the desert and r = 0.6 for precipitation in the Pacific SW.In contrast, RegCM4 exhibits the worst skill scores for both temperature and precipitation in both seasons (Figures S6, S7, S8 and S9 for the other subregions).For the in situ-based gridded data sets (blue symbols in Figures 8, S6 and S7), CPC shows the best skills scores in all subregions, with correlation values up to r = 0.9 over most of the subregions (East, Central, Great Lakes and South) for temperature in both seasons, while Daymet shows the worst skill.For precipitation, the best performance is shown by Livneh,

| Relationship to orography
Table 5 reports the elevation mismatch and root mean square error (RMSE) estimated between simulated elevations from RCMs/REA and actual elevations of USCRN stations.The highest average mismatch was found with REA in comparison to USCRN (with values as high as 30.5 m and RMSE values higher than 215.8 m).The inconsistencies for RCMs are still present in the highresolution simulations (RCM-44 vs. RCM-22/CRCM5-U-QAM-11), indicating that resolution is still insufficient to replicate the orography's small-scale features.Additionally, geographical patterns of the average elevation mismatch derived from REA and RCMs are shown in Figure 9. Compared to USCRN, REA and RCMs display comparable elevation mismatches over the whole CONUS, with values varied within ±600 m at stations situated in the western part of the United States, notably in the complicated orography of the MtWest.Moreover, REA and RCMs show similar uncertainty due to the orography.Darrington station has the biggest difference across all data sets (values greater than 1000 m).The temperature in REA and RCMs is therefore often lower than for USCRN (see Figure 3) due to the disparity between the heights of the simulated mean grid point and station, with the latter being lower than the former.
The elevation-dependent biases follow approximately the value of the adiabatic lapse rate of −6.5 C/km, consistently with hydrostatic equilibrium and thermodynamic principles (Dutra et al., 2020).
Precipitation-dependent temperature biases are also investigated in Figure S14.The negative relationship between temperature and precipitation biases is critical for understanding the impacts of changing climate on snowpack, drought, and heat stress.In summer, the driest RCMs are in general also the hottest (Figure 5) that results in larger number of heat waves due to small values of moisture, that is, summer dry soils heat up faster than wet soils (Miralles et al., 2019).This may also affect the land surface energy balance, with implications for local and downwind precipitation (Schumacher et al., 2022).By contrast, in winter, wet RCMs are usually too cold because of the snow presence (Li et al., 2016).On the other hand, in winter, RCMs have typical problems in reproducing snow-related processes in regions with a complex orography (Bordoy & Burlando, 2013), mainly because of a poor representation of mountains (Dutra et al., 2012) that can further enhance the cold waves and their duration.

| DISCUSSION AND CONCLUSION
This paper, for the first time, assesses the performance of NA-CORDEX RCMs, reanalysis (ERA5, ERA-Interim, MERRA2 and NARR) and in situ-based gridded data sets (Daymet, PRISM, Livneh and CPC) in the CONUS using reference temperature and precipitation observations from USCRN.Assessment of climate model's performance to reproduce present climate conditions has been typically carried out using reanalysis and/or gridded near-surface observational data sets as comparators over various regions, including the United States (e.g., Gibson et al., 2019;Srivastava et al., 2020Srivastava et al., , 2022)), Canada (e.g., Diaconescu et al., 2018) and Europe (G omez-Navarro et al., 2012;Kotlarski et al., 2019;Prein & Gobiet, 2017).However, these data sets are affected by several uncertainties due to their spatial interpolation and representation of complex terrain (Napoli  , 2019;Velasquez et al., 2019).Furthermore, it is worth pointing out that several existing papers (e.g., Gibson et al., 2019;Srivastava et al., 2020Srivastava et al., , 2022) ) used daily precipitation and/or temperature-based indices to rank model skills.However, the indices-based analyses do not fully capture the model performance (Alexander et al., 2020) and can hence miss their complete assessment.Being a reference comparator (Buban et al., 2020;Madonna et al., 2023;Thorne et al., 2017), the use of USCRN measurements for the assessment of the data sets listed above allows to conclude that: F I G U R E 1 0 Contributions of orography to the simulated temperature ( C, in blue) and precipitation (mm/day, in red) mean biases.The relations between elevation mismatch (x-axis) versus temperature and precipitation biases (y-axis) are drawn for MtWest region.Note that the MtWest region is considered due to its complex topography.Correlations are calculated using the Kendall non-linear rank (τ) test between the simulated elevation difference and temperature/precipitation mean biases for all REA (a), RCM-44 (b), RCM-22 (c) and CRCM5-UQAM-11 (d).Two-star symbols (**) are added when the correlation is significant at 99% confidence intervals (i.e., p < 0.01; 99% CI), while one-star symbol (*) is added when the correlation is significant at 95% (i.e., p < 0.05; 95% CI).The slopes indicated at the top right of each panel are estimated using median of pairwise non-parametric linear regression method (Sy et al., 2021).[Colour figure can be viewed at wileyonlinelibrary.com] • The simulated mean biases among the data sets are primarily seasonally and sub-regionally dependent.(Napoli et al., 2019;Velasquez et al., 2019;Xie et al., 2007).• Reanalyses are generally drier in both seasons and show large biases exceeding −4.0 mm/day in the Pacific Northwest.However, they are typically able to capture the phase of the precipitation annual cycle although they underestimate the magnitude in most of the subregions, also with large uncertainty bounds (Figure 7e,g).Such findings are consistent with previous studies available in the literature (e.g., Alexander et al., 2020;Bador et al., 2020;Gibson et al., 2019;Srivastava et al., 2022).• RCMs are generally wetter than USCRN over the CONUS but show opposite winter-summer patterns in the southeast United States, that is, dry in winter and wet in summer.They are also able to reproduce the different phases of the observed seasonal cycle but overestimate the amplitudes.• The poorest RCM performance is found over the desert, MtWest mountains and the Pacific Northwest regions, with the largest biases in wintertime because of the discrepancies between the modelled surface albedo and orography (Li et al., 2016).
The present study also reveals that RCMs and reanalysis still suffer from uncertainties due to the inaccurate altitude representation in the most complex orographic areas (Figure 9) such as MtWest and Pacific Northwest in the United States.These uncertainties are mainly due to: (i) the RCM grid scale and the sub-grid scale orography representation; (ii) the parameterization schemes and representation of near-surface processes (Diallo et al., 2019;Sy et al., 2017); and (iii) the orographic source data sets as well as the methodologies applied for deriving orography fields (Elvidge et al., 2019).
Our results also show that temperature and precipitation biases are significantly linked to orography, both in reanalysis and RCMs, and temperature biases are larger at coarser resolutions (50 km, −7.9 C/km) than at finer (12.5 km, −5.1 C/km).For precipitation, reanalysis bias is larger than that of RCMs.The positive elevationdependent precipitation bias found for all data sets is a finding that has been pointed out in very few publications (Kuhn & Olefs, 2020;Pepin et al., 2022).This bias has a higher sensitivity than temperature bias because it depends on latitude, seasons, the shape of the mountain and the relative station position with respect to the arrival direction of the air masses and to the orography.It can be influenced by various mechanisms such as land cover, clouds, aerosols and soil moisture, which covary with orographic characteristics (Pepin et al., 2015(Pepin et al., , 2022;;Rangwala & Miller, 2012).In other words, these factors can significantly influence the spatial pattern of precipitation and temperature, especially in mountainous regions where precipitation is typically higher on the windward side of the mountains and lower on the leeward side due to the rain shadow effect.The most common biases found in both reanalyses and RCMs are the overestimation of precipitation on the windward side of mountains and the underestimation on the leeward side (Chen et al., 2021;Dallan et al., 2023).This bias is often linked to an overestimation of the strength of the atmospheric circulation that drives moisture towards the mountains (Chen et al., 2021;Munday & Washington, 2018).Likewise, the elevation and orientation of mountains contribute to the temperature bias because this alters the local energy balance, and, as a consequence, affect the near-surface temperature (Massey et al., 2016(Massey et al., , 2017)).
Overall, our results indicate that in situ-based gridded data sets provide the best performance in most of the CONUS regions compared to reanalysis and RCMs, but still have biases in MtWest mountains and the Pacific Northwest that need to be considered before their use as a comparator for evaluating models performance.Also, reanalysis data sets do not outperform RCMs in most subregions.Hence, we recommend caution when using reanalyses, especially when assessing model performance in mountainous and coastal regions.
Finally, our study also highlights the need to improve our understanding of the influence of different climate drivers in high mountain regions.This would imply a densification of the climate reference network in regions with a complex orography such as the MtWest mountains and the availability of energy fluxes measurements.This will enhance the assessment of RCMs with a positive U R E 1 U.S. Climate Reference Network (USCRN) stations distribution along with the different station locations representing stations with at least 12 valid years (i.e., without missing data) over the 2006-2020 period (black circles).
Figure (b) displays the CONUS topographic height (in m) with surface elevation ranging from 0 to more than 3600 m.
Figure (c) gives the different USCRN station elevation values (unit: m) indicated by the different colours.CONUS, Contiguous United States.[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 4 Same as in Figure 3 but for the spatial patterns of the daily mean precipitation biases (mm/day).[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R 5 Daily mean temperature bias ( C) estimated from the ensemble-mean of reanalyses (REA), of models at 0.44 , 0.22 and 0.11 resolutions (RCM-44, RCM-22 and CRCM5-UQAM-11 respectively), and of the situ-based gridded data sets (in-situ-based), in each study subregion against USCRN for the period 2006-2014 and for both winter (DJF, top panel) and summer (JJA, bottom panel).The median value is indicated with a black line, while the lower hinge of each box is Q1 quartile (25th), and the upper hinge is for Q3 quartile (75th).DJF, December, January and February; JJA, June, July and August; RCM, regional climate model; USCRN, U.S. Climate Reference Network.[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 6 Same as in Figure 5 but for the daily mean precipitation bias (mm/day).[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 8 DJF (top panels) and JJA (bottom panels) Taylor diagrams showing the comparison of temperature (left panels, over the desert) and precipitation (right panels, over the PacificSW) among models, reanalysis and in situ-based gridded data sets against USCRN.The reference point (USCRN) is represented by a solid black circle.Symbols indicate the position of each individual data set and their ensemble means (represented by dots): green (dark-red) symbols for models at 0.44 (0.22 ) resolutions; CRCM5-UQAM-11 at 0.11 resolution is represented by a dark-red symbol; red for reanalysis, and blue for the in situ-based gridded data sets over 2006-2014 period.The dashed black lines on the outermost semicircle indicate Kendall rank correlations between USCRN and each data set.The blue dashed curves indicate the normalized standard deviations, while the grey dashed curves show the centred normalized root mean squared error (NRMSE).DJF, December, January and February; JJA, June, July and August; USCRN, U.S. Climate Reference Network.[Colour figure can be viewed at wileyonlinelibrary.com] T A B L E 2 Characteristics of in-situ-based gridded observational data sets used in this study.
An overview of the spatial elevation mismatch and root mean square error (RMSE) values estimated between the simulated elevations from RCMs/REA and the actual elevations recorded at USCRN stations.These metrics are calculated based on the ensemble-mean of reanalysis (REA), models at resolutions of 0.44 (RCM-44), 0.22 (RCM-22) and 0.11 (CRCM5-UQAM-11) with respect to actual elevations of USCRN stations.
Abbreviations: RCM, regional climate model; USCRN, U.S. Climate Reference Network.F I G U R E 9 Spatial patterns of the average elevation mismatch (m.above sea level) estimated from ensemble-mean of reanalyses (REA) and models at 0.44 , 0.22 and 0.11 resolutions (RCM-44, RCM-22 and CRCM5-UQAM-11) in comparison to the actual elevations of USCRN stations.The skills scores (the spatial mean bias and RMSE values) are provided at the top left of each panel.RCM, regional climate model; RMSE, root mean square error; USCRN, U.S. Climate Reference Network.[Colour figure can be viewed at wileyonlinelibrary.com] et al.

•
For temperature, reanalysis and gridded data sets are generally warmer in both seasons, while RCMs are generally colder in wintertime and warmer in summertime.• Spatial patterns of the temperature mean bias are quite similar between data sets with cold (warm) biases in the western (eastern) United States, with the effect of orography increasing the bias to values larger than ±8.0 C in stations over MtWest mountains and Pacific Northwest and Southwest.• Overall, in situ-based gridded data sets are able to capture the daily mean patterns of precipitation and have the best skills in reproducing the phase and amplitude of the observed rainfall annual cycle compared to reanalysis and RCMs.• The worst skills for precipitation for all data sets are found in the Pacific Northwest, where the largest rainfall is typically recorded at stations and on MtWest mountains likely due to the orographic enhancement