The Office of Water Prediction's Analysis of Record for Calibration, version 1.1: Dataset description and precipitation evaluation

Hydrologic models operated by the National Weather Service call for an accurate, consistent, high‐resolution, multi‐decade, continental‐scale record of hydrometeorological fields to serve as forcing data for model calibration. To serve this purpose, the Analysis of Record for Calibration was developed, and version 1.1 of the dataset is described in this study. Geospatial and scientific requirements, methods used in dataset generation, and input data sources are described. Given the prominent role of precipitation in model calibration, accurate and consistent precipitation is a particularly high priority for the analysis. To evaluate the analysis from this perspective, its daily precipitation is compared with surface observing stations over 43 years. The analysis exhibits low bias compared with other similar products. It also displays nonstationary bias behavior after 2015 due to the lack of a climatological constraint, as well as frequent occurrences of heavy‐to‐extreme precipitation that are often difficult to verify. These findings should be taken into account when the product is used for model calibration.

physical parameters (Khakbaz et al., 2012;Smith et al., 2003)-followed by validation to confirm that those parameters are effective in minimizing error measures for retrospective streamflow predictions (Gupta et al., 2005), ordinarily covering time periods distinct from those used for calibration (Refsgaard, 1997;Shen et al., 2022).
The execution of these models, both in the retrospective calibration/validation phase and in real-time operations, relies upon hydrometeorological input data (e.g., Newman et al., 2015).Traditional operational hydrology requires basin-averaged temperature and precipitation time series (Khakbaz et al., 2012), whereas continental-scale hydrology often demands a more complete set of spatially and temporally discretized hydrometeorological forcing elements to drive land surface models (Xia et al., 2012).
To meet the calibration requirements of regional operational hydrology as practiced at NWS River Forecast Centers (RFCs) and of continental scale hydrology as practiced at the NWS Office of Water Prediction's (OWP's) National Water Center (NWC), a highresolution, multi-decade, CONUS-scale record of hydrometeorological elements is needed.The dataset developed for this purpose at OWP is known as the Analysis of Record for Calibration (AORC), often specified in this study as AORC-CONUS to distinguish it from counterparts that cover other regions.Beyond the CONUS domain, the AORC additionally consists of forcing data to support the calibration of model configurations for Hawaii, Puerto Rico and the U.S. Virgin Islands, and Alaska.However, this study concerns only the AORC-CONUS domain, which includes the contiguous U.S., southern Canada, and northern Mexico, on a geographic grid at 30-arc-second (approximately 800 m) resolution.AORC temporal coverage is hourly, from February 1979 to the present (more than 44 years at the time of this writing).
The remainder of this introduction discusses the history behind the creation of the AORC-CONUS dataset and summarizes its requirements.The study that follows provides additional details on the dataset, including its spatial domain, its input data sources, and the methods used in its generation, in particular how those methods evolve in time alongside changes in the availability and quality of input data sources.
Finally, the study provides an evaluation of AORC daily precipitation from 1979 to 2021, illustrating both the overall accuracy of AORC precipitation and the ways in which it responds to the evolution of its sources and methods.

| Background
1. 1.1 | Planning andpreparation (2010-2015) In the early 2010s, the National Weather Service identified the need for a continental-scale, climatology-constrained, multi-decade, nearsurface weather record, encompassing at minimum the contiguous United States (CONUS) and contributing watersheds and consistent in its quality, that is, in its bias and error structure.This analysis of record would drive distributed land surface and hydrologic models for the purpose of calibrating such models as operated at NWS RFCs.Around the same time, the NWS began planning for a CONUS-scale hydrologic modeling capability to be operated at a national center.In the context of that planned capability, which became the National Water Model (NWM), the analysis of record was imagined not just as a specifically historical product, but also as the final step in-and logical conclusion of-a robust operational process to generate spatially discretized (i.e., gridded) driving data for operational water forecasts.
Consequently, the analysis of record would consist of both a historical gridded dataset assembled from the best available data from recent decades, and a near-real-time component updated at short lags (from days to weeks) relative to real time.The transition from historical to near-real-time datasets was expected to coincide with the initiation of operations for the first NWM release calibrated using the analysis of record.1.1.2 | AORC version 1.0 (20151.1.2 | AORC version 1.0 ( -2020) ) With the establishment of the Office of Water Prediction in 2015 and the development and implementation of the NWM in 2016, the assembly of the analysis of record continued, and it came to be known as the AORC.AORC-CONUS version 1.0 was completed in late 2019, and the first calibration of the NWM using AORC-CONUS 1.0 was performed in preparation for the release of NWM 2.1, which entered into operations on April 20, 2021.

Research Impact Statement
To improve water prediction in the United States, the Analysis of Record for Calibration offers a multi-decade driving dataset for hydrologic and land surface model calibration, validation, and reanalysis.1.1.3 | AORC version 1.1 (2020-present) After receiving feedback from several NWS RFCs and from the NWM 2.1 calibration/validation team, the Office of Water Prediction performed numerous improvements to produce AORC-CONUS version 1.1; these are listed in Table 1.As those changes were implemented, the AORC-CONUS 1.1 team also established a near-real-time updating system, which produces hourly AORC grids at a 9-day lag relative to real time.NWM version 3.0, scheduled for release in 2023, is calibrated with AORC-CONUS version 1.1.

| Geophysical requirements
The geophysical requirements of the AORC are summarized in Table 2. Fulfillment of these requirements provides a dataset that can be used to drive most land surface and hydrologic models relevant to CONUS watersheds at a regional or continental scale, with sufficient spatial detail to produce re-analyses of land surface conditions (including snowpack, soil moisture, and runoff) and streamflow simulations such as those used in operational hydrologic forecasting.A 44-year (and expanding) time period is sufficient to satisfy the calibration requirements of most hydrologic modeling systems currently in use (Boulmaiz et al., 2020), including the NWM (Lahmers et al., 2021).
Existing datasets such as the North American Land Data Assimilation System project, phase 2 (NLDAS-2; Xia et al., 2009) and the Parameterelevation Regressions on Independent Slopes Model (PRISM) dataset (Daly et al., 2008) partially fulfill the requirements of the AORC, but no single existing product sufficiently fulfills them all.For example, the grid resolution of NLDAS-2 (0.125°, about 12 km) is too coarse; PRISM provides only a subset of the needed hydrometeorological elements, and its primary coverage is restricted to the CONUS.To fulfill all the geophysical requirements of the AORC, multiple datasets are systematically assembled using an approach designed to take best advantage of the favorable qualities of each constituent.

| Scientific requirements
Operational hydrologic forecasts, including those within different cycles of the NWM, draw their forcing data from a variety of numerical weather prediction (NWP) systems.Given such diversity, it must be assumed that these forcing data, and their counterparts used for model calibration, are unbiased, have covariance structures similar to that of observed quantities, and possess favorable error statistics.Depending on hydrologic model adaptability and calibration strategy (Oudin et al., 2006;Wang et al., 2023), the extent to which this is or is not the case for the AORC dataset determines its viability for calibration, as bias and error in calibration forcings may lead to uncertain or poorly specified model parameters, directly impacting model reliability and performance (Renard et al., 2010).While accuracy is the key scientific requirement of a forcing dataset intended for hydrologic model calibration, temporal consistency of bias and error behavior across the periods for which hydrologic model calibration and validation are performed is of comparable importance (Mizukami & Smith, 2012).As this study shows, TA B L E 1 Improvements made for the CONUS Analysis of Record for Calibration (AORC-CONUS), version 1.1.Note: "NLDAS-2" refers to forcing data from the North American Land Data Assimilation System project, phase 2 (Xia et al., 2009)."CaPA" refers to the Government of Canada's 10-km Regional Deterministic Precipitation Analysis (RDPA) based on the Canadian Precipitation Analysis (CaPA) system (Fortin et al., 2018)."CPC" refers to the NOAA Climate Prediction Center's Global Unified Gauge-Based Analysis of Daily Precipitation (Xie et al., 2007).
constraining precipitation via climatology is an effective means of maintaining temporal and spatial consistency, as long as it does not do away with actual extreme events and other real-world statistical anomalies.

| Spatial coverage and raster geometry
The AORC-CONUS dataset is generated on a geographic (longitude/latitude) World Geodetic System 1984 (WGS 84) coordinate system, at 30-arc-second (roughly 800 m) spatial resolution.The spatial coverage of the AORC-CONUS domain, illustrated in Figure 1, is a bounding box from roughly −125° E to −67° E longitude, and from roughly 25° N to 53° N latitude.While these coordinates do not fully encompass the National Water Model's operational CONUS domain, they do accommodate calibration within inland U.S. basins where model calibration is generally practiced, including the bounds of all CONUS RFC regions apart from the easternmost portion of the Northeast RFC region in New Brunswick, the southernmost portion of the West Gulf RFC region in Mexico, and the Florida Keys.Within its usable extents, the AORC dataset does not contain missing values in any inland location, including over large water bodies such as the Great Lakes.Its data/no-data mask, which excludes only offshore areas of the Atlantic Ocean, the Pacific Ocean, and the Gulf of Mexico, is largely consistent with that of forcing grids from NLDAS-2, an essential input data source for the AORC.Note that the full extents of the AORC grid as defined in its data files (labeled "AORC file extents" in Figure 1) exceed those cited above (labeled "AORC data extents" in Figure 1); those outermost rows and columns of data grids are filled with missing values, and are the remnant of an early set of required AORC extents that have since been adjusted inward.

| Essential climatologies
Because temperature and precipitation exert a dominant influence on hydrologic processes relative to other hydrometeorological elements (listed in Table 2), the AORC-CONUS 1.1 relies on gridded monthly 30-year climatologies of mean precipitation, mean daily minimum temperature, and mean daily maximum temperature to constrain the behavior of those elements.Prior to the development of the AORC, no single climatology grid in existence provided these quantities over the entire AORC-CONUS spatial domain; therefore, several products were combined to fulfill the need for a complete climatology.
For coverage outside the CONUS, a regional 1981-2010 station climatic values from Mexico's Servicio Meteorológico Nacional (SMN).These were interpolated to 30-arc-second resolution using the ANUSPLIN 4.4 thin-plate spline package (Hutchinson & Xu, 2013), with required gridded terrain heights provided by the Shuttle Radar Topography Mission (SRTM; Farr et al., 2007;NASA, 2013).The resulting precipitation climatology grids show appreciable improvement in terms of absolute accuracy relative to the NClimGrid-Monthly dataset, and to both the WorldClim v1.4 1960-1990 climatology (Hijmans et al., 2005) and the Uniatmos 1902-2011climatology (Fernández Eguiarte et al., 2012).Precipitation root-mean-squared error (RMSE) is reduced by 25%-50% relative to the Uniatmos climatology for most months at stations excluded from the spline analysis.For temperature, RMSE is usually <1.5°C for both Uniatmos and OWP climatologies.
To reduce discontinuities at the boundaries between sources and to preserve information from the PRISM datasets over adjacent non-PRISM areas, NClimGrid-Monthly values within 40 km of PRISM cells were blended with nearby PRISM data via weighted average.Finally, the PRISM and NClimGrid-Monthly datasets are both undefined over large inland water bodies, such as the Great Lakes, Lake Winnipeg, and the Great Salt Lake.Values for these areas were estimated from 30-year (1981-2010) monthly mean values from NLDAS-2, which were then modified to agree with the original climatologies at water boundaries by treating adjacent PRISM values as boundary values and performing statistical relaxation using NLDAS-2 as the initial guess.
The combined climatology produced through these efforts is hereafter referred to as the PRISM/NCEI/OWP climatology.

| Methodology for non-precipitation elements
While the subsequent portions of this study focus on AORC precipitation, in this section the sources and methods used to generate nonprecipitation elements of the AORC are described.

| Air temperature
Apart from precipitation, 2-m air temperature is widely regarded as the physical element that most significantly influences runoff processes, via its approximately determinative relationships with potential evapotranspiration, precipitation phase, and snowmelt (McCabe & Wolock, 2011).
Furthermore, other near-surface hydrometeorological fields in the AORC are adjusted for consistency with temperature; therefore, we begin with that element.
For the years 1979-2015, AORC-CONUS air temperature grids are generated by merging two datasets.The first consists of 2-m temperatures from hourly NLDAS-2 land surface forcing fields, delivered at 0.125° (~12-km) spatial resolution and 1-h temporal resolution, having been interpolated from a native 40-km/3-h resolution (Cosgrove et al., 2003).The second are Livneh et al. (2015) minimum and maximum temperatures, which have a ⅙° (~16-km) spatial resolution and are available at both daily and monthly frequencies (both frequencies contribute This figure depicts bounding rectangles of the National Water Model (NWM) CONUS domain, AORC grids, and CONUS River Forecast Center (RFC) extents, in geographic (longitude/latitude) coordinates.The area shaded in green is the AORC data/no-data mask.The "NWM land mask" is included to illustrate the distortion between the Lambert conformal NWM grid and geographic coordinates, and to provide a familiar visual reference for the bounding rectangles.Finally, the inverse-projected extents of the polar stereographic Hydrologic Rainfall Analysis Project (HRAP; Reed & Maidment, 1999) grid are shown as an orange dotted line.A substantial portion of AORC processing is performed on this grid.

HRAP grid extents NWM extents
to AORC-CONUS temperatures).While the original Livneh et al. (2015) data were available only through 2013, the dataset was extended after publication, in 2016, to cover the years through 2015 using the same monthly bias/downscaling ratios generated for the 1981-2010 period.
This dataset is hence referred to as LIV16.In the most general terms, the merging process downscales hourly NLDAS-2 temperatures using a method designed to maximize their consistency with LIV16 daily temperature extrema, the latter having been climatologically downscaled using the difference between the PRISM/NCEI/OWP climatology of monthly temperature extrema and the corresponding averages from 30 years  of monthly LIV16 temperature extrema.
In 2016, AORC-CONUS temperatures begin a transition to hourly temperature grids derived from the 2.5-km/1-h UnRestricted Mesoscale Analysis (URMA; Pondeca et al., 2015).URMA is generally considered an authoritative gridded analysis for model forecast verification within the NWS.For 2016 and 2017, URMA temperatures are climatologically downscaled and adjusted by the differences between monthly 2010-2015 URMA means and the corresponding 2010-2015 AORC means.Beginning in 2018, AORC-CONUS 1.1 temperatures are nearestneighbor sampled directly from URMA grids, with no additional measures to downscale them from 2.5-to 1-km resolution.

| Pressure and specific humidity
From 1979 to 2017, terrain-based adjustment of hourly surface pressure from the input data source is performed following the calculation of 2-m temperatures in such a way as to preserve hydrostatic balance between the near-surface hydrometeorological state of the input source and its AORC counterpart, based on the difference between their surface elevations.For specific humidity, atmospheric demand for water vapor is preserved by maintaining the 2-m relative humidity of the input source, given the previously calculated AORC temperature and pressure.This approach is identical to that used for downscaling retrospective forcing fields in the original NLDAS project (Cosgrove et al., 2003).
The input source for surface pressure and 2-m specific humidity is NLDAS-2 for 1979-2015, followed by URMA thereafter, with Global Data Assimilation System (GDAS) cycles of the Global Forecast System (GFS) atmospheric model (Kleist et al., 2009) providing data outside the bounds of the URMA domain.From 2018 onward, no terrain-based adjustment of URMA pressure and specific humidity is performed because temperature, pressure, and humidity are all nearest-neighbor sampled, unmodified, from URMA grids.

| Downward longwave radiation
From 1979 to 2015, hourly downward surface longwave radiation fluxes (DLWRF) from NLDAS-2 are adjusted for consistency with the AORC temperature and emissivity (a function of pressure, specific humidity, and temperature).As is the case for pressure and specific humidity, the procedure followed is identical to that described in Cosgrove et al. (2003).After 2015, URMA is the primary source of temperature, pressure, and specific humidity in the AORC.Nevertheless, adjustment of NLDAS-2 DLWRF using those temperatures and derived emissivities continues for 2016 and 2017, justified by the climatological similarity between NLDAS-2 and URMA temperatures during that time.Beginning in 2018, however, no such reasoning applies, and NLDAS-2 DLWRF grids are bilinearly interpolated to the AORC grid without further modification.

| Downward shortwave radiation
For downward surface shortwave radiation, no downscaling process apart from simple bilinear interpolation is applied to input data sources.NLDAS-2 is used from 1979 onward, with GDAS-derived grids available beginning in 2018 as a contingency for missing NLDAS-2 data.

| U and V wind
Hourly U and V wind grids at the 10-m anemometer height are bilinearly interpolated from NLDAS-2 from 1979 to 2017.From 2018 onward, the input grids are provided by URMA and by GDAS when and where URMA is unavailable.

| Methodology for precipitation
The general strategy for generating hourly gridded precipitation for the AORC, described with additional detail in the subsections below, begins with high-accuracy and low-bias monthly gridded precipitation estimates.These are conservatively disaggregated, first into daily (herein defined as the 24-h accumulation from 12 UTC to 12 UTC) and finally into hourly estimates, using datasets that are less constrained, and on their own more prone to error and bias, but provide the desired time step sizes.For the first 31 years, from 1979 to 2009 (to which this study refers as the "historical period"), this approach is explicitly followed.During an interim 6-year period from 2010 to 2015, daily estimates are subdivided into hourly amounts as during the historical period, but total monthly precipitation is not directly adjusted to match a gridded analysis; instead, it is constrained for climatological consistency.Finally, beginning in 2016, daily estimates are subdivided into hourly amounts, but those daily estimates are unfettered, with neither a monthly precipitation analysis nor a climatology to constrain them.
Temporal and spatial manipulation of observation data to produce gridded datasets can suppress extreme events present in the underlying observation data (Pierce et al., 2021).More generally, the statistical behavior of those gridded results may differ significantly from that of observations, in particular by possessing a higher frequency of low-intensity precipitation and a lower frequency of heavy precipitation (Ensor & Robeson, 2008).These impacts may be reduced in the AORC through the use of observation-based daily and hourly datasets to subdivide monthly precipitation from 1979 to 2015; however, these disaggregation methods, and the gridded datasets used in their implementation, themselves have the potential to misrepresent precipitation extremes and further disrupt the intensity-duration-frequency characteristics of the resulting dataset.A comparison of precipitation intensity distributions from the AORC with those of observations is outside the scope of this study, but users of the data are advised to carefully consider these matters.
The remainder of this section describes in more detail the methods used to generate monthly and associated daily precipitation amounts for the AORC for the three eras described above, followed by a discussion of disaggregation of daily precipitation into hourly amounts.

| Monthly and daily precipitation
The availability and accuracy of different gridded input datasets have evolved significantly over the years covered by the period of interest for the AORC (1979-present), as listed in Table 3.Consequently, the methods for combining them into a minimally biased and consistent product follow that evolution.Note that all processing of daily precipitation is performed on the 4.7625-km HRAP grid after converting input data sources, if necessary, to that coordinate system via budget regridding/reprojection, with downscaling to the AORC grid to be performed later.
The budget regridding/reprojection method used in AORC processing calculates each output grid value as the simple average over a 5 × 5 raster that subdivides the (HRAP) output cell into a 0.9525-km resolution subgrid, with bilinear interpolation of the input grid performed at the center of each of those 25 subgrid cells.This approach, provided by the IPOLATES software library via the wgrib2 utility (National Centers for Environmental Prediction, Version 0.2.0.4,February 2016), approximates more precise conservative regridding methods such as that described by Jones (1999).
TA B L E 3 Summary of input datasets for monthly and daily (i.e., the 24-h accumulation ending at 12 UTC) AORC precipitation.When this circumstance arises, NLDAS-2 precipitation is used to fill areas of missing Stage IV data.In areas outside the coverage of LIV16 precipitation grids, NLDAS-2 daily precipitation grids are used with no monthly constraint.

| Interim period (2010-2015)
The PRISM/NCEI/OWP climatology described above in Section 2.2 covers the period 1981-2010, and is therefore no longer directly applicable after 2010; however, the LIV16 dataset does cover the years 2010-2015, as discussed in Section 2.3.To generate AORC precipitation grids for this period, monthly LIV16 gridded precipitation accumulations are not explicitly disaggregated into daily amounts as they are for the years 1979-2009.As is done for the latter portion of the historical period, daily precipitation estimates are built from Stage IV grids, with NLDAS-2 used outside the coverage of Stage IV grids.However, instead of adjusting these grids so their monthly accumulations match that of the LIV16 monthly product (as was done for 1979-2009), they are bias-adjusted such that the 6-year mean for each grid location for each calendar month is equal to the corresponding 6-year mean from the LIV16 dataset.Because the LIV16 dataset for 2010-2015 is generated using a spatial distribution and bias adjustment based on the 1981-2010 PRISM climatology, this approach effectively does the same for the AORC data for those years.Finally, internal evaluations revealed AORC-CONUS version 1.0 precipitation to possess high bias in the portion of southern Canada near the lower Great Lakes beginning in 2012 (see Gronewold et al., 2018), and replacement of NLDAS-2 with the Government of Canada's 10-km Regional Deterministic Precipitation Analysis (RDPA) based on the Canadian Precipitation Analysis (CaPA) system (Fortin et al., 2018) was retrospectively applied in this area beginning in May 2012 as part of the AORC-CONUS 1.1 effort (see Table 1).The area of interest, which includes lakes Huron, Erie, and Ontario, is defined by a polygon, bounded on the west by Lake Huron, on the north by a path between Sault Ste.Marie and Montreal, Canada, and on the south and east by the U.S. shores of the Great Lakes and the U.S./Canada border.

| Near-real-time operations (post-2015)
The LIV16 dataset is unavailable after 2015.This circumstance has thus far prevented AORC precipitation from undergoing any bias adjustment for dates beginning in January 2016, and no disaggregation of monthly precipitation analysis is performed during this time either.Instead, daily precipitation amounts are generated by combining, in descending order of preference, Stage IV 24-h precipitation, the 0.5-degree CPC Global Unified Gauge-Based Analysis of Daily Precipitation (Xie et al., 2007), and daily CaPA grids, using the highest priority product available for each grid location.As was done from May 2012 through December 2015, CaPA daily precipitation supplants other precipitation estimates (primarily the CPC analysis) over the lower Great Lakes region in Canada.In spite of the lack of bias adjustment on precipitation analyses during this period-which includes near-real-time updates-improvements in QPE generation such as the key contribution of Multi-Radar, Multi-Sensor (MRMS) precipitation products (Zhang et al., 2016) have improved the skill of Stage IV QPE since early 2015.Even so, the lack of bias adjustment does alter the nature of AORC-CONUS 1.1 precipitation beginning in 2016, as the sections to follow illustrate.

| Hourly precipitation
Disaggregation of daily precipitation accumulations into 1-h amounts is based on gridded hourly precipitation from a variety of input sources, these varying significantly in their availability over the period of interest.These are listed in Table 4, and the timeline for which each is used, including occasional data gaps, is illustrated in Figure 2. Note that total daily accumulation is never altered by the disaggregation process.
For daily precipitation estimates from February 1979 through December 2015, all available hourly input sources are combined using a weighted average.The associated weighting grids for each day are generated so as to maximize the Nash-Sutcliffe Efficiency (NSE), calculated over regional subgrids, between aggregated 1-h precipitation and the associated daily precipitation estimate.In general, for any grid location, the weighted average is based on the best-performing (highest NSE) source receiving a weight of 1.0 and all others receiving weights of zero, but a smoothing process blends the weights to prevent discontinuous results.
Beginning in 2016, a simpler method is used.Disaggregation at each HRAP grid location is performed using one of the following, in order of preference depending on what is available: NEXRAD Stage IV, NEXRAD Stage II, CMORPH, or GDAS.
As was the case with daily precipitation processing, the above steps are performed on the HRAP grid after budget regridding of input data sources to that coordinate system.Downscaling to AORC grid coordinates using the PRISM/NCEI/OWP climatology, and bias adjustment for consistency with LIV16 (monthly accumulations for 1979-2009, monthly climatologies for 2010-2015) is the final step in processing.

| PR ECI PITATI O N E VA LUATI O N
The following evaluation of AORC-CONUS 1.1 precipitation is focused on daily (24 h) precipitation amounts ending at 12 UTC.This approach allows for the use of a large, well-established database of observations, but it does not consider the question of how successfully AORC processing procedures subdivide daily precipitation into hourly amounts.A discussion of intense hourly precipitation occurrences in the AORC follows the evaluation results.

| Observation data
The observations with which AORC-CONUS 1.1 precipitation and its gridded input data sources are compared in this study are provided by the Global Historical Climatology Network-Daily (GHCN-D) database, version 3.28 (Menne, Durre, Korzeniewski, et al., 2012;Menne, Durre, Vose, et al., 2012).The subset of the database used in this study includes all observing locations in the U.S., Canada, and Mexico (US/CA/MX hereafter), excluding the following: • Reports for which a local time of observation of 12:00 or later is indicated (reports for which no local time is indicated are not excluded; see the following discussion); • "Global Summary of the Day" reports (source flag "S"), which are derived from hourly synoptic reports and may differ significantly, particularly in the 24-h period they represent, from daily precipitation measured at or near 12 UTC (Menne, Durre, Korzeniewski, et al., 2012;Menne, Durre, Vose, et al., 2012); • Reports flagged as having failed any quality assurance check (Durre et al., 2010).

TA B L E 4
Input data sources used to disaggregate 24-h precipitation into 1-h accumulations in AORC-CONUS 1.1.The first two criteria ensure that only a small proportion of the remaining observations were collected more than 2-3 h from 12 UTC.A more detailed discussion of the reasoning behind this assertion, and of the timing of GHCN-D precipitation reports in general, is provided in the "Observation Times of GHCN-Daily Precipitation Reports" section of the Appendix.

| Evaluation methodology
For each evaluation covering a particular time period and spatial domain, all qualifying GHCN-D precipitation observations are selected from the US/CA/MX database.Gridded precipitation estimates under investigation are then aggregated, if necessary, into 24-h amounts ending at 12 UTC.These are nearest-neighbor sampled at station coordinates, then paired with corresponding precipitation observations at those dates and locations.Stations must report on at least 50% of the days covering an evaluation period to be included in that evaluation.
To supplement the GHCN-D quality flags, two additional checks on precipitation reports are performed during each evaluation.First, any report of >10 inches of daily precipitation must be accompanied by at least nine other observations of >10 inches on the same day, without regard for the locations of these observations, to be included.Second, any report of either zero or >1 inch of daily precipitation must be corroborated by observations in its vicinity, if available.Otherwise, such a report is flagged as a local oddity and excluded from evaluation.
Specifically, a report of zero is flagged if more than 95% of the reporting stations in its neighborhood observed >1 inch, and 100% of its nearest neighbors (defined as the nearest 5% of reporting stations in the neighborhood) observed >1 inch.A report of >1 inch is flagged if more than 95% of the reporting stations in its neighborhood observed ≤0.1 inch, and 100% of its nearest neighbors observed ≤0.1 inch.Reports at stations located at elevations more than 2 standard deviations from their mean neighborhood elevation (based on the elevations of neighboring stations) are exempt from this check.The neighborhood size for this check is a 100-km radius around any reporting station.This distance effectively balances practicability with the likelihoods of type I and type II errors, and a typical decorrelation distance for daily precipitation is on the order of 2-3 times this radius (van Leth et al., 2021).
Once observations and accompanying sampled precipitation estimates are collected and prepared, a variety of bias and skill metrics are calculated both across the spatial domain and the time domain.For metrics calculated across the spatial domain (which produce time series), data are collected over aggregation periods of 28 days.Metrics for a 28-day aggregation period are calculated using all the data in that sample population and attributed to the midpoint of the 28-day range.This approach produces more stable metrics for time series than what is obtained by calculating an independent metric for each day.Note that observations in any such sample population and their gridded, sampled counterparts are never themselves summed into 28-day accumulations; each daily pair of observed and gridded data at any given location is a discrete member of the sample, and any single observing station may provide up to 28 such pairs.For metrics calculated across the time domain, a simpler approach applies: all the daily pairs of observed and gridded precipitation attributed to a particular station participate in the computation of its bias and skill metrics.
Among all metrics examined, this study relies largely on the unrestricted accumulation (multiplicative) bias (UAMB; the ratio of total analysis precipitation to total observed precipitation) to present our evaluation results.In some circumstances, it is beneficial to categorize events on a contingency table and treat bias and error as exclusive categories; for example, to calculate bias based only on hits (cases in which both gridded and observed daily precipitation exceed some threshold amount).It may also be worthwhile to examine the statistical qualities of individual values of multiplicative bias (MB; the ratio of analysis precipitation to observed precipitation on a station-by-station, day-by-day basis).By contrast, UAMB is simply the sum of all gridded samples in a population divided by the sum of all corresponding observed values, without regard for whether any particular gridded value would be judged a categorical error by any standard.Among a variety of metrics, UAMB has been found to best illustrate the overall accuracy of the gridded precipitation products used in the AORC, as well as that of the AORC itself.It provides a practical measure of the aggregate effect of bias and error across a given temporal and/or spatial domain.The use of UAMB is largely consistent with a multiplicative error model, which has been shown to better represent errors in estimates of daily precipitation than an additive model (Tian et al., 2013) and is often used to characterize precipitation uncertainty in hydrologic modeling applications (Kavetski et al., 2006;McMillan et al., 2011).
An additional examination of deviation and bias in AORC-CONUS 1.1 precipitation, provided in the "Error Statistics of AORC-CONUS 1.1 Precipitation" section of the Appendix, demonstrates that multiplicative bias (MB) exhibits a distinctly log-normal tendency.In this study, we address this tendency by using logarithmic coordinates for all UAMB (vertical) axes in time-series plots, constructing color ramps that are logarithmically symmetric about a bias value of 1.0, and using geometric statistics as defined by Koopmans et al. (1964) whenever possible to quantify the behavior of both UAMB and MB.
In addition to AORC-CONUS 1.1 precipitation grids, 24-h Stage IV and NLDAS-2 precipitation estimates are also evaluated here, as these are the primary building blocks used to generate daily AORC precipitation.Daily LIV16 precipitation grids, while available, are not considered in this study.They do not contribute to AORC precipitation grids, as they are non-synoptic, representing accumulation periods ending at local midnight.Finally, it is worthy of note that all NWM versions prior to 2.1 were calibrated using NLDAS-2 forcings; therefore, the behavior of that dataset may provide clues as to how the calibration of those NWM versions may have differed from later versions calibrated with the AORC.

| Overall bias behavior of the AORC
Figure 3 shows the UAMB of AORC, NLDAS-2, and Stage IV (beginning in 2002) precipitation calculated across the full CONUS domain from 1979 through 2021 using 28-day aggregation periods.This result demonstrates that the AORC shows little overall bias and generally less bias than its counterparts, with a minimum of 0.88, a maximum of 1.11, and a median value of 0.98.However, these values are calculated for the full CONUS domain, using all observations gathered over 28-day aggregation periods, and such an aggregation can conceal significant local and regional biases.Individual captions describe the primary input datasets used during this period, including any constraints imposed on monthly precipitation totals.A legend for line colors is provided in Figure 3.
Stage IV QPE (2002QPE ( -2003) ) show less improvement, likely due to the instability and seasonal bias extremes seen in the Stage IV during that time.Over the interim period from 2010 to 2015 (Figure 4c), the average precipitation for each calendar month is constrained by the 1981-2010 climatology, but monthly precipitation is not explicitly matched to LIV16 analysis.Consequently, more interannual variability is present during this time, with a mostly dry-biased period from early 2011 through mid-2013, followed by a mostly wet-biased period from mid-2013 through late 2015.Beginning in 2016, AORC precipitation is not constrained by climatology.The potential bias that this change might introduce is mitigated by improvement in the overall performance of the Stage IV QPE mosaics in late 2013 to early 2014, likely due to the operational deployment of MRMS QPE at that time.Operational MRMS products were rapidly integrated into QPE generation processes at many RFCs, and a sustained improvement in the bias performance of Stage IV QPE followed.However, AORC precipitation bias is also tightly correlated with that of the Stage IV beginning in 2016, and this correlation includes a clear annual variation in bias, with winter minima and summer maxima.This signal persists to varying degrees throughout all 20 years of the Stage IV QPE record examined in this study, but prior to 2016 the control asserted by LIV16 monthly precipitation and associated climatologies largely prevents the associated biases from impacting the AORC.
Examination of UAMB calculated across time at specific station locations reveals more about the performance of the AORC.In Figure 5, which shows UAMB at GHCN-D observing stations from 1979 to 2001, some bias extremes are apparent, particularly in Mexico and on coastlines, but with the exception of some dryness across the Western U.S. and Canadian Plains, most bias values are close to 1.0.The geometric mean (GM) of the UAMB values shown is 0.97, and the geometric coefficient of variation (GCV) is 7.8%.In Figure 6 (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009), more extremes in bias, particularly dryness in the Western U.S., appear to have been introduced into the AORC.The influence of eastern portions of the domain, which contain more observing locations and suffer from less bias, prevents these dry areas from significantly affecting the time series of Figures 3 and 4. Still, the GM for this period is 0.98 and GCV is 11.5%, the latter value revealing an increase in spatial scatter over the results in Figure 5.
For the period from 2010 to 2015 (Figure 7), less systematic dryness in the Western U.S. has been replaced with greater overall variability in bias between both low (dry) and high (wet) results across the domain.The GM and GCV for this map are 1.00 and 13.2%.Finally, lacking climatological constraints, AORC precipitation during the period from 2016 to 2021 (Figure 8) suffers from far more significant systematic bias in several portions of the domain, including dry bias across all Western RFC regions, and still more increase in scatter relative to earlier periods: F I G U R E 7 UAMB of 24-h AORC-CONUS 1.1 precipitation compared with GHCN-D observations from January 2010 through December 2015.

| Other evaluation metrics
The UAMB captures much of the error and bias behavior of the AORC and related datasets; however, additional insights can be gained from examining other statistical metrics, particularly those which quantify categorical skill and variability in bias.
First, we present a discussion of categorical skill in the AORC, which delivers an event-by-event analysis of AORC precipitation that the UAMB, given its aggregate nature, does not provide.For this analysis, a precipitation event is defined as any observation or grid value that exceeds 0.1 inch in a day (constrained to 12 UTC to 12 UTC as is the case throughout this study).We limit this discussion to two basic, wellknown metrics: 1. Probability of Detection (POD) = hits ÷ (hits + misses) 2. False Alarm Ratio (FAR) = false alarms ÷ (hits + false alarms) Figure 9 depicts time series of POD and FAR for each of the four sub-periods previously examined.The AORC and its component counterparts exhibit a combination of biannual and annual variation in errors, in particular showing many local maxima in FAR near the summer solstice and local minima in POD near the winter solstice.These errors are particularly noticeable in the NLDAS-2 results, which impact the AORC across its full spatial domain through 2001, and continue to affect most areas outside the bounds of CONUS RFC regions through 2015, as discussed in Section 2.4.
A note on the peculiarity in the POD and FAR in late 2007 (Figure 9b): this oddity occurs because for most of November and December 2007, NLDAS-2 precipitation coverage inside U.S. boundaries appears to be lagged by 4 days; for example, NLDAS-2 precipitation over the CONUS for December 14 is based on data from December 10.To our knowledge, this issue has not been reported elsewhere, and does not occur at any other time in the NLDAS-2 forcing data record.
Another characteristic of AORC behavior not illustrated by the UAMB is the spread in bias from location to location, and event to event.
For this study, we measured the spread in bias using the GCV in multiplicative bias (GCVMB).Figure 10 shows the GCVMB in AORC-CONUS 1.1 precipitation for the entire evaluation period.Results for each of the four sub-periods (not shown) are similar, though the magnitudes of GCVMB outside the CONUS are largest for the 2002-2009 period.This bias spread is likely attributable to anomalies and unrealistic gradients in NLDAS-2 precipitation outside U.S. boundaries; these have previously been reported by Gronewold et al. (2018).To illustrate this proposition, the UAMB for NLDAS-2 precipitation for each of the four eras examined in this study is displayed in data.Finally, regional variations in the UAMB of NLDAS-2 precipitation within the CONUS are not unlike regional anomalies reported by Mo et al. (2012).
The release of climatological constraints in the AORC beginning in January 2016 exposes the analysis to bias and error from that point forward.Figures 4d and 8 illustrate the change in precipitation bias across the AORC-CONUS spatial domain beginning in 2016.In the "Precipitation Evaluation for Two RFC Regions" section of the Appendix, we illustrate how this and other changes from one era to another are manifested on a more regional basis, focusing on two CONUS RFC regions.

| Unresolved heavy hourly precipitation amounts
During the development of AORC-CONUS 1.1, several high-priority fixes to the AORC component elements were implemented, including the removal of the very highest, and most fundamentally unrealistic, hourly precipitation values (see Table 1).Specifically, for all hours during which any AORC precipitation grid value exceeded 280 mm/h (11 in./h), the entire AORC-CONUS grid was replaced with NLDAS-2 precipita-Note that currently, "world record exceedance" tests for hourly precipitation generally reference an observation of 12 inches of hourly rainfall in Holt, Missouri, on June 22, 1947 (Cerveny et al., 2007).
The removal of precipitation exceeding 11 in./hleaves open the possibility that unrealistic hourly precipitation amounts not reaching that threshold remain unresolved in AORC-CONUS 1.1.Indeed, as Figure 12 illustrates, if one identifies the maximum value from each hourly AORC-CONUS 1.1 precipitation grid, it is clear that extreme precipitation amounts still occur frequently in the dataset.There appear to exist three distinct eras visible in the figure: the period prior to 1995, the period from 1995 to 2001, and the post-2002 period.It is worth noting, in terms of hourly data used to disaggregate daily precipitation totals, a few key milestones (see Table 4; Figure 2) that are likely related to the changes distinguishing these eras: • December 1994: the inclusion of Manually Digitized Radar (MDR; Miller & Kitzmiller, 2017) reflectivity data ends; • January 1996: the inclusion of Weather Services International National Operational Weather Radar (WSI-NOWrad; Zhang et al., 2017) reflectivity grids begins; • May 1996: the inclusion of NEXRAD Stage II (Lin & Mitchell, 2005) precipitation grids begins; • January 2002: the inclusion of RFC-generated Stage IV precipitation estimates begins.An inventory of maximum hourly precipitation in various ranges (using units of inches) for all 376,200 h (February 1979 through December 2021) of the AORC-CONUS 1.1 dataset considered for this study is provided in Table 5. Precipitation rates exceeding 50 mm/h (about 2 in./h) are conventionally categorized as "violent" (Met Office, 2012); therefore, these results indicate 38,635 discrete 1-h occurrences of violent rain over the 43-year period (10% of the hours in the dataset, averaging to almost 2.5 occurrences per day), and 134,941 h (36% of the hours in the dataset) during which heavy-to-violent rain (>1 in./h) occurred in at least one location on the AORC grid.
Next, we examine the question of whether heavy-to-violent hourly precipitation amounts are sustained or isolated in time.Short durations are expected when heavy-to-violent precipitation occurs, but we are nonetheless interested to know to what extent the anomalies identified here are localized in time.For a preliminary answer, we examined grid locations within a 75 km radius of each maximum hourly gridded value during the two adjacent hours of AORC precipitation, and identified the number of cases in which the grid for either or both of the adjacent hours includes a maximum value in this regional neighborhood within 25% of the reference maximum.These results are listed in the fourth column of Table 5.They indicate that the likelihood of a heavy-to-violent hourly maximum precipitation rate being sustained for more than 1 h in the same neighborhood is 62% for the (1.0, 2.0] in./h range, 57% for (2.0, 3.0] in./h, and decreases substantially as intensity increases beyond 3 in./h.This result suggests that a subset of the intense precipitation identified in this study is sustained for at least 2 h, but that most hourly maxima above 3-4 in./h are isolated in time.a single ground report of 6.21 inches near Claude, Texas, for the hour ending at 00 UTC, and an associated report of 9.54 inches for the 24 h ending at 12 UTC on September 20, 2019, from the same observer.Both reports are outliers relative to other surface observations in the vicinity, where the highest 24-h accumulations reported were from 2 to 3 inches.They were, however, described by the WFO as valid reports. Typically, when erroneous extreme precipitation intensities appear in the AORC-CONUS 1.1 dataset, they are the result of a minimal quality assurance process, which does not seek to identify spurious values present in input data sources or emerging from data processing steps.Figure 14 shows the locations and values of all hourly maxima from AORC-CONUS precipitation grids that exceed 2 in./h.This collection represents a subset of such events, since it includes at most one grid location (that of the maximum) for each hour, but given that the dataset we have examined covers 376,200 h, it provides a large sample, illustrating the spatial distribution of intense precipitation events throughout the AORC dataset.Note that the 6.6 in./h value from September 20, 2019, described in Figure 13, is visible over the Texas Panhandle.Only 11 of the 13 maxima above 10 in./h are visible in the figure because three occur at exactly the same grid location in northern Utah over a 12-h period around December 24, 2018.
Most occurrences of extreme precipitation are seen in areas frequently affected by tropical cyclones and severe convective storms.The figure suggests that while some extreme events present in the AORC-CONUS 1.1 are depicted accurately, it is likely that a significant subset of them are the result of insufficient quality assurance in AORC data processing.
In an effort intended to simulate a prototypical quality assurance test, each of the hourly maxima exceeding 2 in./hsummarized in Table 5 was compared, if possible, with a database of hourly precipitation observations from surface stations covering the years 2005-2021.To determine whether the maxima were supported by observations, all observations within a 75 km radius of the hourly maximum's grid location and attributed to times within 90 min of the same date and time were collected.Whenever at least five such neighboring observations were found, those reporting precipitation within 25% of the maximum AORC grid value were judged as corroborating reports.A1; Figure A4b)-though the GM bias for the points in Figure A3b is slightly higher at 0.97.A wet bias in Stage IV QPE (GM = 1.16 in Table A1) over the MBRFC region emerges during 2010 (Figure A4c) and increases and persists thereafter (Figure A4d).This bias arises from a merging of different datasets in the Multisensor Precipitation Estimation (MPE) process at MBRFC, which serves the needs of operational hydrology for that region but impacts the accuracy of the Stage IV mosaic.From 2010 to 2015, this bias is, as has been noted previously, constrained by the process of maintaining climatological similarity between LIV16 and AORC for those years, but a slight wet bias does appear in the AORC over portions of the region during that period (Figure A3c), with a GM of 1.03 for the time series in Figure A4c, 1.02 for the points Figure A3c.
Beginning in 2016, there is virtually no difference between AORC and Stage IV (Figure A4d), resulting in a distinct wet bias in AORC across the region, particularly its western portion (Figure A3d).

Figure 4
Figure 4 presents the time series of Figure 3 separated into each of the four sub-periods within which input data sources and methods were consistent.The latter part of the historical period, from roughly 2004 to 2009 (Figure 4b), in which Stage IV QPE mosaics contribute to daily precipitation estimates, shows a slight improvement in AORC bias over earlier periods.The first 2 years in which AORC uses daily the GM and GCV in UAMB across the domain for the 2010-2015 map are 1.01%and 16.7%.F I G U R E 5 UAMB of 24-h AORC-CONUS 1.1 precipitation compared with GHCN-D observations from February 1979 through December 2001.F I G U R E 6 UAMB of 24-h AORC-CONUS 1.1 precipitation compared with GHCN-D observations from January 2002 through December 2009.
Figure 11.Of particular note is the high (wet) bias in the lower Great Lakes region during 2010-2015, which is generally low (dry) during other eras.This anomaly is actually F I G U R E 8 UAMB of 24-h AORC-CONUS 1.1 precipitation compared with GHCN-D observations from January 2016 through December 2021.FALL et al. confined to the period from May 2012 through July 2013, but anomalous low bias also occurs in the region for brief periods in early 2012 and early 2013.As Figures 5-8 demonstrate, these issues have been addressed in AORC-CONUS 1.1 through monthly gridded bias adjustments through 2015, and via prioritizing CaPA over NLDAS-2 precipitation over the lower Great Lakes beginning in May 2012, but Figure 10 reveals that traces of the inaccuracies in NLDAS-2 along and outside U.S. boundaries remain, and appear as bias variability in the AORC precipitation F I G U R E 9 Probability of Detection (POD) and False Alarm Ratio (FAR) time series for 24-h AORC-CONUS 1.1, NLDAS-2, and Stage IV precipitation compared with GHCN-D observations over the CONUS: (a) February 1979 through December 2001, (b) January 2002 through December 2009, (c) January 2010 through December 2015, and (d) January 2016 through December 2021.Vertical blue and red lines indicate January 1 and July 1, respectively.Individual captions describe the primary input datasets used during this period, including any constraints imposed on monthly precipitation totals.A legend for line colors is provided in Figure 3.F I G U R E 1 0 Geometric coefficient of variation (GCV) in multiplicative bias in AORC-CONUS 1.1 precipitation compared with GHCN-D observations from February 1979 through December 2021.

F
I G U R E 11 UAMB of NLDAS-2 precipitation compared with GHCN-D observations from (a) February 1979 through January 2001, (b) January 2002 through December 2009, (c) January 2010 through December 2015, (d) January 2016 through December 2021.A color ramp legend is provided in Figures 5-8.

Figure 13
Figure13provides one example of intense precipitation, showing a maximum precipitation amount of 6.6 in./h in the Texas Panhandle on September 20, 2019, for the hour ending at 00 UTC.This result, and the precipitation in the surrounding area, is consistent with the Stage IV precipitation estimate for the same hour.The thunderstorms that produced this precipitation were reported by the NWS Weather Forecast Office (WFO) in Lubbock, TX(Lubbock WFO, 2019), to have generated "very heavy" rain, but the extreme amount seen here appears to be associated with

F
FigureA3shows UAMB in AORC precipitation at station locations, as was done for the CONUS in Figures5-8, but limited in this case to the MBRFC region.FigureA4provides the corresponding UAMB time series.From 1979 to 2001, AORC precipitation is consistently unbiased across the region (FiguresA3a and A4a).With the introduction of Stage IV daily QPE in 2002, dry biases arise in the mountainous Western portions of the domain (FigureA3b), and an annual variation in UAMB with winter minima and summer maxima appears and remains until about 2008 (FigureA4b).The climatological constraint imposed by the LIV16 data reduces the amplitude of this variation for AORC versus Stage IV, but the geometric time-averaged bias across the domain for the AORC is still somewhat low (0.93) from 2002 to 2009 (TableA1;

Figure A5 ,
Figure A5, UAMB values at individual observing stations for 2010-2015 (Figure A5a,b) and 2016-2021 (Figure A5c,d) are shown for Stage IV Summary of geophysical requirements for the AORC over the CONUS domain.

Dataset Monthly or daily grids Spatial resolution Dates applied to AORC Notes
(Du, 2011;Lin & Mitchell, 2005his period begins with LIV16 monthly precipitation grids covering the full 31-year period.For the first 23 years(February 1979 through December 2001), disaggregation from monthly LIV16 to daily amounts is based on NLDAS-2 daily (12 UTC to 12 UTC) precipitation accumulations.Beginning in January 2002, Stage IV daily precipitation grids(Du, 2011;Lin & Mitchell, 2005) mosaiced from U.S. RFC quantitative precipitation estimates (QPE) are used where available over the CONUS; consequently, the influence of NLDAS-2 on daily precipitation within the bounds of CONUS RFC regions is diminished after 2001.However, given the varied nature and priorities of RFC operations, the Stage IV mosaic occasionally has regions of missing data, typically associated with non-receipt of QPE from one or more RFCs.