Ensemble Representation of Satellite Precipitation Uncertainty Using a Nonstationary, Anisotropic Autocorrelation Model

The usefulness of satellite multi‐sensor precipitation (SMP) and other satellite‐informed precipitation products in water resources modeling can be hindered by substantial errors which vary considerably with spatiotemporal scale. One approach to cope with these errors is to combine SMPs with ensemble generation methods, such that each ensemble member reflects one plausible realization of the true—but unknown—precipitation. This requires replicating the spatiotemporal autocorrelation structure of SMP errors. The climatology of this structure is unknown for most locations due to a lack of ground‐reference observations, while the unique anisotropy and nonstationarity within any particular precipitation system limit the relevance of this climatology to the depiction of individual storm systems. Characterizing and simulating autocorrelation across spatiotemporal scales has thus been called a grand challenge within the precipitation community. We introduce the Space‐Time Rainfall Error and Autocorrelation Model (STREAM), which combines uncalibrated, anisotropic and nonstationary SMP spatiotemporal correlation modeling with a pixel‐scale precipitation error model to stochastically generate ensemble precipitation fields that resemble “ground truth” precipitation. We generate STREAM precipitation ensembles at high resolution (1‐hr, 0.1°) and evaluate these ensembles at multiple scales. STREAM ensembles consistently bracket ground‐truth observations and replicate the autocorrelation structure of ground‐truth precipitation fields. STREAM is compatible with pixel‐scale error/uncertainty formulations beyond those presented here, and could be applied to other precipitation sources such as numerical weather predictions or blended products. Although ground truth is used here to parameterize pixel‐scale uncertainty, if combined with other recent work in SMP uncertainty characterization, STREAM could be used without any ground data.

Errors in SMPs can arise from a variety of causes, including variable sensor accuracy and sampling error from infrequent satellite overpasses, and are modulated by retrieval conditions (e.g., Tan et al., 2016Tan et al., , 2018Tian & Peters-Lidard, 2007). Validation studies have demonstrated that errors tend to grow with latitude, precipitation intensity, terrain complexity, and in frozen or mixed-phase precipitation conditions (e.g., Aghakouchak et al., 2011;Shige et al., 2013). Spatial and temporal autocorrelation among SMP errors exists because the retrieval conditions and sampling limitations that impact a precipitation estimate at a given location and time tend to also impact estimates that are nearby in space or time. This autocorrelation means that error properties vary according to the level of spatial or temporal aggregation of the data (Quintero et al., 2016;Sarachi et al., 2015;Tang et al., 2016); specifically, errors generally diminish with increasing aggregation as errors tend to cancel.
When used to force water prediction models, errors in precipitation products lead to errors in model estimates of key variables such as streamflow, soil moisture, and groundwater storage (e.g., Falck et al., 2015;Hossain et al., 2004;Maggioni et al., 2011;Schreiner-McGraw & Ajami, 2020;Serpetzoglou et al., 2010). Precipitation uncertainty and error also depend on spatial and temporal resolution, with random errors tending to diminish with aggregation in space or time Quintero et al., 2016;Sarachi et al., 2015). The same is true when erroneous precipitation is used to predict streamflow, since river networks serve to aggregate rainfall-runoff errors over spatial and temporal scales (Maggioni et al., 2013;Nikolopoulos et al., 2010). Because of these issues and the limits they impose on large-scale water modeling, characterizing the space-time autocorrelation structure of SMP error at arbitrary space-time scales has been called a "grand challenge" for the precipitation community . This work takes aim at this grand challenge by attempting to model the space-time autocorrelation of SMP error; the proposed approach could also be applied to precipitation estimates from satellite-assimilating numerical weather prediction models or blended datasets due to the aforementioned broad similarities in their error/uncertainty characteristics. A significant challenge in addressing the space-time correlation structure of SMP error is the nonstationarity and anisotropy of SMP error structures, which this study hypothesizes are closely linked to the nonstationarity and anisotropy of rainfall fields themselves. For example, the spatiotemporal structure of SMP error is likely very different during an elongated frontal storm than during an isolated convective event or a highly-coherent tropical cyclone. This suggests that it would likely prove very challenging to develop robust characterizations of these structures based on a climatology of past storms, at least in a way that could be used operationally to supply uncertainty information to end users. As will be seen, we avoid such an approach.
It should be noted that the findings from the numerous validation studies that have assessed SMP accuracy relative to ground-reference data (e.g., Asong et al., 2017;Gadelha et al., 2019;Li et al., 2016;Tian et al., 2009 to name just a few) are not directly useful for SMP-based water modeling applications. This is because the metrics they calculate-such as mean squared or absolute errors, biases, and probabilities of detection and false alarms-do not readily translate into "new" (i.e., better) precipitation fields that are needed as model inputs. They do, however, highlight the challenge of providing better inputs by showing the prevalence, complexity, and magnitudes of such errors. In recognition of this, Gebremichael et al. (2011) called for a shift in SMP error characterization work towards "converting deterministic satellite rainfall estimates to probabilistic form by overlaying an estimated error distribution around the deterministic rainfall estimate." Several earlier efforts-detailed in Section 2-have answered this call by introducing techniques that can generate distributions to characterize the uncertainty of a specific SMP estimate.
While precipitation uncertainty is a random variable that can be described probabilistically-typically via a probability distribution describing the possible "true" (but unknown) precipitation rate at a given location and time-virtually all water prediction models are formulated to ingest deterministic precipitation estimates. This disconnect between probabilistic precipitation uncertainty and the need for deterministic input can be bridged by ensemble methods, in which multiple realizations of possible precipitation can be generated which, in their totality, reflect the range of uncertainty. These can be used to force an ensemble of water model simulations. Ensemble methods are well-developed in the numerical weather prediction community, with members created by perturbing initial conditions, boundary conditions, or parameters of a numerical atmospheric model or by assembling outputs from different models (Cuo et al., 2011). Ensemble methods are much less developed in the SMP community, for reasons that are difficult to summarize and beyond the scope of this work. Nonetheless, a relatively limited set of studies have used ensemble approaches to assess propagation of SMP error through hydrological and land surface models (Falck et Serpetzoglou et al., 2010;Shrestha et al., 2020). These studies relied on ground reference data both to characterize SMP uncertainty and to simulate the space-time correlation structure of SMP error.
This work introduces the Space-Time Rainfall Error and Autocorrelation Model (STREAM), which combines the simulated nonstationary, anisotropic space-time autocorrelation structure of precipitation error with pixel-scale estimates of precipitation uncertainty (Figure 1). STREAM uses an ensemble-based approach, generating realizations of "reference-like" precipitation fields-that is, fields that individually represent plausible realizations of the true (unknown) precipitation based on satellite precipitation estimates, and together represent the range of possible true rainfall ( Figure 1). While not demonstrated here, the output from STREAM can be ingested by hydrologic or land surface models without requiring any modification to these models' structures. Uniquely, STREAM's space-time autocorrelation component is calibration-free and requires no ground-reference data. This capability, demonstrated below, rests on the hypothesis-apparently confirmed by our results-that the known space-time structure of SMP fields themselves provides a useful approximation of the unknown space-time structure of SMP error fields. This paper is organized as follows: Past error modeling work is summarized in Section 2. Section 3 describes the study region and data. The methodologies for STREAM and a previous error modeling approach, a Two-Dimensional Satellite Rainfall Error Model (SREM2D), are covered in Section 4. Model results are shown and discussed in Sections 5 and 6, respectively, and the contributions of this work are summarized in Section 7.

Background-Satellite Precipitation Error Modeling
Although the terms "error" and "uncertainty" are sometimes used interchangeably in the literature, in this paper we use error to refer to quantifiable differences between specific precipitation estimates and higher accuracy "ground truth" precipitation estimates, and use uncertainty to refer to the distribution of the possible true values relative to a precipitation estimate. For instance, the error for a given precipitation estimate is a deterministic value which can be calculated provided that high-quality ground truth data is available. In the absence of ground truth, this error is unknowable, and thus the best we can hope for is to know the uncertainty for that estimate-e.g. a range or distribution of plausible values which could be estimated through a variety of methods including those reviewed here. Regardless, the past literature uses the term "error model" to describe a method that provides an estimated distribution or range of possible true values based on an SMP observation. We keep with that terminological convention.
To simulate fields of possible SMP errors in space and time, and thus generate precipitation ensembles, two components are required: (a) pixel-scale models of errors in precipitation intensity and occurrence, which characterize the SMP uncertainty associated with a single SMP estimate for a single control volume (invariably a grid cell) and time-step but do not consider the space-time autocorrelation structures between times and control volumes; and (b) space-time autocorrelation models, which simulate the correlation structure of SMP error, essentially facilitating the "stitching together" of pixel-scale errors into plausible realizations of error or rainfall fields. In prior work, both error modeling components, and the latter one in particular, have relied on extensive ground reference data for calibration. Additionally, space-time autocorrelation models have thus far neglected the nonstationarity and anisotropy in SMP error fields. Both categories face the challenge of representing the diversity of possible SMP errors-namely false alarms, missed precipitation, and hit errors (when a SMP estimate correctly detects rainfall but incorrectly estimates the magnitude). Some error models have focused entirely on hit cases while neglecting false alarms and missed cases (Reichle et al., 2007;Sarachi et al., 2015), while others handle rainfall detection and magnitude separately, resulting in disjointed formulations of precipitation uncertainty (Maggioni et al., 2014).
In pixel-scale error models, the uncertainty associated with a specific SMP estimate is described by a probability of precipitation and distribution of nonzero precipitation values which are conditional on the value of a particular SMP observation. Pixel-scale error models in literature include the Censored Shifted Gamma Distribution (CSGD; Wright et al., 2017), Precipitation Uncertainties for Satellite Hydrology (PUSH; Maggioni et al., 2014), and Probabilistic QPE using InfraRed Satellite Observations (PIRSO; Kirstetter et al., 2018), among others (Gebremichael et al., 2011;Yan & Gebremichael, 2009). Sarachi et al. (2015) utilized a generalized normal distribution to model SMP uncertainty across scales by interpolating pixel-scale model parameters across various space-time resolutions. This approach considered hit cases only and required calibration at several scales. Pixel-scale error models can be trained using co-located time series of SMP and ground reference data and are therefore amenable to calibration using records from sparse rain gage networks. Pixel-scale error models can also be "regionalized" by pooling together available training data from across a region to calibrate a regional error model (Hartke et al., 2020;Khan & Maggioni, 2020;Li et al., 2021). However, without a model of space-time autocorrelation of SMP errors, pixel-scale error models have no way to relate the uncertainty of an SMP estimate in one pixel to the uncertainty in nearby pixels in space and time.
Space-time autocorrelation models thus far have used calibration to characterize the climatological autocorrelation structure of precipitation error. The SREM2D was developed by Hossain and Anagnostou (2006) in order to generate ensembles of SMP-like rainfall fields which preserve the error characteristics of SMP fields. SREM2D combines a model of precipitation intensity and occurrence error with a space-time autocorrelation modeling component. Though SREM2D models the spatial correlation structure of SMP error fields as isotropic, these error fields often exhibit substantial anisotropy, reflecting the anisotropy inherent in real storm structures (Niemi et al., 2014;Zawadzki, 1973). Furthermore, SREM2D was not designed to represent differences in spatial autocorrelation of SMP error across a study area (i.e., spatial nonstationarity) and assumes that the average spatial correlation length calculated for a study region is representative of that region for all locations and time steps. Since the spatial correlation structure of SMP and SMP error can vary greatly at regional scales, this precludes SREM2D from application to large (i.e., subcontinental-to-global) scales. Because SREM2D relies on a climatological depiction of error autocorrelation, the model training and calibration process requires a gridded (or at least spatially extensive) ground-based precipitation data set. Such datasets are lacking in many parts of the world (Kidd et al., 2017), further limiting is general applicability. Though applied to radar rainfall rather than SMP, Villarini et al. (2009) introduced an error-driven generator to stochastically perturb radar fields while accounting for the spatial correlation of multiplicative error. However, this approach considered hit cases only, neglected temporal correlation and anisotropy in error correlation structures, and used a computationally intensive method to generate Gaussian noise. Space-time autocorrelation models that rely on climatologically-calibrated parameters to simulate space-time correlation are not designed to simulate the unique correlation structure-that is, varying degrees of anisotropy and correlation distances in space and time-of precipitation error that is associated with each new storm system.
The STREAM framework introduced in this article utilizes a calibration-free approach to modeling the space-time autocorrelation structure of precipitation error and provides a way to leverage pixel-scale estimates of precipitation uncertainty in space and time. Although this work utilizes a subcontinental study area, STREAM's approach of reproducing the local spatial autocorrelation structures of SMP fields enables continental-to-global-scale application.

Study Region
The study area covers the central US (Figure 1; 100°-85°W, 35°-45°N), a region known for high agricultural production (Prince et al., 2001) and also marked by flood events often caused by heavy, long-lasting precipitation that severely impact local communities (e.g., the 1993 Mississippi River and 2008 Iowa flood events; Budikova et al., 2010;Najibi et al., 2016;Nakamura et al., 2013;Smith et al., 2013). Intense events provide a significant portion of the region's annual precipitation total, and convective storm systems are frequent during the warm summer period. The topography of this region is fairly uniform (Andresen et al., 2012).

Rainfall Data
The NASA Integrated MultisatellitE Retrievals for Global Precipitation Measurement (IMERG) Version 06 product is available globally at a 30-min, 0.1° resolution and consists of three latency options Huffman, Stocker, et al., 2019): IMERG Early (4-hr latency; lacks some data sources and data processing elements of longer latencies), IMERG Late (12-hr latency), and IMERG Final product (approximately 2.5-month latency; includes a gage-based correction). IMERG precipitation estimates are calculated by merging data from passive microwave (PMW) sensors, intercalibrating PMW estimates with a dual-frequency precipitation radar aboard the Global Precipitation Measurement (GPM) Core Observatory satellite, and interpolating (or "morphing") the resulting estimates in time using water vapor motion vectors from MERRA-2 and GEOS-5 (see Tan et al., 2016 for more details). This study uses IMERG Early, aggregated to the hourly scale to match the radar-gage ground reference product; the approach could be readily applied to other IMERG latencies, as well as to other SMP products.
The NEXRAD Stage IV radar-gage product, available over Continental United States (CONUS) at an hourly, roughly 1/24° resolution (Du, 2011), is used as the ground reference in this study. Although NEXRAD's Stage IV product contains errors stemming from issues such as beam blockage and range from the nearest radar, we assume that the errors in this product are infrequent and negligible relative to IMERG, consistent with previous SMP studies (e.g., Aghakouchak et al., 2011) and consistent with our own prior experience using the data set in this region. We upscaled Stage IV to IMERG's native 0.1° resolution using bilinear interpolation.
IMERG-Early (hereinafter IMERG) and Stage IV data from 2005 to 2007 were used to calibrate both the CSGD and SREM2D models, while data from 2008 to 2013 was used to validate STREAM and SREM2D simulations. To minimize issues related to frozen precipitation and maintain an accurate ground-reference during model calibration and validation, Stage IV and IMERG data were used only for March through October, excluding months with greater likelihood of frozen precipitation in the study area (November-February). This is admittedly a limitation of our study that should be addressed in the future. For both Stage IV and IMERG, the threshold for precipitation detection was set to 0.1 mm/hr, below which all hourly estimates were set to zero. This detection threshold is consistent with previous SMP and radar-based studies (Germann & Zawadzki, 2002;Li et al., 2021).

Wind Data
As an approximation of the "steering winds" that govern the motion of storm systems, 850 mb wind fields were retrieved from the global MERRA-2 reanalysis product (Gelaro et al., 2017; Global Modeling and Assimilation Office (GMAO), 2015) at a hourly 0.5° by 0.625° resolution. These were regridded to 0.1°. These wind fields were used together with IMERG fields to simulate the temporal evolution and autocorrelation structure of SMP error in STREAM (described in Section 4.2). Testing using 500 mb wind vectors produced very similar results (not shown). Other sources of motion vectors could be used, including potentially those used in IMERG's aforementioned "morphing" space-time interpolation scheme. Those motion vectors are not publicly available, however, so were not considered here. This point is discussed further in Section 6.4. 10.1029/2021WR031650 6 of 21

Overview
The STREAM merges an autocorrelation model that is nonstationary and requires no ground reference data with pixel-scale uncertainty estimates which are generated by a calibrated SMP error model ( Figure 1). Section 4.2 describes the STREAM methodology for simulating space-time autocorrelation structures and combining these structures with pixel-scale uncertainty estimates to generate precipitation ensembles. Section 4.3 introduces the pixel-scale error model used in this work, the CSGD error model, and Section 4.4 describes a previously developed approach, the SREM2D. Nerini et al. (2017) introduced a non-stationary stochastic generator for radar precipitation fields using the short-space Fourier transform (SSFT). The Fourier power spectrum of a precipitation field (e.g., from weather radar or SMP) is convolved with Gaussian white noise to generate correlated Gaussian noise fields and ultimately produce an ensemble of precipitation forecasts which maintain the anisotropy and spatial correlation structure of observed radar rainfall fields. This methodology reproduces both the global and local power spectra of radar fields by using a moving window scheme. This moving window can thus capture spatial nonstationarity in field properties, since at any particular location the correlated noise is based on properties within the window. This SSFT-based non-stationary noise generator has since been incorporated into the pySTEPS Python library for short-range probabilistic precipitation forecasting, as a tool for generating ensemble nowcasts (Pulkkinen et al., 2019). While Nerini et al. (2017) used this tool to generate stochastic precipitation fields that replicate the local spatial correlation structure of observed radar rainfall fields, the authors emphasized that it could be applied to other applications involving complex non-stationary fields. Notably, this approach requires no calibration against ground truth measurements or parameterization of long-term precipitation behavior.

Correlated Noise Ensemble Generation
In the first step of STREAM, the pySTEPS noise generator described in Section 4.2.1 is applied to stochastically generate Gaussian noise that replicates the local spatial correlation structure of an IMERG field, including anisotropy ( Figure 2). After the initial noise field has been created for each ensemble member, each noise field is advected at each time step via steering winds (described in Section 3.3) using a semi-Lagrangian scheme. In such a scheme, a time derivative (in this application, 850 mb wind vectors) is used to calculate where the value arriving at a grid cell, termed the arrival point, originated from in the previous time step (Lauritzen et al., 2011;Staniforth & Cote, 1991). This semi-Lagrangian scheme is advantageous over a strictly Lagrangian one because it does not allow individual parcels (in our case, noise values) to all advect into a single region and leave some regions without parcels. Our semi-Lagrangian scheme also incorporates a new instance of correlated noise, or a "shock term" (Nerini et al., 2017) which is the second term on the righthand side of Equation 1: where is a noise value to be calculated at time t and position (i, j) in the field and −1, − , − is the noise value that has been advected by north-south and east-west wind vectors v t and u t from position (i − v t , j − u t ) at time step t −1 to position (i, j) at time step t. v t and u t are obtained by multiplying MERRA2 wind vectors, originally in units of m/s, by 3,600 s and dividing by 11,000 m, the approximate width of an IMERG pixel, to obtain units of 0.1° pixel hr −1 . is a new correlated Gaussian noise field based on the structure of IMERG at time t. The shock term is used to perturb the noise field and to incorporate the current IMERG spatial correlation structure, , into the noise field at each time step. This allows the noise field to evolve over time and to reflect the nonstationary spatial correlation structure of IMERG and IMERG error. We assume that the error fields are first order autoregressive in time, calculating α as the Pearson correlation coefficient between IMERG fields at time t and t −1 (Figure 2). Analysis of the temporal autocorrelation function (ACF) of IMERG error fields supports this autoregressive assumption (results not shown). After the noise ensemble has been generated for all time steps in the study period, the correlated Gaussian noise ensemble N(0,1) is transformed to uniform noise U(0,1) using the error function: where n gaussian is the noise field described in Equation 1.
Note that correlated noise fields rn t and temporal coefficient α are only calculated based on IMERG at time t when the IMERG field is "rainy," defined as when at least 5% of the study area registers rainfall ( Figure 2). During time steps with non-rainy fields, which are frequent at the hourly scale, the spatial correlation structure from the most recent rainy field is used to generate rn t . In either case, no parameters depend on a long-term climatology.

Precipitation Ensemble Generation
In the final step of STREAM, the correlated uniformly-distributed noise ensemble is combined with the CSGD error model. The CSGD model and training scheme methodology are described in Section 4.3. The standard uniform noise values generated by STREAM's correlation modeling component (Section 4.2.2) are inputted to the inverse CDF of the conditional CSGD generated for the unique IMERG rain rate estimate at each hourly time step and pixel. Thus, each noise value corresponds to a value of possible true precipitation conditional on a given IMERG estimate (further described in Section 4.3), correlated with surrounding pixels. The uniform noise ensemble is censored at 0.995 to guard against unrealistically extreme precipitation values generated when very high noise values are used to select a precipitation value from conditional CSGDs with long tails. The output of STREAM consists of an ensemble of three-dimensional (north-south, east-west, time) precipitation fields, with each ensemble member representing one realization of the possible true precipitation across the study region for all time steps in the study period.
We also generated "uncorrelated" precipitation ensembles by using white (uncorrelated) noise as input to the inverse CDF of conditional CSGDs, thus neglecting spatial and temporal correlation of errors. The precipitation ensemble generated in this way is henceforth referred to as the uncorrelated ensemble, though they are not strictly uncorrelated since the resulting precipitation fields will inevitably exhibit some autocorrelation stemming from the IMERG precipitation rates (albeit much weaker than that of the ground-reference, IMERG, or STREAM fields).

Censored Shifted Gamma Distribution Error Model
The CSGD model framework was introduced by Scheuerer and Hamill (2015) to model uncertainty in numerical weather forecasts, and was adapted in Wright et al. (2017) to characterize pixel-scale SMP error across CONUS. The CSGD is an adaptation of the two-parameter gamma distribution (here written in terms of its mean and standard deviation, but which can be reparametrized in terms of shape and scale parameters) with an additional "shift" parameter δ that shifts the probability density function (PDF) leftward ( Figure 3a). The density left of zero represents the probability of zero precipitation, while the density at any value greater than zero represents the likelihood of that amount of precipitation (Figures 3a and 3b). The reparameterized distribution is then left-censored at zero, replacing all negative values with zero. While previous precipitation error models either focused only on hit errors or required separate components to model rainfall occurrence and magnitude (see Section 2), the CSGD error model characterizes both the discrete and continuous components of satellite precipitation error using this single distribution. A regression model is trained based on contemporaneous co-located SMP and ground-truth observations to produce model parameters α 1 , α 2 , α 3 , α 4 ,… and, at any time t, unique "conditional" CSGD parameters μ(t), σ(t), and δ(t) as a function of those parameters and the SMP estimate R s (t): where is the mean of the SMP time series during the training period and (μ c , σ c , δ c ) are the parameters of the climatological CSGD, which is a CSGD fit to a time series of SMP observations. The climatological CSGD approximates the ground-truth climatology of a pixel or a region. All conditional CSGDs are thus versions of this climatological CSGD, adjusted (i.e., conditioned) to reflect the systematic bias and range of random error at a given SMP rain rate estimate, R s (t). Generally, the range of random error increases and the probability of zero precipitation reflected by the conditional CSGD decreases as rain rate increases (Figure 3a). The regression model defined by Equations 3-5 allows the model to capture nonlinear behavior of SMP error across increasing precipitation rates. A simpler linear regression system can also be used in the  regression framework can also incorporate additional contemporaneous covariates C 1 (t), C 2 (t),…, C n (t), such as temperature or precipitable water, that could help to further characterize SMP uncertainty. These covariates are incorporated into the regression framework using an adjusted version of Equation 3: For more information on the CSGD error model framework, see Scheuerer and Hamill (2015) and Wright et al. (2017).
In this study, we use wetted area ratio (WAR) for the first time as a covariate in the CSGD error model. WAR for any IMERG estimate R s (t) at a given pixel is the proportion of pixels within a distance of r pixels that record nonzero rainfall at time t. WAR ranges from a value of 0 when no pixels within radius r record nonzero precipitation, to 1, when all pixels within radius r record precipitation. Because WAR captures the spatial "context" of an IMERG observation, it is a useful covariate for predicting detection/non-detection performance within the CSGD framework. Figure 3c demonstrates that the probability of an IMERG estimate of nonzero rainfall being a correct detection is much greater if the associated WAR is high (i.e., close to 1.0) than if it is low. Likewise, the probability of IMERG correctly not detecting rainfall is highest when WAR is close to 0 (Figure 3c). A radius of r = 10 pixels was used to calculate WAR in this work; higher and lower values of r did not significantly alter CSGD error model performance (results not shown).
In this study, CSGD error model parameters are trained using time series "pooled" together from 25 co-located IMERG and Stage IV pixels (i.e., a 0.5° × 0.5° area). CSGD error model training for each 0.5° × 0.5° window in the study area is performed using the regression system defined in Equations 3-5 with the WAR covariate. The parameter estimation is completed via mean continuous ranked probability score (CRPS) minimization methods described in Scheuerer and Hamill (2015). Using time series from multiple pixels reduces sampling error and generates a more robust error model than model training using time series from a single IMERG pixel (not shown). This approach is suitable for the relatively homogenous terrain in the study area but may not be appropriate in more complex terrain where IMERG error characteristics are more closely tied to terrain heterogeneity.

SREM2D
The SREM2D error model was designed to generate ensembles of "satellite-like" fields that replicate the error properties of an SMP data set relative to a ground-reference (Hossain and Anagnostou (2006)). SREM2D separately accounts for the spatial correlation of detection errors and precipitation rate errors, and uses the Turning Bands algorithm (Mantoglou & Wilson, 1982) to generate 2-D Gaussian noise with correlation lengths matching that of the conditional error of SMP fields. In this work, SREM2D is run "in reverse" to generate reference-like rainfall fields that are closer to the ground-reference by replicating the error properties of Stage IV relative to IMERG Early. SREM2D has been used in this fashion previously in Falck et al. (2015) and Maggioni et al. (2013) to improve model-simulated streamflow estimates compared against hydrographs from SMPs. SREM2D parameters are trained using Stage IV and IMERG data for the 2005-2007 training period detailed in Section 3.2 and are listed in Table S1. Consistent with earlier SREM2D studies, additional trial-and-error calibration is needed (specifically, adjustment of the mean parameter) to minimize bias in SREM2D-perturbed fields. Note that these error parameters are calculated for the reference, Stage IV, relative to IMERG, highlighting SREM2D's need for ground reference data to characterize not only pixel-scale errors (akin to the CSGD approach) but also the spatiotemporal autocorrelation process (unlike STREAM, which doesn't require ground reference for this purpose).

Ensemble Performance Metrics
STREAM, SREM2D, and the uncorrelated ensembles were run at an hourly time step with an ensemble size of 50 for the evaluation period 2008-2013, excluding winter months (November-February). The spatial ACF, temporal ACF, probability of detection (POD), probability of false alarm (POFA), root mean square error (RMSE), and Containing Ratio were used to evaluate the performance of IMERG, the STREAM ensemble, the uncorrelated ensemble, and the SREM2D ensemble. The Containing Ratio (CR) is the proportion of observed data bracketed by the range of an ensemble, and has been used within the forecast verification and runoff modeling community to assess ensemble accuracy ( where [⋅] is an indicator function that equals 1 when the observed rainfall obs( ) falls between the lowest and highest values of the ensemble at time t and that equals 0 when the observation falls outside ensemble bounds. For deterministic evaluation metrics, including RMSE, POD, and POFA, the mean of the ensemble was evaluated.
We assessed the spatial autocorrelation length of IMERG and STREAM noise fields, which is the separation distance at which correlation is approximately equal to 0.3678 (as in Hossain & Anagnostou, 2006;Hossain & Huffman, 2008). Spatial and temporal linear ACFs were calculated for each ensemble member to assess the ability of STREAM to generate reference-like precipitation fields in space and time. We note that assessing the space-time correlation structure of precipitation ensemble fields is not equivalent to assessing the space-time correlation structure of the SMP error introduced to create these fields; however, the correlation structures of SMP error fields can vary depending on the specific mathematical definition of SMP error. Since precipitation fields that resemble a ground-reference are the ultimate objective of an ensemble-based SMP error model, we chose to evaluate the ability of STREAM ensemble members to replicate the space-time correlation structures of Stage IV.
The above metrics were calculated for all precipitation datasets and ensembles at four space-time resolutions: 1-hr 0.1°, 1-hr 0.25°, 24-hr 0.1°, and 24-hr, 0.25°. Precipitation fields were regridded to coarser spatial resolutions using bilinear interpolation. Figure 4 presents the spatial autocorrelation length of Stage IV, IMERG and STREAM noise ensemble members for each hourly time step in June 2008. The spatial autocorrelation length of the noise fields generally follows that of IMERG as intended by the STREAM methodology, except during nonrainy periods (i.e., June 13; Figure 4) when the STREAM autocorrelation model uses the spatial structure of the most recent rainy field to generate correlated noise. The autocorrelation length of STREAM noise fields follows Stage IV autocorrelation length when IMERG autocorrelation length matches Stage IV but deviates from Stage IV when IMERG autocorrelation length does not match that of Stage IV. Figure 5 shows IMERG, Stage IV, and outputs from the uncorrelated ensemble, STREAM autocorrelated noise and ensemble, and SREM2D ensemble for a six-hour period during a storm event in 2008 that led to heavy flooding in Cedar Rapids and Iowa City. Ensemble members shown in Figure 5 were chosen at random. While the uncorrelated ensemble fields do not resemble precipitation structures observed by Stage IV, STREAM and SREM2D fields visually resemble realistic precipitation structures from Stage IV, and STREAM also reproduces the observed anisotropy. The spatial correlation features generated in the STREAM noise fields clearly translate to similar spatial correlation features in STREAM precipitation fields. SREM2D fields exhibit less fine-scale anisotropic detail than STREAM, presumably due to its isotropic formulation.

Results
Figures S1 and S2 in Supporting Information S1 show model outputs for a disorganized convective precipitation case and a stratiform precipitation case, respectively. Figure 6 provides additional event-scale analysis of STREAM, showing cumulative hourly precipitation and daily precipitation rates for a heavy rainfall event in June 2013 in southcentral Wisconsin. Area-averaged precipitation was calculated for the inset area in Figure 6a. The spatial ACF of precipitation within the inset area was also calculated to assess STREAM ensemble performance during this event, confirming STREAM ensemble members' ability to replicate the spatial structure of Stage IV rainfall (Figure 6c). The STREAM ensemble brackets the observed cumulative precipitation over the course of the event, reducing IMERG's stark overestimation (Figure 6b), and generally brackets observed precipitation rates at the daily scale, with the exception of 2 days (Figure 6d). Note that the uncertainty described by the range of the STREAM ensemble is small on days with low IMERG estimates, but widens when IMERG observes nonzero rainfall, reflecting the greater range of random error in nonzero IMERG estimates (Figure 6d). The ensemble spread brackets the cumulative precipitation at the end of May in all years, regardless of whether IMERG over-or underestimates spring cumulative precipitation, except 2008, a year in which IMERG significantly underestimated cumulative precipitation. Precipitation in 2008 was well above the climatological average for all months shown in Figure 7, due in part to unprecedented rainfall occurring in the end of May and early June-conditions that likely pose a particular challenge for error modeling. Figure 8 presents RMSE, POD, and POFA calculated over the entire study area and validation period for IMERG and all ensemble products at four space-time resolutions. The RMSE of IMERG and all ensemble means increases sharply for extreme hourly rainfall rates (>8 mm/hr). The STREAM ensemble mean and uncorrelated ensemble mean exhibit reduced RMSE at all scales and across all rain rates, with the exception of heavy rain rates at an hourly scale. The mean of the uncorrelated ensemble performs identically to the STREAM ensemble mean at all scales because both ensemble means reflect a bias-corrected version of IMERG, but the range of the STREAM ensemble at coarser resolutions is much greater (compare Figure S5 in Supporting Information S1 to Figure 6). The higher RMSE of the SREM2D ensemble mean relative to the STREAM ensemble aligns with results from Maggioni et al. (2011), who found that the relative RMSE of SREM2D-perturbed rainfall was slightly greater than that of the original satellite product.
The STREAM ensemble mean and uncorrelated ensemble mean exhibit higher POD across all space-time scales. Notably, the STREAM ensemble mean and uncorrelated ensemble mean are able to simultaneously increase the POD while reducing the POFA at the hourly scale for precipitation rates greater than 1 mm/hr. The POFA of the STREAM and uncorrelated ensemble means are slightly higher than IMERG at the daily scale. The probability of false alarm at rates below 1 mm/hr at the hourly scale and below 10 mm/day at the daily scale is slightly higher for the STREAM ensemble mean relative to IMERG. However, at rates greater than 1 mm/hr (at both 0.1° and 0.25° spatial resolutions), the probability of false alarm is substantially lower for the STREAM ensemble mean than for IMERG (Figure 8).
The spatial ACFs in the east-west and north-south -directions (x-and y-, respectively) and the temporal ACF of IMERG, Stage IV, and ensemble fields are shown in Figure 9. Only 10 members of each ensemble from STREAM, SREM2D, and the uncorrelated ensemble are displayed for clarity; since the ACFs are calculated over a long validation period, the ACFs of individual members within each error modeling method are nearly identical.
The correlation structure of STREAM ensemble fields nearly matches that of Stage IV at every scale (Figure 9), although the spatial ACF of STREAM ensemble fields-both in the east-west and north-south directions-is slightly lower than the spatial ACF of Stage IV at the hourly scale. The lower spatial autocorrelation of STREAM fields relative to Stage IV is visible in output fields ( Figure 5). The uncorrelated ensemble members exhibit much lower spatial and temporal autocorrelation than Stage IV at the hourly scale, with the greatest difference at the finest spatial resolution. SREM2D ensemble fields replicate the spatial ACF of Stage IV well at the hourly scale, but also exhibit a lower temporal ACF than Stage IV at the hourly scale. SREM2D's underestimation of spatial autocorrelation at the daily scale is likely due to the effect of a low hourly temporal ACF during aggregation of ensemble fields from an hourly to daily resolution. Once ensemble fields are aggregated to a coarser resolution (24-hr, 0.25°), all error model ensembles roughly replicate the temporal and spatial ACFs of Stage IV.
The Containing Ratio (CR) and average ensemble range of the STREAM ensemble, SREM2D ensemble, and uncorrelated ensemble as a function of precipitation rate across four resolutions are presented in Figure 10. The STREAM ensemble consistently maintains a high CR (generally >0.8) across scales, though it dips at extreme rain rates. The STREAM ensemble brackets approximately 50% (70%) of the instances when ground-reference rainfall is greater than 8 mm/hr (35 mm/day) at an hourly (daily) resolution. SREM2D has a high containing ratio for Stage IV observations of zero precipitation but experiences a sharp decrease for nonzero values. The SREM2D ensemble fails to capture many of the nonzero ground-reference observations that the STREAM ensemble and uncorrelated ensemble successfully bracket, even though the SREM2D ensemble range is generally similar to or greater than that of the STREAM ensemble. The performance of the uncorrelated ensemble degrades with increasing scale; at a daily, 0.25° scale, the uncorrelated ensemble fails to capture over 60% of the instances when the ground-reference observes rain rates greater than 30 mm/day. This difference in performance at coarser space-time scales occurs because the STREAM ensemble incorporates error correlation structures which allow ensemble members to simulate regional over-and underestimation by IMERG, ensuring greater variability among ensemble members. Meanwhile, the uncorrelated ensemble aggregates adjacent pixels with randomly simulated under and over-estimation, averaging out random errors and preventing any simulation of regional over-or underestimation and resulting in a much lower ensemble range. 6. Discussion

Performance of STREAM Across Multiple Spatio-Temporal Scales
While the full STREAM ensemble at any given point in time and space represents the range of random error associated with IMERG, the ensemble mean represents the IMERG estimate adjusted only for systematic bias. Therefore, the mean of the ensemble will outperform individual ensemble members in terms of RMSE by strictly addressing systematic bias, but cannot capture the range of IMERG uncertainty on its own. The mean of the STREAM ensemble consistently reduces RMSE relative to IMERG across scales and rain rates except for hourly intensities greater than 10 mm/hr (Figure 8). This is likely due to the difficulty of predicting missed cases associated with heavy ground-reference rainfall. The STREAM ensemble mean has a higher or similar probability of detection relative to IMERG across rain rates and scales, with the greatest improvements achieved at lower precipitation rates ( Figure 8). A portion of this improvement is due to the incorporation of the wetted area ratio (WAR) in the pixel-scale CSGD error model, which helps predict IMERG missed cases based on the presence or absence of nearby IMERG rainfall ( Figure S1 in Supporting Information S1). At the hourly scale, the STREAM ensemble mean shows both a higher POD and lower POFA due to the use of the WAR covariate in the pixel-scale CSGD error model ( Figure S1 in Supporting Information S1). By removing the censoring of uniform noise greater than 0.995, the POD of STREAM can be slightly increased for low rain rates at the expense of a slight increase in POFA ( Figure S2 in Supporting Information S1). However, removal of this censoring component in STREAM can lead to "INF" values in the precipitation ensemble when extremely high noise values are ingested by the inverse CDF of conditional distributions. The STREAM ensemble's ability to bracket ground-reference observations at event and seasonal scales (Figures 6 and 7) suggests that STREAM would be well-suited to creating inputs to hydrologic, land surface, or drought monitoring models-a direction that will be pursued in follow-on work.
We reran ensemble generation and analysis using Stage IV in place of IMERG in the correlated noise generation scheme ( Figure 3) to understand if applying the ground-reference spatial correlation structure significantly improves ensemble performance; it does not (results not shown). This indicates that IMERG, although imperfect, provides valuable information about error correlation structures, on par with the information available through a ground-reference product.
STREAM was run for a 50-member ensemble in this work. Although performance metrics at the original 1-hr, 0.1° resolution of STREAM fields are not impacted by an increase in ensemble size past 25, performance metrics at coarser resolutions (24-hr, 0.25°) improve with increasing ensemble size until a size of roughly 50 (results not shown). This reflects the increasing number of permutations of native resolution errors and error correlation structures that are combined during rescaling to coarser resolutions, leading to a greater range of possible precipitation estimates at coarse resolutions; the implications of this for water resources modeling are unclear and will be explored in future work. It is likely that at resolutions coarser than 24-hr and 0.25°, a larger STREAM ensemble could be beneficial.
Although an ensemble-based approach is currently the most feasible way to incorporate precipitation uncertainty into applications that ingest deterministic data, a large number of ensemble members may be required to accurately represent precipitation uncertainty. This may require prohibitive computing resources for the storage of precipitation outputs and the computational demands of hydrologic or land surface models. Although this study does not address this challenge, we note that very little work has been done in attempting to adapt the structure of environmental models to probabilistic precipitation inputs. As summarized in Nogueira (2020), large-scale precipitation estimates involve substantial uncertainties; thus, the adaption of models to ingest probabilistic precipitation data is an appropriate way to account for precipitation uncertainty (e.g., Hartke et al., 2020).

Comparison With SREM2D Model
The STREAM ensemble meets or exceeds the performance of the SREM2D ensemble at all resolutions and rain rates except for the most extreme hourly rain rates (>10 mm/hr) when SREM2D exhibits a slightly higher containing ratio (Figure 10). SREM2D shows a particularly low containing ratio for light rainfall rates, meaning that SREM2D-perturbed IMERG fields often fail to bracket observed light rainfall rates. Visually, SREM2D fields exhibit more isotropic structure than IMERG, Stage IV, or STREAM ensemble fields ( Figure 5, S1 and S2 in Supporting Information S1). The noticeable drop in CR that occurs when observed rain rate shifts from zero to nonzero (Figure 10) is likely due to the separate handling of rainfall occurrence and hit errors in SREM2D. Even in the presence of plentiful ground data, a climatologically-trained approach to space-time correlation modeling, such as that used in SREM2D, is potentially problematic: each storm system is unique, so properties will deviate from a climatological training. The STREAM approach, in contrast, infers correlation structure properties directly from each storm and thus foregoes the need for calibration or ground-reference data. STREAM's ability to outperform SREM2D suggests that the use of observed SMP space-time correlation is an attractive and practical alternative to the calibration-based simulation of error correlation.

Comparison With Uncorrelated Error Modeling Approach
The briefest visual analysis of the uncorrelated ensemble fields reveals that they do not resemble real precipitation, instead exhibiting scattered precipitation and little structure ( Figure 5 and S1 in Supporting Information S1). The uncorrelated ensemble's ability to bracket observed precipitation rates in fact worsens as the ensemble is aggregated to coarser resolutions ( Figure 10). The uncorrelated error modeling approach demonstrates the inability of the pixel-scale CSGD error model to generate precipitation ensembles that preserve the space-time correlation structure of SMP error and that bracket ground observations across scales when not paired with STREAM or another autocorrelation modeling approach. The improved performance of STREAM relative to the uncorrelated ensemble emphasizes the central importance of simulating the space-time correlation structure of precipitation error. We note that the mean or median of each conditional distribution generated by the CSGD model could theoretically be taken and combined to generate a precipitation field without the use of any noise fields; however, this only generates a single field of precipitation in space and time and only corrects for systematic bias. This method would not simulate any of the substantial random errors that we know occur in SMP estimates of precipitation or allow hydrological models (or other water resources applications) to account for the range of possible SMP error.

STREAM Future Adaptions
Although the demonstration of STREAM in this work uses the CSGD error model, other pixel-scale error models, such as PUSH (Maggioni et al., 2014) or PIRSO  could likely be used within STREAM to represent IMERG uncertainty across arbitrary space-time scales. The CSGD error model is uniquely useful within STREAM, however, due to its ability to incorporate an arbitrary number of covariates to constrain pixel-scale uncertainty estimates.
One limitation of STREAM in modeling the error of stratiform rainfall that covers a large portion of a study area is that the nature of the uniform noise distribution prevents the model from modeling under-or over-estimation of precipitation over the majority of the study area (even though that may sometimes be the case). This deserves further consideration but is beyond our scope here. The straight-forward solution to this limitation is to select study areas that are substantially larger than the size of regional storm systems.
Although STREAM is run using an hourly time step in this work due to the hourly resolution of the uncertainty estimates generated by the CSGD error model, STREAM can be run at a finer scale (e.g., 30-min) if the uncertainty estimates being used are at the same resolution. The 850 mb steering winds from MERRA2 that are used here have a latency of several weeks. These data were chosen for illustrative purposes only; steering wind data could be obtained from lower-latency datasets such as from data-assimilating numerical weather forecasts or from the motion vectors used in the IMERG morphing scheme (Tan et al., 2019). This latter option would increase the consistency between how errors propagate over space and time within IMERG and how the correlated noise is propagated in STREAM's semi-Lagrangian advection scheme. This option was not pursued here since the IMERG motion vectors are not publicly available; this may be pursued in future work.

Conclusions
The potential of satellite multi-sensor precipitation (SMP) products-and other large-scale precipitation sources with similar error/uncertainty properties, such as satellite-assimilating numerical weather prediction (NWP) models and "blended" datasets that combine NWP and SMP data-in water resources modeling is limited by their uncertainties, which can mischaracterize both precipitation occurrence and intensity. Uncertainty during extreme precipitation events is particularly problematic for applications which assess hazards such as flooding or landsliding (e.g., Hartke et al., 2020;Jia et al., 2020;Prakash et al., 2016). Precipitation uncertainty and error vary according to spatial and temporal resolution, with random errors tending to "cancel out" when aggregated in space and time. SMP errors are autocorrelated in space and time, however, leading to regional (i.e., watershed scale) over-or underestimation by satellite-based products. This problem can be remedied using ensemble generation techniques that produce multiple plausible realizations of the unknown true precipitation field conditioned on the SMP observations. To incorporate precipitation uncertainty into applications which consider accumulated precipitation, such as flood prediction or drought monitoring, ensemble members must replicate the space-time correlation structure of precipitation error. This has been called a grand challenge within the precipitation community , while the usability of other large-scale precipitation datasets would benefit from breakthroughs.
The STREAM combines space-time correlation structures with a pixel scale precipitation error model to generate precipitation ensembles that can "bracket" the magnitude and replicate the correlation structure of higher-accuracy "ground truth" rainfall fields. SMP-based STREAM ensembles are generated at high resolution (1-hr, 0.1°) and are shown to outperform the satellite product IMERG at several spatiotemporal scales. STREAM relies minimally on ground-reference data during calibration and requires no ground-reference data to run. Specifically, the approach taken to model spacetime correlation does not require any ground data and does not even require a "training period," since all necessary properties are inferred from IMERG and wind fields. STREAM ensembles generated at a high resolution can be aggregated to arbitrary space-time scales for use in hydrologic or land surface models while preserving the characteristics of real precipitation at these scales. The ensemble output of STREAM can be ingested in water modeling applications with no modification to those models' structures. This enables water resource predictions that reflect input precipitation uncertainty, though the computational demands of ensemble simulations may become burdensome.
Pixel-scale uncertainty (i.e., the probabilistic uncertainty surrounding a satellite-based precipitation estimate at a single pixel and time step) is the most feasible way to characterize SMP uncertainty around the world. In data-limited regions, pixel-scale precipitation error models can leverage available ground-reference data (i.e., sparse rain gage records), and pixel-scale uncertainty estimates can also be obtained via other approaches (i.e., Kirstetter et al., 2018;Li et al., 2021). Taken alone, however, pixel scale uncertainty is of limited value in water resources applications because it offers no way to connect uncertainty estimates in space and time and thus no way to create realistic ensemble fields of precipitation error that can be rescaled to arbitrary resolutions. STREAM allows users to combine pixel-scale precipitation uncertainty in space and time while accounting for nonstationary SMP error correlation structures. While not explored here, it appears that any pixel-scale uncertainty model-and not just the CSGD approach used here-can fit into the STREAM framework.
To be applicable to continental-to-global scale applications, a space-time SMP error model must rely minimally or not at all on ground-reference data. STREAM is shown to outperform a previous rainfall error model (SREM2D), which utilized extensive gridded ground-reference data for training SMP error and correlation properties. This work demonstrates that the anisotropic, nonstationary space-time correlation structure of SMP errors can be modeled using only SMP fields and atmospheric motion vectors. Meanwhile, ongoing work has demonstrated that the GPM Dual Precipitation Radar (DPR) instrument, which is quite accurate relative to other space-based microwave and infrared sensors, can be used to train pixel-scale error models (Khan et al., 2018;Li et al., 2021). Combining that approach with STREAM would completely eliminate the need for ground reference data, providing tools that could be used anywhere around the globe-though not without some shortcomings (Li et al., 2021). In addition, the nonstationary and computationally efficient nature of the STREAM ensemble generation means that it could be applied at a global scale. Thus, while challenges remain, we believe that this work constitutes a meaningful step toward solving the grand challenge of characterizing precipitation error across arbitrary space-time scales.