Statistical downscaling provides a technique for deriving local-scale information of precipitation and temperature from numerical weather prediction model output. The K-nearest neighbor (K-nn) is a new analog-type approach that is used in this paper to downscale the National Centers for Environmental Prediction 1998 medium-range forecast model output. The K-nn algorithm queries days similar to a given feature vector in this archive and using empirical orthogonal function analysis identifies a subset of days (K) similar to the feature day. These K days are then weighted using a bisquare weight function and randomly sampled to generate ensembles. A set of 15 medium-range forecast runs was used, and seven ensemble members were generated from each run. The ensemble of 105 members was then used to select the local-scale precipitation and temperature values in four diverse basins across the contiguous United States. These downscaled precipitation and temperature estimates were subsequently analyzed to test the performance of this downscaling approach. The downscaled ensembles were evaluated in terms of bias, the ranked probability skill score as a measure of forecast skill, spatial covariability between stations, temporal persistence, consistency between variables, and conditional bias and to develop spread-skill relationships. Though this approach does not explicitly model the space-time variability of the weather fields at each individual station, the above statistics were extremely well captured. The K-nn method was also compared with a multiple-linear-regression-based downscaling model.
 Statistical downscaling provides a way to utilize output of climate models for local-scale applications. Typical grid size for global-scale simulations are of the order of 100–200 km, and the raw global-scale model output is of limited use when information is required at local scales. The objective of downscaling is to overcome this scale mismatch and to use the skill in atmospheric forecasts at local scales.
 In short, statistical downscaling develops relationships between large-scale atmospheric circulation variables and local climate information (e.g., precipitation and temperature observations at individual stations). Using these observed relationships, forecasts of atmospheric variables can be translated into forecasts of local climate variables. Several methods of varying complexity have been used in performing statistical downscaling. Zorita and von Storch  have classified existing statistical methods into three categories: (1) linear methods (e.g., canonical correlation analysis), (2) classification methods (e.g., weather generators and regression tree), and (3) deterministic nonlinear methods (e.g., neural networks). They also propose an analog method and compare the results with a method chosen from each of the above three categories to reconstruct average December–February (DJF) precipitation over the Iberian Peninsula for the period 1901–1989.
Widmann et al.  applied three different statistical downscaling methods that used simulated precipitation fields from the National Centers for Environmental Prediction-National Center for Atmospheric Research (NCEP-NCAR) reanalysis [Kalnay et al., 1996] as the predictor. These methods are (1) local rescaling of the simulated precipitation, (2) downscaling using singular value decomposition (SVD), and (3) local rescaling with a dynamical correction. The three methods were applied to reconstruct historical (1958–1994) wintertime precipitation over Oregon and Washington and concluded that local rescaling with dynamical correction and SVD-based downscaling yielded comparable skills over the Pacific Northwest region. Salathé  forced a hydrologic model of the Yakima River in central Washington with three downscaled precipitation fields to compare the effectiveness of the downscaling methods. One of these methods was an analog method that used a 1000-hPa geopotential height field from the NCEP-NCAR reanalysis as a predictor. Salathé  showed that downscaling by local scaling of simulated large-scale precipitation from the NCEP-NCAR model was quite successful in streamflow simulations in the Yakima basin.
 In this paper we present a downscaling methodology based on the K-nearest neighbor (K-nn) algorithm. The K-nn algorithm is described for use in a stochastic weather generator by Lall and Sharma , Rajagopalan and Lall , Buishand and Brandsma , and Yates et al. . The fundamental idea of the K-nn algorithm is to search for analogs of a feature vector (vector of variables for which analogs are sought) based on similarity criteria in the observed time series. In the weather generator model, the day immediately following the analog day is taken as the next day in the generated sequence, and the process is repeated. In the method presented here, local-scale station information is used for analog days selected on the basis of global-scale climate model output.
 Though transfer-function-based models (e.g., multiple linear regression, or MLR) are widely in use [Antolik, 2000], the K-nn based approach developed here has several advantages. First, this method is data-driven and makes no assumptions of the underlying marginal and joint probability distributions of variables. For example, to downscale precipitation using MLR, we need a two-step process [e.g., Clark et al., 2004]. We need to account for the intermittent property of precipitation (typically modeled using a logistic regression), and then transform to normal space to satisfy the inherent normality criteria needed in least squares regression to model precipitation amounts. Second, K-nn based downscaling will be shown to intrinsically preserve the spatial covariability and consistency of the downscaled climate fields. Third, ensemble medium-range forecast (MRF) runs can be readily utilized in the downscaling process, and there is no need to use the ensemble mean of MRF predictors, as is normally used in regression models. Finally, the ensemble spread information from MRF runs can be utilized to develop spread-skill relationships, which is not possible in a MLR model [e.g., Clark et al., 2004].
 The K-nn downscaling methodology was tested on four example river basins distributed over the continental United States, covering both snowmelt- and rainfall-dominated hydrologic regimes. These four basins are (1) the Animas River in southwestern Colorado, (2) the east fork of the Carson River on the California/Nevada border, (3) the Cle Elum River in central Washington, and (4) the Alapaha River in southern Georgia (Figure 1).
Section 2 provides a description of the data used in the analysis. Section 3 describes the K-nn methodology developed for statistical downscaling. Section 4 present a discussion of the results from the four example river basins. Section 5 is a summary of the techniques and results.
2. Data Description
2.1. The CDC Forecast Archive
 The NOAA-CIRES Climate Diagnostics Center (CDC) has generated a “reforecast” data set (1978 to present) using a fixed version (circa 1998) of the NCEP operational medium-range forecast (MRF) model [Hamill et al., 2004]. This is a spectral model and has a horizontal resolution of approximately 200 km, with 28 vertical layers (T62/L28). The archive consists of one control run plus 14 ensemble members, a total of 15 members. The control run is based on the global analysis from the NCEP-NCAR reanalysis project [Kalnay et al., 1996]. Initial perturbations for ensemble members are generated from the control run with the “breeding method” [Toth and Kalnay, 1993]. Each ensemble member consists of a 14-day forecast starting every day since 1 January 1978, and presently the model continues to be run in real time. The model outputs are saved at 0000 UT and 1200 UT. The 20-year archive data from 1 January 1979 to 31 December 1998 was used in this study.
 We used seven output variables [Clark and Hay, 2004] from each of the ensemble members in our analysis. The model output variables used are (1) the accumulated precipitation for a 12-hour period (e.g., 0000–1200 UT) at the surface, (2) mean sea level pressure, (3) total column precipitable water, (4) relative humidity at 700 hPa, (5) 2-m air temperature, (6) 10-m zonal wind speed, and (7) 10-m meridional wind speed.
2.2. Station Data
 This study employs daily precipitation, and maximum and minimum temperature data from the National Weather Service (NWS) manual cooperative (COOP) network of climate observing stations across the contiguous United States. These data were extracted from the National Climatic Data Center (NCDC) Summary of the Day (TD3200) data set [Eischeid et al., 2000]. Quality control performed by NCDC includes the procedures described by Reek et al.  that flag questionable data based on checks for (1) absurdly extreme values, (2) internal consistency among variables (e.g., maximum temperature less than minimum temperature), (3) constant temperature (e.g., 5 or more days with the same temperature are suspect), (4) excessive diurnal temperature range, (5) invalid relationships between precipitation, snowfall, and snow depth, and (6) unusual spikes in temperature time series. Records at most of these stations start in 1948 and continue through 1998.
 The four example basins (the Animas River, Colorado, referred to in the figures as anmas; the East Carson River, California and Nevada, carsn; the Cle Elum River, Washington, celum; and the Alapaha River, Georgia, alapa) were selected based on their geographical distribution and streamflow characteristics. The Animas, East Carson, and Cle Elum are snowmelt-dominated, and the Alapaha is a rainfall-dominated basin. We select the “best stations” in the COOP network that are located within a 100-km search radius of the center of these four basins: 15 stations in the Animas, 16 stations in the Carson, 18 stations in the Cle Elum, and 10 stations in the Alapaha (Table 1). These “best stations” are defined as those with less than 10% missing or questionable data over the analysis period, 1979–1998.
Table 1. Stations Used in the Four Study Basins
Animas (Colorado), 37.50°N, 107.50°W
Alapaha (Georgia), 31.35°N, 83.22°W
Cle Elum (Washington), 47.37°N, 121.05°W
Carson (California-Nevada), 38.55°N, 119.80°W
 The steps in downscaling the atmospheric variables to basin-scale precipitation and temperature using the K-nn algorithm are outlined in this section. The CDC NCEP-MRF forecast archive was retrieved and formatted to form a data matrix consisting of 7305 rows (corresponding to the number of days from 1 January 1979 to 31 December 1998) and 14 columns (corresponding to the number of lead times) for each of the seven variables (see section 2.1). Days similar to each of the 7305 × 14 days in the archive were identified using the K-nn algorithm. A description of the K-nn algorithm follows.
3.1. K-nn Algorithm
 Each of the 15 ensemble members of the MRF archive for each basin was examined individually. The steps of the K-nn algorithm for a given MRF ensemble member and basin are as follows:
 1. Compile a feature vector of MRF model output for a given day and forecast lead time. The feature vector () consists of values for all the climate variables of the day (the feature day, f) for which we are trying to find the K-nearest neighbors. Since two model outputs, 0000 and 1200 UT, were available for each of the seven variables, the feature vector was assumed to consist of 14 variables.
where vij is the value of the climate variable i (i = 1,…, 7; the seven climate variables, see section 2.1) at time j (j = 1, 2; 0000 and 1200 UT) for the feature day f. Explicitly, x1 = v11; x2 = v21, and so on.
 2. Set a window of chosen width centered on the feature day f. We used a 14-day window (7 days lagged and 7 days lead) [Yates et al., 2003] starting with the first day of the archive (1 January 1979). The subset of data for a given variable now consists of all days over the 20-year period (1979–1998) but excluding day f within this 14-day window. Missing data, if any, also need to be accounted for, and let nt be the total number of days with available data within the temporal window centered on the feature day f. So for the 14 variables (refer to step 1), the data matrix was reformatted to have nt rows and 14 columns. The structure of this data matrix ([A]nt×14f) is
where ai,j is the value of the climate variable for time index i (i = 1,…, nt) and for variable j (j = 1, …, 14).
 3. Standardize matrix [A]nt×14f. The standardized matrix [S]nt×14f is expressed as
where the underbars represent vectors; represents the vector of standardized values of vector for variable j. The variable counter j loops from 1 through 14 (the total number of variables); μj and σj are the mean and standard deviation, respectively, of variable j estimated from vector ; E[ ] is the expected value; and superscript T represents the vector-matrix transpose operator.
 4. Perform empirical orthogonal function (EOF) decomposition or principal component analysis (PCA) of matrix [S]nt×14f. We first estimate the correlation/covariance matrix [C]14×14f, which is given by
where [S]T is the transpose of matrix [S] (the superscript f has been dropped for clarity; see equation (3a)). A singular value decomposition of [C]14×14f [Press et al., 1992] yields
where [U] and [V] are the orthogonal matrices (order, 14 × 14), and [W] is a diagonal matrix of the same order whose elements are the eigenvalues (λj, j = 1, …, 14 such that λ1 > λ2 > … > λ14; corresponding to the 14 variables). Since [C]14×14f is symmetric, [U] = [V]. Each column of [U] (or [V]) represents the eigenvectors corresponding to a given eigenvalue λj. Let be the eigenvector corresponding to eigenvalue λj. So
The principal components (PCs) are then derived as
where [P]nt×14f is the principal component matrix for feature day f and column vector is the jth principal component (j = 1,…, 14) of length nt. The principal components that explained more than 1 percent of the total variance (total variance is given by the trace of matrix [W], i.e., tr[W]) for feature day f were retained. Let nret be the number of PCs retained, and nret < 14. Typically five PCs were retained.
 5. Using summary statistics (mean and standard deviation, equations (3d) and (3e), respectively) from step 3, and eigenvectors from step 4, project the feature vector in step 1 on to eigenspace. Let the projected feature vector be , which is given by
where xj′ are the elements of the projected feature vector .
 6. For each time element i (i = 1,…, nt), compute the weighted Euclidian distance between the projected feature vector (equation (8b)) and the PCs (equation (7b)). The distance computation is carried out using only the nret components. Let di be the distance metric corresponding to day i, which is calculated as
The ratio λj/tr[W] is the weight and corresponds to the fraction of variance explained by PC . This gives a set of nt distances as possible neighbors of feature day f.
 7. Sort the distances di in ascending order (d(i)), and retain only the first K neighbors. The choice of K is based on the prescriptive choice of the square root of all possible candidates (i.e., K = ) [Rajagopalan and Lall, 1999; Yates et al., 2003]. From the asymptotic arguments of Fukunaga , K should be chosen so as to be proportional to nt4/(d+4), where d is the dimension of the vector (i.e., in our case, number of PCs retained, nret) for which the nearest-neighbor density is to be estimated, with the constant of proportionality dependent on the underlying density. One can also use objective criteria such as generalized cross validation (GCV) as proposed by Rajagopalan and Lall  or Lall and Sharma . For the sample sizes under consideration here, the choice of K = was found to give consistent results for the simulated statistics. With a 14-day window, 20 years of data, and no missing data for days in the temporal window, the maximum number of nearest neighbors in our case was = 17 (rounded to nearest integer).
 8. Assign weight wi (0 < wi < 1) to each of the K neighbors using the bisquare weight function [Huber, 2003] based on distance d(i).
where d(K) is the distance (sorted) of neighbor K.
 9. Select a neighbor from the K neighbors as an analog for feature day f. A uniform random number [Press et al., 1992], u ∼ U[0, 1], is first generated, and if u ≥ w1, then the day corresponding to distance d(1) is selected. If u ≤ wK, then the day corresponding to d(K) is selected. For w1 < u < wK, the day corresponding to d(i) is selected for which u is closer to wi.
 10. Repeat step 9 seven times to generate seven ensemble members.
 11. Repeat steps 1–10 for each of the days (7305) corresponding to a forecast lead time (14 lead times), a total of (7305 × 14) feature days in the archive.
 12. Repeat steps 1–11 fifteen times corresponding to the 15 MRF runs.
 13. Repeat steps 1–12 four times for the four study basins.
 Thus the final output for each of the four basins consisted of analog dates (pointers to physical dates were stored) corresponding to each day in the MRF archive (size, 7305), each forecast lead time (size, 14), and an ensemble of 105 ensemble members (seven realizations from each of the 15 MRF model runs). Note that the downscaling was carried out for the center point of each of the four basins.
 The premise of this approach is that the atmospheric model captures the large-scale circulation patterns, which are assumed to be related to the synoptic-scale processes that generate the local weather. Since neighbors are sought only in the PC space of the atmospheric model, the choice of the downscaling location is not influenced by any basin feature (e.g., topography, precipitation shadow, etc.). Our choice of the downscaling location (i.e., centers of basins) was guided primarily by the intended application to hydrologic modeling by River Forecast Centers. Forecasting of precipitation and temperature fields at individual stations adjoining the basins is described in the next section.
3.2. Forecasting Precipitation and Temperature Fields at Individual Stations
 A 100-km search radius was used from the center of each basin to pick up the closest stations (see Table 1). The dates derived using the K-nn algorithm for a given basin were used to select from the daily-observed precipitation and temperature values for each of the adjoining stations of that basin. This then constitutes the downscaled precipitation and temperature for each of the stations used in this study. Several statistics were then calculated to analyze these downscaled precipitation and temperature fields, and these are presented in the next section.
4. Results and Discussions
 The statistics used to analyze and verify the downscaled precipitation and temperature forecasts are (1) seasonal cycles of precipitation amount and temperature (results are shown only for maximum temperature), (2) bias, (3) spatial correlations, (4) forecast skill, (5) forecast reliability, (6) rank histograms, and (7) spread-skill relationships.
4.1. Seasonal Cycles of Precipitation Amount and Temperature
 We first analyzed the variation of the annual cycle of precipitation and temperature for the four study basins. In Figures 2 and 3 the annual cycles (derived from observations for the period 1979–1998) of precipitation and temperature, respectively, for selected COOP stations in the basins along with the ensemble spread (as box plots) for each month are presented. The COOP stations used were CO1609, GA0140, WA0456, and CA0931 for the Animas, Alapaha, Cle Elum, and East Carson, respectively (see Table 1 for locations). The box plots for each month were estimated from the 105 ensemble members and are shown for the forecast lead time of 5 days. The boxes in these plots (e.g., Figure 2) indicate the interquartile range of the simulations, and the whiskers show the 5th and 95th percentile of the simulations, while the open circles indicate values outside this range. The horizontal lines within the box indicate the median value, and the solid lines join values of the statistic from the observed data. Typically, if the statistics of the observed data fall within the box, it indicates that the simulations adequately reproduced the statistics of the historical data.
 In case of precipitation (Figure 2), there is a wide regional variation in the amounts and timing of the maximum precipitation occurrences among the basins. The Alapaha, for example, has a precipitation peak in summer, but the Cle Elum is the driest during the summer season (June–July–August). Also, it is well known that in the western United States, in particular during wintertime, the hydroclimate variables have a coherent spatial pattern [e.g., Rajagopalan and Lall, 1998]. The atmospheric models generally represent these synoptic scales quite well, and we see better simulations of precipitation amount in the snow-dominated Animas, Cle Elum, and East Carson basins (Figures 2a, 2c, and 2d, respectively) over the rainfall-dominated Alapaha basin (Figure 2b). The K-nn downscaling model in all cases largely captures the seasonal variation of precipitation. Given that the K-nn algorithm was not explicitly designed to preserve monthly statistics, the seasonal cycle is fairly well captured. For maximum temperature (Figure 3), the downscaled values in all cases were able to capture the historical observations. Unlike precipitation, the ensemble spread (interquartile range in the box plots) was minimal in case of temperature. Similar results were noted for other forecast lead times.
 Bias is defined as the deviation of the expected value of a given variable from its true value. We estimated the median absolute bias (MABl) for each forecast lead time (l) and month as the following:
where, ndays is the total number of days in the time series for a given month (e.g., ndays = 620 for January from 20 years of data and with no missing values); is the expected value of the observed variable (precipitation or temperature) for lead time l (i.e., climatological mean); and Oil is the observation for day i and lead time l. Similarly, is the expected value of the downscaled variable for lead time l and ensemble member e, and (Yil)e is the downscaled variable value for day i, lead time l, and ensemble member e. We then calculate the absolute bias for a given ensemble member and use the n (equal to 105) ensemble members to calculate the median absolute bias (MABl) for lead time l (equation (11c)). For precipitation, the absolute bias was expressed as a percentage of . In other words, the absolute difference term within the square brackets of equation (11c) was expressed as .
Figure 4 shows the bias for precipitation for each of the four basins for the month of January. Once again, these biases are median absolute biases and are expressed as a percentage of the mean climatology. The box plots correspond to the spread from the number of closest stations (shown in parenthesis) in a given basin. The median bias (estimated from the closest stations for a given basin) for all the basins is within 20%. In some cases, stations have biases greater than 20%. Of the four basins, the biases are largest for the Animas. This is probably because the Animas is the driest of all the four basins with an average January precipitation of about 1.28 mm. The temperature biases (not shown) were quite small and typically were within 0.5°C.
4.3. Spatial Correlations
 Spatial autocorrelations are used to check how well the K-nn algorithm performs in preserving the spatial autocorrelation. The Pearson correlation (hereinafter referred to as correlation) between two example stations, say, 1 and 2, was estimated as follows:
 Let and be the vector of downscaled values for a given variable (e.g., precipitation) for lead time l from ensemble member e. That is,
where (Y1le)i and (Y2le)i are the downscaled variable values for lead time l, ensemble member e, and day i for stations 1 and 2, respectively; and i = 1,…, ndays. Next we calculate the correlation (ρle) for a given ensemble member (e) and lead time (l) using the vectors and . That is,
where, E[ ] is the expected value and σ1 and σ2 are the standard deviations of and , respectively.
Figure 5 shows the correlation box plots (for a given l using the 105 ensemble members of ρle) over 14-day forecast lead time between two example stations in the Animas basin (CO4734 and CO1609), and Figure 6 presents similar results for two stations in the Alapaha basin (GA0140 and GA2266) for winter and summer precipitation and temperature. Since we pick the weather data for all the stations simultaneously on the selected neighbor (i.e., day, step 9), the K-nn method intrinsically preserves (1) the spatial correlation structure of the variables (precipitation and temperature) and (2) the correlation between the variables at each station.
 For precipitation, in the case of the Animas basin, which overall is a dry basin, the observed spatial correlation is about 0.2 for both January and July. These observed spatial correlations are quite small. Since the Animas basin has significant topographical variations (see Figure 1), elevation differences and measurement errors in precipitation can contribute to low observed spatial correlation values. In the case of Alapaha, which is relatively flat, and wetter, we see a high degree of spatial correlation (about 0.7) between the example stations in January. In July, the spatial correlation diminishes because the precipitation is largely generated by convection, which is generally difficult to capture by the atmospheric models.
 For temperature (see Figures 5 and 6), the box plots of downscaled values adequately bracket the observed spatial correlation. The temperature correlations among the stations are very similar for winter and summer in both the basins, and the biases are quite small in all cases. As mentioned above, the cross correlations (i.e., correlations between variables at each station) are also intrinsically preserved by this downscaling method.
4.4. Forecast Skill
 The probabilistic skill of the downscaled precipitation and temperature forecasts was assessed using the ranked probability skill score (RPSS) [Wilks, 1995]. The RPSS is based on the ranked probability score (RPS) computed for each downscaled forecast and observation pair:
where Ym is the cumulative probability of the forecast for category m and Om is the cumulative probability of the observation for category m. This is implemented as follows. First, the observed time series is used to distinguish 10(J) possible categories for forecasts of precipitation and temperature (i.e., the minimum value to the 10th percentile, the 10th percentile to the 20th percentile, … , the 90th percentile to the maximum value). These categories are determined separately for each month, variable, and stations in the basin. Next, for each forecast-observation pair, the number of ensemble members forecast in each category is determined (out of 105 ensemble members), and their cumulative probabilities are computed. Similarly, the appropriate category for the observation is identified and the observation's cumulative probabilities are computed (i.e., all categories less than the observation's position are assigned zero and all categories equal to and greater than the observation's position are assigned 1). Now the RPS is computed as the squared difference between the observed and forecast cumulative probabilities, and the squared differences are summed over all categories (equation (14)).
 The RPSS is then computed as
where is the mean ranked probability score for all forecast-observation pairs and is the mean ranked probability score for climatological forecast.
 For temperature, is computed using an equal probability in each of the m categories defined in equation (14) (i.e., 1/J); for precipitation, the probability for the first category (zero precipitation) is taken as the observed probability of no precipitation, and the probability for all other categories is taken as 1/(J − 1) (see equation (14)). An RPSS of 0.0 indicates no difference in skill over the reference climatological forecast (), and an RPSS of 1.0 indicates a perfect forecast. Negative RPSS implies that the model performs worse than climatology. Here RPSS was estimated separately for each forecast lead time, for each month, and for each station in the basin. The median RPSS was then calculated from the station RPSS values for each of the basins.
Figures 7 and 8 show plots of median RPSS for precipitation and temperature, respectively. These plots show the months along the abscissa and forecast lead times along the ordinate, with darker shading representing regions of higher skill. For precipitation (Figure 7), in all the basins higher skills are obtained during the fall and winter months, and extend for only short forecast lead times (e.g., up to 3 days in the case of Cle Elum). Wintertime skill scores are around 0.4 for all of the basins. This means that the K-nn downscaled forecasts are superior 40% of the time over the reference climatological forecasts. In summer, the skills drops down considerably even at short forecast lead times. For Cle Elum, however, we see higher skills even during the summertime. This is because the basin is the driest during the summer months (see Figure 2), and higher skill arises from consistent dry forecasts from the downscaling model.
 For temperature (Figure 8) the skills are higher than that of precipitation, with a maximum for all the basins to be around 0.5. Higher skills are generally observed during all the seasons and are valuable up to lead times of 5 days. Also, for both temperature and precipitation, the results overall are very consistent, showing skills diminishing with an increase in forecast lead times.
 Since the RPSS is only a single number, it is a useful measure for ranking competing forecasts, but it does not illuminate the underlying basis for the forecast errors. For example [Hamill, 1997], are the forecasts too specific, or biased? Are 25% of the forecasts on average below the 25th percentile of forecast distribution? Thus we need additional forecast verification measures to address such issues. The reliability diagram [Wilks, 1995] is a frequently used tool in probabilistic forecast verification and is discussed in the next section.
4.5. Forecast Reliability
 The fundamental interest in forecast verification is to analyze the joint probability distribution of forecasts and observations [Wilks, 1995]. Let yi denote discrete forecasts that can take one of the any I values y1, y2,…, yI; and let oj be the corresponding observations (discrete), which can have any of the J values o1, o2,…, oJ. Then the joint probability mass function p(yi, oj) of the forecasts and observations is given by
where p(oj∣yi) is the conditional probability, implying how often each possible event (out of J outcomes) occurred on those occasions when the single forecast yi was issued; and p(yi) is the unconditional (marginal) distribution that specifies the relative frequencies of use of each of the forecast values yi.
 The reliability diagram graphically represents the performance of probability forecasts of dichotomous events and depicts the conditional probability that an event occurred (say, o1), given the different probabilistic forecasts (yi); that is, the observed relative frequency, p(o1∣yi), as a function of the forecast probability p(yi). This was implemented as follows.
 First, the ensemble output (105 ensemble members) for a given basin is converted into probabilistic forecasts (i.e., the probability a specific event occurs). In this case, the “event” is that the day is forecasted to lie in the upper tercile of the distribution, and the probability is simply calculated as all ensemble members in the upper tercile divided by the total number of ensemble members. The upper tercile was chosen to focus attention on events such as heavy precipitation and high temperatures that can cause significant changes to streamflow. Next, the observed data are converted to a binary time series: A day is assigned “one” if the data lies in the upper tercile and “zero” if the data does not. The above steps produce a set of probabilistic forecast–observation pairs for each variable, station, month, and forecast lead time. Finally, the forecasted probabilities are classified into I categories (i.e., probabilities between 0.0 and 0.1, between 0.1 and 0.2, …, between 0.9 and 1.0, a total of 10 categories), and for each category both the average forecasted probability and the average of the observed binary data are calculated. It should also be noted that the number of categories used affects the forecast resolution (i.e., the ability to distinguish subsample forecast periods with different relative frequencies of the event). These averaged observed relative frequency and forecast probability values were then plotted to form the basic reliability diagram.
 Reliability diagrams for January precipitation and maximum temperature in the four study basins at 5-day forecast lead time are shown in Figures 9 and 10, respectively. For precipitation, if there were fewer than one third of days with precipitation, a value of zero was used for the probabilities in the reliability diagrams. The 1:1 diagonal in these figures represent the perfect reliability line, and the inset histogram shows the frequency of use of each of the forecasts, p(yi). Also, to construct the reliability diagrams for each basin as a whole, the forecast-observation pairs were lumped together from all stations in that basin (see Table 1). Results show that overall, the forecasted probabilities match the observed relative frequencies remarkably well for both precipitation and temperature. In case of precipitation (Figure 9), for example, in the case of the Alapaha basin, we see some tendency of higher observed relative frequency at lower forecasted probabilities and the opposite at higher forecasted probabilities. In other words, when a low probability of the event is forecasted, the actual occurrence of the event is more common, and vice versa. Also note that the sample size at high forecast probabilities is often very small, except in case of the Cle Elum. This basin in the Pacific Northwest receives considerable precipitation in January, and we have enough subsamples in each of the forecasted probabilities (see inset histogram in Figure 9c), i.e., we have excellent resolution and reliability in our downscaled forecasts.
 For the case of maximum temperature (Figure 10), in general we have sharper forecasts (high resolution) at the price of some reduced reliability, in particular for the Cle Elum and East Carson, where we can see more frequent occurrences of the event when the forecast probability was slightly lower. Reliability diagrams similar to the above were also plotted for the month of July (not shown). Overall results were similar to January, but for precipitation in the East Carson, practically all the forecasted probabilities (frequency of usage) were within the lowest category (0.0–0.1) and imply the presence of rare events. Though these forecasts were reliable, they exhibit minimal resolution.
 Once again, the results are overall quite impressive and demonstrate that the proposed K-nn algorithm can be used to generate reliable forecasts with negligible conditional bias. The reliability of the forecasts was further evaluated using rank histograms.
4.6. Rank Histograms
 Rank histograms were used to evaluate the reliability of ensemble forecasts and to diagnose errors in their mean and spread. Rank histograms for a given month and forecast lead time were constructed by repeatedly tallying the ranks of the observed precipitation and temperature values relative to values from the 105 member ensemble. The process to obtain the rank histogram for precipitation is slightly different from that of temperature because of the presence of a large number of zero-precipitation days in the observed and ensemble precipitation time series. For temperature the rank histogram was implemented as follows:
 Let for a given forecast day (say, j), and forecast lead time (say, l), let X = (x(1),…, x(n)) be the sorted n-member ensemble (recall n = 105 in this study) and let V be the observed temperature. Then the rank of V, which can have (n + 1) possible values relative to the sorted ensemble, is obtained. Let this rank be denoted by rjl. If, say, there were 20 years of data, then for January (assuming no missing observations), there would be 620 (31days × 20 years) time elements in this time series for a given forecast lead time. By tallying the ranks of the observed through this time series we can obtain a vector of ranks for the selected month (m) and lead time (Rml),
where N is the length of the time series (or sample size, e.g., 620). The elements of Rml are then binned into the (n + 1) possible categories for constructing the rank histogram. So the rank histogram constitutes the rank of the observed and the probability of the rank to fall in any one of the (n + 1) categories.
 In case of precipitation when there are zero precipitation days in the observed and ensemble time series, a modified rule for rank assignment was implemented [Hamill and Colucci, 1998]. If, say, there are M members tied with the verification (i.e., M ensemble members with zero precipitation), a total of (M + 1) uniform random deviates [Press et al., 1992] are generated, corresponding to the M members, and one for the observed zero precipitation. Then the rank of the deviate corresponding to the observed in the pool of (M + 1) deviates is determined. The rank histogram is then constructed in a manner similar to the one described for temperature.
 To interpret the rank histograms, it is assumed that the observations and the ensemble members are samples from the same probability distribution. In that case, counting the rank of the observation over several independent samples, an approximately uniform distribution should result across the possible ranks, i.e.,
where, E[ ] denotes the expected value and P is the probability. Hamill  describes the interpretation of rank histograms and provides these guidelines. When the ensemble members are from a distribution with lack of variability, a U-shaped rank histogram results. An excess of variability in the ensemble members, on the other hand, overpopulates the middle ranks, and ensemble bias (positive or negative) excessively populates the (left or right) extreme ranks.
Figures 11 and 12 show the basin rank histograms for precipitation and maximum temperature, respectively. Basin rank histograms were constructed by pooling in ranks from all stations for a given basin. The basin rank histograms are shown for January at 5-day lead time. For precipitation (Figure 11), the rank histograms are relatively flat, demonstrating that the K-nn method produces realistic ensemble spread. The noise in the rank histograms simply reflects the noisy character of the precipitation time series. For temperature (Figure 12), the basin rank histograms are largely uniform in the middle ranks, except at the extremities where we observe some bias. We see that on average, nearly 2% of the time the observed temperature can be lower (greater) than the lowest (highest) ensemble member.
 In general, from all the cases (including summer) we see from the precipitation rank histograms that the ensembles are relatively flat, and for temperature there is only a small fraction of cases (∼2%) when the observed falls outside the ensemble range. We also constructed rank histograms for each of the individual stations used in the study (see Table 1) and overall found no unusual behavior in the structure of the rank histograms. The next question then is, Can we use the ensemble spread information to predict forecast skill? This topic is discussed in the next section.
4.7. Spread-Skill Relationships
 Ensemble forecasts provide an estimate of the forecast probability distribution. If the spread of this distribution varies from forecast to forecast, then the spread in the distribution may be related to the forecast skill [Kalnay and Dalcher, 1987; Whitaker and Loughe, 1998]. To analyze the spread-skill relationship, we first need to select appropriate measures to define the ensemble spread and ensemble skill. We used three measures of ensemble spread: (1) standard deviation of the ensembles, (2) interquartile range, and (3) the 95th minus the 5th quantiles. As skill measures we used (1) RPSS and (2) the absolute error of the ensemble mean (absolute difference between the observed and the ensemble mean). The utility of ensemble spread as a predictor of ensemble skill has traditionally been measured in terms of linear correlation, although Whitaker and Loughe  suggest an analysis of the joint spread-skill probability distribution.
 Contingency table of spread (ensemble standard deviation) and skill (RPSS) for 5-day forecasts of January precipitation for example station WA0456.COOP is given in Table 2. Here we considered all days for which the observed precipitation was greater than 0.01 inch (0.3 mm). The entries Table 2 are the joint probability of obtaining the spread and skill values in the indicated quintiles. The columns are spread quintiles, and the rows are skill quintiles. If there were no correlation between spread and skill, all entries in the table would be equal to 0.2. On the other hand, if there were a perfect linear relationship between spread and skill (correlation equal to one), all the diagonal elements would be one and the off-diagonals would be zero. Many of the entries in Table 2 are not very different from 0.2, except at the corners. For example, if the spread is in the lowest quintile, there is about 2.5 times higher probability of the skill to be in the lowest, rather than the highest quintile. This observation was consistent among all stations in the study.
Table 2. Contingency Table of Spread (Ensemble Standard Deviation) and Skill (RPSS) for 5-Day Forecasts of January Precipitation for Station WA0456.COOP When the Observed Precipitation is Greater Than 0.3 mma
The entries are the joint probability of obtaining the spread and skill values in the indicated quintiles. The columns are spread quintiles, and the rows are skill quintiles.
 To summarize the contingency table for all stations in a basin, we constructed box plots showing the variation of the joint spread-skill probability for all spread and skill quintiles. Results are shown for January precipitation at 5-day forecast lead time for the Animas and Alapaha basins in Figures 13 and 14, respectively. In each of these figures we show three cases: (1) considering all days (left column); (2) days with precipitation within 0 mm and 0.3 mm, including the zero precipitation days (middle column); and (3) days with precipitation greater than 0.3 mm (right column). For each spread quintile, box plots are plotted showing the variation of the joint probability of spread-skill in all stations of the basin with the skill quintiles as the abscissa. The dashed horizontal line corresponds to a joint probability value of 0.2 when there is no spread-skill correlation.
 In both Figures 13 and 14 we see that when all days are considered, and also in the case where precipitation is within 0.3 mm (with zero precipitation days included), the spread-skill relationship is negatively correlated. That is, for lower spread, there is a higher probability of greater skill. Here a large number of ensemble members with zero precipitation contribute to both a lower ensemble spread and higher skill for small precipitation amounts. Conversely, a small number of ensemble members with zero precipitation contribute to higher ensemble spread and lower skill for small precipitation amounts. Unfortunately, these results do not allow us to construct any meaningful spread-skill relationships in order to place time-variant confidence limits on precipitation forecasts.
 Similar box plots for maximum temperature (here data from all days were used) are shown in Figure 15 for the Animas (left column) and Alapaha (right column) basins. In all cases we see that the boxes are close to the dashed horizontal line (i.e., joint probability value of 0.2), which implies that there is no spread-skill correlation. Similar results for both precipitation and temperature were observed for July.
 All the results presented here used standard deviation and RPSS as the spread and skill measures, respectively. Analysis was also carried out using the other spread and skill measures, and the results were found to be robust, that is, the underlying spread-skill relationships do not change with the choice of different measures. Also, though no clear spread-skill relationships were apparent here, the K-nn method is theoretically capable of extracting the spread-skill relationship if it exists in the atmospheric model.
5. Summary and Conclusions
 A method for statistical downscaling using the K-nn algorithm in eigenspace was developed. A 20-year (1979–1998; 7305 days) data archive consisting of model outputs from the NCEP 1998 version of the operational medium-range forecast model from NOAA/CDC was used in this study. A total of 15 MRF runs (one control run plus 14 ensemble members) were available for analysis. Seven MRF model output variables going out to lead time of 14 days was used in the downscaling algorithm. Analogs to (7305 × 14) feature days using a 14-day temporal window were subsequently identified. All data were projected onto eigen space, and distance between a feature day and all candidate days were calculated using a weighted Euclidian norm. The weighting used considered the fractional variance explained by a given principal component. The distances were then sorted in ascending order, and weights were assigned to each using the bisquare weight function. On the basis of weights, and repeatedly generated uniform random numbers a set of seven ensemble members were created from each MRF run.
 Results were assessed over four river basins distributed across the contiguous United States. These were Animas (southwestern Colorado); Alapaha (southern Georgia); Cle Elum (central Washington); and east fork of the Carson (California-Nevada border). The K-nn downscaling algorithm was repeated for the 15 MRF runs and for the four basins. Since from each MRF run seven ensemble members were generated, the 15 MRF runs yielded a total of 105 ensemble members for each basin. To obtain local estimates of precipitation and temperature, closest COOP stations (within a 100 km search radius) from the center of the basins were selected, and observed data corresponding to the downscaled dates were used to obtain these estimates. The precipitation and temperature estimates from these 105 ensemble members over 20 years and 14-day forecast lead times were used to evaluate the K-nn downscaling methodology.
 The statistics included seasonal cycles, bias, spatial correlations, and a suite of forecast verification statistics. The K-nn downscaling model in all cases largely captured the seasonal variation of precipitation and temperature. Precipitation biases were generally within 20%, but in many cases (mostly for the climatologically drier Animas basin at longer lead-times) exceeded 20%. This is consistent with the noisy character of precipitation time series. Temperature biases were small and within 0.5°C. Since we used data for all stations on a given day, the K-nn method intrinsically preserves the spatial autocorrelation structure and the consistency between variables. Furthermore, since this method relies solely on the atmospheric model output and does not incorporate any joint relationship between the atmospheric and surface variables, it does not fully preserve the lag-one correlation statistics (not shown). However, postprocessing using an ensemble reordering type method [Clark et al., 2004] can be used to recover this serial correlation. Also, methods based on nonhomogeneous hidden Markov models (NHMM) [e.g., Hughes and Guttorp, 1994; Hughes et al., 1999] to downscale synoptic atmospheric patterns to local scale precipitation have been shown to preserve temporal correlation but fail to preserve spatial correlation.
 Next we evaluated the skill, reliability, and time-variant spread-skill relationships in the downscaled forecast ensembles. The rank probability skill score (RPSS) was used to verify the forecast skills. For precipitation, the skills generally were higher in winter than in summer and valid at only short forecast lead times (2–3 days). Temperature RPSS scores were around 0.5, and valuable skill was present even up to lead times of 5 days in all seasons. Forecast reliability or conditional bias were evaluated using reliability diagrams, and we found that the observed relative frequencies of the event (days being in the upper tercile) matched well with forecasted probabilities, and there was very little conditional bias in the forecasts.
 Rank histograms showed that although precipitation ensembles are to an extent noisy, the ensemble spread is nevertheless meaningful. For temperature, the observed fell outside the ensemble range in about 2% cases. Next, we analyzed possible spread-skill relationships. We did not find a meaningful relationship to forecast precipitation forecast skills. For temperature, results clearly showed that there is no relationship between the ensemble spread and skill.
 Though regression-based approaches are widely used to extract local-scale information from forecast models [e.g., Antolik, 2000], these methods are not data-driven, they need variable transformations, they do not intrinsically preserve space-time autocorrelations and cross correlations of the downscaled variables, and they cannot be utilized to investigate spread-skill relationships. We did a comparison, however, to test the skill (RPSS) of the K-nn approach with a multiple linear regression (MLR) based downscaling method (see Clark and Hay  and Clark et al.  for descriptions of the MLR method). Results of this comparison are summarized in box plots shown in Figures 16 and 17 for precipitation and temperature, respectively. Given that the K-nn algorithm does not use the joint relationship between forecast model output and station data, the results are extremely impressive. The skill obtained from the K-nn method is competitive with the skill obtained using MLR. The MLR utilizes the joint relationship between surface and atmospheric variables and needs postprocessing to reconstruct the space-time variability between the ensembles (typically the downscaling is done for each station individually). The PCs in the K-nn method also provide a consistent spatial representation, whereas the variables in case of MLR typically change from one station to the other.
 The marginally higher skills that are seen in case of MLR are also because the 15-member ensemble mean from the MRFs are used as predictors. Furthermore, the sum of squared errors between observed and downscaled values at each station is explicitly minimized in developing the MLR models. Finally, the K-nn method is computationally efficient and can be readily implemented. One shortcoming of K-nn type algorithms is that values not seen in the historical record are not reproduced. This could be a potential problem if the historical archive is short. Rajagopalan and Lall  describe a modified resampling strategy with perturbations of the historical data to overcome this limitation. The results described here, however, demonstrate the strength of this algorithm and provide a viable alternative in providing skillful and reliable downscaled forecasts to other downscaling methods.
 This research was supported by the NOAA GEWEX Americas Prediction Program (GAPP) and the NOAA Regional Integrated Science and Assessment (RISA) Program under awards NA16GP1587 and NA17RJ1229. Thanks are due two anonymous reviewers for their comments, which improved the manuscript.