Improving the use of observations to calibrate ensemble spread



The Met Office Global and Regional Ensemble Prediction System (MOGREPS) uses an online inflation factor calculation to calibrate the spread of the ensemble in space and time and counteract the tendency of the Ensemble Transform Kalman Filter (ETKF) to underestimate analysis uncertainty. Until 2008, this calibration mechanism relied entirely on sonde and wind profiler data, and was only applied locally in the Extratropics. By producing more appropriate estimates of the error variance of ATOVS brightness temperature observations, it has become possible to include these in the inflation factor calculation. This in turn provides sufficient data to apply localisation uniformly over the globe. The new scheme improves the latitudinal distribution of spread in comparison to forecast error, especially in the Tropics. Issues remain with the vertical distribution of spread, which are addressed by work to be reported in a future paper. © 2011 Crown Copyright, the Met Office. Published by John Wiley & Sons, Ltd.

1. Introduction

A traditional deterministic weather forecast produces a single estimate of how each output will evolve as a function of time. For many practical applications, what is really needed is an estimate of the distribution of possible outcomes. This allows estimation of the probability of rain, or high winds, or a storm surge, so that decisions can be taken balancing the costs of taking protective action against the expected loss if action were not taken (Richardson, 2000). Estimates of the forecast error are also key to the process of data assimilation. It is hoped that ‘flow-dependent’ error covariances that take account of transitory features such as fronts and inversions will provide better increment structures than traditional climatological covariances (Wang et al., 2008).

Ensemble forecasting systems aim to provide this information by producing not one but several forecasts using slightly different initial states, boundary conditions and model physics, with the aim of sampling the range of forecast results consistent with the uncertainty in observations and the modelling system itself. A small ensemble cannot be expected to reproduce all features of the distribution of possible outcomes, but at a most basic level one can ask whether the typical magnitude of the perturbations (spread) matches the magnitude of the errors on the ensemble mean forecast.

In practice, ensemble systems run with finite ensemble size, limited resolution and imperfect components. These lead to systematic deficiencies in both the generation and evolution of the predicted error distributions. Whilst ensembles try to predict uncertainty by simulating its underlying sources as far as possible, these limitations make it reasonable to ask if the distribution of spread can be improved by calibration against observed forecast error. The need for such calibration is particularly clear in simple transform filters such as that employed by the UK Met Office ensemble, where spurious correlations from the background forecast cause the basic calculation to dramatically underestimate analysis uncertainty.

The spread and error characteristics of various global ensemble prediction systems have been compared by Park et al. (2008). For Northern Hemisphere forecasts of 500 hPa geopotential height, they found that the previous version of the Met Office ensemble was overspread in the short range and underspread in the medium range. The short-range conclusion may be affected by the fact that Park et al. used verification against analyses, which is likely to underestimate forecast error in the short range (Bowler, 2008). For 850 hPa temperature in the Tropics, the Met Office ensemble was underspread at all lead times, which is consistent with results obtained in section 5.1 below. Underspread in the Tropics was observed in many of the systems considered in the study. Exceptions to this rule are the ensembles run by the Japanese Meteorological Agency (JMA) and the Meteorological Service of Canada (MSC). The MSC ensemble is also well-calibrated in the Northern Hemisphere Extratropics, and at longer lead times, so this system is notable in the calibration of its spread. The tropical underspread of the European Centre for Medium-range Weather Forecasts (ECMWF) ensemble is one of the targets of current research (Buizza et al., 2008).

This paper describes recent work to improve the horizontal distribution of spread in the Met Office ensemble system. Section 2 summarises the system and the context in which the calibration is performed. The determination of the correct spread is crucially dependent on estimates of observation error, with requirements that differ from those of data assimilation. A variety of methods have been used to obtain more appropriate estimates of the error variance of brightness temperature observations from the Advanced TIROS Operational Vertical Sounder (ATOVS), as described in section 3. Section 4 describes the changes made to improve the calibration of spread as a function of latitude and longitude, with corresponding trial results in section 5. Overall conclusions and suggestions for future work are given in section 6.

2. System background

The Met Office Global and Regional Ensemble Prediction System (MOGREPS) is described by Bowler et al. (2008). Its primary focus is short-range forecasting for the UK, particularly high-impact weather relevant to the protection of life and property. At the time of the work reported here, it consisted of a regional ensemble covering a North Atlantic and European domain at 24 km resolution, nested within a global ensemble with about 90 km resolution at midlatitudes (N144). Both systems used 38 model levels covering the troposphere and stratosphere up to about 40 km. The ensembles contain 23 perturbed members and one unperturbed control forecast. The perturbed members include stochastic parametrizations to sample some of the errors associated with model evolution. The global forecasts run for 72 hours from 0000 and 1200 UTC, and the regional ensemble for 54 hours from 0600 and 1800 UTC. Initial and boundary conditions for each regional ensemble member are interpolated from the corresponding global ensemble member (Bowler and Mylne, 2009). Whilst the experiments reported here focussed on the global ensemble, this coupling should mean that the benefits carry through to the regional ensemble, although this has not been explicitly verified.

The focus on short-range forecasting demands a sampling of initial uncertainty which approximates the structure and magnitude of the error covariance from the very start of the forecast. MOGREPS uses an Ensemble Transform Kalman Filter (ETKF; Wang and Bishop, 2003), which obtains the analysis perturbations as a linear combination of the forecast perturbations from the previous cycle. The ‘transform matrix’ of combination coefficients is chosen so that the analysis perturbation covariance matches that predicted by the Kalman Filter equations, taking account of the impact of the observations supplied to the data assimilation system along with their specified observation errors. The ETKF is part of a class of ‘square root’ filters (Tippett et al., 2003), whose properties are compared to perturbed observation filters by Whitaker and Hamill (2002) and Lawson and Hansen (2004).

MOGREPS centres its perturbations about an analysis interpolated from the separate four-dimensional variational assimilation (4D-Var) system associated with the high-resolution deterministic forecast (Rawlins et al., 2007). This is expected to be superior to anything MOGREPS itself could produce without significantly increased computational cost. MOGREPS uses the spherical simplex formulation of the ETKF (Wang et al., 2004) to ensure the ensemble mean does not degrade the supplied analysis. It should be noted that the ETKF is formally predicting the error distribution for an analysis based on its own low-resolution background, with increments drawn from the background perturbations, whereas 4D-Var processes the high-resolution background with different methodology and many more potential increments. However, results such as Magnusson et al. (2008) suggest that the detailed structure of initial perturbations may have limited impact on ensemble forecast skill beyond the first few hours, provided they are dynamically balanced and have magnitude consistent with analysis errors.

In the limit of a large well-calibrated background ensemble, optimal data assimilation system, correctly specified observation errors, and Gaussian statistics, the ETKF would produce perturbations with the magnitude and structure demanded by the true analysis-error covariance matrix. In practice, the finite size of the ensemble creates spurious correlations. These lead the ETKF to vastly overestimate the impact of each observation. This will affect the structure of the resulting perturbations, but the most obvious and damaging impact is on their magnitude. With all satellite data included, the raw MOGREPS perturbations for a single global ETKF are underspread by about a factor of 70. This would clearly have a huge impact on both forecast probabilities and perturbation development if left uncorrected.

As is common with ensemble filters, the problem of spurious correlations can be mitigated through the use of horizontal localisation (Houtekamer and Mitchell, 1998). In the observation-space formulation of the ETKF, this is achieved by performing a separate ETKF calculation for each local region, using only observations within a certain distance. This amounts to neglecting all background correlations beyond that distance, assuming that they are more than likely to be spurious. MOGREPS performs ETKF calculations for a mesh of 92 localisation centres approximately 2500 km apart, interpolating the resulting transform matrices to each gridpoint (Bowler et al., 2009). The original implementation of this technique used a data selection radius of 5000 km, reducing the underspread in the raw transform matrix to a factor of about 25.

To correct the first-order impact of the remaining spurious correlations on analysis ensemble spread, MOGREPS multiplies the raw ETKF perturbations by an ‘inflation factor’. As in Wang and Bishop (2003), this inflation factor is based on a comparison of the ensemble spread to the root mean square (rms) error of the ensemble mean, after subtracting an estimate of observation error. By calibrating the inflation factor, rather than the spread itself, the ETKF remains able to propagate regions of temporarily increased uncertainty from one forecast to the next, subject to the impact of observations. It is important to note that this approach can only ever deduce the inflation factor which should have been used on the previous forecast, from which an extrapolation is made to the inflation factor to be used for the new forecast. This assumes that the fractional underspread produced by the raw ETKF is approximately constant. In practice, it probably varies with the observation distribution, although this should be fairly stable from run to run.

Bowler et al. (2008) describe the precise method by which the inflation factor was derived in the previous MOGREPS system. Because of the requirement for accurate observation errors, the calculation was restricted to ‘sonde’ observations only, a grouping which includes both radiosonde and wind profiler data. Due to limited sonde coverage in the tropical and polar regions, the localised ETKF calculation was only performed in the Extratropics (20–70°N/S), with the global ETKF and associated global inflation factor calculation being used elsewhere. This led to substantial underspread in the Tropics (section 5.1 below), and required an arbitrary division by 2 to constrain perturbations over the South Pole. Addressing these limitations in the inflation factor calculation was the main aim of the work reported in this paper.

Although it was introduced in response to the impact of spurious correlations on the ETKF, the online nature of the inflation factor calculation allows it to compensate for a number of other deficiencies in the ensemble system. By constantly comparing ensemble spread to the actual rms error of the ensemble mean on a local basis, it will counteract overspread and underspread however they arise. In particular, it can make overall statistical allowance for forecast model deficiencies such as the failure to predict some potential developments, or the failure to correctly evolve existing perturbations. Its role is similar to that of statistical post-processing, providing a data-based correction for those errors which the dynamical ensemble is not able to represent correctly. Whilst it would be preferable to correct the underlying system so that its output did not require calibration, such approaches provide a pragmatic way to obtain better forecasts until such underlying improvements are available. Unlike a post-processing system, the inflation factor feeds back directly into the dynamic ensemble, where the calibrated perturbations can grow in the subsequent forecast and propagate into the next cycle. An online system also has the advantage over a one-off calibration that it can sense and automatically respond to changes in performance as the forecasting system evolves, and does not require extensive testing of alternative inflation factors once the basic system has been established.

The use of an online spread calibration is fairly unusual amongst current operational ensemble prediction systems. The ensemble run operationally at the US National Centres for Environmental Prediction (NCEP) uses a rescaling technique to control the size of the initial perturbations (Wei et al., 2008). The regionally-varying rescaling mask is based on an analysis-error estimate from their 3D-Var system, and the perturbations are reduced in magnitude where they exceed this value. The scaling of the initial perturbations for the ECMWF ensemble is essentially fixed using an empirical tuning factor, which is determined off-line (Leutbecher and Palmer, 2008). Although there is a very limited flow-dependence in the perturbation size for the ECMWF system, an offline tuning factor is common to all ensembles based on singular vectors. The ensemble prediction system run at MSC does not use spread calibration (Houtekamer et al., 2005). The initial condition perturbations are determined using a localised EnKF, and these are uniformly inflated before being used in the forecast.

3. Observation-error estimation

To allow localisation to be extended to the Tropics, the polar regions and the vertical dimension, the first requirement is more data suitable for use in the inflation factor calculation. Considering horizontal and vertical coverage, as well as consistency of observing platform and the associated observation errors, the ATOVS series (English et al., 2000) seemed an appropriate choice. ATOVS brightness temperatures depend predominantly on temperature and humidity, with no direct sensitivity to wind. However, the standard ETKF formulation does not include separate inflation factors for different variables, since their relationship is derived from the dynamically-balanced background perturbations. Thus, it is consistent with the basic assumptions of the ETKF to measure the spread/skill relationship for temperature and to assume that the same inflation factor applies to wind, although clearly the best overall compromise will be obtained by taking into account wind observations where these are available.

For all satellite data types, the rms ratio of innovation (observation minus background) to the standard observation errors used for Met Office data assimilation is less than 1. This would imply a forecast-error variance less than zero, which unambiguously indicates that the observation-error estimate must be reduced before these observations can be used in the inflation factor calculation.

3.1. Types of observation-error estimate

In a perfect world, each observation has a single true error variance which should be used by both data assimilation and the inflation factor calculation. In practice, the two systems involve different approximations, are optimised by different values of the observation error, and have different sensitivities to particular types of inaccuracy in its estimation.

Data assimilation combines the background forecast with observations based on their relative information content. This depends on the ratio of their errors, not the absolute value, and if for instance the background-error variances are overestimated, tuning will produce observation-error variances that are similarly overestimated. Although the underlying mathematics of data assimilation are formulated as a statistically optimal interpolation, the ultimate aim is to initialise a nonlinear model forecast, which responds better to some types of perturbation than others. For various reasons, optimal performance might be achieved with ‘incorrect’ observation errors, as observations which the system is less able to handle or measure less important parts of the atmospheric state are downweighted by exaggerating their observation error. By contrast, overestimated observation errors supplied to the inflation factor calculation lead directly to an underspread ensemble.

Most current data assimilation systems neglect correlations in observation error. These can arise as an instrument whose error is drifting over time makes a succession of measurements, when parts of a satellite instrument are shared between channels, or because of systematic variations in performance as a function of scan angle, height, etc. Observation-error correlations also arise from the mismatch between model gridboxes and the observation volume (error of representativeness), as illustrated by Liu and Rabier (2002). To avoid overestimating the impact of observations with correlated error (and reduce computational cost), satellite data are routinely thinned and the error variances exaggerated to make up for the neglected correlations. Again, the inflation factor calculation needs the true error variance, without any adjustment relating to neglected correlations.

The net result of these considerations is that the ETKF program must consider two distinct values for each observation error. The ‘Var’ error matches the value used for data assimilation, and is used in the transform matrix calculation. This mimics a data assimilation process, neglecting observation-error correlations in the same way, and needs to estimate the impact of each observation given the error estimate that was actually used by the data assimilation system. In contrast, the inflation factor calculation needs a ‘pure’ estimate of the variance between observations and gridbox-averaged truth, which must be consistent with innovation statistics. It should be noted that this still includes errors of representativeness in addition to instrument and forward-model error, which means that it can in principle vary with the forecast model resolution.

Observation error estimation is not an exact science. Even if instrument and forward-model error can be independently quantified, the complexity of atmospheric forecasting systems and the variability of weather make it difficult to estimate representativeness error without referring to forecast–observation differences. Empirical estimates based on innovations must use some characteristic difference in structure to separate the forecast- and observation-error components (Dee and Da Silva, 1999). Like data assimilation, the assumptions involved in these analysis methods are never fully satisfied, so that there may not even be a consistent correct answer (Dee, 1995), as illustrated by the Var/pure error distinction.

To provide some degree of robustness, and an indication of the level of agreement, a variety of methods, based on different assumptions, have been used to produce observation-error estimates for this project. The full set of methods and results for ATOVS data are discussed in section 3.4 below. The intervening subsections provide further details on the two most complex estimates considered: the first a novel approach based specifically on ensemble data, and the second a more established method commonly used for data assimilation. All of the results presented here are based on data for 13–31 July 2007, although data from August 2007 have also been examined. Since data have been averaged over the whole globe, there is not expected to be a strong seasonal dependence, although this has not been studied explicitly. There has also been no attempt to analyse regional variations in observation error, which might arise from systematic variations in instrument or forward-model performance with scene, or variations in representativeness error with the different types and scales of atmospheric motion.

Whilst this project has considered a range of methods for estimating observation error, there has not been time to pursue every possible approach. Desroziers et al. (2005) describe a method for iteratively estimating observation and background errors from differences between observation, background and analysis fields. This assumes the analysis is produced by assimilating observations into the background forecast, so cannot remain fully consistent in a system such as MOGREPS where the analysis is derived from a separate external forecast. Whilst the Desroziers method could be applied to the deterministic forecasting system, this would capture representativeness errors at its resolution rather than the ensemble resolution. Finally, this method seems most likely to estimate the observation-error variance needed to optimise the data assimilation system, including the effect of neglected correlations, rather than the ‘pure’ error variance needed for spread calibration.

Dee and Da Silva (1999) describe more general applications of the maximum-likelihood approach, but this could be costly and require considerable care in the definition of the observation-error model to be fitted. Dee (1995) provides interesting suggestions for the online estimation of observation errors, formulated in the maximum-likelihood framework. This would avoid relying on fixed estimates that could become out of date, but would require care to establish robustly in an operational system. It was felt that an initial offline study was more appropriate, and necessary as a prerequisite to establish the categories and contingencies that might need to be considered for an online approach.

Unless otherwise stated, the following discussion treats variances as mean square quantities, including any mean component. In this way, the innovation variance represents the total difference between forecasts and the supplied observations, forecast bias is regarded as a component of representativeness error, and the spread is only asked to reproduce the non-systematic component of forecast error.

3.2. Spread-skill method

Estimating observation error from innovations is essentially a matter of separating the forecast- and observation-error components. With deterministic forecasts, this requires the existence and modelling of some characteristic difference in the variation of forecast and observation errors as a function of space, time, value or forecast variable. Ensemble forecasts provide a new possibility because they produce their own case-specific estimate of the expected forecast error. If the ensemble were well calibrated, subtracting the mean square spread from the innovation variance would yield the observation-error variance.

In practice, the ensemble is not perfectly calibrated; indeed the magnitude of the spread is precisely what the inflation factor is trying to tune. An alternative estimate of observation error can be obtained by binning a large number of forecast–observation pairs by the forecast spread, and finding the innovation variance in each case. This is similar to a standard verification plot used to evaluate ensemble forecasting systems (e.g. Figure 8 of Wang and Bishop, 2003). The innovation variance should be a linear function of ensemble variance (square of spread) with intercept equal to the observation-error variance. This estimate should be largely independent of the inflation factor of the underlying ensemble, since to first order (neglecting nonlinear feedbacks) that only affects the scale of the spread axis. Forecast errors which are completely uncorrelated with ensemble spread will be aliased into the intercept, which is therefore expected to provide an upper bound on observation error, to be used in conjunction with the other error estimates discussed in sections 3.3 and 3.4. Conversely, if a component of observation error were correlated with spread (perhaps through a common dependence on forecast magnitude), it would not be included in the estimate produced by this method.

The spread/skill method has been applied to the same data that go into the ETKF calculation, i.e. the range from T + 9 to T + 15 hours centred on the 12 h cycle time. T + 12 hours also seems a reasonable compromise between allowing the ensemble perturbations some time to spin up, and ensuring that observation error remains a significant fraction of the innovations. Both spread and innovation have been normalised by the standard observation error used in the Met Office data assimilation system. This reflects the operation of the ETKF code, and emphasises the ratio of the new observation-error estimates to those used for data assimilation.

To reduce disparities in statistical significance, the bins have been defined using quantiles of ensemble spread. Each bin contains about 100 observations, as a compromise between resolving variations in error as a function of spread and the accuracy with which the innovation variance is estimated within each bin. The standard error on the innovation variance of each bin is estimated as equation image, where V is the sample variance and n is the number of innovations from which it was constructed. Strictly, this formula applies to variances without the mean included, but it should provide a reasonable indicative magnitude provided the mean is not too large compared with the standard deviation. It will be an underestimate to the extent that there is correlation between the innovations arising from spatial or temporal correlation in the model or observation errors, although it should be noted that the calculation uses data which have been thinned precisely to reduce such correlations. The linear regression used to estimate observation error is weighted by the estimated error on each innovation variance, with the consequence that is drawn more strongly to the low-spread bins than the more uncertain high-variance ones.

Figure 1 shows spread/skill results for a ‘workhorse’ Advanced Microwave Sounding Unit (AMSU-A) temperature channel peaking around 700 hPa. The relationship is reasonably linear except for very high spreads, with an intercept clearly distinct from both zero and unity. Inspection of the axis values quickly shows that the ensemble is overspread by about a factor of 2 (4 in variance terms) for this channel.

Figure 1.

Mean square innovation as a function of spread for AMSU-A channel 5 in the global ensemble from 13 to 31 July 2007. Vertical bars show the standard error on each innovation variance, whilst the solid line shows the weighted linear regression used to estimate the observation error.

Figure 2 summarises regression statistics for all of the ATOVS channels that were used by the MOGREPS ETKF in July 2007. The ATOVS package includes three separate instruments, which the Met Office combines into a single numbering system for ease of reference. Numbers 1–20 identify channels 1–20 from the High-Resolution Infrared Sounder (HIRS). Numbers 21–35 correspond to channels 1–15 from the AMSU-A microwave temperature sounder, where higher numbers correspond to weighting functions that peak higher up in the atmosphere. Finally, numbers 36–40 identify channels 1–5 from the microwave humidity sounder (AMSU-B or MHS, depending on the satellite).

Figure 2.

Summary statistics for ATOVS spread/skill regression, showing (a) intercept, (b) square root of gradient, (c) correlation coefficient and (d) chi-squared normalised by expected degrees of freedom. Dotted vertical lines indicate significance bounds assuming no error correlations, as discussed in the text.

Figure 2(a) shows the estimate of observation error obtained from the intercept of each linear regression. Since the innovations have been normalised by the Var observation-error estimate, values greater/less than 1 suggest the ‘pure’ observation error should be respectively greater/smaller than the value used for data assimilation. These results suggest the observation errors for HIRS, AMSU-B/MHS and the lowest-peaking AMSU-A channels should be significantly decreased. This fits the expectation that the standard observation errors for hard-to-assimilate or more recently introduced channels will have been exaggerated to reduce their impact, an exaggeration which is not appropriate when trying to deduce the true forecast error in the inflation factor calculation.

Figure 2(b) shows the square root of the fitted gradient. This is the factor by which the spread would have to be scaled to produce a unit gradient. It is essentially the inflation factor that would be deduced given data from just this channel and the observation error implied by the intercept—with the caveat that the online inflation factor calculation operates on a single run without dividing the data into differently weighted bins. In this case, the results suggest a halving of spread for most channels, with the notable exception of the AMSU-A channels near the model top, which are significantly underspread. In practice, as will be shown later, the revised system did not produce an overall reduction in spread. This is presumably due to feedbacks changing the distribution of spread, although this has not been investigated in detail.

Figure 2(c) shows the correlation coefficient, whose square indicates the fraction of the variance of the bin innovation variances which can be explained by the linear fit. This measures the power of ensemble spread to predict the forecast-error variance, although it also includes a penalty due to the noise arising from the finite number of contributions used for each bin. The results suggest a strong relationship for most channels, including those which measure humidity, but reducing dramatically for AMSU-A channels towards the top of model. The dotted vertical lines at the bottom of the plot show the 95th percentile of the distribution of sample correlation coefficients that would be expected for a true correlation of zero. These suggest that there is a significant relationship for all channels, though it is again subject to the caveat that the number of independent contributions will have been overestimated.

Finally, Figure 2(d) shows the chi-squared value for each regression, which compares the scale of departures from the linear fit with the estimated error of the bin innovation variances and the number of bins used. The crosses positioned above/below the horizontal dotted line indicate the 5th and 95th percentiles of the sample chi-squared values expected from an underlying linear relationship with the specified amount of uncorrelated Gaussian noise. Results outside this range suggest something other than noise is contributing to the residuals, such as nonlinear terms or additional variables. The results suggest that there are very few channels that can be completely explained by the linear model, although again the neglect of correlations will mean that the confidence interval is unduly tight. However, the results do clearly highlight channels 32–34 as particularly poorly modelled. This is not surprising, given that they have significant contributions near or above the top of the model. The simulation at these levels is not very physical, being strongly influenced by the artificial rigid lid. To provide the necessary temperature profile near and above the model top, the observation operator moves smoothly from the forecast temperatures below 30 hPa to a profile derived entirely from observations above 10 hPa. The latter of course has no spread or direct relationship to model performance lower down. This indicates that these channels should not be used for calibrating the spread of the column as a whole, since this strong inflationary signal would force overspread at lower levels.

3.3. Innovation covariance method

The innovation covariance method (Hollingsworth and Lönnberg, 1986; Daley, 1993) is an established technique to estimate observation errors for atmospheric data assimilation, particularly for radiosonde data. In its simplest application, as used here, the forecast- and observation-error components of the innovation are separated by the assumption that observation errors are uncorrelated, whilst forecast-error correlation varies smoothly as a function of distance. In this case, the forecast-error variance can be estimated by extrapolating the innovation covariance to zero distance, whilst the remainder of the innovation variance must be due to observation error.

Daley (1993) proposes extensions to this approach to allow for long-range correlations in observation error such as might arise from satellite retrievals using climatological or forecast backgrounds. This method relies on establishing a relationship between multiple observing systems. It is not pursued here since thinned ATOVS brightness temperatures are expected to have much less severe observation-error correlations, and consistency with the other error estimation methods can be used as a sanity check. To the extent that observation-error correlations do exist, they will lead to an overestimate of forecast error and hence an underestimate of observation error. This complements the spread/skill method, which is expected to overestimate the observation error if there are forecast errors which are uncorrelated with ensemble spread. Since observation-error correlations can produce a change in the behaviour of covariances at short distances, the results presented here have been fitted by human judgement rather than attempting to apply an automatic algorithm. Clearly, this approach is only practical whilst the number of categories for which an observation error is estimated remains relatively small.

Figure 3 shows innovation covariance results for the same channel considered in Figure 2. Innovations have again been constructed with respect to the ensemble mean, as the best central forecast provided by the ensemble system. For each 12 h forecast, each distinct pair of differences between the ensemble mean forecast and observations has been assigned to one of 100 bins based on the great circle distance between them (approximating the Earth as a sphere). The plotted quantity is the bin-average product of these innovations. This includes any overall bias, which will therefore not be included in the implied observation error. It is worth noting that the innovation covariance method is dramatically more expensive than spread/skill regression, taking about a day to process half a month of data for all observation types on a 2.6 GHz processor. This is essentially because the calculation is quadratic in the number of observations in each forecast, whilst spread/skill regression has linear complexity.

Figure 3.

Innovation covariance (including any bias) as a function of distance for AMSU-A channel 5 in the global ensemble from 13 to 31 July 2007. The histogram below the main plot shows the number of millions of observation pairs in each bin.

Since the observations are drawn from a 6 h window, there will in general be a non-zero time interval between them. This will reduce the measured covariance slightly, due to the non-unit temporal component. Without explicit data on temporal correlations, no attempt has been made to compensate for this effect, but it will tend to reduce the implied forecast error and thus slightly exaggerate the estimate of observation error.

The method seems to work well for practically all the ATOVS channels considered, with a smooth increase in covariance as distance decreases, a clear extrapolation to zero distance, and a clear division of the innovation variance. Indeed, the results appeared superior to equivalent data for sonde observations (not shown), perhaps due to greater problems of sample size, instrument heterogeneity, and representativeness error in that case. Figure 3 shows the sample covariance decreasing at distances below about 100 km. This is close to the 90 km typical grid length of the global model used for this study, so that results below this scale should probably be ignored when trying to evaluate the grid-scale observation error. It is the opposite of the increase in short-range covariances which would be expected from correlated observation errors. This may be explained by the fact that the Met Office ATOVS thinning algorithm does not normally permit observations from the same satellite closer than 154 km (Dando et al., 2007). Thus, all contributions below this distance must come from observation pairs involving two different satellites. Whatever the cause, these short-range covariances have been ignored in the extrapolation to zero distance and, as will be seen below, this produces results which are reasonably consistent with those obtained by other methods.

3.4. Collated results

Figure 4 collates the estimates of observation error for each ATOVS channel. To maximise resolution across channels with different intrinsic accuracies, each result is again normalised by the observation error used by 4D-Var. As above, the statistical data are based on the 13–31 July 2007 period, although results from August 2007 have also been considered in the interpretation. The following traces are included:

Figure 4.

Collated ATOVS observation-error estimates, normalised by the values used within 4D-Var. The individual traces are described in the text. Square brackets denote channels that predominantly sense humidity rather than temperature. Parentheses indicate channels that were later omitted from the inflation factor calculation due to significant contributions near or above the top of the 38-level model.

Innov: The rms innovation between ensemble mean and observations, including any bias. This places a hard upper limit on the observation error. The proximity of other estimates to this trace indicates how the innovation variance is split between model and observation error. Calculations are most accurate when the contributions are approximately equal. In particular, the inflation factor calculation will be very sensitive to noise and the precise value of the observation error if that quantity dominates the innovation variance, as occurs with channel 4 due to the large forecast bias.

SprdSk: The observation-error estimate derived from the intercept of spread/skill regression. As described in section 3.2, this is expected to be an overestimate to the extent that there are forecast errors uncorrelated with ensemble spread. The quality of the underlying spread/skill relationship gives some indication of the reliability of this error estimate for each individual channel. By its nature, this estimate will also include any overall bias between the forecast and observations.

ExtICov: An observation-error estimate derived from the innovation covariance residuals discussed in section 3.3. To make this comparable with the spread/skill results, the overall bias (mean innovation) has been added in quadrature. This amounts to regarding bias as a kind of representativeness error for the purposes of spread calibration, so that the spread only covers the non-systematic component of the forecast error.

ExtRaw: The sum in quadrature of instrument error as measured by built-in calibration, forward model error as measured by comparison with line-by-line models, and forecast bias as measured by mean innovation. These suggest a lower bound for each observation error, excluding the more elusive representativeness component.

Chosen: The observation error (including bias) manually chosen as a compromise between the above estimates, taking account of their different characteristics and the overall needs of the inflation factor calculation.

DbCho: The result of subtracting in quadrature the sample bias for 13–31 July 2007 from the Chosen observation error. Unlike conventional observations, these biases are a substantial fraction of the innovation for some of the temperature channels, and could conceivably vary with season, model upgrades, and other causes. At a relatively late stage, it was decided to subtract the global bias for each channel from the ATOVS innovations on a run-by-run basis, to remove the risk that changes in bias could render the observation-error estimates and derived inflation factor target incorrect. This trace shows the observation-error estimate actually used for subsequent experiments, after removal of the run- and channel-specific bias.

Two other error estimates were included in the original investigation, but are omitted here in the interests of brevity. The first is the observation error used for 1D-Var retrievals within the observation processing system. This might be expected to be less exaggerated than the value used for 4D-Var, since it has no need to make up for neglected horizontal correlations. The second is the ‘neutral’ observation error needed to preserve the existing spread, obtained by subtracting the mean square spread from the mean square innovation. This represents the observation error implied by meteorological consistency (as encoded in the forecast model) with the spread indicated by all the other observations used in the inflation factor calculation.

The agreement between the SprdSk and ExtICov error estimates shown in Figure 4, based on completely different assumptions and methods of calculation, enhances confidence in both methods. With the exception of channel 32, where the spread/skill relationship is particularly nonlinear, the spread/skill estimate exceeds that based on innovation covariance, as predicted. As before, the results suggest significant reductions from the 4D-Var error estimates for HIRS, AMSU-B/MHS, and the lowest-peaking AMSU-A channels. The AMSU-A values appear more consistent, except near the top of the model where model performance is poor.

The ExtRaw trace places a strong lower bound on many of the observation errors, although the fact that it exceeds the rms innovation for some AMSU-A channels shows that it may be exaggerated in some cases. As a result, several of the Chosen error estimates have been kept closer to the ExtICov–SprdSk range than would be suggested by a literal interpretation of ExtRaw. It was later discovered that the observation processing system averages multiple AMSU spots in order to present HIRS and AMSU data on the same grid. This reduces the effective AMSU instrument error compared to the underlying values used to construct the ExtRaw trace. It also emerged that the HIRS raw errors had been exaggerated due to a failure to correctly account for the natural variability of the calibration coefficients around the orbit. With hindsight, slightly lower values of observation error might therefore have been chosen for channels with large ExtRaw values, more intermediate between the spread-skill and innovation covariance results.

The data used for this investigation did not distinguish between the different ATOVS satellites, so only a single error estimate has been produced for each channel. In practise, the Var error estimates were often identical across satellites, with data from badly performing channels on particular satellites simply being excluded altogether. For the purpose of the work reported here, the new error estimates have been used to set the overall scale of observation error for each channel across all satellites, whilst preserving the ratio between error estimates used for the same channel on different satellites. The online bias removal does not distinguish between satellites since it is ostensibly only removing forecast biases, observation biases having been removed in the satellite pre-processing.

Several of the above results have highlighted high-peaking channels which the 38-level model is not able to handle properly, and whose inflationary signal could be impossible to satisfy without producing overspread in the main body of the model. After comparison of typical weighting functions with the model extent, it was decided to exclude channels 4 and 30–34 from the inflation factor calculation, although they were only removed from the transform matrix at a later date.

4. Revised spread calibration method

One consequence of using multiple observation types to calculate the inflation factor is that they can and do give different answers in some cases. For instance, they may sample different parts of the localisation region, such as fixed locations on land for sonde observations, but distributed observations over sea for ATOVS observations in the lower atmosphere. ATOVS observations were used only in cloud-free regions, whereas sonde observations are used regardless of the cloud situation. The different observation types may be dominated by the spread/skill relationship at different heights, whilst it is known for instance that the ensemble is particularly underspread near the surface. Most fundamentally, they may measure different physical variables, either temperature or specific humidity for ATOVS channels, whereas sondes also measure wind. Whilst the model should represent the dynamic relationship between these variables and hence between their errors, it may not do so perfectly.

In the end, a single inflation factor has to be derived for each localisation volume. The relative weight given to different observation groups is a matter of judgement and priorities rather than statistics: for which variables and locations is it most important to get the spread correct? Whilst observation counts are relevant when combining two estimates of the same quantity, estimates of different quantities should not be combined in this manner because there is no way for observations from one group to indicate what further observations from the other group would have shown. Indeed, the answer from such a combination would spuriously fluctuate with the spatial and temporal variations in the observation counts of each group.

To avoid these issues, the new inflation factor calculation divides the observations it uses into categories, where each category is supposed to provide internally consistent sampling. Within a category, observations are accumulated within and between runs weighted by observation count, but categories are combined with fixed weights. (A later enhancement, described in a future paper, modifies this slightly for cases where one or more categories persistently lack observations over many runs—for instance in regions with no sonde observations). Two categories are currently defined, for sonde and ATOVS data respectively. Neither of these is necessarily homogenous in its internal statistics, but they provide a readily comprehensible initial implementation, and the subgroup proportions within each category should be reasonably consistent to the extent that each ATOVS observation includes the same set of channels and each sonde observation the same set of levels and variables.

For observations within any one category, the inflation factor which should have been applied to the previous run is estimated as

equation image(1)

where I0 is the final inflation factor that was actually used for the localisation volume, Di is the mean square difference between the ensemble mean and each selected observation in category i, and Si is the mean square spread of the ensemble equivalents of each observation. This is equivalent to Wang and Bishop (2003) equations (17) and (18), and states that the ideal inflation factor would make the rms spread equal to the rms error of the ensemble mean forecast. Only the perturbed ensemble members are included in this calculation, since these are the focus of the ETKF. As noted in section 3.4, the run- and channel-specific global mean difference between the perturbed ensemble members and ATOVS observations is subtracted from the corresponding innovations before the data are passed to the inflation factor calculation, so that the spread is only asked to represent the random portion of the forecast error. As in the transform matrix calculation, the error and spread for each observation are normalised by an observation-error estimate to permit easy combination of observations measured in different units. Unlike the transform matrix, the ‘Chosen’ pure error estimate from section 3.4 is used so that (Di − 1) estimates the normalised error variance of the ensemble mean forecasts. In the rare cases where this comes out negative (due to a very small sample or excessive observation-error estimate), the corresponding observations have to be ignored.

To make optimal use of recent observations, a running average target inflation factor is maintained for each category and volume, together with a running average observation count. These are updated using

equation image(2)

where ni is the number of observations for category i in the current run, overbars without primes represent the previous running averages and overbars with primes the updated values. The parameter b exponentially downweights historic observations with half-life − ln(2)/ln(b), chosen as 1.0 forecast cycles, after trials discussed below demonstrated that the original estimate of 3.0 cycles was insufficiently adaptive. This approach is similar to the ‘running mean’ inflation factor used by Wang et al. (2008) to reduce the effects of sampling noise, replacing their equally weighted five-day average with an exponential profile that takes account of run-to-run variations in sample count.

One of the intended roles of this time smoothing is to damp short-term oscillations in the inflation factor. These can arise when a larger inflation factor is required to increase spread to the correct magnitude than to maintain the spread at that magnitude. For constant observation count, oscillations with a period of four times the half life are damped by a factor of around 1/2, so even such a short half-life achieves significant damping of short-period oscillations. The precise nature of such oscillations and the type of filter needed to damp them depends on the details of how the scaling of initial perturbations propagates to forecast perturbations, and then through the ETKF to raw analysis perturbations. This has not been investigated in detail, but timeseries of inflation factor from individual regions are generally free of noticeable oscillation. The original scheme of Bowler et al. (2008) was not retained due to concern that it may overcompensate for the impact of one forecast cycle on the next, and the difficulty of adapting it to a multicategory calculation that takes account of variations in observation count and consistently handles runs with no observations.

The category average inflation factors are combined into an overall inflation factor I for each localisation volume using a formula which minimises the weighted fractional discrepancy between the overall inflation factor and the category average results:

equation image(3)

where the weights wi represent the relative importance assigned to getting the spread correct for category i. In the current implementation, the sonde and ATOVS categories are both assigned equal weight. Since the fractional discrepancies measure the ratio between the predicted resulting spread and the spread desired by each category, this amounts to minimising the weighted mean square fractional overspread or underspread. One characteristic of this formulation compared to a simple arithmetic mean is that more attention is paid to categories producing low inflation factors, since a given absolute discrepancy will be a greater fraction of their desired spread. Some of the earlier trials presented below used a slightly different combination formula, due to a confusion over whether to minimise the fractional discrepancy in I or 1/I. However, trials in which only this calculation was changed demonstrate that it has limited impact on the overall distribution of spread. This seems reasonable given that the two methods produce different results only to the extent that the two categories disagree about the correct value of the inflation factor.

The new inflation factor calculation retains a final scaling limiter which was present in the previous ETKF code: the inflation factor is capped if necessary to ensure that the rms length of the vectors in the final transform matrix does not exceed a predefined value, usually 1.2. This is intended as a heuristic measure of the ratio between the magnitudes of the analysis and background perturbations. In a perfect system, this ratio is expected to be less than 1, reflecting the impact of observations in reducing background uncertainty. In practice, it is allowed to slightly exceed unity to allow for cases where spread is insufficiently grown by the forecast model, or is just found to be too small in comparison to the observed forecast error. Keeping a scaling close to unity helps to limit the disturbance to model balance and avoid introducing very large perturbations which could cause the forecast model to fail. It also guards against violations of the stationarity assumption which underlies the projection of the inflation factor from one forecast cycle to the next. A classic example of this is where a major observation type, such as ATOVS, drops out for one forecast. This will tend to increase the magnitude of the raw transform matrix by more than the increase in analysis uncertainty, as the impact of the removed observations is overestimated due to spurious background-error correlations arising from small ensemble size. Blind application of the standard inflation factor in such cases would produce excessively large perturbations.

5. Trial results

The revised inflation factor calculation was tested in a series of trials covering the period from 0000 UTC on 2 December 2006 to 0000 UTC on 1 January 2007. Due to time constraints, the forecasts were only run to the T + 15 hours needed to cycle the ETKF. All the statistics presented below are based on the observations assimilated by the next forecast cycle, from T + 9 to T + 15 hours, averaged over forecasts from 10 December onwards, leaving the first eight days for spin-up of the revised perturbations. This permits evaluation of how well the inflation factor calculation is achieving its target of calibrating the spread around T + 12 h in a fully cycling system. To the extent that the changes simply scale the magnitude of the perturbations rather than their structure, it might be expected that spread at longer lead times would change in a similar manner, or at least not be detrimentally affected. However, the introduction of the local ETKF to tropical and polar regions, and the reduction of the data selection radius from 5000 to 2000 km in the later trials to be discussed below, could have a more significant impact on perturbation growth. Whilst this cannot be directly examined from the trials presented here, other work provides some reassurance that the impact should not be detrimental. In the Extratropics, Bowler et al. (2009) found that their localised perturbations generally grew faster than unlocalised ones. The new choice of localisation radius is more similar to that found to be successful in systems such as the Canadian ensemble (Houtekamer et al., 2005). Further work to scale the ensemble spread as a function of height, to be presented in a future paper, has been evaluated for the full 72 h lead time of the global ensemble. Whilst the near-surface spread does not grow as rapidly as in the reference system, it remains larger at all lead times, suggesting that the impact of any dynamical imbalance is not too severe.

5.1. Spread and error

Figure 5 shows mean square spread and ensemble mean forecast error against sonde and ATOVS observations for three selected trials. As in the inflation factor calculation, each contribution has been normalised by the associated pure observation-error estimate. The mean square forecast error was then obtained by subtracting 1 from the mean square normalised innovation.

Figure 5.

Mean square spread (dotted) and ensemble mean error (solid) with respect to (a) sonde and (b) ATOVS observations. All contributions have been normalised by the associated ‘pure’ error estimate and include only those levels/channels used in the inflation factor calculation. The mean square forecast error is then obtained by subtracting 1 from the mean square normalised innovation (compare Eq. 1). The global mean (bias) for each run and channel has been subtracted from the ATOVS innovations, to replicate the input to the inflation factor calculation.

The PS16L26 trial represents the baseline system for the current work. This uses the original Bowler et al. (2008) inflation factor calculation based on sonde data alone. It is almost identical to the ‘Parallel Suite 16’ version of the global ensemble, which was operational between August and December 2007. Against sondes, it has insufficient spread around the Tropics and South Pole, together with overspread at the North Pole. These are all regions where the global transform and inflation factor have been used rather than a local calculation. The calibration with respect to ATOVS data is also poor, with spread exceeding the innovation around 50°S and the North Pole, whilst being several times too small in the Tropics.

ExtInfCa was the first trial to extend localisation to the tropical and polar regions, utilising both sonde and ATOVS data in the inflation factor calculation with a 5000 km data selection radius and a half-life of 3.0 forecast cycles (36 h). Against sondes, the tropical spread is substantially increased, but overspread remains around the North Pole and 60°S, and a new overspread has been introduced around 10°N. Against ATOVS, the spread does not successfully track the forecast error, which has itself increased markedly in the Tropics, presumably due to the excessive perturbations in that region.

It was noticed that the standard transform scale limit of 1.2 was being hit quite often in this trial. If the introduction of ATOVS had really made the inflation factor calculation more reliable, it was suggested that its results should be trusted more, so a further trial (ExtIF2) was run with a limit of 2.0. This value attempts to divide between the transform scale values produced by ExtInfCa under ‘normal’ circumstances, and the very large and spurious values produced by the single cycle within this period for which no ATOVS observations were available. This change (not shown) created several peaks of spread in excess of innovation with respect to sondes and exacerbated the problems with respect to ATOVS, confirming the existence of an underlying problem with the new inflation factor calculation which needed to be resolved.

Figure 6 shows examples of the analysis perturbations to the potential temperature field 980 m above the ground. ExtIF2 produces the largest peak perturbations, appearing within ‘patches’ about 5000 km in scale. Examination of the performance of the new inflation factor calculation for localisation centres around 10°N showed that it was doing a reasonable job of calibrating spread to error averaged over the 5000 km data selection radius. However, this can hide smaller-scale excesses and deficiencies of spread. Since the localisation centres are spaced 2000–2500 km apart, the set of observations used to derive each inflation factor is drawn from a much wider area than that inflation factor actually controls, as illustrated in Figure 7. This breaks the feedback loop needed to maintain control over the system. For instance, a stable arrangement could be formed in which a localisation centre with excessive spread were surrounded by centres with deficient spread, so that each appeared to be well spread over the 5000 km data selection radius. This is quite unlike the use of localisation for data assimilation purposes, where overlap between localisation centres can be helpful to maintain consistency and minimise disruption to model balance. In the long term, it may be beneficial to use distinct data selection radii for the transform matrix calculation (governed by error correlation length-scales) and the inflation factor calculation (governed by localisation centre spacing).

Figure 6.

Member 1 ETKF perturbation to the level 7 potential temperature field (980 m above the ground) for the final forecast (0000 UTC on 1 January 2007) of the (a) ExtIF2 and (b) ExtQLIF2 trials.

Figure 7.

The MOGREPS localisation grid applied uniformly across the globe by the new system (compare Figure 1 of Bowler et al., 2009). Shading illustrates the data selection region for the localisation centre in the mid-Atlantic. The new 2000 km radius (dark grey) greatly reduces the overlap between localisation centres compared to the previous 5000 km implementation (light grey).

An isosceles triangle construction suggests that a data selection radius of about 2000 km should be sufficient to utilise all available observations for the current localisation centre spacing, whilst largely eliminating the overlap between centres. This radius is also more consistent with that typically used for ensemble data assimilation, and was adopted in all subsequent trials. This produced more uniform perturbations, as shown in Figure 6 for the later ExtQLIF2 trial. However, the first attempt with no other changes (not shown) produced peaks of spread reaching one to three times the innovation variance at T + 12 h. The main cause of this problem appears to be the rather sluggish response of the 3.0 forecast inflation factor half-life that was originally chosen. This receives a particularly stern test when the required value of some inflation factors changes abruptly with the reduction in data selection radius. This, ironically, demonstrates the value of an online inflation factor calculation, in that a properly responsive system can adapt to such changes. The inflation factor in most localisation regions stabilised after about ten forecast cycles. For ATOVS, this was sufficiently long for observation background checks to start rejecting the very observations needed to correct the spread. In some regions, this led to the complete rejection of all ATOVS observations after the first few forecasts.

As described in section 4, more careful consideration suggested that a 1.0 forecast half-life should be sufficient to damp undesirable oscillations whilst providing some memory in cases of low observation count. The final trial, ExtQLIF2, uses this half-life with a 2000 km data selection radius. It also includes the minor correction to the inflation factor combination formula mentioned in section 4, and a provision to include runs with no observations in the running average observation counts (causing the system to respond more quickly to future observations). This system is sufficiently responsive to avoid the previous problems with the ATOVS background check, despite the sudden change in data selection radius. The average inflation factor drops from about 25 in PS16L26 to around 10 in ExtQLIF2, supporting the idea that the reduced data selection radius lessens that component of the inflation factor needed to compensate for the impact of spurious background correlations.

As shown in Figure 5, ExtQLIF2 produces a competitive ensemble mean forecast error, and a spread which does a reasonable job of tracking the variation of forecast error with latitude. Relative to PS16L26, the tropical spread has been increased and the Southern Hemisphere and North Pole spread excesses removed. If anything, the southern Tropics seem slightly overspread, whilst the Extratropics appear underspread with respect to sondes. The latter result is partly due to a conflicting signal from ATOVS, as shown in Figure 5(b). This highlights the fact that sonde and ATOVS data can say different things, because they measure different variables and volumes of atmosphere, and because the observation-error estimates are not perfect. The revised system seems to be producing a reasonable compromise between these conflicting signals, bearing in mind that it can only see the innovation signal averaged over a 2000 km radius, and the dynamic model will only accept some types of structural change.

The operational starting point for all of these trials included some localisation centres with inflation factors appreciably less than 1. Many of these were persisted by the early trials, including PS16L26. ExtQLIF2 spontaneously grew these inflation factors for all regions receiving observations, helping to normalise behaviour and improve the geographic calibration of spread. The reduced data selection radius and increased responsiveness of ExtQLIF2 are both required for the inflation factor calculation to see the necessary signal to increase spread in these regions. In some cases, the inflation factor was still stabilising at the end of the month-long trial. Further investigation showed the system was successfully calibrating the spread to the measured error, but that target itself evolved through the trial in response to the new perturbation magnitudes. Results from a follow-on trial before final operational implementation show that these inflation factors do stabilise eventually.

Figure 8 shows the calculated forecast error and spread as a function of height for ExtQLIF2. This suggests an excess of spread in the tropical upper troposphere and a deficiency near the surface. Whilst horizontal localisation has improved the column-integrated spread calibration, the model is not by itself able to produce the correct relationship as a function of height. The lack of surface spread in the northern midlatitudes is particularly important, since this is the main region in which the forecasts are used. A future paper will report further work on vertical localisation of the inflation factor aimed at improving the vertical distribution of spread.

Figure 8.

ExtQLIF2 (a) rms forecast error and (b) spread for sonde observations as a function of latitude and model level, showing the 26 levels used in the inflation factor calculation, up to about 13.5 km. Results have been normalised by the supplied observation error.

A further complication is illustrated in Figure 9, which shows the forecast error implied by sonde temperature observations. There are large regions in the Tropics and upper atmosphere where the implied error variance is less than zero, indicating that the supplied observation error is significantly overestimated. Unless there are corresponding underestimates for other sonde observations within the same localisation region, both the inflation factor calculation and verification will deduce a target spread that is too small, and that a correct model probably could not produce, given that the solution has to be dynamically consistent with other regions. Thus, the apparent tropical overspread with respect to sondes might well be spurious. Similar but less severe artefacts are seen for the wind variables, and to a limited extent for relative humidity. Whilst not as dramatic as the original problem with ATOVS error estimates, these results suggest that the system would benefit from work to more closely match the sonde observation-error estimates to the actual observation-error variance.

Figure 9.

Rms forecast error implied by the Var error estimates for sonde temperature observations. As in Figure 8, results have been normalised by the supplied observation error. Dark blue indicates regions where the observation error estimate exceeds the rms innovation. Grey indicates bins with fewer than 100 observations in the verification period.

5.2. Probability verification

To examine the impact of these changes in spread on probabilistic performance, statistics such as reliability, resolution, Relative Economic Value and Relative Operating Characteristic (ROC) have been calculated, again based on the available data centred on T + 12 h. These scores are discussed in detail in Wilks (2006). Thresholds at the 10th, 50th and 90th percentiles of local in-sample climatology have been used to provide explicit control over sampling issues, separate behaviour in common and less common situations, reduce ‘false skill’ problems, and permit effective aggregation over different regions and levels. A selection of Brier Skill Scores for different regions and variables is shown in Table I. When no attempt is made to account for the effects of observation error, almost all of the results show ExtQLIF2 to be equal or superior to PS16L26. As expected from the magnitude of the changes in spread, the strongest improvements are seen for tropical and polar regions. Southern extratropical results are more mixed, though the changes are generally small and results with respect to ATOVS are more clearly positive. The global improvement is strongest for the tenth percentile of temperature (the main variable measured by ATOVS), where the increment in Brier Skill Score corresponds to about 3.5% of the original value. In this case, the main change comes from the resolution component, perhaps as the broadened probability distributions start to permit better detection of low probability tails crossing the threshold. In the Tropics, the main improvement tends to come from reliability, as might be more naturally expected for a change in spread.

Table 1. Brier Skill Scores for sonde temperature (T), relative humidity (RH), zonal wind (u) and ATOVS brightness temperature (TB) with respect to the 10th and 50th percentiles (pc) of the local in-sample climatology. Only levels and channels used in the inflation factor calculation have been included. For each threshold and region, the first column gives the score for PS16L26, and the second column the increment in score to ExtQLIF2.
ThresholdGlobalN. ExtratropicsS. ExtratropicsTropicsPoles
  1. N. Extratropics = 20–70°N; S. Extratropics = 20–70°S; Tropics = 20°S–20°N; Poles = poleward of 70°.

  2. Rows prefixed with * include simulation of observation error to broaden the forecast probability distribution into a predicted innovation distribution, as discussed in the text.

T < 10 pc0.573 + 0.0200.638 + 0.0140.492 + 0.0050.077 + 0.0490.363 + 0.046
T ≥ 50 pc0.725 + 0.0070.789 + 0.0030.697 − 0.0050.102 + 0.0550.660 + 0.014
RH < 10 pc0.134 + 0.0020.137 + 0.0010.254 − 0.0080.018 + 0.0100.180 + 0.033
RH ≥ 50 pc0.299 + 0.0090.345 + 0.0020.366 − 0.0060.023 + 0.0830.067 − 0.021
u < 10% pc0.510 − 0.0030.582 − 0.0090.463 − 0.0060.039 + 0.0310.546 + 0.004
u ≥ 50 pc0.643 + 0.0000.677 − 0.0040.593 − 0.0050.360 + 0.0360.623 + 0.019
TB < 10 pc0.717 + 0.0170.826 + 0.0040.747 + 0.0120.645 + 0.0250.796 + 0.014
TB ≥ 50 pc0.850 + 0.0100.891 + 0.0030.854 + 0.0090.808 + 0.0170.858 + 0.011
* T < 10% pc0.595 + 0.0080.659 + 0.0040.521 + 0.0020.114 + 0.0220.394 + 0.028
* TB ≥ 50 pc0.858 + 0.0040.896 + 0.0000.858 + 0.0050.826 + 0.0030.859 + 0.012

Ensemble forecasts aim to predict the distribution of possible outcomes. The distribution of measured values will be broader due to the impact of observation error. Verification which fails to take this into account will unduly favour slightly overspread ensembles, where the excess spread matches the impact of observation error. A second set of verification results has been produced which attempts to avoid this problem by adding independent random numbers to each ensemble member forecast of each observation, drawn from a Gaussian distribution with standard deviation equal to the estimated observation error. This procedure is similar to that used by Saetra et al. (2004).

In general, this technique removes much of the apparent differences between the trials, with PS16L26 gaining more benefit from the simulated observation error than ExtQLIF2. This is consistent with the former trial being more underspread than the latter, since the added variance would have more impact on the smaller spread. The fact that the ExtQLIF2 scores do still improve with observation-error simulation suggests it is not actually overspread. The one region where ExtQLIF2 retains a clear advantage is the poles, where the PS16L26 spread was particularly bad, and overspread in the case of the North Pole, which the addition of simulated observation error will only worsen.

The observation-error simulation adds noise, which may be contributing to the apparent homogenisation of the forecasts. It is possible that taking more observation-error samples per ensemble member, or using a deterministic technique such as the dressing applied to account for harmonic tide error in Flowerdew et al. (2010) could restore some of the advantage of ExtQLIF2 compared to PS16L26. It has also been demonstrated above that at least some of the sonde observation errors are overestimated, leading to excessive adjustment of the ensemble spread. Provided ExtQLIF2 has not become overspread, the superiority demonstrated in threshold statistics without observation-error simulation will be genuine, and the complete suite of verification measures supports this conclusion.

6. Discussion

Ensemble forecasting systems aim to improve decision-making by predicting the distribution of possible outcomes, rather than just a single deterministic realisation. This distribution is obtained by using a dynamic model to predict the consequences of specific sampled sources of uncertainty. Since both the model and the sampling are imperfect, the resulting distribution will be imperfect, and may have systematic deficiencies at particular locations.

Whilst the full multivariate probability distribution can be complex and impossible to completely verify, spatial variations in climatological forecast error are one basic attribute whose reproduction can be evaluated and improved. This directly affects probabilistic scores and should also influence subsequent perturbation development. This article reports the extension of an online scheme for local spread calibration to uniform use across the globe.

As a prerequisite for this extension, the sonde observations previously used to calibrate the spread have been augmented by ATOVS brightness temperatures. This in turn required derivation of new estimates of the actual ATOVS observation-error variance, including the representativeness component but not exaggerations introduced to balance the data assimilation system or make up for neglected observation-error correlations. A variety of methods, including the standard innovation covariance approach and a novel ensemble-based method, have been used to produce a consensus answer and confirm that it is reasonably well constrained for channels with no significant contribution above the model top. Although the sonde observation-error estimates were not modified in the work reported here, the verification exercise highlighted some regions and variables for which the current estimates are greater than the mean square innovation. There was also some inconsistency between the inflation factor signals from sonde and ATOVS data, although this may be partly due to the different variables and heights sampled by the two observing systems. Whilst the sonde observation errors are not as problematic as the original ATOVS values, improving the sonde estimates is clearly a topic for future work.

The scheme presented here couples online measurements of innovation with offline estimates of observation error. Over time, these estimates can become inappropriate, although the largest changes will be associated with model upgrades where offline analysis is possible if time-consuming. For instance, the ATOVS observation errors were recently recalculated for the increase in horizontal and vertical resolution and extent made possible by the new Met Office supercomputer. As expected, this produced some large reductions in forecast bias, and changes to the representativeness component of observation error. Online estimates of observation error could automatically adapt to system changes. The procedures used here required sufficient care and human judgement that this may not be achievable in the near future, although Li et al. (2009) have reported some success in simpler systems using a different single technique.

The revised inflation factor calculation reduces the main problems identified with the match of spread to forecast error as a function of latitude in the previous system, in particular the lack of tropical spread. To achieve this, it was necessary to ensure that the time-averaging between runs was sufficiently adaptive, and that the bulk of contributions to each inflation factor calculation come from within the region which it directly controls. All of the trials (even those with excessive spread) ran completely through the month, with few scientifically-driven forecast failures.

The improvements in spread carry through to some probabilistic scores, particularly for the temperature variable and the tropical and polar regions. Whilst the available trials only provide verification around T + 12 h, other results provide some reassurance that the increased spread can be retained. The distribution of spread as a function of height remains an issue and will be harming these scores. A future paper will report later work to vertically localise the inflation factor calculation, artificially constrain perturbations at the very top of the model for stability reasons, and deal more satisfactorily with centres that receive few or no observations.

Like post-processing systems, the online spread calibration uses observations of recent performance to reduce systematic deficiencies in the underlying dynamic prediction. Whilst this improves the forecasts, it is a rather blunt instrument that introduces spread without knowing where it comes from, or for which situations it will be more or less applicable. Examination of areas in which the calibration is making the largest adjustments could help to target future work on initial condition perturbations and stochastic physics. These should provide more precise perturbations that adapt better to different forecasting situations. As more of the needed spread is generated by explicit modelling of the relevant uncertainties and processes, the amount added by the online inflation factor calculation will automatically reduce.

Whilst spread calibration can improve predictions of the magnitude of forecast uncertainty, it does not directly affect the predicted correlation structure. This has limited impact on standard point-based verification, but will be important for downstream applications that integrate meteorological uncertainty over space and time, such as hydrological (Pappenberger et al., 2008) and storm surge (Flowerdew et al., 2010) modelling. Accurate covariance structures are also a key requirement and anticipated benefit for the use of ensemble estimates of background uncertainty in data assimilation. Diagnostics such as Perturbation versus Error Correlation Analysis (PECA; Wei and Toth, 2003), similarity index and forecast projection index (Buizza et al., 2008) may be helpful for understanding and improving the local structure of ensemble perturbations.

The work presented here already includes some elements which should help to improve the structure of perturbations in the MOGREPS ensemble. The introduction of localisation to the tropical and polar regions should make the transform used in those areas more locally appropriate. The reduction in data selection radius from 5000 to 2000 km made possible by the introduction of ATOVS observations reduces the impact of spurious correlations, as evidenced by the reduction in mean inflation factor from 25 to 10. Attempts to derive climatological background-error covariances for data assimilation from MOGREPS data suggest the perturbation length-scales may still be slightly too large. This may be helped by further reductions in localisation scale, although care will be needed to ensure that genuine correlations are retained. The adaptive localisation proposals of Bishop and Hodyss (2009) may be helpful in this regard, although it is not yet clear whether the benefit would justify their computational cost. Sharper gradients associated with tighter localisation risk increasing disruption to geostrophic balance, harming forecast development. Applying the ETKF to transformed variables such as streamfunction should avoid this problem (Kepert, 2009).

Whilst further work on horizontal localisation may be beneficial, it seems likely that most of the remaining spurious correlations occur in the vertical dimension. Vertical localisation of covariances risks disturbing hydrostatic balance and, unlike the horizontal dimension, there is no clear variable transformation by which this problem could be avoided. Campbell et al. (2010) illustrate further issues when integrated observations such as satellite radiances are localised in observation space rather than model space. However, the success of the Canadian ensemble system (where the impact of satellite observations is localised in observation space using a notional height) suggests these deficiencies are outweighed by the benefits of vertical covariance localisation. Finally, some of the impacts of spurious correlations might be counteracted by improvements to the ETKF calculation itself. The revised ‘bias amelioration’ scheme proposed in the Appendices of Wang et al. (2007) tries to correct the perceived impact of observations on each eigenvector of the background perturbations. Besides reducing the magnitude and variability of the required inflation factor, this should improve the ETKF's filtering of the eigenvalue spectrum, and thereby improve the local structure of the perturbations in ways that a regional inflation factor calculation cannot.


The authors thank Nigel Atkinson, Brett Candy, Stephen English, Fiona Hilton and Roger Saunders for discussions and data relating to the satellite aspects of this work.