Long lead rainfall forecasts are highly valuable for planning and management of water resources and agriculture. In this study, we establish multiple statistical calibration and bridging models that use general circulation model (GCM) outputs as predictors to produce monthly rainfall forecasts for Australia with lead times up to 8 months. The statistical calibration models make use of raw forecasts of rainfall from a coupled GCM, and the statistical bridging models make use of sea surface temperature (SST) forecasts of the GCM. The forecasts from the multiple models are merged through Bayesian model averaging to take advantage of the strengths of individual models. The skill of monthly rainfall forecasts is generally low. Compared to forecasting seasonal rainfall totals, it is more challenging to forecast monthly rainfall. However, there are regions and months for which forecasts are skillful. In particular, there are months of the year for which forecasts can be skillfully made at long lead times. This is most evident for the period of November and December. Using GCM forecasts of SST through bridging clearly improves monthly rainfall forecasts. For lead time 0, the improvement is particularly evident for February to March, July and October to December. For longer lead times, the benefit of bridging is more apparent. As lead time increases, bridging is able to maintain forecast skill much better than when only calibration is applied.
 Long lead rainfall forecasts have great potential to improve planning and management of water resources and agriculture [Everingham et al., 2008]. Many climate modeling centers around the world now routinely produce long lead global rainfall forecasts from coupled ocean-atmosphere general circulation models (GCMs). In Australia, the Predictive Ocean Atmosphere Model for Australia (POAMA) produces rainfall forecasts at approximately 250 km resolution. POAMA (version P2.4) is initialized at the beginning of each month and forecasts are issued for that month (lead time 0) and the next 8 months (lead times 1–8 months).
 GCM forecasts are produced as ensembles to quantify forecast uncertainty and thus enable users to make assessments of risk. However, the current GCM models produce ensemble forecasts that are typically biased and underestimate forecast uncertainty [Goddard et al., 2001; Maraun et al., 2010]. The bias in GCM rainfall forecasts arises from model structural error, particularly approximations of subgrid-scale processes. The underestimation of forecast uncertainty arises from the difficulty in capturing all sources of uncertainty and the use of a small number of perturbations, leading to insufficient spread in the ensembles. Therefore, postprocessing is required before GCM forecasts are suitable for use in quantitative modeling (e.g., water allocation and crop modeling) or for planning and management purposes.
 One approach to postprocessing is to apply a statistical model to GCM forecasts to obtain unbiased forecasts that have appropriate ensemble spread [Feddersen et al., 1999; Landman and Goddard, 2002; Schepen et al., 2012a]. This dynamical-statistical modeling approach combines the strengths of dynamical and statistical models. In this study, we use a statistical method to calibrate raw GCM rainfall forecasts against observed rainfall. For the calibrated forecasts to be skillful, the GCMs would need to simulate rainfall well. In this regard, more progress is yet to be made as the resolution of many GCMs is still too coarse to resolve complex topographical effects and convective processes that directly affect rainfall. However, GCMs tend to be better at simulating large-scale processes, such as those affecting the sea surface temperature (SST) [Goddard et al., 2001; Wang et al., 2011]. As the large-scale processes account for rainfall predictability at long lead times, a statistical method can be used to relate GCM forecasts of SST and other large-scale variables to observed rainfall to produce skillful rainfall forecasts. We refer to this method as bridging.
 Both rainfall calibration and bridging have been employed to improve seasonal rainfall forecasts at relatively short lead times. Rainfall calibration models based on raw GCM rainfall have been used to downscale precipitation over the north western United States [Widmann et al., 2003] and the European Alps [Schmidli et al., 2006]. Bridging models have been used to forecast precipitation based on 850 hPa geopotential height field [Landman and Goddard, 2002]. Lim et al.  combined raw GCM rainfall, rainfall calibration, and bridging based on mean sea level pressure (MSLP), to forecast spring rainfall in Australia. Friederichs and Paeth  compared rainfall calibration and bridging with various predictors including SST anomalies, MSLP, stream function, and 850 hPa velocity potential to obtain the most skillful forecasts of seasonal rainfall for Africa.
Schepen et al. [2012b] used calibration and bridging models to forecast Australian seasonal rainfall (3 month totals) at 0 and 1 month lead times. Multiple calibration and bridging models were established using a Bayesian method. Forecasts from the multiple models were merged through Bayesian model averaging (BMA) to maximize forecast skill across season, region, and lead time. This approach avoided the need for model selection and was able to combine the strengths of different models. Bridging increased the coverage of positive skill predominantly in the second half of the year, and the improvement in 1 month lead time forecasts was greater than in 0 month lead time forecasts.
 While forecasts of seasonal rainfall totals are useful, long lead monthly forecasts are needed for many applications. In this paper, we explore the potential of using multiple calibration and bridging models to produce long lead (up to 8 months) forecasts of monthly rainfall for Australia. We use a Bayesian approach to calibrate POAMA rainfall forecasts and to establish a set of bridging models with SST-based predictors. BMA is used to merge forecasts from the multiple calibration and bridging models. Forecast skills and reliability are assessed through cross-validation on historical data.
 Historical monthly rainfall data used in this study are extracted from a gridded data set produced by the Australian Water Availability Project [Jones et al., 2009]. The data are upscaled from 0.05° by 0.05° to 2.5° by 2.5° to match the POAMA grid. POAMA hindcast data are also used in this study.
 POAMA is a coupled ocean-atmosphere dynamical model that is currently run in an experimental mode by the Australian Bureau of Meteorology. The resolution of the ocean model is 2.0° longitude by 0.5° latitude at the equator, decreasing to 2.0° longitude by 2.5° latitude near the poles. The horizontal resolution of the atmospheric model is approximately 2.5° by 2.5°. There are three variants of POAMA (version P2.4) that are used in this study: P2.4a, P2.4b, and P2.4c. Compared to P2.4c, P2.4a and P2.4b have a different parameterization of atmospheric physics associated with shallow convection. P2.4b has an additional flux correction scheme to correct biases in model climatology at longer lead times. The initial conditions of the ocean model in each P2.4 variant are perturbed using PEODAS (POAMA Ensemble Ocean Data Assimilation System) to generate a 10 member forecast ensemble. Further detail about P2.4 can be found in Wang et al. .
 We use monthly POAMA hindcast data from 1980 to 2009. For every month from January 1980 to December 2009, forecasts for lead times of 0–8 months are produced. A lead time is defined as the period of time between a forecast issue time and the beginning of a forecast target period. For example, the forecasts generated at the start of February 1995 comprise of monthly rainfall forecasts for February to October 1995 (lead times 0–8 months).
 Calibration and bridging models are established for 123 grid cells that cover the Australian continent. The ensemble mean of 10 member rainfall forecasts from each of the POAMA P2.4 variants is used as a predictor in a rainfall calibration model. Nine climate indices (Table 1) are derived from the ensemble mean of monthly SST forecasts from all POAMA P2.4 variants (30 members). Each climate index is used as a predictor in a bridging model. The particular nine climate indices are chosen because of their connections with rainfall over Australia [Risbey et al., 2009; Schepen et al., 2012a].
Table 1. SST-Based Climate Indices Used as Predictors in the Bridging Models
Average sea surface temperature anomaly over 150°W–90°W and 5°N–5°S
Average sea surface temperature anomaly over 170°W–120°W and 5°N–5°S
Average sea surface temperature anomaly over 160°E–150°W and 5°N–5°S
ENSO Modoki Index (EMI)
C − 0.5 (E + W) where the components are average sea surface temperature anomalies over: C: 165°E–140°W and 10°N–10°S E: 110°W–70°W and 5°N–15°S W: 125°E–145°E and 20°N–10°S
Indian Ocean West Pole Index (WPI)
Average sea surface temperature anomaly over 50°E–70°E and 10°N–10°S
Indian Ocean East Pole Index (EPI)
Average sea surface temperature anomaly over 90°E–110°E and 0°N–10°S
Indian Ocean Dipole Mode Index (DMI)
Indonesia Index (II)
Average sea surface temperature anomaly over 120°E–130°E and 0°N–10°S
Tasman Sea Index (TSI)
Average sea surface temperature anomaly over 150°E–160°E and 30°S–40°S
3.1. Rainfall Calibration Models
 We apply rainfall calibration models to raw POAMA rainfall forecasts to correct biases and improve ensemble spreads. Each calibration model is based on a Bayesian joint probability (BJP) modeling approach. A brief overview of the BJP modeling approach is provided here, while a detailed description is given by Wang et al.  and Wang and Robertson .
 We establish three calibration models that correspond to the three variants of POAMA (P2.4a, P2.4b, and P2.4c) for each forecast issue month, lead time, and grid cell. In each calibration model, the predictor variable (x) is the ensemble mean of POAMA rainfall and the predictand variable (y) is the observed rainfall. The predictor and predictand are each transformed using a log-sinh transform [Wang et al., 2012b] to satisfy the modeling assumptions of normality and homogeneity of variance:
where αx, αy, βx, and βy are parameters of the transforms.
 We then assume that the transformed variables follow a bivariate normal distribution:
 The model has a total of nine parameters (denoted here as θ) that consist of the transform parameters (αx, αy, βx, and βy), means ( , ), standard deviations ( , ), and the correlation coefficient ( ).
 A Bayesian inference of the model parameters (θ) is made using the POAMA hindcast data (1980–2009) and the observed rainfall data. The inference is implemented through Markov Chain Monte Carlo (MCMC) sampling based on the Metropolis algorithm. The rainfall variables (POAMA forecast and observed) are bounded by zero and treated as left censored at zero in the BJP implementation [Wang and Robertson, 2011].
 Given a BJP model with parameters θ, the posterior predictive density of a probabilistic forecast for a new event can be described as:
where contains the predictor and predictand data used for parameter inference.
3.2. Bridging Models
 We develop bridging models to take advantage of the ability of POAMA in forecasting large-scale SST patterns, such as those associated with El Niño–Southern Oscillation (ENSO). To demonstrate this ability, Figure 1 shows persistently strong correlation between the POAMA NINO34 index (see Table 1 for definition) and the observed NINO34 index for November at lead times of 0–8 months.
 We establish nine bridging models that correspond to nine climate indices for each forecast issue month, lead time, and grid cell. In each bridging model, the predictor variable (x) is a climate index and the predictand variable (y) is the observed rainfall. Again, BJP models are used and therefore the details are the same as for calibration models, except that the predictor x is transformed with a Yeo-Johnson transform [Yeo and Johnson, 2000] to allow for negative values.
where λx is parameter of the transform.
 The transforms of the predictor and predictand variables and the bivariate distributions of the transformed variables define the calibration and bridging models. The validity of the model assumptions can be checked by examining the fitted marginal distributions of, and the correlation between, the predictor and predictand variables in relation to data, as described in Wang et al.  and Wang and Robertson . In this study, checks are made on both the calibration and bridging models for a number of representative grid cells, forecast issue months, and target months. The model assumptions are found to be reasonable (results not shown here).
3.3. Merging Forecasts From Calibration and Bridging Models
 There are three calibration and nine bridging models for each forecast issue month, lead time, and grid cell. We apply a Bayesian model averaging (BMA) method to merge the forecasts from the twelve models. The details of the BMA method are presented in Wang et al. [2012a]. Here, we provide a brief summary of the method. The predictive density of a merged probabilistic forecast from K models is given as:
where y is the predictand, xk is the predictor used in model k, and wk is the BMA weight assigned to model k.
 The BMA weights can be estimated by using a finite mixture model approach based on a direct evaluation of the performance of the merged forecasts [Raftery et al., 2005]. Following the method of Wang et al. [2012a], we apply a symmetric Dirichlet prior for the weights:
where α is a concentration parameter. When α > 1, more evenly distributed weights among the models are encouraged. In this study, we set α = 1.0 + α0/K, with α0 = 1.0. The use of this prior helps stabilize the weights when they are subject to large uncertainty due to sampling variability and, more importantly, due to the fact that some of the models are related to each other because the predictors used in the different models are correlated.
 Following Wang et al. [2012a], we use a cross-validation likelihood function instead of the classical likelihood function. Thus, the weights are assigned according to the model predictive abilities rather than fitting abilities. The resulting distribution of the weights can be shown to be proportional to:
where fk(t)(yt|xkt) is the cross-validation predictive density for the forecasted event (t). A is maximized using an expectation-maximization (EM) algorithm to find a point estimate of the weights. The weights are set as equal in the initial step. The EM algorithm is iterated until convergence is achieved, when the change in ln(A) is smaller than 0.0001.
 We merge forecasts from the three calibration and nine bridging models for each combination of forecast month, lead time, and grid cell. We also merge forecasts from only the calibration models to assess the benefit of bridging.
 As an illustration, Figure 2 shows November rainfall forecasts (lead time 0) after merging the rainfall calibration and bridging models. The forecasts are cross-validation forecasts for a grid cell centered at [11.2°N, 132.5°E] in northern Australia. It shows the forecast ensemble spreads for different events and compares them against the observed rainfall data.
3.4. Forecast Assessment
 To get an indication of forecast performance for future events, we produce and assess forecasts for historical events in the period 1980–2009 using leave-one-year-out cross-validation. When inferring the parameters of individual models, the predictor and predictand data for the event to be forecasted are excluded. When inferring the BMA model weights, the cross-validation predictive densities for the event to be forecasted are excluded.
 In evaluating the overall forecast accuracy, we use the root mean square error in probability (RMSEP) [Wang and Robertson, 2011] to summarize the error between forecast medians and observed values over all years. RMSEP is defined as
where is the cumulative distribution function of the climatological data, and are the forecast median and observed value for event t, t = 1, 2,..., T. To help assess forecast skill, we calculate the RMSEP skill score:
where RMSEPREF is the RMSEP value calculated for reference forecasts, which are taken as the medians of cross-validated climatology distributions.
 The RMSEP skill score defines the improvement of the merged forecasts over reference forecasts. A skill score of 100 indicates perfect forecasts. A skill score of 0 indicates the forecasts are not better than the reference forecasts (no skill). A negative skill score means the forecasts are inferior to the reference forecasts.
 We also assess forecast accuracy in terms of the continuous rank probability score (CRPS) of the forecast distributions [Wang and Robertson, 2011]. Conclusions drawn from the results of CRPS-based skill score are consistent with RMSEP, and are therefore not presented in this paper.
 To assess forecast reliability, we plot a reliability diagram to show how well the forecast probabilities correspond to their observed frequencies [Wilks, 2011]. In this study, we examine the reliability of forecast probabilities of events not exceeding 33.3, 50.0, and 66.7 percentiles of climatology. Again, the percentiles are determined from cross-validated climatology distributions.
4.1. Skill at Lead Time 0
 It is reasonable to expect that forecasts at lead time 0 (issued at the beginning of the month for that month) are more skillful than forecasts for the same month at longer lead times. Figure 3 shows the skill of the merged calibration and bridging forecasts at lead time 0. The RMSEP skill score varies with location and forecast target month. In general, forecasts for January to April and October to December have the highest spatial coverage of positive skill. Locations that have skillful forecasts during those periods include north-western, northern, eastern, and southern Australia. There is little skill in the May forecasts.
 Figure 4 shows the proportion of grid cells with RMSEP skill score equal to or greater than 5% for each month. We compare the skill of the merged calibration and bridging forecasts with the skill of merged calibration forecasts. This enables an assessment of the effect of merging the bridging models with the calibration models. The inclusion of bridging models leads to a greater number of grid cells with skill score greater than 5%, improving the spatial coverage of positive skill. The benefit is particularly evident for February to March, July and October to December.
4.2. Skill of November Forecasts at Lead Times up to 8 Months
 November is often about the peak of an ENSO event. The merged calibration and bridging forecasts for November have already been shown to be skillful at lead time 0 (see Figure 3), particularly in south-eastern, north-eastern, and central-western Australia. Thus, we assess the forecasts of the merged calibration and bridging models for November at various lead times.
 Figure 5 shows that the positive skill of the merged calibration and bridging forecasts for November also extends to longer lead times in these regions. Figure 6 plots the proportion of grid cells with RMSEP skill score equal to or greater than 5% for each lead time, therefore summarizing the spatial coverage of skill for each lead time. At lead times of 0 and 1 month, there is approximately 37% coverage of RMSEP skill score equal to or greater than 5%. The coverage of skill decreases at 2 month lead time, after which the coverage of skill is maintained to 8 month lead time.
 Figure 6 also shows that the merged calibration forecasts have less coverage of positive skill than the merged calibration and bridging forecasts across all lead times. The skill of the merged calibration forecasts decreases sharply from 0 lead time to 1 month lead time, with approximately 15% coverage of RMSEP skill score equal to or greater than 5% at 1 month lead time. Beyond a lead time of 4 months, the merged calibration forecasts have limited coverage of skill.
4.3. Skill for Selected Grids and Skill for Selected Lead Times
 As it is not possible to show forecast skill for all months, grid cells, and lead times, we choose a number of grid cells around the coast of the Australian continent to demonstrate forecast skill at different lead times. Figure 7 shows the locations of the selected grid cells and the RMSEP skill scores of merged calibration and bridging forecasts for all months and lead times. In general, the skill of monthly rainfall forecasts is low. For some grid cells, however, there are months of the year for which forecasts can be skillfully produced even at long lead times. The most obvious are forecasts for October to December for grid cells B and C and November at grid cell H. Forecasts for April to June also show positive skill at long lead times in some grid cells. At grid cell A, skill is observed for most months at lead time 0, but there is little skill beyond lead time of 2 or 3 months.
 We also assess forecast skill at lead times of 4 and 8 months for all grid cells (Figures 8 and 9). Overall, there is only a small fraction of cells where positive skill is seen at these lead times despite the use of bridging models. For November and December, however, positive skill seen at lead time 0 (Figure 3) continues to show persistence to a lead time of 4 months and even to a lead time of 8 months (Figures 8 and 9).
4.4. Reliability of Merged Forecasts
 We pool together all the lead time 0 forecasts from merging the calibrating and bridging models for all the grid cells and months to construct a reliability diagram (Figure 10). The observed relative frequencies are shown to be consistent with the forecast probabilities (close to the 1:1 line) of events not exceeding 33.3, 50.0, or 66.7 percentiles of historically observed monthly rainfall. Therefore, the forecasts have appropriate uncertainty spread and are considered reliable. This conclusion is also valid for other lead times (results not shown).
 The monthly rainfall forecasts in this study have lower skill than the seasonal rainfall forecasts reported by Schepen et al. [2012b], despite using the same techniques for establishing and merging calibration and bridging models. This demonstrates that forecasting rainfall at monthly temporal resolution is more difficult than at seasonal resolution. Monthly rainfall totals are more influenced by weather (e.g., low-pressure systems and fronts over eastern Australia) and faster climate processes (e.g., the Madden-Jullian Oscillation over northern Australia). Therefore, monthly rainfall has more noise than seasonal rainfall, leading to less predictability.
 An interesting result has emerged from this study. There are months of the year for which forecasts can be skillfully made at long lead times for some regions in Australia. This phenomenon is most evident for November and December. At this time of the year, SST patterns have a strong relationship with Australian rainfall. As shown in Figure 1, POAMA is skillful at predicting SST patterns during the peak of ENSO in November out to a lead time of 8 months (correlation > 0.8). This helps to explain some of the long lead predictability of rainfall found in this study.
 In this study, we establish multiple rainfall calibration and bridging models to produce monthly rainfall forecasts for Australia with lead times up to 8 months. The models make use of raw forecasts of rainfall and SST from POAMA. The forecasts from the multiple models are merged through BMA to take advantage of the strengths of individual models.
 The skill of monthly rainfall forecasts is generally low. However, there are months of the year for which forecasts can be skillfully made at very long lead times. Using POAMA forecasts of SST through bridging clearly improves monthly rainfall forecasts. For lead time 0, the improvement is particularly evident for February to March, July and October to December. For longer lead times, the benefit of bridging is even more apparent.
 The calibration, bridging, and BMA techniques used in this paper can also be useful for downscaling POAMA outputs to produce monthly rainfall forecasts at finer spatial scales than the grid scale used in this study. As an example, forecast ensembles may be generated for a catchment. The ensembles for different lead times (months) can be sequenced together to give rainfall forecast traces, which can then be used as forcings of hydrological models to produce streamflow forecasts to long lead times. When rainfall forecasts have little skill, the ensembles are essentially stochastic scenarios.
 This work was completed as part of the Water Information Research and Development Alliance (WIRADA) between CSIRO and the Bureau of Meteorology to facilitate the transfer of research to operation. We would like to thank Eun-Pa Lim from the Australian Bureau of Meteorology for valuable discussions and James Bennett for assisting and facilitating the data generation. We also thank the three anonymous reviewers for their comments that have helped improve this paper.