Novel evaluation of sub‐seasonal precipitation ensemble forecasts for identifying high‐impact weather events associated with the Indian monsoon

We assess the skill of the fully coupled lagged ensemble forecasts from GloSea5‐GC2, for the sub‐seasonal to seasonal (S2S) timescale up to 4 weeks, with the aim of understanding how these forecasts might be used in a Ready‐Set‐Go style decision‐making framework. Integrated Multi‐satellite Retrievals for Global Precipitation Measurement (IMERG‐GPM) are used to seamlessly verify these ensemble forecasts up to monthly timescales whereby forecast and observed precipitation fields are summed over a sequence of increasing lead time accumulation windows (LTAWs), from 1d1d up to 2w2w. Results show that model biases grow with increasing LTAW and with ensemble member age. The S2S model exhibits both wet and dry biases across different parts of the Indian domain. The S2S model error grows from around 10 mm for a 24‐h accumulation to 50 mm for the 2‐week LTAWs. The actual skill and potential skill of the ensemble forecasts reveal that the potential skill is not always greater than actual skill everywhere. The sensitivity to the number and age of ensemble members was tested, with potential skill showing more impact from the exclusion of older members at all LTAWs. We conclude that the older lagged members do not necessarily add value by being included in the short to medium range or even for the extended range forecasts. GloSea5‐GC2 shows some skill in detecting large accumulations, which are not always tied to locations where they are climatologically frequent.


| INTRODUCTION
Precipitation is a diagnosed cumulative by-product of many atmospheric physics processes, and is often used to assess the skill of a modelling system due to its relative importance on human activity. Heavy rainfall can result in flooding and landslides, while a rainfall deficit can lead to drought, both of which can have a devastating impact on daily life. Nowhere is this more relevant than over the Indian subcontinent. Indian summer monsoon (ISM) is the major rainy season from June to September for the Indian subcontinent, contributing around 78% of the total annual rainfall. It has variability from subseasonal, intraseasonal, through interannual, to decadal timescales and beyond (e.g., Saha et al., 2021;Schneider et al., 2014;Turner & Annamalai, 2012;Webster et al., 1998). The total rainfall received during monsoon period strongly affects the prevalence of natural hazards (e.g., flooding, landslides), food grain and energy production and the gross domestic product (GDP) (Gadgil & Kumar, 2006) and can affect the availability of drinking water, the energy sector, and the livelihoods of millions of people and livestocks. It is therefore crucial to understand the predictability, accuracy and skill of ISM rainfall (ISMR) forecasts across a range of lead times, to enable forecasts to be appropriately used to reduce and manage rainfall-related impacts.
Monsoon predictability is strongly influenced by boundary forcings, such as sea surface temperature (SST), soil moisture and snow cover (Charney & Shukla, 1981;Webster et al., 1998). From a modelling perspective, the accurate simulation and prediction of ISMR remain challenging (Hazra et al., 2017;Kolusu et al., 2014;Pattnayak et al., 2016). Dynamical global atmospheric models are still not able to fully simulate the mean and inter-annual variability of the ISM (Kang et al., 2002;Kolusu et al., 2014;Pattnayak et al., 2016), while sub-seasonal variability also remains challenging (Saha et al., 2021). Subseasonal to seasonal (S2S) prediction of precipitation over India using coupled ocean-atmosphere models is an active area of research that aims to close the recognized gap in pullthrough of S2S science into user-orientated forecast applications (Robbins et al., 2019;White et al., 2017). Limited pullthrough is often cited as a consequence of the poor skill of such models; however, studies using different coupled dynamical seasonal prediction systems assessed the predictability of ISMR (Nageswararao et al., 2022;Ramu et al., 2016Ramu et al., , 2017 and showed moderate skill, with skill improving when the horizontal resolution is increased. This suggests that S2S models could provide useful information for user-orientated forecast applications if applied appropriately. For S2S ensemble forecasts to have value, they need to provide information relevant to the decisions being taken by users at this time range. The information needs to be temporally, quantitatively and locationally specific. Brunet et al. (2010) indicate that this requires S2S predictions to realistically reflect day-to-day weather and extreme events. Currently, the majority of extended range and seasonal forecasts are provided in terms of anomalies, rather than absolute rain volumes, which can be challenging for users to interpret and take actionable decisions on. Approaches to effectively utilize S2S forecasts within a seamless 'weather-to-climate' approach have been proposed by the Red Cross Red Crescent Climate Centre and the International Research Institute (IRI). The 'Ready-Set-Go!' approach (Goddard et al., 2014) aims to make use of forecasts at different time horizons, aligning them with different user actions to optimize the use of skilful forecasts for preparedness and to reduce losses from extreme weather events. This approach also aligns with forecast-based financing methods that are being increasingly adopted to support the Humanitarian sector prepare for and reduce the impacts of severe weather (Coughlan de Perez et al., 2015). In both cases, longer range forecasts provide context for an upcoming season and can support contingency planning activities. As the lead time up to the event reduces, forecasts providing more detail (e.g., magnitude, extent, timing) enable decisions and actions to become more refined and targeted as uncertainties reduce. Of course, for such approaches to be effective, it is important to know the skill of the different forecasts used, the style of forecast output that is most appropriate at each stage (e.g., anomalies or rainfall accumulations) and associate appropriate actions to each forecast time horizon. In this article, we focus on understanding the skill of quantitative rainfall accumulation forecasts for a selection of predefined temporal windows derived from an S2S model to inform how we might utilize such forecasts in a Ready-Set-Go style framework.
To do this, we adopt a similar approach to that described by Wheeler et al. (2017). Seasonal prediction 'skill' is typically quantified using a correlation coefficient. When computed against observations, it is referred to as 'actual' skill. When model skill is computed against itself it is referred to as 'potential' skill. Previously, Wheeler et al. (2017) and Zhu et al. (2014) examined the actual and potential skill of precipitation forecasts from two global coupled models: the Predictive Ocean-Atmosphere Model for Australia (POAMA) and the 2011 version of the European Centre for Medium-range Weather Forecasts (ECMWF) monthly system across a seamless range of timescales (1 day-4 weeks), the focus being the skill of the ensemble mean. Several recent studies have found that in tropical regions errors start to develop very quickly in the early days of model simulation and persist to seasonal, decadal and climate timescales (Martin et al., 2021 and references therein). Martin et al. (2021) used model sensitivity simulations using the Met Office Unified Model (UM) over the Asian monsoon domain and found that both the Maritime Continent and the oceans around the Philippines play a role in the development of East Asian summer monsoon errors on the short and seasonal timescales, with the ISM region providing an additional contribution. The errors over the ISM region itself appear to arise locally. Therefore, it is imperative to quantify the errors in the models at S2S timescale. This work investigates the precipitation actual and potential pattern skill of the UK Met Office Global Seasonal forecast system version 5 (GloSea5-GC2) over India for up to 4 weeks ahead. This is then extended to also consider more quantitative metrics of accuracy and probabilistic skill to understand what value there might be in forecasts of, for example, 2-week accumulations 3-4 weeks ahead.
We have chosen the 2019 Indian monsoon season (June-September) monsoon season for testing the model skill at different lead times up to 1 month ahead. The 2019 season yielded 110% of its long period average (LPA) of 880 mm and was the heaviest over India in 25 years. The season can be considered extreme in terms of the amount of rainfall recorded, with peculiar intraseasonal variations. The early season was dry and the end was very wet (Yadav et al., 2020). This was due to the strongest positive Indian Ocean Dipole (IOD) on record (Ratna et al., 2020). It is important to compare and understand the skill of the operational precipitation forecasts during such an extreme monsoon season. In this study, we address the following research questions: (i) What useful skill does the S2S forecast system provide for detecting high-impact weather events? (ii) What is the impact of the number of ensemble members on metrics? These questions will be addressed primarily through an assessment of the skill in the rainfall pattern. The accuracy of rain volumes is also considered without any form of bias adjustment or hindcasts to be able to establish what the raw system could provide if it were used to drive downstream applications (e.g., integrating S2S rainfall forecasts with hydrological models; White et al., 2015), which require essentially unprocessed model data. The article is organized as follows: data, model and methods are presented in Section 2; Section 3 presents results; finally, discussion and conclusions are presented in Section 4.

| Observation data
The Global Precipitation Measurement (Huffman et al., 2020) Integrated Multi-satellitE Retrievals for GPM (Hou, 2014;Skofronick-Jackson, 2017) final precipitation products are used in this study. The IMERG-GPM data are available at a horizontal resolution of 0.1 Â 0.1 (or roughly 10 km Â 10 km) with a 30-min temporal resolution. GPM precipitation accumulations are not without errors, but no measurements of precipitation (direct or indirect) are error-free or without uncertainties. One of the biggest issues is spatial representativeness, which is especially acute when comparing a model grid square value to a point measurement. This error is reduced when using a gridded precipitation observation product. There have been a number of regional comparisons between GPM and gauges, for example, Xu et al. (2017), Zhang et al. (2018), Sungmin et al. (2017), Saikrishna et al. (2021). The main issues related to the use of GPM include detection of light precipitation from warm shallow clouds (<5 mm/day); loss of detection over increasingly high mountainous regions (especially above 4500 m); lack of detection in semi-arid regions; and timing differences when using the half-hourly product for sub-daily studies. Despite acknowledged shortcomings, GPM provides a consistent dataset to compare the forecasts and can be aggregated easily to support the methodology used in this analysis. To compare the model, the GPM data were interpolated using nearest-neighbour regridding method to the GloSea5-GC2 grid ($93 km at the Equator).

| The GloSea5-GC2 forecast system
GloSea5-GC2 provided operational sub-seasonal and seasonal forecasts from July 2013 until Jan 2021 (MacLachlan et al., 2015;Scaife et al., 2014). The model uses the MetUM GC2.0 configuration and is a coupled atmosphere-landocean-sea-ice ensemble forecast system with a horizontal resolution of 0.8 in latitude and 0.5 in longitude, which translates to $60 km in the mid-latitudes and $95 km at the Equator, with 85 vertical levels. The Global Atmosphere component (GA6.0; Walters et al., 2017) is coupled to the Joint UK Land Environment Simulator (JULES; Best et al., 2011) and is forced with data from the Japanese re-analysis (JRA-55), NEMO (Nucleus for European Modelling of the Ocean; Madec, 2008) and the Los Alamos Sea Ice Model (CICE) (Hunke & Lipscomb, 2010). A stochastic kinetic energy backscatter scheme (SKEB2; Bowler et al., 2009) is used to introduce small grid-level perturbations throughout the integrations to enhance ensemble spread. The forecasting system is comprised of three parts; (i) sub-seasonal, (ii) seasonal and (iii) hindcast, also called historical re-forecast. GloSea5-GC2 has four members initialized at 00 UTC every day. Two members have a forecast horizon of 64 days (sub-seasonal forecast system), while another two members have a forecast horizon of 216 days (seasonal forecast system). These four members are used to generate a 40-member ensemble forecast with 10 days of lag time. We use the 40 'time-lagged' members in this analysis. For week one, forecast skill is primarily derived from the initial conditions, whereas at the longer range (e.g., seasonal scale), forecast skill is derived from external forcing (such as SSTs). It is also worth noting that the GloSea5-GC2 model is fundamentally designed for seasonal forecasting, and for this purpose, the 10-day lagging is considered to be appropriate. Forecasts from this system are unlikely to be used for the short or medium range, and the purpose of this paper is not to advocate such use. However, to understand the utility of such models in a seamless decision-making framework, it is worth understanding the boundaries (in terms of skilful time horizons) of the GloSea5-GC2 model by assessing its skill from near-initialization, over a range of timescales, from daily to monthly, and identifying where useful information might be available and potentially useful to users. A detailed description of model configuration is presented in Table 1.

| Methods
Precipitation is a cumulative field driven by many largescale and local-scale atmospheric processes. As a result, most operational institutes assess the skill of their precipitation forecasts, across a range of accumulation windows and forecast lead times: from sub-daily to seasonal. In this study, we employ a similar method to that used by Wheeler et al. (2017) and Zhu et al. (2014) to understand the skill of rainfall forecasts across increasing lead times and accumulation windows. The properties of the forecasts are assessed to determine whether the model can provide useful information on potentially impactful weather events more than 1-2 weeks in advance, here without the use of any bias correction. A schematic of the different lead times and accumulation windows used is shown in Figure 1. The abscissa shows the forecast lead time. Each arrow represents a different lead time/accumulation window (LTAW), each with a unique identifier, for example, '1d1d'. The first two characters provide the lead time, the second two provide the accumulation window or length going forward in time from the initiation time (see Figure 1). Note that '1d1d' implies a forecast with 1-day lead time (i.e., from t + 24 h onwards) and 1-day (24 h) accumulation, that is, the accumulation between t + 24 h and t + 48 h and it is referred as 'day2'. Similarly, '2d2d' implies a forecast with a 2-day lead time (i.e., from t + 48 h onwards) and 2 days (48 h) accumulation, that is, the accumulation between t + 48 h and t + 96 h. The longest forecast lead and accumulation window that was considered was '2w2w' and was about a monthly forecast. These LTAW combinations will provide a seamless transition from weather to climate timescales. The methods used in this study are applicable to the identification and verification of highimpact weather events.
For each LTAW combination (1d1d up to 2w2w), the actual and potential skill of the precipitation forecasts is calculated in addition to domain averages, mean error, root mean squared error (RMSE) and ensemble metrics. Actual skill is calculated using the correlation of the ensemble mean rainfall with the observed GPM-IMERG at each grid point (hereafter referred to as CORa) and all LTAWs. Similarly, the potential (or 'perfect predictability') skill is derived by systematically taking a single ensemble member (i.e., replacing the GPM-IMERG observations) and designating this as the reference, and correlating this with the ensemble mean computed from the remaining members. This calculation is repeated for all other members (e.g., 40 times for GloSea5-GC2), following a similar approach to previous studies (e.g., Becker et al., 2013;Boer et al., 2013;Buizza, 1997;Holland et al., 2013;Rodwell & Doblas-Reyes, 2006). These different assessment techniques are applied over the Indian domain region and the five homogenous climatic regions defined by the Indian Meteorological Department (IMD): Central India (CEI), North East India (NEI), North West India (NWI), South Peninsula India (SPI) and West Central India (WCI) (Figure 2). The climatic regions are used to calculate the regional skill of the model forecasts and are similar to Ramu et al. (2017) (Figure 2). Climatologically, the coastal parts of SPI and WCI (lying along the windward side of Western Ghats) and some parts of NEI receive high annual rainfall. Similarly, NWI and foothills of Himalayas receive lower amount of rainfall.

| RESULTS
First, the entire India domain averages (as covered in Figure 2) are explored to consider the bulk properties of the forecast over the region. This is followed by inspecting the biases (mean error) and RMSE of the ensemble T A B L E 1 Detailed description of the GloSea5-GC2 modelling system configuration differences used in this study.

Model description
GloSea5-GC2 UK Met Office Global Seasonal forecast system version 5 Convection scheme Mass flux convection scheme mean to provide information on the ensemble characteristics as a function of forecast lead time, before examining the pattern skill.

| Domain averages
A simple but effective way of exploring the large-scale biases in the coupled modelling system is to compare the domain averages for each of the LTAWs ( Figure 3). To compare the different accumulation windows, the windows longer than 24 h are converted into an equivalent daily value to match 1d1d magnitudes, that is, for 2d2d, the daily domain average is divided by 2, for 3d3d by 3, for 4d4d by 4, etc. Figure 3 shows the daily accumulation values for both the GloSea5-GC2 forecasts and GPM observations reducing with increasing LTAWs. As might be expected, the profile of daily accumulations is smoother for longer LTAWs than shorter LTAWs. Figure 3 suggests that GloSea5-GC2 forecasts can capture the southwest monsoon period at longer lead times (1w1w and 2w2w), as there are similarities between the GloSea5-GC2 forecasts and GPM observations. The GloSea5-GC2 rainfall forecasts are notably less temporally variable compared with GPM for 1d1d and 2d2d, due to our comparison using the ensemble mean rather than individual members, and also the inclusion of lagged members used in the ensemble mean computation.
However, the domain average temporal correlations for 2d2d are higher (0.73) than other lead windows and lower for 2w2w (0.36). GloSea5-GC2 appears to exhibit a lag such that while the peaks observed in the GPM are present in the forecast, they appear to be slightly delayed, particularly at the longer LTAWs. This is because any timing errors become accentuated as the accumulation window length increases. As is often the case, the model appears to perform less well at simulating the rainfall variability during the onset and end of the monsoon period at all the LTAWs. The biases and magnitude of the errors of the ensemble mean are explored further in the following section.

| Accuracy of the ensemble mean
The mean error or bias (Model-GPM) of 2019 June to September precipitation for the 1d1d LTAW is shown in Figure 4. To highlight the difference in ensemble construction, the biases are shown for each ensemble member. GloSea5-GC2 shows a wet bias over portions of the Indian land mass region (North Eastern Himalaya; West coast; East-central India), Southern Arabian Sea and Southern Bay of Bengal (BoB). These get more pronounced for the other LTAWs (not shown). This wet bias is present for the most recently initialized (day 1) GloSea5-GC2 members (see the first set of 4 plots on the top row) and becomes progressively larger and drier as  2017) but focused on the short to medium range, with the accumulation windows increasing in length with increasing lead time. Therefore '1d1d' implies a forecast with 1-day lead time (i.e., from t + 24 h onwards) and 1-day (24 h) accumulation, that is, the accumulation between t + 24 h and t + 48 h. the age of the ensemble members increases from top left to bottom right. The wet bias to the south of India is very clear across the GloSea5-GC2 plots, as is the growing dry bias west and east of the Indian subcontinent. Most of the wet bias disappears over land for GloSea5-GC2 as the ensemble members are older. The dry biases are observed over the Northern BoB and Arabian Sea. These dry biases maybe due to less formation and progression or due to development of synoptic systems in the model. The biases observed at 1d1d tend to become more pronounced for the other LTAWs, with the error patterns growing in magnitude and spatial extent with increasing LTAWs (not shown).
In order to quantify the precipitation volume errors in the lagged ensembles (to complement the mean error), we calculated the LTAW RMSE for each ensemble member over all model grid points in the Core Monsoon Zone (CMZ) region (18 -28 N; 65 -88 E) for the 2019 June-September season, as shown in Figure 5. The magnitude of RMSE values for individual ensemble members grows with LTAW as longer accumulation windows can contain larger rain volumes, and larger rain volumes can produce larger errors. The mean GPM rainfall over the CMZ at all LTAW is 8.5 mm/day. Similarly, the GloSea5-GC2 mean rainfall over the CMZ region across the 40 ensemble members ranges between 9 and 6 mm/day, 8.8-5.3 mm/day, 8-4.7 mm/day, 7.5-4.4 mm/day, 6-4 mm/day and 4-3.5 mm/day for 1d1d, 2d2d, 3d3d, 4d4d, 1w1w, 2w2w respectively. Figure 5 shows higher RMSE variation across the ensemble members at shorter LTAWs (i.e., 1d1d, 2d2d), with an increased RMSE for the larger LTAWs, though there is a distinct increasing trend in RMSE for 3d3d and especially 4d4d towards the older ensemble members due to differences in initialization. The oldest members in the GloSea5-GC2 ensemble from 3d3d onwards have significantly larger RMSE, and this larger error will translate to any ensemble metrics, but also the ensemble mean, and consequently any quantitative skill in the ensemble mean (which is examined in the following section). The RMSE of GloSea5-GC2 for all ensembles for 1w1w and 2w2w is flatter with higher mean errors compared with other LTAWs. Furthermore, it is hypothesized that the skill of the GloSea5-GC2 ensemble mean could potentially be improved by excluding some of the oldest members. This will be explored in subsequent sections, where only the first 23,12, 8 and 4 members are retained. The reduction in RMSE variability with increasing LTAW can be attributed to the increased smoothing of any dayto-day variability in the 1w1w and 2w2w accumulations.

| Pattern skill of the ensemble mean
In order to understand model forecast errors, in this section, we measure the skill/accuracy using the pattern correlation of the ensemble mean precipitation forecast with GPM data at all LTAWs.

| Actual skill compared with GPM
The actual skill of the ensemble mean as determined by the correlation coefficient (CORa) with GPM measures the pattern correspondence. CORa is insensitive to the intensity bias but could be sensitive to the impact of any bias if this affects the spatial characteristics of the field. Maps of computed CORa for a selection of LTAWs are presented in Figure 6. The spatial maps of actual skill based on only the newest four members are shown for the 1d1d, 4d4d, 1w1w and 2w2w LTAWs in Figure 6a,c,e,g and illustrate that the actual skill of GloSea5-GC2 decreases with increasing LTAWs. Figure 6b,d,f,h shows the CORa maps for the full 40 ensemble members for the same LTAWs. Green contours indicate regions where the CORa exceeds 0.7. For 1d1d LTAW, the four-member correlations are somewhat higher than those obtained using 40 members. For the 4d4d LTAW, the results begin to look very similar, and beyond this, the 40-member results suggest high levels of skill, with some larger areas of spatially consolidated skill above 0.7, though the regions where the ensemble mean and the observations are anti-correlated (negative) increase with increasing LTAW.

| Potential skill comparison
While actual skill is computed against a physical observation or an estimate thereof (noting that GPM is an F I G U R E 3 Time series of ensemble mean domain averages for the different lead time/accumulation windows (LTAWs) in panels (a-f). The LTAWs are scaled to be a daily value to facilitate comparison. The value r is the correlation between GloSea5-GC2 and GPM. estimate of precipitation not a measurement), potential skill is defined as designating an ensemble member as truth and computing an ensemble mean from the remaining ensemble members. This is repeated for the entire ensemble, and at the end, an 'average' correlation is computed, which is called 'potential skill'. Generally, potential skill is greater than actual skill, but not always (Kumar et al., 2014). Furthermore, it would seem unlikely for ensemble members to be anti-correlated with each other.
The potential skill provides some information on the spread of the ensemble members to an ensemble mean, providing a measure of how similar, or dissimilar, the members are from each other and from the ensemble mean. Therefore, higher potential skill could indicate that a model has higher predictive skill in the ensemble mean (at predicting itself) but also indicate low ensemble spread. The potential skill could be considered an upper limit of predictive skill and the ratio of the actual-to-potential skill could conceivably provide an estimate of how much of the skill in the ensemble mean is realized in reality. Figure 7 provides the same panels as Figure 6 but for potential skill. The green contours once again show regions where correlations exceed 0.7. The spatial pattern of potential skill of the GloSea5-GC2 is much smoother and more uniform with higher values than for actual skill. This applies to all LTAWs. This is because the model is seemingly able to predict itself much better than reality. Figure 7 also shows the impact of the number of ensemble. For 1d1d, the four members initialized on the same day are unsurprisingly highly correlated. The potential skill for the four members shows a strong decrease with increasing LTAWs such that the 2w2w potential skill looks very similar to that calculated using 40 members (Figure 7g,h). The 40-member results on the other hand show much less change with LTAW with correlations widely in excess of 0.5. Of course, potential skill is strongly impacted by the lagged ensemble members of GloSea5-GC2, and with members that are initialized close together being much more correlated than when older members are also used. Therefore, it is not surprising the potential skill of four members is higher than for 40 members. What is somewhat surprising is that the potential skill for the 2w2w LTAWs of the 4 and 40 members is so similar. To explore the relationship between the number of ensemble members and lead time, the skill differentials and the field sum differences were computed, such that first, the number of ensemble members was kept constant to look at the impact on the LTAW, and second, the LTAW was kept constant and the number of ensemble members allowed to vary between 4 and 40 members. These results are summarized in Table 2. Note that 23 members happen to be the number of members of the numerical weather prediction (NWP) ensemble that is run over this region to give a comparator based on the members typically provided for the short to medium range. Most of the field sum differences are positive (where fewer members or shorter lead times have higher correlations). The exceptions are for 2w2w where the 4-8 member difference is negative, and for 40 members where the 1d1d-2d2d, 2d2d-3d2d and 1w1w-2w2w are also negative. In most instances, the size of the difference as a function of lead time (coloured numbers) is dwarfed by the size of the differences as a function of the number of members (numbers in coloured circles). Only for the 4 and 8 members are the 1w1w-2w2w differences larger or comparable to the size of the number-of-member differences (in the circles). Computing potential skill involves large amounts of averaging and smoothing, which increases with the number of ensemble members. This smoothing offsets the differences that are introduced by using the older members. The final outcome is therefore a trade-off between the smoothing and the spread in the ensemble members. We can conclude that the differences between the members do win for a time, in that the lagging in that as the number of members increases the correlations go down. The fact that the 2w2w field sum is greater than 1w1w for 40 members would seem to suggest that some maximum in spread may have been reached around 14 days.
The impact of the number of ensemble members and LTAWs on the five homogeneous climatic regions of India shown in Figure 2 is summarized in Figures 8 and  9. For all, 95% CIs are computed using Fisher's z-transform. For actual skill, the regional correlations for GloSea5-GC2 are remarkably similar in shape for 4 and 40 members, and this holds for all LTAWs except for the F I G U R E 6 Actual skill as a function of the number of ensemble members for four different lead time/accumulation windows. Green contour denotes a correlation of 0.7. F I G U R E 7 Same as for Figure 6 but for potential skill. NWI region. It is very interesting to note that the CORa values for SPI (Southern Peninsula) are higher than the other homogenous regions, as this region's rainfall is mostly driven by the large-scale processes of precipitation. More members appear to be beneficial for NEI, particularly for the 1w1w and 2w2w LTAWs while there appears little benefit for CEI. By contrast, potential skill over the regions (Figure 9) shows a much stronger impact from the change in ensemble members. GloSea5-GC2 potential skill is higher when older members are removed (Figure 9a), suggesting that the remaining members add some variability, which acts to reduce the potential skill, and this is true for all the regions and all LTAWs (Figure 9). The CIs are very small. As also summarized in Table 2, the potential skill stays much more similar as a function of LTAW when using 40 members while there is a strong decrease with LTAWs for 4 members, eventually reaching levels similar to those seen for 40 members at 2w2w. Like the actual skill of SPI, the potential skill is also larger than the other homogeneous T A B L E 2 The differences in potential skill are shown in two ways. Note: Firstly, by keeping the lead time/accumulation window (LTAW) constant and comparing the differences in potential skill between the number of ensemble members (in the coloured circles). Secondly, the members are kept the same and we compute the difference between different LTAW (plain text).

Red differences show where the fewer members/shorter LTAWs have higher values (correlations).
F I G U R E 8 GloSea5-GC2 actual skill with 95% confidence intervals for the five Indian homogeneous precipitation regions for all lead time/accumulation windows as a function of the number of ensemble members.
regions. Clearly, GloSea5-GC2 would not be used to produce forecasts in the short to medium range but understanding this behaviour is of value when wishing to evaluate forecasts seamlessly. It would also suggest that if you were to use forecasts from GloSea5-GC2 you would be better of using far few members, with the potential skill for 4 members much more representative.
Finally, the ratio of the actual-to-potential skill in Figure 10 offers a way of gauging how much of the pattern skill is translated locally into reality, and it can be used to compare the impact of the number of members. The zero contour line is depicted in green colour, noting then that the ratio can be negative when the actual skill is negative (i.e., the ensemble and the observations are essentially anticorrelated). It was also mentioned that generally actual skill is less than potential skill but this is not always the case either. The ratio can therefore be greater than 1, that is, actual skill is higher. It can also be greater than À1, that is, where the model is effectively very anticorrelated with reality. It is important to note that when this happens this may not be because the actual skill is good. It is far morely likely due to the potential skill being low or too low, noting that potential skill is itself a fairly strange construct. As Figure 10 shows for 4 members, this ratio remains below 1 over the land regions with the 2w2w LTAW ratios being the highest and the 1d1d LTAW ratio the lowest. Regions where the ratios are greater than 1 or less than À1 appear mostly over the BoB but also over the drier west and over high ground, noting that GPM over the high ground may not be very good and so the actual skill values should be interpreted with some care. For 40 members, the ratios are much higher with a larger proportion of locations where the ratio is greater than 1, which increases in occurrence with increasing LTAW, so that at 2w2w, there are more instances where locally the model is correlating better with reality than with itself. It is interesting to note that the ratio values are significantly higher over southern central India at 2w2w LTAW. In this context, the 40 members show value at longer lead times as well, though regions with negative ratios also increase with increasing LTAWs. The ratio is highest for 2d2d irrespective of the number of ensemble members, with 3d3d the lowest (not shown).

| Ensemble and probabilistic performance
Thus far, the evaluation has been based on bulk quantities and the ensemble mean. The impact of the number of ensemble members on pattern skill has also been explored, and this showed that for the longer lead LTAWs, the number of members does add to the spread and skill of the ensemble. In this section, we present some results of the ensemble as a whole, considering elements of spread (using the rank histogram) and an ensemble metric that quantifies accuracy, the Continuous Ranked Probability Score (CRPS) and its decomposition (Hersbach, 2000). It is worth noting that the CRPS is sensitive to the bias. The lower the value the better. We decompose the CRPS into the CRPS reliability (CRPSReli) term, which is related to the rank histogram. CRPSReli is not dimensionless but is in the unit of millimetres. A smaller CRPSReli is desirable. The other component is CRPSpot (not shown), which is the difference between the resolution and uncertainty and represents the value one would obtain if the system is F I G U R E 9 GloSea5-GC2 potential skill with 95% confidence intervals for the five Indian homogeneous precipitation regions for all lead time/accumulation windows as a function of the number of ensemble members.
perfectly reliable, that is, CRPSReli = 0. Here only the performance of the 40-member ensemble is shown, though the performance was assessed using the 4, 8 and 23 most recent members as well. One might hypothesize that for the shorter LTAWs, the probabilistic performance is degraded by the presence of the older lagged members, but the results suggest that the 40-member ensemble verifies better despite the inclusion of the older members.
The CRPS in Figure 11a shows that the differences between the CRPS as a whole for 1d1d and 2d2d are small but the differences do increase with LTAW, alongside the bias (as the CRPS is sensitive to the bias), with 95% bootstrap confidence intervals showing the relatively small spread in scores for the season and region. The CRPS values across the whole domain are smaller than most of the homogeneous regions, on par with SPI. Only NWI errors are smaller. It is noted that the CRPS values for 2w2w for all the homogeneous regions and the domain as a whole increase by about 40 mm. WCI has the largest errors while the domain-wide error is the lowest, suggesting the strong influence of the sea areas on predictability and skill for weeks 3-4. Though not close to zero, the most interesting result from Figure 11b is that GloSea5-GC2 shows reasonable reliability up to 4d4d LTAWs and thereafter it deteriorates rapidly, except for NEI where the reliability improves (decreases towards zero). This could be due to the model providing better simulation of large-scale precipitation, which is the dominant mechanism for NEI. Put succinctly, the ensemble has an error in the order of 1 mm to 9 mm for daily accumulations with a forecast horizon of 2 days over the entire domain and homogeneous regions, and this error at least doubles for a 4-day accumulation ending on day 8. These numbers are comparable to the evaluation of the ensemble mean RMSE shown in Figure 5. This gives some hope that the bias could be corrected using NWPlike bias correction techniques, which do not specifically rely on the availability of hindcasts but can be based on observational records (Mittermaier et al., 2022, in preparation). Figure 11c shows the rank histograms for the GloSea5-GC2 ensemble forecast system by region for the June-September 2019 period. Only the 1d1d and 2w2w LTAWs are shown. While the rank histograms for the whole domain show U-shaped distributions for 1d1d to 2w2w LTAWs, there is a lot more weight in the higher ranked bins for 2w2w LTAW, and it is more symmetrical than the 1d1d rank histogram. Despite the larger number of grid points used for the whole domain and hence the larger counts per bin (rank), the rank histograms based on the regions for 1d1d retain a more symmetrical u-shape while the 2w2w rank histograms show much stronger evidence of bias with a significant reduction in the counts populating the lower tail. There is a systematic shift towards the higher ranks, which is indicative of the bias. Notably the NEI region again shows a remarkably flat rank histogram for 2w2w, suggesting that for this region, the 2w2w LTAW ensemble forecast spread is fairly good.
F I G U R E 1 0 Actual-to-potential skill ratio for GloSea5-GC2 for the 1d1d, 4d4d, 1w1w and 2w2w lead time/accumulation windows as a function of the number of ensemble members. The 0 ratio contour is indicated in green. Actual skill can be negative such that the forecasts and observations are anti-correlated.

| Ensemble and probabilistic performance with respect to climatology
Beyond the daily timescale (for which some critical thresholds have been defined for issuing warnings), there are no user-defined thresholds available to test the skill of the model forecasts for the LTAWs, which are more than 1 day in length. The most widely used reference value for computing the skill of a probabilistic forecast is the Brier score with respect to climatology. This can be a sample climatology or a long-term climatology. For example, using a climatological reference is an appealing strategy because it will provide an indication of whether the model forecasts provide something better than having no forecast. Here the sample climatology is used to compute centiles of occurrence for the 2-, 3-, 4-, 7-and 14-day accumulations. The sample climatology probabilities for exceeding the specified threshold in each grid point (based on GPM on the GloSea5-GC2 grid) for June-September 2019 are given for each of the LTAWs. These are presented in Figure 12, with the 0.1 probability contoured in orange. The threshold choice was a pragmatic decision driven by a desire to assess the different LTAWs with a similar observed frequency of occurrence. The starting point was the thresholds used for issuing warnings for daily precipitation (the only thresholds available) and scaling this up to the longer LTAWs. The maps show that though it is possible to get large totals anywhere, climatologically speaking the kinds of rain amounts used to create these maps are locally unlikely in most places and locally highly likely in a few places, for example, the Western Ghats and few spatial locations Northeast of India. When aggregating the homogeneous regions shown in Figure 2 are perhaps not a good delineator of this pattern but then the 2019 season may also not be typical of the long-term climatology. The local climatology affects the ability of an ensemble forecast to add value. If an event is climatologically likely, then the forecast probability has to add value over simply forecasting the local climate. This means that in the regions such as the Western Ghats it would appear to be difficult for the model to add value. Whereas over regions where such accumulations are climatologically unlikely, the ensemble forecast could potentially add a lot of value if it could correctly detect an event. Figure 13 shows Brier skill scores relative to the sample climatology for different LTAWs using the thresholds shown in Figure 12. The 0.1 observed frequency contour is overlaid. Grey indicates where the score is negative, and therefore worse than forecasting the climatological frequency in Figure 12. White indicates locations where the forecast could not be computed. The results suggest that GloSea5-GC2 has fairly low skill for these high thresholds though clusters of skill exist over the Arabian Sea and BoB, over NW and NE India where the terrain adds predictability. The Western Ghats shows a surprising amount of skill too up to 1w1w with hints of some positive skill for 2w2w. These results are sensitive to the choice of thresholds for each LTAW. It is F I G U R E 1 3 Brier skill score relative to sample climatology. One threshold per lead time/accumulation window (LTAW) is shown. These were pragmatically chosen (see previous figure) to reflect similar observed frequencies of occurrence across all LTAWs. A base rate (observed frequency of occurrence) of 0.1 is shown as the contour in all panels. Negative scores in grey where the forecast performs worse than the sample climatology.
F I G U R E 1 2 Observed frequencies of occurrence (base rate) for a selection of thresholds, one for each lead time/accumulation window, based on the sample climatology for June-September 2019. Orange contour indicates regions with frequencies greater than 0.1. encouraging to see that GloSea5-GC2 shows greater skill for regions where the rainfall amounts are not climatologically common, for example, over the northwest coast and central India from 1d1d to 4d4d LTAW. Historical time series should be considered (e.g., Pai et al., 2017) and any future evaluation of skill should use variable thresholds to reflect the large variations in climatology. It would be worth considering the model forecast bias correction and then test this skill.

| DISCUSSION AND CONCLUSIONS
Decision-makers across multiple sectors require accurate weather forecasts with sufficient notice (lead time) to make planning decisions and take action to reduce the impact of weather-and climate-related events. National Meteorological and Hydrological Services (NMHSs) and the Humanitarian sector are identifying ways to effectively utilize the available skill of forecasts, across a range of timescales, to enhance pre-preparedness ahead of impactful weather events. Supporting this desire, this work has been conducted to examine the skill of the GloSea5-GC2 global ensemble forecast system, which provides forecasts of precipitation amount across S2S timescales. Here, the focus was on the early to extended range (up to 4 weeks). A method originally developed by Wheeler et al. (2017) is applied to GloSea5-GC2 forecasts to understand their relative skill for different LTAWs from 1d1d up to 2w2w, for the extreme Indian monsoon season of 2019. Our LTAWs are slightly different from their windows. Two questions were posed in this study. What useful skill does the forecast system provide for detecting high-impact weather events and what is the impact of the number of ensemble members on metrics. The latter question has been addressed primarily through an assessment of the skill in the rainfall pattern. The results have demonstrated that the number of ensemble members has a larger impact on the correlations than the forecast lead time, suggesting that more members for the LTAWs may be detrimental, adding too much spread. However, when considering the ensemble as a whole, not just the ensemble mean, having 40 members would appear to still be beneficial.
In terms of all-India domain averages, GloSea5-GC2 appears to begin by over-estimating daily-equivalent domain averages, this over-estimation is reduced with increasing LTAWs but with an increasing offset in the timing of peak rainfall. The model has a better estimation of precipitation phase and precipitation totals at longer LTAWs (1w1w and 2w2w) in the middle of the monsoon months as mostly dominated by the large-scale features of monsoon.
The mean error and RMSE of the individual ensemble members showed that GloSea5-GC2 ensemble members have a systematic behaviour in RMSE and mean error, with errors increasing with ensemble member age (as there is a 10-day difference between the oldest and most recent lagged ensemble members). The actual (pattern) skill (based on the correlation coefficient) of the ensemble mean showed that GloSea5-GC2 has reasonable skill over the Indian homogeneous climatic regions at most of the LTAWs but decreases in the actual skill with longer LTAWs. It is worth mentioning that GloSea5-GC2 actual skill increases when only 4 members are used (i.e., the oldest members are excluded), but for longer lead times and regions (e.g., NEI) the 40 members do add value.
The potential skill over the regions tends to be higher than the actual skill and this is not true for all LTAWs. The interpretation of this could go two ways: (1) the ensemble spread in GloSea5-GC2 is low, or (2) there is still some room for improvement in the actual skill. One notable reversal in behaviour is that when the older members are excluded the potential skill is generally higher such that it would appear that the members that are initialized together are more similar to each other. Therefore a reduction in potential skill with increased lagged members is probably desirable as it ensures that the ensemble members are less like each other. In order to gain sufficient ensemble spread, it is necessary to time lag ensemble members from recent runs. This is particularly important for probabilistic forecasts at the S2S timescale. Locally (i.e., homogeneous regions), there is a split between regions that suggest that the potential skill is higher (lower) than their actual skill, and this trend is seen for all LTAWs.
On the other hand, trimming the members does not appear to have a detrimental impact on actual skill, except for NEI region and for 1w1w and 2w2w LTAWs. Thus, it would appear as though these additional members do not add value by being included in the short to medium range or even for the extended range forecast. This has potential implications for how such forecasts might be applied in downstream applications. For example, in instances where individual ensemble members are fed into downstream hydrological models, or impact models it may be important to consider the age and number of the members being used in a seamless decisionmaking framework (e.g., Ready-Set-Go). Whereas higher resolution NWP forecasts are likely to be used closer to the time of the high-impact weather event. Finally, the relationship between the actual and potential skill can be explored by taking the ratio of the two correlations. When viewed on the grid, the actual-to-potential ratios can be locally greater than 1 and more likely so for 40 members. This could be considered unusual but it suggests that the model is actually better correlated to the observed state than the ensemble members with each other. Crucially, ratios greater than 1 do not necessarily happen because actual skill is good. It is far more likely due to the potential skill being poor. There are also locations where the ratios are less than À1, indicating where the ensemble is strongly anti-correlated with the observations, the magnitude of which is greater than against the ensemble members. This behaviour is more likely to be an artefact of the observations and is primarily restricted to grid points over the BoB, though it is possible that these areas are associated with systematic biases where the pattern is by definition displaced in the model compared with reality.
A more quantitative assessment of rain volume as a function of LTAW shows that GloSea5-GC2 shows a modest skill for a small subset of locations, though these are not only co-located with locations where such rainfall amounts are common but also elsewhere. This strengthens the usefulness of the model output, particularly when viewed as increasing LTAWs. Multi-day accumulations and probabilities of exceedance have been largely unexploited in the S2S arena, yet it is often the steady build-up of substantive accumulations over many days, which eventually leads to flooding and other impacts. Here we show the utility of such and suggest that a review of alternative, user or sector-specific rainfall accumulation thresholds would be additionally valuable to consider.
Overall, the results show that there is value in utilizing the seasonal forecasts from GloSea5-GC2, even without extensive use of hindcasts. It shows that strong signals could potentially be identified and that the forecasts should be exploited more like standard NWP ensemble output to extract more of this useful information. The overall skill of such products may be further enhanced with some observation-based post-processing, something that is currently being explored in follow-on work. The study does not advocate using output from the monthly forecast system for the short to medium range, but the results have shown how the monthly forecast performance evolves in a seamless way.