Validation of GPM IMERG Extreme Precipitation in the Maritime Continent by Station and Radar Data

The Maritime Continent (MC) is a region subject to high impact weather (HIW) events, which are still poorly predicted by numerical weather prediction (NWP) models. To improve predictability of such events, NWP needs to be evaluated against accurate measures of extreme precipitation across the whole MC. With its global spatial coverage at high spatio‐temporal resolution, the Global Precipitation Measurement (GPM) data set is a suitable candidate. Here we evaluate extreme precipitation in the Integrated Multi‐Satellite Retrieval for GPM (IMERG) V06B product against station data from the Global Historical Climatology Network in Malaysia and the Philippines. We find that the high intragrid spatial variability of precipitation extremes results in large spatial sampling errors when each IMERG grid box is compared with individual co‐located precipitation measurements, a result that may explain discrepancies found in earlier studies in the MC. Overall, IMERG daily precipitation is similar to station precipitation between the 85th and 95th percentile, but tends to overestimate above the 95th. IMERG data were also compared with radar data in western Peninsular Malaysia for sub‐daily timescales. Allowing for uncertainties in radar data, the analysis suggests that the 95th percentile is still suitable for NWP evaluation of extreme sub‐daily precipitation, but that the rainfall rates diverge at higher percentiles. Hence, our overall recommendation is that the 95th percentile be used to evaluate NWP forecasts of HIW on daily and sub‐daily time scales against IMERG data, but that higher percentiles (i.e., more extreme precipitation) be treated with caution.

the local population (Abd Majid et al., 2019;Cabrera & Lee, 2020;Karki, 2019;Takama et al., 2017), can lead to severe consequences. An accurate prediction of extreme precipitation in the MC is therefore of crucial importance for society. Numerical weather prediction (NWP) models still struggle to correctly predict such extreme events in the MC. Progress in the prediction of extreme precipitation needs accurate evaluations of NWP. This requires the use of an accurate observation system of actual precipitation.
Current observations of precipitation are made with the use of station gauge networks, ground-based radars, and satellite measurements. While prone to errors due to evaporation and wind effects (Du et al., 2018;Lorenz & Kunstmann, 2012;Maggioni et al., 2016), gauge measurements are expected to be more accurate as they provide a direct measure of precipitation (Sun et al., 2018). However, gauge measurements are limited by their localized (point) spatial nature (Kidd et al., 2017), which result in sampling errors when interpolated onto larger areas (Lorenz & Kunstmann, 2012;Rana et al., 2015). Ground-based radars can significantly increase the extent of precipitation observations, and still retain a high spatial resolution. However, because of the indirect way in which they measure precipitation, ground-based radar are affected by errors from contamination, attenuation of signal, and the uncertainty associated with the reflectivity-rain-rate (Z-R) relationship (Berne & Krajewski, 2013;Iguchi et al., 2009;Maggioni et al., 2016). Furthermore, the MC is poorly covered by ground-based measurements of precipitation (Kidd et al., 2017). Hence, NWP evaluation in the MC particularly relies on satellite precipitation measurements, with their potentially global spatial coverage. Although errors in estimation methods still remain (Camici et al., 2018;Derin et al., 2016), the use of precipitation data from satellites has increased and has enabled new applications Kucera et al., 2013).
To benefit from the advantages of both satellite (higher spatial coverage) and gauge measurements (higher accuracy), considerable effort has been invested in the development of mixed gauge-satellite precipitation data sets (Adler et al., 2018;Huffman et al., 1995Huffman et al., , 2007Huffman et al., , 2019Xie & Arkin, 1997). The Global Precipitation Measurement (GPM) Integrated Multi-Satellite Retrievals for GPM (IMERG) is one such data set. The IM-ERG precipitation data set was built with the use of over 10 satellites, including the GPM Core Observatory satellite launched in 2014. It carries the Ku-and Ka-band Dual-frequency Precipitation Radar (DPR) and the GPM Microwave Imager (GMI) sensors, two of the most sophisticated satellite precipitation sensors currently in space (Skofronick-Jackson et al., 2018). These instruments are complemented by both passive microwave (PMW) and infrared (IR) sensors on board the IMERG satellite constellation.
The IMERG product has been evaluated in many locations globally (Dezfuli et al., 2017;Fang et al., 2019;Kim et al., 2017;Mayor et al., 2017;Navarro et al., 2019;Omranian & Sharif, 2018;Prakash et al., 2016;Sharifi et al., 2016), and is generally an improvement with respect to its predecessors. Thus, IMERG is a suitable candidate for the systematic evaluation of NWP extreme precipitation in the MC. However, IMERG is not exempt from errors, some of which are already well documented (O & Kirstetter, 2018;O et al., 2017;Oliveira et al., 2016;J. Tan et al., 2016. The IMERG precipitation estimates were shown to better match gauge data at the monthly timescale than at the daily/sub-daily timescales (M. L. Tan & Duan, 2017;Yuda et al., 2020).
Although accurate at measuring mean precipitation rates, such global satellite precipitation products often show deficiencies in their representation of extreme precipitation, and their accuracy may be regionally and climatically dependent (Rajulapati et al., 2020). The IMERG product does not seem to be an exception; it underestimates extreme precipitation over Mexico (Mayor et al., 2017), the eastern coast of the United States (J. Tan et al., 2016), Singapore (M. L. Tan & Duan, 2017), and Austria (O et al., 2017), and overestimates extreme precipitation in the central Amazon (Oliveira et al., 2016), the Tibetan plateau , and the Netherlands (Gaona et al., 2016). Previous analysis of IMERG performance over the MC (Liu et al., 2020;M. L. Tan & Duan, 2017;M. L. Tan & Santo, 2018;Yuda et al., 2020) found that IMERG underestimates extreme precipitation and performs better during the wettest season. However, these studies were subject to potentially large spatial sampling errors, that is, errors incurred when interpolating gauge precipitation data onto the IMERG grid. By degrading the same precipitation product onto different spatio-temporal resolutions, Behrangi and Wen (2017) showed that these errors can be large, especially over land areas. Similarly, Tian et al. (2018) and Tang et al. (2018) found that rain gauge density has a large impact on IMERG skill metrics over China.
Previous IMERG evaluation studies in the MC were done over relatively short periods of 1-2 years. By definition, extreme precipitation is very infrequent; hence, small sample sizes may have a detrimental effect here. Consequently, these studies do not provide a practical range of precipitation from which IMERG can be used with the aim of evaluating extreme precipitation events simulated by NWP in the MC.
Therefore, the objective of the present study is to reassess the performance of IMERG in the detection of extreme precipitation over the MC, with an estimation of spatial sampling error, and to provide practical information for use in NWP evaluation. For this purpose, the IMERG V06B data set is evaluated against the Global Historical Climatology Network (GHCN) gauge data set over Malaysia and the Philippines, and against a ground-based weather radar data set from western Peninsular Malaysia. Section 2 describes the precipitation data sets used in this study. Section 3 presents an evaluation of IMERG in the MC. Finally, Section 4 describes key findings and practical guidance for the use of IMERG in NWP evaluation.

IMERG Data
The main analysis in this study is based on the IMERG product, version V06B, from the GPM project . This product is based on measurements from a constellation of satellites, equipped with PMW and geo-IR sensors. The PMW measurements give more accurate direct estimations of precipitation rate but have limited spatial and temporal coverage. Meanwhile, the IR measurements only measure precipitation indirectly, but have almost complete spatial and temporal coverage.
The PMW precipitation estimates are first converted from brightness temperature to precipitation rate following the Goddard profiling algorithm (GPROF) (Kummerow et al., 2015) or the Precipitation Retrieval and Profiling Scheme . Among PMW satellites, the GPM core observatory is considered to carry the most advanced instruments for precipitation detection (Skofronick-Jackson et al., 2018). It was launched in February 2014 and is the successor to the Tropical Rainfall Measuring Mission (TRMM, Huffman et al., 2007) satellite, which was launched in 1997. As well as providing accurate precipitation measurements for the IMERG product, the TRMM satellite and the GPM core observatory serve for the intercalibration of the whole IMERG PMW satellite constellation, in their respective eras. Several studies have identified improvements of precipitation estimates by IMERG relative to its predecessors in South East Asia (Kim et al., 2017;Prakash et al., 2016;M. L. Tan & Duan, 2017;F. Xu et al., 2019).
Prior to intercalibration, the TRMM and GPM core observatory estimates are seasonally corrected over land areas by the climatological values from the Global Precipitation Climatology Project satellite-gauge product (Adler et al., 2018). The PMW intercalibration is achieved through quantile matching, using a method similar to Miller (1972) and Krajewski and Smith (1991). The IR data, which essentially measure cloud top features rather than precipitation directly, are trained and calibrated against the PMW estimates using an artificial neural network cloud classification system (PERSIANN-CSS; Nguyen et al., 2018).
All precipitation estimates are gridded on to a 0.1° × 0.1° longitude-latitude spatial grid. A Kalman smoother is then used to combine all precipitation estimates into a single half-hourly estimate (Joyce & Xie, 2011). In this step, the closest PMW estimates forward and backward in time from the analysis time of the half-hourly window are propagated to the analysis time using precipitable water vapor motion vectors from the Goddard Earth Observing System Forward Processing (IMERG early and late runs; Keller et al., 2021) or the Modern-Era Retrospective Analysis for Research and Applications, version 2 (IMERG final run; MERRA-2; Gelaro et al., 2017). A weighted average of the two resultant estimates is then performed. The IR data are used only if the nearest PMW measurement is more than 30 min from the target time. In this, the IR estimates are incorporated into a Kalman filter in the form of an observation correcting the PMW "forecast." The resulting half-hourly estimates over land are then multiplied by the ratio between the Global Precipitation Climatology Centre (GPCC) (Schneider et al., 2008) monthly gauge estimates with the monthly sum of half-hourly estimates derived in the early steps of the IMERG algorithm. This step is only performed in the final version of the product, which is used in the present study. The IMERG product is thus a multi satellite-gauge precipitation data set for which data are provided with a 30-min time interval on a global 0.1° × 0.1° grid.
The diurnal cycle of precipitation is reasonably well captured by IMERG, when compared to rain gauge Mayor et al., 2017;O & Kirstetter, 2018;Tang et al., 2016;Zhang et al., 2018) or ground-based radar precipitation estimates (Oliveira et al., 2016), although a phase delay of about 40 min was found in the presence of frozen hydrometeors aloft (O & Kirstetter, 2018;O et al., 2017;You et al., 2019). Potential sources of IMERG errors were attributed to the precision of the instruments on board the satellite constellation J. Tan et al., 2016). IMERG retrievals that only used IR measurements were found to be the least accurate, because precipitation is measured indirectly from cloud top brightness temperatures. However, PMW sensors tend to underestimate warm cloud precipitation (Dinku et al., 2007;Shige et al., 2013), which can affect the performance of IM-ERG (O & Kirstetter, 2018). The IMERG algorithm itself was sometimes identified as a source of error, notably through its morphing and GPROF precipitation retrieval schemes (Oliveira et al., 2016;J. Tan et al., 2016).
In this study, 19 years of the IMERG precipitation data set from January 1, 2001 to December 31, 2019 over Malaysia and the Philippines (Figure 1) were used. When IMERG data were compared to radar data, IM-ERG accumulations were calculated only using data from times at which radar data were also available.

GHCN Station Data
The GHCN data set comprises several meteorological variables measured by surface weather stations across the Earth (Menne, Durre, Korzeniewski, et al., 2012;Menne, Durre, Vose, et al., 2012). Data are available at daily (UTC) time resolution, and have undergone a common suite of quality assurance reviews (Durre et al., 2010). In the present study, only the daily mean precipitation data from Malaysia and the Philippines were used to evaluate the IMERG data. First, the gauge time series were truncated to the IMERG period examined (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019) to ensure time coherence between both data sets. Then, only GHCN stations having at least 1,000 days of data within this period were selected for analysis. The GHCN data set also included weather station time series from Indonesia but the lengths of these time series did not satisfy the latter criteria. The exact locations of the gauges used are shown in Figure 1. The gauges are spread over large areas with different climate characteristics. Previous studies found that IMERG may have variable skill, depending on regional characteristics within the MC (M. L. Tan & Santo, 2018). Hence, six groups of weather stations were defined in the following regions (red markers in Figure 1): Western Peninsular Malaysia (5 stations); Eastern Peninsular Malaysia (3 stations); Northwest Borneo (6 stations); Western Philippines (except mountain regions, 6 stations); Eastern Philippines (11 stations); Philippines mountain region (1 station).

Radar Data
Data from an S-band Doppler weather radar at Subang, Kuala Lumpur (101.559°E, 3.145°N), operated by MetMalaysia, were also used to evaluate the IMERG data. There are 89 days of radar data in a period spanning 94 days, from January 11 to April 15, 2019. The radar measurements were calibrated first using a relative calibration against clutter points and second using the DPR aboard GPM, following Warren et al. (2018) and Louf et al. (2019). Following calibration, the radar data were interpolated on to a Cartesian grid at 2-km height above the radar location, from which precipitation values were retrieved using the Weather Surveillance Radar (WSR) Z-R relationship (Fulton et al., 1998). The WSR Z-R relationship is known to give correct estimations for convective precipitation. The Marshall-Palmer (Marshall et al., 1947) and the Rosenfeld (Rosenfeld et al., 1993) Z-R relationships, which perform well for stratiform and tropical precipitation (respectively), were also tested and taken into account in the study in the form of uncertainties. Instantaneous precipitation values are provided every 10 min, at 0000, 0010, 0020, …., 2350, each day. The spatial resolution of the radar data is 0.0045°, or approximately 400 m. A spatial bilinear interpolation was performed on the radar data, to map it from its original grid to the 0.1° IMERG grid, for comparison. Both the 0.0045° and the 0.1° radar data were used in this study, the 0.0045° radar data being used as an estimate of pointwise precipitation in order to quantify the spatial sampling error.
The Subang radar is located on the coastal plain of western peninsular Malaysia, with the prominent Titiwangsa mountain range to the east ( Figure 2). The mountains clearly block the radar signal to the east, as evidenced by the near zero accumulations in this region. Hence, all radar grid points over and to the east of the Titiwangsa mountains were removed from the analysis.
The IMERG data are available every 30 min, at 0015, 0045, 0115, …, 2345, each day. When there is no PMW measurement in the corresponding 30 min windows, the IMERG values are calculated as an average of the closest previous PMW measurement advected forward in time by MER-RA-2 motion vectors, and the closest following PMW measurement advected backward in time by MERRA-2 motion vectors. IR precipitation data are also incorporated in the calculation when no PMW measurements are available within ±30 min of the time window. This effectively gives an approximately 25-min mean precipitation value (O et al., 2017). Hence, for direct comparison of "instantaneous" radar and IMERG data, the two closest instantaneous radar values (backward and forward in time) from the IMERG output time were averaged. For example, the IMERG precipitation value at 1,415 was compared with the average of the instantaneous 1,410 and 1,420 radar precipitation values. For the sake of simplicity, this average is still referred to as "instantaneous" in this study. While such an averaging procedure is the best estimate of precipitation intensity between two radar output time steps, it tends to underestimate extremes of instantaneous precipitation (and conversely, overestimate low precipitation). This averaging procedure was only carried out for the comparison of "instantaneous" precipitation values.
Rainfall accumulations were also calculated from the 10-min instantaneous radar data, for periods of 30 min, and 1, 3, 6, 12, and 24 h. A weighted average was calculated from all instantaneous precipitation measures within the period. Each 10-min instantaneous radar scan was interpreted as the representation of averaged precipitation over a 10-min window centered on the nominal time and the weightings were chosen accordingly.
There was a significant fraction of missing radar data (13%). Gaps in the radar time series were filled using linear time interpolation before the accumulations were calculated. To reduce potential errors from this interpolation, all accumulation periods for which more than half of the data were missing were discarded from the analysis. This restriction does not completely avoid errors, especially for the longest accumulation periods. A discussion of these errors is provided in Section 3.

Topography Data
The General Bathymetric Chart of the Oceans topography data set was used to distinguish between sea, low land and mountain regions. It was regridded from its native 30 arc-second resolution to the coarser 0.1° × 0.1° longitude-latitude IMERG grid ( Figure 1).  Eastern Philippines, and a high elevation (mountain) station located in the Western Philippines. The correlation coefficient, root mean square error (RMSE), and relative bias were calculated for daily, weekly, and monthly precipitation accumulations (Table 1). For the relative bias, we first calculated the bias and then we divided it by total accumulated precipitation over the time period (thus this metric does not vary with time scale). All of these statistics were initially calculated for each station (using the time series of IMERG precipitation from the nearest grid point, on the 0.1° × 0.1° IMERG grid) and then averaged over the region.

Comparison of IMERG With GHCN Station Data
Correlation coefficients of daily precipitation values range from 0.5 in Western Peninsular Malaysia to 0.74 in Eastern Peninsular Malaysia, while correlation coefficients of monthly precipitation values are typically above 0.8. In each region, the correlation coefficient increases with increasing accumulation time, and RMSE decreases with increasing accumulation time. This increase in performance of IMERG at the seasonal time scale compared with the daily time scale was also observed in Singapore (M. L. Tan & Duan, 2017), Bali (Yuda et al., 2020), and the USA (J. . Our analysis of daily correlation coefficients and RMSEs in Malaysia confirms and extends the results of M. L. Tan and Santo (2018) who used an earlier version of IMERG and a shorter time period.
Although the daily correlation coefficient values reflect a moderate-to-good representation of IMERG in capturing the day-to-day variability of precipitation, the high daily RMSE values in every location emphasize the magnitude of errors in IMERG precipitation intensity, ranging from 13.6 mm day −1 in Western Peninsular Malaysia up to 33.2 mm day −1 at the sole mountain station in the Western Philippines. The relative bias tends to be positive for low-level land locations, but IMERG displays a substantial negative bias at the sole mountain station of −28%. With only one mountain station we cannot conclude that this bias is a consistent feature, but this result is consistent with previous findings that PMW sensors may underestimate SILVA ET AL. warm orographic rain because they use ice loads for their detection of precipitation (Derin et al., 2016;Dinku et al., 2007;Kim et al., 2017;Navarro et al., 2019;O & Kirstetter, 2018;R. Xu et al., 2017). It is also worth noting that IMERG does not explicitly account for orographic enhancement, unlike Global Satellite Mapping of Precipitation, which should have an improved representation of precipitation over mountainous regions (Yamamoto & Shige, 2015).
These statistics were calculated from the comparison of time series of local GHCN gauge measurements with time series of 0.1° gridded IMERG precipitation (Section 2). We expect that the pointwise precipitation measurements will not be representative of the average precipitation over the relatively large 0.1° × 0.1° (approximately 120 km 2 ) area covered by the IMERG nearest grid point. This discrepancy is referred to as the spatial sampling error, and is examined quantitatively below.
Here, the spatial sampling error is estimated by comparing the native resolution Subang radar precipitation (on a 0.0045° grid) against itself, but regridded onto the coarser 0.1° IMERG grid. The "Radar" row in Table 1 shows the daily correlation coefficient, RMSE, and relative bias from these calculations. These statistics were initially calculated for each radar grid point at native resolution and the nearest 0.1° neighbor, and subsequently averaged over all lowland grid points (delimited by the green lines in Figure 2).
As the same product is being compared at two different spatial resolutions, the calculated values of correlation coefficient, RMSE and relative bias can be interpreted as the optimum values attainable, given the spatial sampling error between a 0.1° area-averaged precipitation data set and a (nearly) pointwise precipitation data set in Western Peninsular Malaysia. The daily radar-radar correlation coefficient is only 0.72, that is, significantly less than the maximum theoretical value of 1. This is a similar value to that of Tang et al. (2018), who used a high-density gauge network in the Ganjiang River basin (South China) to assess the expected sampling error. It shows that the sampling error contributes substantially to reducing the correlation coefficient for the IMERG-GHCN comparison, which has a value of 0.5.
A similar conclusion can be drawn for the RMSE, which is 9.2 mm day −1 for the radar-radar comparison. Contributions to mean square error (MSE) can be added linearly, whereas those to RMSE cannot. With this in mind, the radar-radar MSE has a value that is 45% of the value of the IMERG-GHCN MSE. Hence, approximately 45% of the IMERG-GHCN MSE can be attributed to the spatial sampling error, with the remainder being a "genuine" physical error between the two systems. Finally, the radar-radar relative bias is +11.4%, compared with +15.9% for the IMERG-GHCN comparison. Hence, approximately two thirds of the IMERG-GHCN relative bias can be accounted for by spatial sampling error, the remainder being again a "genuine" bias between the two different data sets.
It is likely that precipitation extremes contribute disproportionately to the high RMSE values observed in all the regions. We define extreme precipitation days as those on which the precipitation rate exceeds 20 mm day −1 , in either IMERG or GHCN (or both). Retaining only extreme precipitation days, we were able to retrieve 86% of the MSE in Western Peninsular Malaysia, confirming that high RMSE values are almost entirely due to discrepancies between IMERG and GHCN measurements on extreme precipitation days.
To investigate the distribution of errors for such events, the probability density function (PDF) of daily precipitation differences between IMERG and the three nearest GHCN stations in the Subang area was calculated, following the method of Holloway et al. (2012). Precipitation bins were defined following a regular logarithmic increase in magnitude from 0.5 to 100 mm day −1 for both positive and negative differences. The PDF at bin i was calculated using the following formula: where n(Δpr i < ΔPr < Δpr i+1 ) designates the number of extreme precipitation days (as defined above) for which the precipitation difference (ΔPr) is within the bin limits set by Δpr i and Δpr i+1 , and N is the total number of extreme precipitation days.
The resulting distribution of IMERG versus GHCN daily extreme precipitation differences is bimodal with one local maximum near −20 mm day −1 and another one near +20 mm day −1 (solid line in Figure 3). The maximum near +20 mm day −1 mostly reflects precipitation events that occurred in the IMERG data but did not occur in the GHCN stations, and vice-versa for the maximum near −20 mm day −1 . Notably, such discrepancies are more frequent (note the logarithmic vertical axis in Figure 3) than events where the difference in precipitation intensity is less than 20 mm day −1 . There is also a non-negligible frequency of events for which the differences between IMERG and GHCN daily precipitation are much higher, above 50 mm day −1 . These events contribute the most to the RMSE. These observations are not reassuring for the use of IMERG in evaluating NWP of extreme precipitation, unless they are the consequence of the spatial sampling error.
To ascertain whether the very large IMERG-GHCN precipitation differences can be attributed to the spatial sampling error, we examine the equivalent PDF for differences between the two different spatial resolutions of the Subang radar data. Each 0.0045° radar daily precipitation data point was subtracted from the daily precipitation estimate of its nearest 0.1° grid point equivalent. The PDF of the radar data (dashed line in Figure 3) was constructed, retaining only the lowland radar grid points for a better comparison with the IMERG-GHCN distribution. The two distributions are very similar. The radar-radar distribution also displays a bimodal shape with local maxima at ±20 mm day −1 and a local minimum at 0 mm day −1 of the same amplitude as the IMERG-GHCN distribution. This again highlights the large contribution of the spatial sampling error in explaining the large RMSE values, especially for extreme precipitation. This error cannot be ignored for a correct validation of IMERG extreme precipitation in the MC, which in turn will serve for NWP evaluation.

Evaluation of IMERG Reliability for Extreme Precipitation Thresholds
Extreme precipitation is often defined in relative terms by using the local statistical distribution of precipitation to calculate a threshold such as the 95th percentile of precipitation over a given accumulation period. In this context, it is useful to know for which percentiles IMERG gives reliable estimates and those that should be avoided when using IMERG for NWP evaluation.

Subang Region of Western Peninsular Malaysia
To evaluate the reliability of IMERG at various percentile thresholds we examine a quantile-quantile plot of IMERG versus GHCN precipitation for the three Malaysian stations closest to the Subang radar for northern winter (October-March; blue line in Figure 4). The uncertainty of the percentile values is shown by error bars that cover the 95% confidence interval. If there was a perfect correspondence, the blue line would follow the black 1:1 control line.
However, in practice, there will be errors due to spatial sampling (Section 3.2) and other sources. The spatial sampling error can be accounted for by the use of radar data at both the 0.1° and native (0.0045°) resolution, giving an expected theoretical quantile-quantile relationship due to spatial sampling alone (solid green control R-R line in Figure 4). The solid green spatial sampling line does not follow the black 1:1 line. In particular, for extreme precipitation (95th and higher), the green line is below the 1:1 line, indicating that the, for example, 95th percentile of radar precipitation on the native high resolution grid is larger than the 95th percentile of radar precipitation on the coarser IMERG grid. This neatly illustrates that the effect of spatial averaging is to reduce extremes. This effect works in the opposite sense at the lower percentiles. Here, the SILVA ET AL.  . The PDF of the difference between daily land precipitation from the Subang radar on its native grid and the radar precipitation averaged over the nearest IMERG grid box (dashed line) is also shown for ease of comparison. Both PDFs are conditioned on extreme daily precipitation, defined as days for which at least one of the products exhibits daily precipitation above 20 mm day −1 . green line is above the 1:1 line. Hence, a very low rainfall rate (of a given value, e.g., 0.5 mm day −1 ) is more likely to be observed in low spatial resolution data than in high-resolution data, due to spatial aggregation. In summary, we would not expect the IMERG-GHCN quantile-quantile line to follow the black 1:1 line, because of the spatial sampling effect. We might expect it to follow the green R-R control line, however.
The control R-R quantile-quantile (solid green) line was calculated using the radar data with time interpolation to fill the missing values. For a rough estimation of the interpolation uncertainty, the R-R quantile-quantile line was recalculated by substituting missing values with zero (green dashed line in Figure 4). This lies below the original control R-R line for the whole range of precipitation percentiles with a difference of about 25%.
The radar precipitation product itself presents multiple uncertainties that need to be taken into account in the analysis. In particular, the reflectivity-rainfall (Z-R) relationship is a substantial source of uncertainty. These uncertainties were taken into account in our study by the use of three different Z-R relationships: Marshall-Palmer (Marshall et al., 1947), Rosenfeld (Rosenfeld et al., 1993), and WSR (Fulton et al., 1998). The Marshall-Palmer relationship resulted generally in the weakest rainfall rates, while the Rosenfeld relationship produced the highest rainfall rates, and the WSR relationship led to rainfall rates in between. Solid particles such as hail can also alter the radar signal by amplifying it. The uncertainty related to that was estimated by capping extreme reflectivities at 53 dB. The uncertainty linked to potential hail contamination is non-negligible, although weaker than that linked to the Z-R relationship (not shown). In the following, we use the WSR Z-R relationship without capping as default. The total radar uncertainties were calculated using the minimum and maximum values of the 6 radar estimates emanating from the 3 different radar Z-R relationships with and without cap. The union of the 95% confidence intervals of these minimum and maximum values was taken to account for the percentile uncertainty. The resulted intervals are represented by a shaded gray area and the IMERG 95% confidence intervals are represented by errors bars in Figure 4).
The blue IMERG-GHCN quantile-quantile line remains within the two green control R-R lines from the 60th (∼1.5 mm day −1 ) to the 95th percentile (35 mm day −1 ), thus displaying a high fidelity in estimating this range of precipitation values. In particular, the 95th quantile is consistent with the control R-R line (solid green line, using interpolation for missing values) with a relatively low uncertainty of about 20%. The 95th percentile thus appears to be a reliable choice for evaluation of extreme precipitation in NWP against IMERG.
For percentiles above the 95th, IMERG remains close to GHCN (i.e., close to the black 1:1 control line), but increasingly deviates above the solid green R-R control lines for higher percentiles. Indeed, the 99th percentile of IMERG is approximately 70 mm day −1 against an expected value of about 50 mm day −1 (from the green R-R lines). The 99th percentile of IMERG lies beyond the R-R uncertainty envelope, which means that the overestimation is significant. This reflects a tendency for IMERG to overestimate very extreme precipitation and reach values that tend to be higher than expected for its resolution. It should be noted that IMERG values are corrected by GPCC monthly accumulations (Section 2.1). Given that only one GPCC station was used to make this correction in Malaysia (M. L. Tan & Santo, 2018), it may not be surprising that IMERG precipitation extremes have the same magnitude as station precipitation extremes, and thus overestimate area-averaged precipitation extremes. The fact that IMERG remains close to GHCN for these extreme percentiles can be useful for estimating the potential values that extreme precipitation could reach in local areas. However, these high percentiles are not recommended for NWP evaluations against IMERG since NWP are gridded products that usually do not output such local point measures of precipitation.
SILVA ET AL.  Quantiles are calculated at 5% intervals from the 50th to the 95th percentile, then at the 97.5th, 99th, and 99.9th percentiles. The red markers highlight the 50th (square), 95th (diamond), and 99th (asterisk) percentiles. Error bars show the 95% confidence interval. The black line shows the 1:1 control line. To account for spatial sampling error, the green lines represent the quantile-quantile diagram of Subang radar daily precipitation in low-land areas versus the corresponding (nearest neighbor) daily precipitation of the Subang radar averaged on the IMERG grid, with temporal interpolation over missing values (solid green line; control R-R), and by substituting each instantaneous missing value by zero (green dashed line). The gray shading corresponds to the merged 95% confidence intervals of the green lines.
IMERG tends to overestimate the number of low precipitation rate days (<1.5 mm day −1 , or the 60th percentile), compared to the solid green R-R line. The overestimation is significant for precipitation below <0.9 mm day −1 where the IMERG line lies above the R-R uncertainty envelope. It should be noted that percentiles below the 50th were not represented in Figure 4 because they are all equal to 0 mm day −1 for GHCN, and thus do not fit a log-log representation. The number of dry days is lower for IMERG than for GHCN (not shown). Non-meteorological targets such as insects affect the radar retrievals, making it impossible to detect dry days and thus evaluate more accurately if IMERG detects less dry days than it should at its resolution.

Other Regions in the MC
We now investigate whether these conclusions hold for areas outside of the Subang area (Western Peninsular Malaysia) and for seasons other than northern winter, using six selected areas in Malaysia and in the Philippines ( Figure 5). The absence of a high-resolution data set equivalent to the radar in Subang makes it difficult to precisely determine IMERG performance against the location-specific spatial sampling error in these regions. However, in most regions, the percentile relationships between IMERG and GHCN are very similar to the one observed in Subang: IMERG displays higher precipitation rates than GHCN for percentiles below the 90th percentile and is similar to GHCN for percentiles above the 90th percentile. This is the case in Western Peninsular Malaysia, Eastern Peninsular Malaysia, Northwest Borneo, and Western Philippines during northern summer and Eastern Philippines during northern winter. While the optimal percentile cannot be precisely determined for these regions, the similarity with Subang suggests that the IMERG 95th percentile is also likely to be a suitable percentile to evaluate NWP extreme precipitation against in these regions. Conversely, higher percentiles are not recommended for NWP evaluation, as they will tend to overestimate area-averaged precipitation.
SILVA ET AL.
10.1029/2021EA001738 10 of 18 The performance of IMERG also shows seasonal dependence (Oliveira et al., 2016;M. L. Tan & Santo, 2018). This is particularly true in both the Western and Eastern Philippines (Figures 5d and 5e). Indeed, IMERG displays higher precipitation rates than GHCN for every precipitation percentile during northern winter in the Western Philippines, whereas this is only the case for the lowest precipitation during northern summer (Figure 5d). Thus, the positive bias for IMERG extreme precipitation is stronger during northern winter in the Western Philippines. This stronger overestimation might be explained by enhanced errors in the IMERG morphing scheme in this region, which is subjected to easterlies during the northern winter, such that most of the precipitating systems (including tropical cyclones) come from the east and cross the Cordillera Central mountain range. The propagation of precipitation in IMERG is based on the motion of total precipitable water vapor fields of the MERRA-2 reanalysis that may underestimate the mountain blocking effect on precipitation due to its relatively coarse spatial resolution. The use of IMERG for NWP evaluation of extreme precipitation in this region during northern winter should therefore be approached with caution.
In the Eastern Philippines, the weak precipitation is underestimated by IMERG during northern winter but overestimated in northern summer (Figure 5e); the rainfall matches GHCN station data above the 90th percentile for both seasons, suggesting that the 95th percentile choice for evaluating extreme precipitation also holds during the northern winter in this region.
The case of the mountain Philippines station (Figure 5f) remains undetermined because of the use of only one GHCN station, on the western side of the Cordillera Central mountain range. In mountain regions, the statistical distribution of precipitation extrema will vary spatially within a single IMERG grid box (∼11 km) due to topographic effects largely absent in coastal land areas. Indeed, precipitation will tend to be systematically heavier at high altitude than low altitude or on the windward side compared to the leeward side of individual mountains. These patterns of precipitation will persist between events, in contrast to the more random spatial distribution of rainfall over flat topography. These topographic controls will lead to spatial biases even in perfect observations. Overall, the 95th percentile appears to be a suitable choice for evaluating NWP daily precipitation in most of the regions evaluated here. However, this choice of percentile may not necessarily be appropriate for sub-daily precipitation extremes, which are examined in Section 3.4.

Evaluation of Sub-Daily IMERG Precipitation Accumulation Against Radar
The Subang radar makes it possible to evaluate IMERG precipitation on sub-daily time scales. By comparing the IMERG data to the radar data gridded onto the same 0.1° IMERG grid, the spatial sampling error disappears. The uncertainties related to the Z-R relationship and potential hail contamination are evaluated in a similar way as in the previous section. The resultant intervals, as well as the IMERG 95% confidence intervals are represented by errors bars in Figure 6. The uncertainties are far larger for the radar data than the IMERG data ( Figure 6), mainly associated with the choice of the Z-R relationship.
Sub-daily rainfall accumulations in IMERG were evaluated against radar data by constructing quantile-quantile diagrams of IMERG accumulated precipitation against 0.1° gridded radar accumulated precipitation, for various accumulation times (from instantaneous to daily), for lowland and sea grid points separately ( Figure 6). Despite the uncertainties, the comparison over land (Figure 6a) shows that IMERG overestimates the lowest precipitation amounts compared to the radar, for all accumulation time scales from instantaneous to daily. This overestimation is consistent with the previous daily comparison with GHCN station data. For higher percentiles, IMERG tends to underestimate extreme precipitation for sub-hourly timescales compared with radar. Note that this underestimation only holds for the highest percentile used here, that is, the 99.9 th percentile, thus corresponding to a very small number of cases.
Overall, the results for sea grid points are qualitatively similar to those for the land grid points (Figures 6c  and 6d). The overestimation of IMERG at low precipitation intensities is similar to the land case. The underestimation of IMERG sub-hourly extreme precipitation is less pronounced and no more robust than over land. Similarly, to the land regions, the temporal interpolation error does not significantly affect the quantile-quantile relationship between IMERG and radar in the sea areas around Subang (Figure 6d).
In contrast to the IMERG-GHCN comparison, we do not find any overestimation of daily IMERG precipitation at percentiles above the 95th percentile and there are no robust differences between IMERG and radar percentiles for longer accumulation times. In addition to the aforementioned radar uncertainties, there are several possible explanations for this. Temporal interpolation was necessary to fill gaps in the radar data, which may have induced errors; we estimate the potential impact of these by drawing a similar quantile-quantile diagram retaining only periods without any missing values (Figure 6b). While this subsetting induces a significant decrease in the number of events (from 89 to 10 days), the qualitative findings remain the same and they are also replicated over the sea (Figures 6c and 6d). We therefore conclude that our findings are not dependent on the temporal interpolation method. Another potential reason for the apparent discrepancy between the radar and GHCN comparisons is the difference of period considered in each SILVA ET AL.
10.1029/2021EA001738 12 of 18 comparison. The IMERG versus GHCN comparison was done using nearly 20 years of data between 2001 and 2019 (without removing missing values) whereas the IMERG versus radar comparison is done with spatially aggregated data from January 11 to April 15, 2019. The 95% confidence interval error bars drawn in the IMERG-GHCN comparison account for the uncertainty linked to the representativeness of chosen period for the distribution of precipitation. However, these same errors bars in the IMERG-radar comparison mostly account for the spatial representativeness rather than the temporal representativeness, since time series from many grid points (86) were aggregated in this case compared to 3 for the GHCN-GPM comparison. Consequently, qualitative differences between the comparisons can be observed without contradiction. This suggests that although IMERG tends to overestimate the very high percentiles of daily precipitation, this overestimation is not necessarily present for all heavy precipitation events.

Representation of the Diurnal Cycle by IMERG
One of the major issues of NWP is its ability to correctly represent the diurnal cycle of precipitation. This is especially important for precipitation extremes, which often result from a complex interaction between the diurnal cycle and large-scale, slowly evolving forcings. With its 30-min output frequency, the IMERG product appears to be a good candidate to evaluate the diurnal cycle in NWP models. In this section, we use the Subang radar to assess the fidelity of IMERG in capturing the diurnal cycle of precipitation. Figure 7 shows the 90th, 95th, 99th percentile and mean instantaneous precipitation as a function of the time of day, for both the Subang radar and IMERG in both lowland and sea grid points. Despite the large uncertainties, IMERG agrees with the radar data with regard to the mean precipitation peak time in both lowland and SILVA ET AL.  sea areas. Mean precipitation peaks at about 6 UTC + 8 over the sea and at 17 UTC + 8 over the low-land areas for both IMERG and radar (Figure 7a). For most times, the mean precipitation intensities are not significantly different between IMERG and radar, although the uncertainty in the radar data is very large.
This good agreement of mean precipitation hides some disparities in the statistical distribution of instantaneous precipitation, as seen previously in the quantile-quantile diagrams (Figure 6). At the 90th percentile, IMERG consistently overestimates precipitation compared with the radar, especially for the peaks. The 95th percentile of IMERG precipitation remains quite close to the radar 95th percentile of precipitation especially over the sea. In the lowland areas, the IMERG 95th percentile precipitation peak is still stronger than the radar one but the differences are generally not significant with respect to the Z-R relationship uncertainty. However, the 99th percentile of precipitation tends to be underestimated by IMERG compared with the radar at the precipitation peak times in both land and sea regions. Despite these deficiencies in the amplitude of the diurnal cycle of extreme precipitation, the diurnal phase of extreme precipitation (the 90th 95th, and 99th percentiles) is reasonably well captured by IMERG.

Conclusion
Precipitation extremes have dramatic impacts on the population of the MC. Improved predictions of such events can help to mitigate their negative effects. The evaluation of NWP models against reliable observation data sets is essential in order to understand model deficiencies. In this study, we evaluated the ability of the IMERG satellite product to detect extreme precipitation with the purpose of assessing its suitability for use in NWP model evaluations in the MC.
We evaluated the global skill of IMERG with respect to the GHCN weather station data set in Malaysia and in the Philippines. Our findings are similar to previous comparisons of IMERG with station data, with the best performance for longer accumulation times. However, we showed that the comparison of 0.1° grid versus pointwise precipitation is subjected to a spatial sampling error. Using the high-resolution radar at Subang, we were able to estimate this spatial sampling error in western Peninsular Malaysia. We found that the sampling error may represent around 45% of the MSE of daily precipitation between the GHCN weather station data and IMERG. This suggests that the skill of IMERG in detecting daily precipitation may have been underestimated in previous studies in this area and likely in the whole MC.
When the spatial sampling error described above is taken into account, IMERG was found to overestimate low intensity daily precipitation. The overestimation of low precipitation may be due to erroneous detection of precipitation by IR sensors, as suggested by previous studies. Meanwhile, for very extreme precipitation over the 95th percentile, the IMERG precipitation coincides with the GHCN measurements in most regions. Given the identified spatial sampling error, this implies that IMERG is overestimating very extreme daily precipitation compared to the true area-averaged daily precipitation. This coincidence of both IMERG and GHCN extreme daily precipitation percentiles may be related to the use of only one gauge per grid point in the GPCC gauge-analysis product (which serves for the calibration of IMERG), as individual gauges unavoidably have higher extreme values than a grid average.
The use of radar data in western Peninsular Malaysia makes it possible to estimate more precisely the ideal choice of percentile to evaluate NWP extreme daily precipitation against IMERG. Our analysis shows that it is preferable to use the 95th percentile rather than the 99th percentile of daily precipitation to evaluate NWP against IMERG in western Peninsular Malaysia. We estimated that the IMERG 95th percentile is accurate with less than 20% potential error. Therefore, a 20% difference between NWP and IMERG is the minimum threshold for identification of model deficiencies, at least for the case of daily extreme precipitation at 0.1° horizontal resolution.
The lack of other very high-resolution observational data sets in the MC prevented us from performing the analysis with the same degree of confidence in the other selected areas. However, it was found that IMERG daily extreme percentiles match with those of GHCN in (the whole of) western Peninsular Malaysia, Eastern Peninsular Malaysia, Northwest Borneo, western Philippines during northern summer, and in eastern Philippines. Assuming that the 0.1° spatial variability of daily extreme precipitation does not vary much between regions, this implies that the findings for western Peninsula Malaysia are applicable across all these regions and likely across the whole MC. Therefore, it is not recommended to use very extreme percentiles for NWP evaluation against IMERG in these regions.
We found robust overestimation of low-level sub-daily IMERG precipitation when compared against Subang radar data. This overestimation was found for percentiles up to the 99th percentile for sub-hourly precipitation. However, very extreme (above the 99th percentile) sub-hourly precipitation was found to be robustly underestimated by IMERG compared to the radar in lowland areas. The differences of extreme precipitation at longer accumulation times were not significant at the 95% confidence interval when considering the uncertainties linked to the radar Z-R relationship and potential hail contamination on radar reflectivities. Further work aimed at reducing these uncertainties could help in diagnosing more precisely the behavior of IMERG, which would in turn improve the evaluation of NWP forecasts of extreme precipitation across the MC.
The mean diurnal cycle of precipitation is fairly well reproduced by IMERG both in timing and intensity when compared with radar data. However, the peaks of precipitation remain either overestimated for percentiles below the 95th percentile or underestimated for percentiles above the 95th. This suggests that the 95th percentile of sub-hourly precipitation would also be preferable to higher percentiles for evaluation of NWP diurnal peak precipitation against IMERG. Finally, there was no obvious decrease of IMERG performances over the sea despite the absence of gauges.
In conclusion, we find that the spatial sampling error of precipitation cannot be neglected when comparing IMERG against pointwise observations, particularly for extreme precipitation. Taking this into account, the combined evaluation of station and radar data supports the key finding that IMERG data are reliable for use in evaluating NWP simulations of extreme precipitation at the 95th percentile, with lower reliability at both higher and lower percentiles.