General Circulation Model Selection Technique for Downscaling: Exemplary Application to East Africa

Downscaling is widely used in studies of local and/or regional climate as it yields a greater spatial resolution than general circulation models (GCMs) can provide. It utilizes GCM output or reanalysis data, which is transformed using mathematical relationships or used to force the lateral boundaries of a regional climate model. However, there is no set selection technique to determine which GCM realization(s) to employ. Here, a comprehensive yet easily applicable model selection technique for studies requiring GCM data as a constraint was developed. The technique evaluates, with respect to a reanalysis product and/or observational data, the ability of GCM realizations to reconstruct the mean state of the climate and the space‐time climatic anomalies for the atmospheric state variables at three distinct pressure levels. It was applied to the region of East Africa, where GISS‐E2‐H r6i1p3 was found to perform the strongest. The top ranked realizations were found to better capture processes when evaluated for the example of the Indian Ocean Dipole. Furthermore, the surface air temperature and precipitation from three 10‐year regional climate model simulations, one forced by the Modern‐Era Retrospective Analysis for Research and Applications version 2 reanalysis, one forced by the top ranked GCM, and one by the lowest ranked one, were compared to gridded observations. Results show that using a top ranked GCM for the boundary conditions leads to a better dynamical downscaling simulation than a low‐ranked GCM, suggesting the potential of the proposed technique for future downscaling techniques.

To address this issue, we will focus on model selection for a case study of East Africa and the high-elevation climate on Kilimanjaro, Tanzania. This area is of interest as African climate suffers from a lack of research attention (Washington et al., 2006) and is highly vulnerable to climate change (Boko et al., 2007;Niang et al., 2014). Furthermore, high-altitude regions, like Kilimanjaro, are topographically complex and particularly sensitive to climate change (e.g., Ohmura, 2012;Pepin et al., 2015). The Kilimanjaro glaciers provide a fitting case study as they allow for the capture of large-scale and local-scale climate change dynamics and examination of how these scales are linked by the mesoscale atmospheric circulation over the mountain (Mölg & Kaser, 2011;Mölg et al., 2006Mölg et al., , 2009Mölg et al., , 2012. They have dramatically retreated over the twentieth century (N. Cullen et al., 2013) and been extensively studied on a local scale (e.g., N. J. Cullen et al., 2007;Mölg et al., 2008Mölg et al., , 2009) and using regional atmospheric modeling (e.g., Collier et al., 2018;Mölg & Kaser, 2011;Mölg et al., 2012). Their large-scale influences have also been studied (Collier et al., 2018;Endris et al., 2016Endris et al., , 2019Mölg et al., 2006;Thielke & Mölg, 2019) and multiannual automated weather station (AWS) records are available for model evaluation. With both the local, regional, and macroscales in the area being evidently affected by climate change, it provides an ideal location for downscaling.
In this study, we use past performance model evaluation to determine how well GCMs capture the local and regional climate of East Africa and the midtroposphere layers around Kilimanjaro. The goal is to provide a useful selection criterion that can be used for downscaling studies and that is, moreover, easily applicable to other regions. In Section 2, we describe the study area, the GCM data, and the regional and local data used for model evaluation. In Sections 3 and 4, we present the model selection procedure and its results. In Sections 5 and 6, we discuss the potential benefits of the model rankings and provide concluding remarks and future outlook. Mawenzi (5,140 MSL),and Kibo (5,895 MSL). Modern glaciation is confined to Kibo where a total ice cover loss of 85% from 1912 to 2011 occurred (N. Cullen et al., 2013). These glaciers and their dependencies on, and signals of, mesoscale atmospheric and large-scale climate dynamics have been extensively studied for more than a decade (e.g., Collier et al., 2018;N. J. Cullen et al., 2007;Kaser et al., 2004;Mölg & Kaser, 2011;Mölg et al., 2008Mölg et al., , 2009). Four AWSs have been installed on the peak. AWS 1 (3.067°S, 37.350°E) is located on the Northern Ice Field at 5,794 MSL and has been operational since March 2000 (Hardy, 2011). In February 2005, AWS 2 and AWS 3 were installed. AWS 2 (3.060°S, 37.349°E) is near the base of a vertical ice cliff face on the Northern Ice Field at 5,763 MSL (Winkler et al., 2010). AWS 3 (3.078°S, 37.354°E) is located on the upper slopes of Kersten Glacier in the Southern Ice Field at 5,873 MSL and has been the subject of significant research (e.g., Collier et al., 2018;Mölg & Kaser, 2011;Mölg et al., 2008Mölg et al., , 2009Mölg et al., , 2012. AWS 4 (3.081°S, 37.352°E) was installed in January 2009 on the midslopes of the Kersten Glacier at 5,603 MSL (Mölg et al., 2020). For this study, only AWS 1 and AWS 3 are suitable, since their records are sufficiently long and they are not influenced by a special (ice cliff) microclimate like AWS 2. Consideration of AWS 1 and AWS 3 data for the present GCM context is only possible due to their location on the peak of an isolated mountain at almost exactly 500 hPa, where local conditions (of daily or longer means) mostly reflect the large-scale properties of the tropical midtroposphere (Mölg et al., 2009;Pepin et al., 2010). Records for AWS 1 and AWS 3 span March 2000 to September 2017 and February 2005 to December 2013, respectively. The monthly means of the air temperature, specific humidity, and zonal and meridional wind measurements were considered. Details concerning the sensors and data processing can be found in Mölg et al. (2008Mölg et al. ( , 2009) and Hardy (2011).
GCMs were accessed from the CMIP5 repository (https://esgf-node.llnl.gov/projects/esgf-llnl/) in September 2018 (Taylor et al., 2012). Each realization was independently analyzed for GCMs with more than one initial-condition ensemble member (realization) as they represent a possible outcome with slightly differing initial conditions and timing of events. The atmospheric state variables (U, V, specific humidity, and temperature) were the focus of the selection procedure due to their importance for downscaling studies. Models were culled depending on availability of the monthly means of these variables at 200, 500, and 850 hPa for the historical simulation, a simulation of the recent past  that is consistent with the instrumental era and includes natural and anthropogenic forcings. This left 50 models and 213 realizations. Moreover, since the aim is to provide a selection technique for downscaling, models were culled depending on whether they provided the boundary conditions required for dynamical downscaling (6 hourly) for the historical simulation and, at least, one representative concentration pathway (RCP). This lead to the elimination of 24 GCMs and a total of 180 GCM realizations. The remaining 26 GCMs and their total 33 realizations were subjected to the model selection procedure and can be found in Table 1.

Model Selection
Building upon the work of McSweeney et al. (2012McSweeney et al. ( , 2015 and Overland et al. (2011), a three-tier evaluation strategy was applied to the realizations in Table 1: (1) whether the global warming signal was captured, (2) how well the tropospheric circulation of the African domain was simulated, and (3) how well the climate of the air layers sampled at the summit of Kilimanjaro was modeled. The first tier is a precondition that evaluates the robustness of the GCM on the global scale, while the following two tiers analyze its ability to reproduce the climate of the region. A schematic of this can be found in Figure 2, which will be explained in Sections 3.1-3.3.

Global Warming Signal
First, as a precondition to the selection procedure, the robustness of each GCM realization was evaluated by examining their ability to reconstruct the global warming signal. If a model cannot capture this global signal, how can it be considered robust? Studies of global temperature evolution over 1979-2005 using observational data sets and global reanalysis products have shown a warming trend (Foster & Rahmstorf, 2011;Simmons et al., 2017). Using the historical simulation, the ability of realizations to capture this tendency was tested by determining their global temperature trend over 1979-2005. If a realization can simulate a positive temperature trend over this period, then they are deemed to meet the global warming criteria and will continue onto the selection procedure.

African Domain
Past performance of the GCM realizations for the African domain was evaluated seasonally (dry: January-February [JF] and June-September [JJAS]; wet: MAM and OND) by comparison to MERRA-2 reanalysis data for the region. Performance was assessed for the historical simulation over the period of 1980-2012. The historical simulation finishes in 2005 and was extended to 2012 using the historicalExt simulation, a simulation continuing the historical simulation. If the historicalExt simulation was not available, the historical simulation was extended using RCP 4.5, as it was the sole RCP available for all the GCM realizations being subjected to the selection procedure (Table 1). Furthermore, it should be noted that all RCP scenarios are almost the same over this period and only begin to diverge at around 2025; therefore, the choice of which RCP scenario used to extend the simulations should not influence the results (Maule et al., 2017). Evaluation of each of the atmospheric state variables (temperature, specific humidity, and U and V winds) was undertaken at 200, 500, and 850 hPa. These variables represent the driving data for dynamical downscaling and studying them at 200, 500, and 850 hPa allows for examination of their performance within the troposphere. They were regridded using bilinear interpolation to match the resolution of the MERRA-2 PICKLER AND MÖLG  Iversen et al. (2013) Note. Resolution represents the horizontal atmospheric grid resolution and is expressed as latitude by longitude. A X stipulates that the historical simulation has been extended until 2012, while * indicates that it has been extended to 11/2019. reanalysis data (0.5° latitude × 0.625° longitude) and time means were taken over each season for each year of the 33-year study period. Analysis of past performance focused on the dominant spatial patterns of variability for each season. Empirical orthogonal function (EOF) analysis was used to extract these spatial patterns, as this approach has been well established (Hannachi et al., 2007;Wilks, 2006). Degenerate EOFs were eliminated using North's Rule of Thumb (North et al., 1982) since nonunique spatial patterns cannot be compared. The nondegenerate EOFs and their variance fraction explained with its 95% confidence interval of MERRA-2 were calculated and can be found in Table A1. For the 33 retained GCM historical simulations, the EOFs and their variance fraction explained with its 95% confidence interval corresponding to each of the MERRA-2 nondegenerate EOFs were calculated. For each variable, EOF 1 of MERRA-2 and its variance fraction explained, if nondegenerate, was compared with that of the GCM realization and so on, until all nondegenerate EOFs of MERRA-2 were analyzed. This strict criterion was undertaken, even though the realizations may have inverted the dominant spatial patterns (e.g., the dominant spatial pattern of MER-RA-2 is represented by that of EOF 2 in the GCM realization), as it reflects a short-coming the realization has in reproducing the dominant spatial pattern in the first place. To evaluate how well the realization captured the variance fraction explained, the relative error with respect to MERRA-2 was calculated (see Section 3.4). Furthermore, the dominant spatial patterns of variability found in the nondegenerate EOFs from the MERRA-2 reanalysis data were correlated with the respective EOF patterns for each realization using the Pearson correlation. The relative error of these correlations was determined and then combined with that from the associated variance fraction explained, giving each equal weighting (50%), providing one metric evaluating how well the tropospheric circulation of the African domain was simulated. These metrics were combined as we want to determine how well the realization captures the spatial pattern and its associated variance fraction explained for the nondegenerate EOF of interest. This means that we do not want to rank a realization higher if it attributes the correct variance fraction explained but for an incorrect spatial pattern.

Kilimanjaro
The 33 GCM historical simulations were also compared with observations from AWS 1 and AWS 3 at the summit of Kilimanjaro to determine their ability to reconstruct the large-scale climatic properties of the tropical midtroposphere. Evaluation focused on reconstructing the mean state of the climate system at this location, since models typically have more problems capturing the mean state than the anomalies thereof (Bindoff et al., 2013). With respect to all the model resolutions, AWS 1 and AWS 3 are located in the same grid cell at 500 hPa (mean air pressure at AWS 3, e.g., is 502 hPa). Hence, their measurements were each averaged together, leading to a monthly observational record spanning March 2000 to September 2017. Using this as the assessment period, the realizations were extended to September 2017 using the historicalExt simulation, when available ( were individually evaluated against the observational record for three distinct metrics. First, their ability to reproduce the overall mean state of temperature, specific humidity, and U and V winds was tested. The mean state of the observational record and realizations were calculated, along with their standard deviation with 2σ, for each variable for the entire comparison period. This standard deviation will be referred to as the 95% confidence interval. The absolute mean difference of the realizations to observations was utilized to analyze how well the mean state was reproduced. Second, the nonparametric Kolmogorov-Smirnov (KS) test was then run for each variable and realization to determine whether they came from the same distribution as the observations. Here, realizations are evaluated with respect to their KS statistic (D value) and the null hypothesis that the observations and GCM realization are drawn from the same source. Third, the mean annual cycle of the observations and realizations was compared by calculating the Pearson correlation and the root mean square error (RMSE). To further quantify and rank how each realization performed during these tests, the relative error was calculated.

Testing Synopsis
The tests performed for model selection evaluated how well the realizations captured the mean climatic state and simulated the anomalies of the space-time patterns. Apart from meeting the precondition of capturing the global warming signal, no one metric was weighted above another as each examines how a realization simulates the climate of the African domain and the large-scale climatic properties of the tropical midtroposphere in the region. Furthermore, to maintain the transferability of the procedure, no preset thresholds or cutoffs were used. These can be selected by the user since they are dependent on the problem being addressed.
The relative error for each metric provided a good tool to evaluate realization performance but is not indicative of performance in absolute terms. It was calculated using the technique outline in Rupp et al. (2013) and resulted in a normalized value ranging between 0 and 1, where 0 represents the best fit and 1 the worst. For each realization, the root mean square (RMS) of the relative errors from each individual test was calculated: In this equation, modelRE is the relative error of the model for the given metric (value between 0 and 1) and n is the number of metrics (73). It should be noted that in our case, RMS and RMSE are equivalent as the best case relative error is always 0. This ranking method will highlight which GCM realization performs best on average rather than just which GCM has the lowest overall relative error.

Model Selection Results
All realizations were found to capture the global warming signal and were subjected to the model selection procedure. This resulted in the realization rankings presented in Table 2 and Figure 3. No GCM realization markedly performed best in every test. However, GISS-E2-H r6i1p3 performs the strongest, showing the lowest total relative error and RMS. The two realizations of GISS-E2-R (r6i1p1, r6i1p3), which differ in ocean model from GISS-E2-H, also rank highly in tenth and twelfth place. In the following subsections, how well each realization performed for each variable and metric is examined. Due to a large quantity of data material, only main characteristics are highlighted.

Temperature
Some distinct tendencies are observed in the model selection results for temperature. GCM realizations performed well when examining the variance fraction metric for EOF 1. For all pressure levels and seasons, the vast majority (≥95%) of the variance fraction explained from the realizations are found within the confidence interval of the respective MERRA-2 variance fraction. This means that the 95% confidence interval of the GCM realization and that of MERRA-2 overlap. This can be observed for JJAS in Figure 4.  patterns of variability found in each realization and the respective nondegenerate EOF from MERRA-2 reanalysis data was examined, meaning that realizations are able to attribute the correct variance fraction explained to an EOF but encounter difficulties reconstructing the spatial pattern. This could point toward the spatial patterns being inverted in the realizations (EOF 1 occurs as EOF 2 and vice versa). An example of the spatial patterns observed can be found in Figure 5 for EOF 1 in JJAS. The spatial correlation observed between the realizations and MERRA-2 is strongest at 200 hPa and decreases toward 850 hPa (Figures 4 and  5). This could be related to the increasing complexity of the spatial patterns toward higher pressure levels.
For this metric, no realization clearly performs strongest. When the variance fraction explained and the spatial patterns are combined into a single metric, the RMS of the relative errors ranges from 0.34 to 0.66 (average: 0.47) ( Table 2). Here, the lower ranked realizations display greater RMSs than the higher ones, with the top 10 realizations having an average RMS of 0.40 and the bottom 10 having an average of 0.57.
Upon examination of the absolute mean difference metric at 500 hPa in the Kilimanjaro grid cell, all realizations perform very well, except FGOALS-g2 r1i1p1, which does not lie within the confidence interval of the observational record (see Figure 3: only yellow pixel in the bottom line). For the mean annual cycle, the majority of realizations reproduce its shape and magnitude (RMSE ≤ 1K). The KS test concludes that, for all the realizations, the null hypothesis that the realization and observations are drawn from the same distribution cannot be rejected.

Specific Humidity
The tendencies observed for temperature are not as clear in the specific humidity evaluation. With regard to the variance fraction explained metric associated with all the EOFs, the majority of realizations are within the confidence interval of the respective MERRA-2 value. EOF 1 at 500 hPa is the exception, where 27 realizations lie outside the confidence interval. The majority of realizations (≥97%) also have difficulties capturing the dominant spatial patterns. They typically explain less than 50% of the variance of MERRA-2. When combining these two metrics into one, the RMS of the relative errors ranges from 0.32 to 0.66, for a total of 13 tests (Table 2). Unlike temperature, the higher ranked models do not have smaller RMSs than the lower ranked ones with results being fairly similar for all realizations.
All the realizations are a good fit with regard to the mean specific humidity metric at 500 hPa over Kilimanjaro and are all found within the confidence interval of the observational record. For the mean annual cycle metric, the majority of realizations capture its shape and magnitude (average Pearson correlation coefficient: 0.8, RMSE: 0.73 g/kg). It should be noted that unlike for temperature, FGOALS-g2 r1i1p1 is a very good fit for the mean annual cycle (Figure 3). In the KS test, GDFL-CM3 r3i1p1 performs best but, as with temperature, the test's null hypothesis cannot be rejected for any realization.

Winds
The majority of realizations (≥79%) are able to capture the variance fraction explained for U and V winds. The exception lies in EOF 1 at 850 hPa in OND, where only about half of the realizations capture the variance fraction explained. Similarly to specific humidity, no clear tendencies are observed in the correlation of the dominant spatial patterns. The majority of realizations explain less than 50% of the variance of MER-RA-2 for U and V winds. However, for the U winds of EOF 1 at 850 hPa in JF, 22 realizations have a Pearson correlation greater than 0.7, with IPSL-CM5A-MR r1i1p1 explaining ∼74% of the variance of MERRA-2. As with specific humidity, the RMSs for U and V winds are fairly similar for all realizations (Table 2 and Figure 3).
PICKLER AND MÖLG 10.1029/2020JD033033 10 of 28 The mean winds of the realizations at 500 hPa are all in the confidence interval of the observational record from the summit. However, V winds are on average underestimated. Realizations encounter difficulties reconstructing the shape of the mean annual cycle for V winds but are able to reproduce its magnitude. Only CanESM2, CSIRO-Mk3.6.0, and all IPSL realizations correctly reproduce the shape of the mean annual cycle (Figure 3). Realizations, however, perform better for the mean annual cycle of the U winds. The majority of realizations are able to reconstruct its shape, with the exception of IPSL-CM5B-LR r1i1p1 and MRI-CGCM3 r1i1p1 (Pearson correlation coefficient ≤ 0.2) and the KS test concludes that the null hypothesis can only be rejected for GFDL-ESM2G r1i1p1. On the other hand, for V winds, the null hypothesis that PICKLER AND MÖLG 10.1029/2020JD033033 12 of 28  the observations and the realization come from the same distribution can be rejected for 15 realizations (all Access realizations, BNU-ESM r1i1p1, CanESM2 r1i1p1, CCSM4 r6i1p1, CMCC-CM r1i1p1, GFDL-CM3 r1i1p1, GISS-E2-R r6i1p3, all HadGEM2-ES realizations, IPSL-CM5A-LR r2i1p1, IPSL-CM5B-LR r1i1p1, MIROC5 r1i1p1, and NorESM1-M r1i1p1).

Processes Analysis
The selection procedure cannot explicitly tell if the realizations were able to represent the key processes of the region. This means that, solely from the statistics of the selection results, we cannot conclude that the highest ranking models were best at capturing the most important processes of East Africa. One of the processes playing a major role in the region's climate is the IOD, interannually due to its influence on rainfall (Black, 2005;Clark et al., 2003) and on greater temporal time scales by affecting climate change in the region (Marchant et al., 2007). Other processes influence East African climate but, for brevity purposes, we focused on the how well the realizations represent the IOD using the method outlined in Thielke and Mölg (2019). A brief description can be found in Appendix B. The results of the associated three tests can be found in Figure 6, where it can be observed that a larger percent of higher ranked models are able to capture the IOD than lower ranked ones. There were 10 failed tests in the upper third of the models, with CCSM4 r6i1p1 representing three of these. In the middle third, there were 13 failed tests and 18 failed tests for the bottom third. This agrees with Thielke and Mölg (2019), who found 12 realizations of the GISS-E2-H to pass all tests. This points to the ability of the selection procedure to rank models that can capture one of the key processes in East Africa higher than those that cannot. However, these tests and others characterizing the additional processes in the region are not incorporated into the model selection procedure in order to maintain its generic application potential. They should, thus, be used along with the model selection procedure as an independent evaluation to stress the importance that GCM realizations capture the key processes of the region of interest.

Reanalysis-Based Climate of the African Domain
The majority of the tests comprised in the model selection procedure are related to how well the GCM realizations simulate the climate of the African domain. These involve evaluating the space-time climate anomalies with respect to MERRA-2, chosen as it was found to be a suitable data set for representing important climatic features of the African domain and East Africa (Collier et al., 2018). As the top ranked GCM realization, GISS-E2-H r6i1p3, and its high ranking counterparts, GISS-E2-R r6i1p1 and r6i1p3, were produced at the same institution as MERRA-2, questions arise concerning sensitivity to the reanalysis product used for the selection procedure. To evaluate this, ERA5 (Hersbach et al., 2020), produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), was used in place of MERRA-2 in the model selection procedure. First, the nondegenerate EOFs of ERA5 for each season and atmospheric state variable at 200, 500, and 850 hPa were calculated for 1980-2012 in the African domain (Table C1). Using the nondegenerate EOFs of ERA5 and following the procedure outlined in Section 3.2, the metrics for how well the GCM realizations simulate the climate of the African domain were calculated. These were combined with those from the 12 tests addressing how well the climate of the air layers sampled at the summit of Kilimanjaro was reconstructed. The procedure from Section 3.4 was then used to provide the model ranking found in Figure 7. It should be noted that the 12 tests evaluating the climate of the air layers sampled at the summit of Kilimanjaro have not been affected by the switch from MERRA-2 to ERA5 as they involve comparison with AWS data.
The ranking obtained using MERRA-2 ( Figure 3) and that using ERA5 (Figure 7) are rather similar. The majority of the top 10 and bottom 10 ranked GCM realizations coincide between the two rankings. The similarities between ranking of the GISS realizations indicate that the choice of reanalysis product does not infer some genealogical bias. This point is further illustrated upon closer examination of the rankings associated with ERA5, which do not favor European models using parts of the ECMWF IFS, coming from the same institution as ERA5.

Number of Metrics
To scrutinize the rather large the number of metrics (73), testing was undertaken to vary the number associated with the space-time climatic anomaly in the African domain. These metrics were chosen as they represent the vast majority of the metrics utilized. Two different approaches were employed: (1) eliminating metrics associated with an EOF with a variance fraction below a certain threshold and (2) considering only certain EOFs. Each of these are made up of a number of scenarios (Table 3). It should be noted that the scenario using a variance fraction threshold of 20% and that where only EOF 1 is considered are equivalent.
Using the technique elaborated in Section 3.4, the results obtained from each of the scenarios were used to produce a respective model ranking. To evaluate whether these rankings were better at capturing key processes of the region, the results of the IOD analysis (Section 5.1) were ordered with respect to them. This showed that as the number of metrics decreased, so did the ability of the higher ranked models to better simulate the IOD than lower ranked ones. This is clear when examining the number of failed IOD tests associated with the upper third, middle third, and bottom third of models for each scenario (Figure 8). These results highlight the importance of the lesser dominant EOFs (EOF 2-EOF 4), which explain a smaller fraction of the variance, and the value of the metrics included within the selection procedure.

Ranking Method
As elaborated in Section 3.4, the GCM realizations were ranked by calculating the RMS of their relative errors. This resulted in the ranking found in Figure 3. Utilizing this technique means that the model that performs best on average ranks highest. This model might perform poorly in a test, such as the top ranked GISS-E2-H r6i1p3 in the KS test for temperature (large D value) and mean annual cycle test for V winds, but still ranks high as it performs best on average, that is has the lowest RMS. To evaluate the robustness of this ranking method, the GCM realizations were also ranked according to another method, lowest cumulative relative error (Equation 2): where Sum represents the cumulative relative error for a GCM realization, n is the number of metrics (73), and modelRE is the relative error of the GCM realization for the respective metric. The difference between the two techniques is minimal as the rankings have remained fairly similar (Table D1) findings of Rupp et al. (2013). This illustrates that, in most cases, the model with the lowest relative error is also the model with the lowest RMS.

Dynamical Downscaling
Furthermore, our concept and motivation imply that higher ranked realizations should result in a better downscaling product. To get a sense for this hypothesis, we undertook three dynamical downscaling simulations for East Africa, one with MERRA-2, one with a "good," and one with a "bad" boundary condition, which come from the highest, GISS-E2-H r6i1p3, and lowest ranked, IPSL-CM5A-LR r4i1p1, GCM realizations, respectively. The three resultant downscaled climates can then be evaluated by comparing them to the observational data set CRU TS 4.04 (Harris et al., 2020). We specified a 10-year period and a 50-km target resolution centered over East Africa and Kilimanjaro to keep the computational expense for the discussion within limits.
In this regard, output from each GCM realization and MERRA-2 were used to drive the lateral boundaries of the Weather and Research Forecasting (WRF) model, version 4.1.2, over a domain typically used for regional atmospheric and climatic modeling of East Africa (33°N-37°S, 1°-75°E) from January 1996 to December 2005 ( Figure 1b). Note that the WRF domain is slightly larger than that used for the model selection procedure as we are strictly following the setup of the most recent WRF study of Collier et al. (2019). More details and an overview of the configuration, which was taken from published WRF studies on East Africa (Collier et al., 2018(Collier et al., , 2019, can be found in Appendix E. Over equatorial East Africa (5°N-5°S, 30°-40°E), monthly downscaled precipitation and surface air temperature from the WRF simulations using MERRA-2 and the "good" and "bad" GCM realizations as boundary conditions were evaluated spatially and temporally with respect to the monthly precipitation and surface air temperature from the observational data set CRU TS 4.04. The downscaled climates were evaluated twofold: (1) ability to simulate the mean annual cycle and (2) ability to capture the seasonal spatial patterns. For the mean annual cycle analysis, the spatial average for surface air temperature and precipitation were taken, individually, for the simulations driven by the two GCM realizations and MERRA-2. The mean annual cycle was then calculated and compared in terms of its magnitude, shape, and bias with that obtained from CRU TS 4.04. The magnitude analysis involved taking the RMSE, while the shape comparison employed the Pearson correlation. For the seasonal spatial pattern evaluation, the study period was split up seasonally (JF, MAM, JJAS, and OND) and the seasonal average was taken temporally for over the entire period. The seasonal spatial patterns from the simulations driven by the realizations and MERRA-2 were correlated to those from CRU TS 4.04 using the Pearson correlation. The results of these analyses can be found in Figure 9.
PICKLER AND MÖLG 10.1029/2020JD033033 16 of 28 Figure 7. GCM realization ranking using ERA5, presented from best to worst (left to right) as determined by the relative error of the realizations for each metric, where darker is a better fit. EOF represents the metric evaluating how well the GCM reconstructed the dominant spatial pattern and its variance fraction explained. The number at the end of these labels refers to which nondegenerate EOF is being analyzed and the following letter and number represent the atmospheric state variable (hus, specific humidity; T, temperature; U, U winds; and V, V winds), the atmospheric pressure level (200, 500, or 850 hPa), and the season (JF, MAM, JJAS, OND  For temperature, using the top ranked GCM, GISS-E2-H r6i1p3, and/or MERRA-2 leads to a better downscaling result than IPSL-CM5A-LR r4i1p1. This is apparent when examining the mean annual cycle, where the simulations driven by GISS-E2-H r6i1p3 and MERRA-2 are better able to capture its magnitude and shape and for almost all 12 months have a lower bias with respect to the downscaled surface temperature from CRU TS 4.04. The exception lies in May, where IPSL-CM5A-LR r4i1p1 has a slightly lower bias than MERRA-2. For the seasonal spatial patterns of temperature, GISS-E2-H r6i1p3 and MERRA-2 perform clearly better than IPSL-CM5A-LR r4i1p1 over all four seasons (Figure 9).
This also holds true for the seasonal patterns of precipitation. However, for JJAS, the three perform similarly with IPSL-CM5A-LR r4i1p1 doing better than MERRA-2 but worse than GISS-E2-H r6i1p3. For the mean annual cycle, MERRA-2 is best at reproducing the shape, while it encounters difficulty with the magnitude. GISS-E2-H r6i1p3 is better able to capture the magnitude than IPSL-CM5A-LR r4i1p1 and MER-RA-2. Examination of its mean annual cycle bias shows that GISS-E2-H r6i1p3 has a lower bias over 7 of the 12 months. It should also be noted that IPSL-CM5A-LR r4i1p1 is better able to capture the magnitude of the mean annual cycle than MERRA-2. All three WRF simulations encounter difficulties reconstructing the OND wet season, where they significantly overestimate precipitation. This could be associated with the physical cumulus parametrization (Kain-Fritsch) used in the WRF simulations (Otieno et al., 2020). It could also be related to a known issue that CMIP5 GCMs have in reconstructing the mean annual cycle for precipitation in East Africa, with underestimating the long rains and overestimating the short ones. A too weak Atlantic meridional overturning circulation in the GCMs leading to a sea surface temperature (SST) bias over the Indian Ocean is postulated to result in the difficulties capturing the rainy seasons (Yang et al., 2015).
Overall, the dynamical downscaling results show that using a top ranked GCM for the boundary conditions leads to a better dynamical downscaling result over equatorial East Africa than a low-ranked GCM, a result that is most obvious for surface air temperature, where a clear distinction between the top ranked realizations and the lower ones exists in the GCM selection results (Table 2). However, the benefit of GCM boundary selection is also evident for precipitation.

Conclusions
We presented a model selection procedure to determine the suitability of GCM realizations for downscaling studies. Our results indicated that no one realization performed markedly best in all the tests but it was able to rank higher models that captured a key process in the region and produced a better downscaling product, which is a promising prospect.
In evaluation or selection of CMIP5 models, only one realization per model is typically analyzed (e.g., McSweeney et al., 2012;Ongoma et al., 2019). However, studies of climate extremes indicate the importance of including all realizations due internal climate variability amplifying or obscuring extreme events (Fischer et al., 2014;Perkins & Fischer, 2013). In Figure 3, it can be observed that most realizations from the same GCM perform similarly in the model selection procedure, as expected since the key difference between realizations is internal climate variability. Upon first glance, this does not appear to hold true for GFDL-CM3 realizations which rank 28th (r1i1p1), 7th (r3i1p1) and 19th (r5i1p1). However, upon closer examination of their RMSs (Table 2), it can be seen that they are rather similar with the main difference being GFDL-CM3 r3i1p1 is better able to capture (smaller RMS) the dominant spatial patterns and their variance fraction explained for U winds over the African domain.
The selection procedure focused on analysis of the atmospheric state variables at varying atmospheric pressure levels. These variables are not usually the focus of model selection procedures but are important as they are used to drive the lateral boundaries of RCMs in dynamical downscaling studies and their evaluation, on at least one atmospheric pressure level, is required to adequately select a GCM for this purpose (Jury et al., 2015). They also tend to provide more reliable seasonal forecasts than SSTs and teleconnections (Nicholson, 2017). In particular, the ability to reproduce the mean climatic state is also analyzed with respect to observations at a fixed atmospheric pressure level. This step is important and usually overlooked PICKLER AND MÖLG 10.1029/2020JD033033 18 of 28 as models have a harder time capturing the mean climatic state than the space-time anomalies (Bindoff et al., 2013). By including it, it provides a thorough examination of the ability of the realization to reproduce the climate of a specific region of interest. All these analyzes are not computationally expensive but result in a comprehensive overview of the climate for the region of interest at different atmospheric pressure levels, time scales, and for various aspects of the climatic state.
The model selection procedure shed light on some of the strengths and weaknesses of the GCM realizations. In all metrics, the realizations performed strongest for temperature. This has also been observed in studies of the surface temperature of the region (Aloysius et al., 2016;Ongoma et al., 2018). For specific humidity, they were able to capture the mean climatic state but encountered difficulties reconstructing the dominant spatial patterns of variability. This could be related to the issues CMIP5 GCMs have in simulating precipitation over the region (Aloysius et al., 2016;Yang et al., 2015). For these two variables, the realizations encountered less difficulties simulating the variance fraction explained of each nondegenerate EOF than reconstructing their spatial patterns. This points to the possibility that the spatial patterns are inverted in the realizations. Realizations had the most issues when attempting to reconstruct the tendencies associated with winds, particularly V winds. Furthermore, the GCM realizations were able to better reproduce the variance fraction explained and spatial patterns for EOF 1 than for EOF 2-EOF 4. This indicates that the realizations display a greater skill in reconstructing the dominant spatial pattern (EOF 1, explaining the largest variance fraction). It also points to the importance of evaluating the lesser dominant spatial patterns (EOF 2-EOF 4, explaining a smaller variance fraction) as they distinguish which realizations are able to better simulate the climate of the region of interest. This is further highlighted when examining the number of EOF metrics utilized. All these findings highlight areas of interest and possible improvement in climate models. Overall, the results clearly indicate that realizations perform better for some metrics but it is important not to weigh these tests more favorably as all atmospheric state variables and climate tendencies are important to the climate of the region.
The rankings produced by the model selection procedure do not explicitly tell whether a realization was able to reconstruct the key processes of the region of interest. To evaluate this for the case study of East Africa, their ability to reconstruct the IOD was analyzed and it was found that a larger percentage of higher ranked models were better able to capture the IOD than lower ranked ones. This illustrates the ability of the selection procedure to rank realizations that can capture a key process of the region of interest higher than those that cannot. However, to maintain the generic potential of the selection procedure, this process analysis step was not incorporated within the testing scheme. It should, nevertheless, be used in conjunction as an independent indicator of model skill.
Furthermore, the assumption and motivation behind the selection procedure was that higher ranked realizations would produce a better downscaling product than lower ranked ones. This was confirmed by running three WRF simulations forced respectively by MERRA-2, the top ranking realization (GISS-E2-H r6i1p3), and the lowest ranked one (IPSL-CM5A-LR r4i1p1) over East Africa for the period of 1996-2005 and comparing the simulated monthly surface temperature and precipitation over equatorial East Africa to CRU TS 4.04. From the examination of the mean annual cycle and seasonal spatial patterns, one can see that the top ranked realization led to a better downscaling product than the lower ranked one, a result more evident for temperature than precipitation.
The selection procedure is, however, unfortunately biased toward the group of the space-time anomaly tests. This is due to the number of nondegenerate EOFs from the seasonal analysis at 200, 500, and 850 hPa. The analyses of their dominant spatial patterns of variability and their importance resulted in 61 tests. This is a significantly higher amount of tests than the 12 for the mean climatic state. This could possibly be compensated for by weighting but that adds a further degree of subjectivity to the selection procedure.
The 12 mean climatic state tests were undertaken with respect to an observational data set that spans March 2000 to September 2017, leading to a small overlap with the historical simulations. This data set was utilized as decadal AWS records at high altitude are scarce and its unique location, on the summit of Kilimanjaro, allowed for the capture of the large-scale climatic properties of the tropical midtroposphere, and comparison to CMIP5 output. For analysis, the historical simulations had to be extended using RCP 4.5 and/or the his-toricalExt simulation to allow for analysis. However, the small overlap period means that from 2006 onward these tests are evaluating how well RCP 4.5 and/or historicalExt did in simulating the mean climatic state rather than how well the historical simulation did. While similar results are expected if the historical simulation were present for this period, this should be noted. Due to the bias toward the space-time anomaly test, the relatively shorter overlap period should not significantly impact the model selection results.
While this methodology was presented for the case study of East Africa, it is comprehensive and easily applicable. Nevertheless, its skill has only been evaluated over East Africa and should be thoroughly examined for other regions. The evaluation of the space-time anomaly only requires selection of a suitable reanalysis product for the study area. The mean climatic state analysis presented was based on observations of the tropical midtroposphere from the summit of Kilimanjaro. This is rather case study specific but can be easily adjusted depending on the problem being addressed by using observational data, radiosondes, or satellite data for the target region and considering the scale they represent. Moreover, we focused the discussion on a regional atmospheric modeling experiment to demonstrate the added value of the methodology for dynamical downscaling. However, it should also be explored in statistical downscaling which is known to benefit from physically sound data inputs (e.g., Hewitson & Crane, 2006), as captured by the four state variables temperature, humidity, and winds.  10°N-15°S). Realizations with a p-value greater than 0.05 did not meet this criterion. Second, the ability to reconstruct the IOD magnitude was examined using the probability density functions for the western and eastern poles of the Indian Ocean SST anomalies. Realizations not within one standard deviation of the mean of three observational SST data sets (version 3b and 4 of the Extended Reconstructed SST [ERSST] data set [Huang et al., 2015a[Huang et al., , 2015bLiu et al., 2015;Smith et al., 2008;Xue et al., 2003] and COBE-SST2 [Hirahara et al., 2019]) are deemed to fail at reconstructing the magnitude of the IOD. Finally, evaluation of the representation of East African precipitation by examining, with respect to the monthly land surface precipitation from the CRU TS 4.03 (Harris et al., 2020), how the realization captures the mag-PICKLER AND MÖLG 10.1029/2020JD033033 21 of 28  Note. If multiple nondegenerate EOFs are present, first value in the variance explained will correspond to the first nondegenerate EOF, the second will correspond to the second EOF, and so on.
nitude and shape of the mean annual cycle. Realizations with a RMSE above the standard deviation of the observational data set and a Pearson correlation with a p-value greater than 0.05 failed this test.  Table D1 GCM Realizations Rankings Produced Using the RMS of the Relative Errors and the Sum of the Relative Error, RE_tot acknowledge the compute resources and support provided by the High Performance Computing Centre (HPC) at the University of Erlangen-Nürnberg's Regional Computation Center (RRZE). We are also grateful to the climate modeling groups, the World Climate Research Programme's (WCRP) Working Group on Coupled Modelling (WGCM) and the Earth System Grid Federation (ESGF) for producing, coordinating, and providing access to all the CMIP5 data used in this study. The authors thank Douglas R. Hardy for supplying the data for AWS 1, which can be found in Hardy (2011), and John C. H. Chiang for helpful discussions. Finally, editor F. Giorgi and two anonymous reviewers deserve special acknowledgment for their very thorough and constructive comments. Open access funding enabled and organized by Projekt DEAL.

Model
Ranking Note. Top and bottom ranked GCM realizations are presented in bold.

Table D1
Continued