Performance Evaluation of CMIP5 and CMIP6 Models on Heatwaves in Korea and Associated Teleconnection Patterns

This study assesses the performance of the Coupled Model Intercomparison Project (CMIP) models for simulating summer heatwaves in Korea during a historical simulation period (1979–2014) using four diagnostic indices that represent the teleconnection mechanism of summer heatwaves in Korea. Four skill metrics are used for the model evaluation, namely, relative error (RE), interannual variability skill‐score (IVS), correlation coefficient (CC), and total ranking (TR) based on daily maximum temperature (TMAX) in Korea and the four diagnostic indices. The results show that the REs of CMIP5 models do not differ significantly from those of the CMIP6 models while the IVSs in the CMIP6 models are significantly improved compared with the CMIP5 models. Observations show that the heatwave circulation index (HWCI) contributes more to the interannual variability in TMAX in Korea than that of the Indian Monsoon Rainfall Index (IMRI), indicating that the teleconnection from the northwestern Pacific is more important than that from northwestern India. Interestingly, the CMIP6 models simulate this property better than the CMIP5 ensemble. The higher TR of CMIP6 models than CMIP5 supports that CMIP6 models are better overall in simulating heatwaves in Korea and the associated diagnostic indices.


Introduction
In recent years, extreme climatic phenomena, such as heatwaves, floods, and typhoons, have occurred frequently around the world, and these phenomena greatly affect human society and ecosystems (IPCC, 2012(IPCC, , 2013Sun et al., 2014). Thus, it is important to understand extreme phenomena in terms of mechanisms, and accurately predict them by global climate models to minimize the damage caused by extreme phenomena (Cherchi et al., 2014;Enomoto et al., 2009; H. J. Lee et al., 2019;Pezza et al., 2012).
An increase in anthropogenic greenhouse gases has increased heatwaves in East Asia including Korea (Im et al., 2019;D. Lee et al., 2016;Min et al., 2014;Sun et al., 2014). However, the simulated performance of heatwaves during historical experiments shows many differences between models (Guo et al., 2017;Seo et al., 2018). Therefore, it is necessary to assess whether climate models can reproduce actual heatwaves and the associated mechanisms found in the observations. There have been limited efforts to evaluate the mechanism of heatwaves in Korea in the CMIP model, especially their performances in CMIP5 and CMIP6.
In this study we focused on the evaluation of the performance of both CMIP5 and CMIP6 in terms of heatwaves in Korea, as well as the associated mechanism. Thus, we first defined appropriate diagnostic indices that determine heatwaves in Korea and associated mechanism in section 2, and then, we examined how well the CMIP5 and CMIP6 models reproduce heatwaves in Korea, as well as the associated diagnostic indices in section 3. Section 4 consists of the summary and conclusions.

Data
This study used the daily maximum temperature (TMAX) data of 59 stations from the Korea Meteorological Administration (KMA) for 36 years from 1979 to 2014. The TMAX in the observations was calculated by averaging the TMAX at 59 in situ observations in Korea for July-August (JA). High-resolution Infrared Radiation Sounder-Outgoing Longwave Radiation (HIRS-OLR) data and monthly precipitation data from the Global Precipitation Climatology Project (GPCP) were used to analyze convective activity and precipitation associated with heatwaves in Korea. The horizontal resolutions of these data are both 2.5°× 2.5°. This study also used the geopotential height of the National Centers for Environmental Prediction/National Centers for Atmospheric Research (NCEP/NCAR) reanalysis 2 (R2) data to examine the association with atmospheric circulation patterns. The horizontal resolutions of these data are 2.5°× 2.5°.
CMIP data are available through the Earth System Grid Federation (ESGF) which is one of the most complex big data systems (Williams et al., 2016). The data from the historical simulation produced by 29 CMIP5 participating models and 25 CMIP6 participating models have been used in this study, as shown in Tables 1 and 2 Tables 1 and 2. We converted CMIP model data to a common 2.5°× 2.5°grid using a bilinear interpolation scheme to facilitate comparison between the models and observations.
Notably, the observation, reanalysis, and CMIP model data presented above were analyzed on average from JA to analyze summer heat waves.

Heatwave in Korea and Associated Diagnostic Indices
In this study we defined a heatwave as when TMAX exceeds 33°C more than two consecutive days, which corresponds to the KMA criteria. The 33°C threshold is the 90th percentile of the summer TMAX in Korea . We also have used the number of heatwave days (NHD) to represent the frequency of heatwaves. The NHD is calculated by the sum of the day of the TMAXs above 33°C for each station during Note. We used ensemble r1i1p1f1 for all the models except UKESM1-0-LL for which we used the ensemble r15i1p1f2 data provided by NIMS.
JA, divided by the total number of stations. The intensity of heatwave is defined as the sum of (TMAX minus 33°C) for each station, divided by the total number of stations. The monthly mean TMAX showed a high correlation with NHD, with a correlation coefficient (CC) of 0.886 during JA, as shown in Kim et al. (2019), indicating that the NHD increases as the intensity of summer heatwaves in Korea increases. Sun et al. (2014) also showed that the CC between two time series is approximately 0.74 in Eastern China. Thus, the magnitude of TMAX can explain both the intensity and frequency of heatwaves in Korea under KMA's criteria. If such a relationship does not appear well in the model, it is difficult to say that the model has excellent performance. Although the CC is low in some CMIP models, most models show strong correlations similar to the observed ( Table 3), indicating that monthly TMAX is appropriate to investigate the heatwave in Korea. Figure 1a represents the correlation pattern between TMAX in the observation and the geopotential height at 500 hPa obtained from the reanalysis. The positive correlation pattern centering on the Korean Peninsula and the negative correlation pattern centering on the subtropics and equator show the zonal structure. Specifically, the wave pattern from the Northwest Pacific to the high latitudes through the Korean Peninsula is a typical pattern of the north-south teleconnection. Figure 1b shows the correlation pattern between TMAX in the observation and OLR. A strong positive correlation extends east-west from the Tibetan Plateau to the eastern part of Japan, centering on the Korean Peninsula, with strong negative correlations appearing in the Northwest Pacific and northwestern India. These results correspond with those of  and Kim et al. (2019). Furthermore, Figure 2 also shows that the correlation pattern with NHD, which represents the frequency of heatwaves in Korea, shows a similar pattern as that shown in Figure 1. The two figures indicate that the upward movement enhanced by the strong convection activity in the Northwest Pacific induces downward movement in East Asia through the north-south teleconnection, resulting in a high-pressure anomaly favorable for heatwaves in Korea. The convective activity near the South China Sea and the Philippine Sea (black box in Figures 1b and 2b) is specifically important in terms of the interannual variabilities in TMAX and NHD, which is consistent with the Pacific-Japan (PJ) teleconnection (Nitta, 1987) and previous studies Kosaka & Nakamura, 2006;Yeh et al., 2018;Yeo et al., 2019), indicating that the OLR over the region can be one of diagnostic indices explaining heatwaves in Korea.
Notably, another strong convective activity related to heatwaves in Korea is located over northwestern India (Figures 1b and 2b). According to recent studies, the Circumglobal Teleconnection (CGT), a representative pattern of midlatitude waves, induces atmospheric circulation patterns and affects heatwaves in Korea through Rossby wave propagation along the jet stream guide (Enomoto et al., 2003;Kim et al., 2019Kim et al., , 2020Min et al., 2020;Park & Schubert, 1997;Yeo et al., 2019). The increase in sensible heat in the Tibetan Plateau and the strengthening of convective activity in northwestern India specifically induce CGT-like patterns, leading to a high-pressure anomaly in favor of heatwaves over the Korean Peninsula (Ding & Wang, 2005;Kim et al., 2019;Lau et al., 2006;Wu et al., 2016). Thus, this result is consistent with previous studies, indicating that the OLR over northwestern India can be one of the indices explaining heatwaves in Korea.
To assess heatwaves in Korea and the associated mechanism, we defined four diagnostic indices explaining teleconnection with heatwaves in Korea (Table 4). Table 4 shows the definition of each index. When a heatwave occurs over the Korean Peninsula, a high-pressure anomaly over the Korean Peninsula and a low-pressure anomaly in the northwestern Pacific are evident in the middle and lower layers Yeo et al., 2019). Thus, we have defined the heatwave circulation index (HWCI) using the 500 hPa geopotential height difference between the Korean Peninsula and the Northwest Pacific based on Note. Correlation in observation is 0.886. See Table 4 for the domain.  Table 4. We have also defined OLRI1 and OLRI2 using OLRs in the Northwest Pacific region and northwestern India, which are closely related to heatwaves in Korea (Table 4), as shown in Figures 1b and 2b.

Model Evaluation Metrics
In this study we used some metrics to evaluate the model performance on TMAX in Korea and associated diagnostic indices for 36 years from 1979 to 2014. All indices each JA are calculated by the area average of grid points in the given domain (Table 4), and metrics are calculated using the JA mean value of the index. In situ observations in Korea and reanalysis data are used for model evaluation of TMAX and diagnostic indices, respectively.
First, relative error (RE) was used to check the mean error for each model (Equation 1). Furthermore, the interannual variability skill score (IVS) (Chen et al., 2011;Jiang et al., 2015) was used to verify that the models simulate the interannual variability in the observations well (Equation 2):  Table 4. TMAX, GPH, and OLR are obtained from the observations, reanalysis, and NOAA, respectively.
where X and STD indicate the climatological mean of X and standard deviation of X, respectively. Subscripts m and o represent the model and observation, respectively. The values of RE and IVS closer to zero indicate that the model simulates similarly to the observations.  Table 4. TMAX, GPH, and OLR are obtained from the observations, reanalysis, and NOAA, respectively. The RE and IVS can explain the similarity of the observed index to the index simulated by the model. However, they do not explain the association with heatwaves in Korea. Therefore, we used a multiple regression to find a combination of diagnostic indices that could explain the interannual variability in heatwaves in Korea using observational data. In this process, the leave-one-out cross validation was performed using the 36 years of data for the verification of the multiple regression equation. The verification results showed that the combination of diagnostic indices that best explained the variability in TMAX in Korea was the linear combination of HWCI and IMRI, which accounted for approximately 61.4% of the total variability during the verification period. Furthermore, TMAX was highly correlated with HWCI and IMRI, with CCs of 0.70 and 0.45 in the verification, respectively. Therefore, we used the CC between the diagnostic indices and TMAX together with RE and IVS as a significant metric for model evaluation. In other words, even if the RE and IVS are similar to the observations, if the CC is low, the simulation performance of the Korean Peninsula heatwave is not considered to be good.
The RE, IVS, and CC provide us with model performance in terms of climatological mean, interannual variability, and teleconnection with diagnostic indices. Therefore, summarizing the rankings of TMAX and associated indices for each metric is also useful for model evaluation.
The comprehensive rating metric (MR) of the models for each metrics was calculated by Equation 3 according to the method of Jiang et al. (2015).
where n and m indicate the number of the index and the number of models, respectively. rank i represents the ranking of the target model for index i. From Equation 3, we can obtain the MRs of the CMIP models for RE, IVS, and CC. Here the MRs give us the ranking information of the models for each metrics.
In addition, summarizing all the rankings should be useful in evaluating the CMIP models. Therefore, the total ranking (TR) metrics was defined in Equation 4 by applying Equation 3 to all ranking as follows: where M represents the number of metrics. rank (i,j) indicates the ranking of index i for metric j. Thus, TR provides us with the total ranking information of each model as only a single value regardless of index and metric. Note that MR and TR values closer to 1 indicate better model performance. The TR can be calculated for each CMIP group. However, the objective way to compare TRs of two CMIP groups is to calculate TRs using all the models regardless of the CMIP phase, which will be shown later. Figure 3a shows the RE for the diagnostic indices as shading, with the number in each box representing the RE value. The MME located at the bottom represents the multimodel ensemble. Note that positive, zero, and negative values of the RE indicate overestimation, perfect, and

Journal of Geophysical Research: Atmospheres
underestimation, respectively. The MMEs show that the difference between CMIP groups is not large regardless of TMAX and associated indices. IMRI, TMAX, and HWCI in CMIP 5 (6) are underestimated with −40.3% (−44.0%), −16.1% (−15.7%), and −7.8% (−6.3%) in MME, respectively. This result means that two teleconnections forced from the northwestern Pacific and Northwestern India in the CMIP models are weaker than the observed one. As a result, TMAX is also lower than that observed due to the impact of

10.1029/2020JD032583
Journal of Geophysical Research: Atmospheres weak teleconnections. In the case of OLR, OLRI1 simulates observations relatively well, while OLRI2 is overestimated in most models. This result means that convective activity in northwestern India is weaker than that observed, and the influence of east-west teleconnection on TMAX in Korea is weaker than that observed. The overestimation of OLRI2 is consistent with the underestimation of IMRI, which means weak convective activity. Note that while other indices are similar in the sign of RE between models, HWCI significantly differs in the sign of RE between models. The RE of HWCI in CMIP5 (6) is the least consistent between the models, with a minimum of −60.2% (−82.8%) and a maximum of 102.4% (58.8%). This means that the MME of the models simulates convective activities in the Northwest Pacific well on average, but there are significant differences between the models in the north-south teleconnection to heatwaves in Korea. In contrast, the RE of TMAX in CMIP 5 (6) is underestimated in all models compared with the observations with range from −34.4% (−30.9%) to −2.5% (−3.3%).
To summarize, the CMIP participating models simulate convective activity near the northwestern Pacific well but there is a limitation in simulating convective activity in the northwestern India. Thus, the impact of the Indian monsoon on heatwaves in Korea as a remote forcing may not work effectively in CMIP models compared with the observations. We will discuss this part in the next section.
IVS is a metrics for measuring model performance regarding interannual variability. Thus, the IVS is independent of the RE measuring mean bias. Figure 4 shows that the IVS of the CMIP6 models is smaller overall than that of the CMIP5 models, indicating that the interannual variability in TMAX and associated indices is improved in the CMIP6 models compared with the CMIP5 models. In particular, the IVS of IMRI in CMIP6 is significantly improved with ensemble mean value of 0.91 compared with 3.56 in CMIP5. The large IVS of IMRI in CMIP5 is partly due to IVS values over 10 in some models, as shown in Figure 4a.

Relationship Between TMAX in Korea and Associated Diagnostic Indices
As explained in section 2.3, we have used a multiple regression model in the observations to find a combination of diagnostic indices that best explains the interannual variability in TMAX. The results show that the HWCI and IMRI indices are major indices used to explain TMAX in Korea. Thus, the two indices were used to evaluate the simulation performance of the model ( Figure 5). In Figure 5, the horizontal axis represents the CC between TMAX and IMRI, and the vertical axis represents the CC between TMAX and HWCI. Note that the size of a circle represents the coefficient of determination in the verification and model performance in CC is determined by the distance from the observed CC in Figure 5. Therefore, the models closest to the observed point have the best performance.
In the observations, a linear combination of HWCI and IMRI accounted for 61.4% of the interannual variability in the TMAX during the verification period. In CMIP5, the CC values are quite scattered (Figure 5a). However, CMCC-CM, CMCC-CESM, and IPSL-CM5A-LR are in a similar position as the observation. In the case of the INMCM4 and CSIRO-Mk3-6-0 models, the correlation between TMAX and HWCI was slightly higher than that in the observation, indicating that the two models simulate the link between heatwaves in Korea and convective activity over the northwestern Pacific. However, there is a limitation in simulating the teleconnection between heatwaves in Korea and the Indian summer monsoon in the CSIRO-Mk3-6-0 model. Figure 7. Total ranking (TR) distributions for the CMIP5 (left) and CMIP6 models (right). Models are arranged near the TR value in the TR order on the y axis. Refer to Table 5 for the TR values and detailed rankings.

Journal of Geophysical Research: Atmospheres
In CMIP6, the scattering of the CC points is greatly reduced compared with CMIP5 ( Figure 5b). Interestingly, CCs between HWCI and TMAX in nine of the CMIP6 models are comparable with the observations. Notably, in most models, the CC of the HWCI is greater than the CC of the IMRI, indicating that heatwaves in Korea are more strongly related to teleconnection from the northwestern Pacific than from northwestern India. The best five models based on CC are GFDL-ESM 4, INM-CM5-0, EC-Earth3, INM-CM4-8, and KACE-1-0-G.

Overall Evaluation in TR
In this section we have assessed model performance using RE, IVS, and CC. For the overall evaluation, the ranking of the models for the three metrics was calculated for each index including TMAX. The overall ranking was determined using the average of the rankings and represented by the portrait diagram ( Figure 6). Note that in Figure 6, the rankings of RE, IVS, and CC for each index are placed on the left, center, and right, respectively. Note that the red line is superior to the purple line.
In CMIP5, the CMCC-CM, MIROC5, HadGEM2-ES, HadGEM2-AO, and GFDL-ESM 2G models were ranked in the top five (Figure 6a). The IPSL-CM5B-LR, GISS-E2-R, and GISS-E2-H models, however, do not perform well, at least for the Korean Peninsula. Notably, the top five models do not have good rankings for all indices and all metrics. For example, the rankings of the REs for three of the indices except for OLRI1 are lower than the IVSs and CCs in the CMCC-CM model. However, the purple line shows a lower ranking in most indices and metrics. Satisfactory model performance appears in the GFDL-ESM 2G and CanESM2 models for the HWCI, MPI-ESM-LR and MIROC5 models for OLRI1, the HadGEM2-CC and CMCC-CM models for OLRI2, and the MIROC-ESM and HadGEM2-ES models for IMRI. However, TMAX showed the highest ranking in MIROC-ESM-CHEM and MPI-ESM-MR.
In CMIP6, EC-Earth3, EC-Earth3-Veg, SAM0-UNICON, KACE-1-0-G, and UKESM-1-0-LL were ranked in the top five (Figure 6b). However, the GISS-E2-1-H, FGOALS-g3, and CanESM5 models do not show a Journal of Geophysical Research: Atmospheres satisfactory simulation performance. By analyzing the model ranking for each diagnostic index, it was found that the EC-Earth3-Veg and EC-Earth3 models simulated HWCI. The IMRI performance is satisfactory in GFDL-ESM 4 and EC-Earth3-Veg. However, TMAX ranked higher in the MIROC6 and UKESM-1-0-LL models. Interestingly, the red line appears more frequently in CMIP6 than in CMIP5, especially in the top ranking, indicating that the top models in CMIP6 show good performance regardless of indices and metrics.
Summarizing all the rankings is useful in evaluating the CMIP models. The TR represents the total ranking of the CMIP participating models. Figure 7 shows the TR distribution for the CMIP participating models. Interestingly, the TR values in CMIP6 are larger than those in CMIP5, indicating that CMIP6 models showed overall improvement in simulating heatwaves in Korea, as well as the associated diagnostic indices. The best two models are EC-Earth3 and EC-Earth3-Veg, with TR values of 0.687 and 0.687, respectively, which participate in CMIP6 (Table 5). The third ranking model is CMCC-CM with TR value of 0.627, which participates in CMIP5. The fourth and fifth models are also participating in CMIP6, indicating that four of the top five models are participating in CMIP6. Thus, the model performance in CMIP6 is superior overall to that in CMIP5. Refer to Table 5 for the detailed ranking.
As can be seen from observations (Figure 1), both OLRI1 and OLRI2 have a significant correlation with TMAX. This means that they are good diagnostic indices to represent remote forcing. In this respect, all diagnostic indices including OLR1 and OLR2 were included when calculating TR. In consideration of the readers' interests, we have tested the sensitivity of TR results to the exclusion of the two OLR indices, i.e., using only HWCI, IMRI, and TMAX (Table 6). There are some noticeable differences in TR of CMIP5 models, but CMIP6 model ranks remain largely unaffected. This supports that the TR obtained from four diagnostic indices reliably represent the model's overall performance in CMIP6 models. TR distributions in CMIP5 models are more sensitive to OLR than in CMIP6 models. The best two models are EC-Earth3 and  Figure 8 shows the correlation pattern of TMAX in Korea with (a) GPH at 500 hPa and (b) OLR during JA for 36 years from 1979 to 2014. The patterns are obtained from the ensemble mean of the best six models (BMME6) in CMIP6. The correlation pattern with GPH at 500 hPa is stronger in the positive direction than in the observation, which is mainly due to the high positive correlation across a wide area in the east-west direction around the Korean Peninsula and zonal direction (Figure 8a). The negative correlation over the subtropics is weaker than the observation. However, the relative pattern in BMME6 is similar to the observations. The correlation pattern with OLR is similar overall to that of the observations, but the negative correlation over subtropics is much weaker than that for the observations, especially over the northwestern Pacific (Figure 8b). In the observations, a remote forcing from the northwestern Pacific on the heatwaves in Korea is more important than that in northwestern India. The CMIP6 results show that the mechanism found in the observations works in the northwestern region of India rather than the northwestern Pacific, resulting in the underestimation of TMAX by the weak remote forcing over the northwestern Pacific.

Summary and Conclusions
In this study we evaluated the performance of the CMIP participating models for heatwaves in Korea during a historical simulation (1979-2014) using four diagnostic indices that explain heatwaves in Korea. Two  Table 4. The patterns are obtained from the ensemble mean of the best six models in CMIP6.
indices, i.e., HWCI and OLRI1, represent the north-south teleconnection, while other indices, i.e., OLRI2 and IMRI, represent the east-west teleconnection. Four metrics of RE, IVS, CC, and TR were used for model evaluations of each index. The main results are summarized as follows: 1. By analyzing the REs of CMIP participating models, TMAX and IMRI tended to underestimate by 15-20% and 40-50%, respectively, and OLRI2 overestimated by 7-9% ( Figure 3). This result means that the simulated convection over northwestern India is not enough to provide proper teleconnection effects for heatwaves in Korea. In addition, there is no clear difference in REs between CMIP5 and CMIP6. 2. By analyzing the IVSs of the CMIP models, the CMIP6 models showed significantly improved IVS than the CMIP5 model ( Figure 4). Especially in IMRI, the IVS is decreased by 74% in CMIP6 compared with CMIP5. 3. In the observations the interannual variability in TMAX in Korea can be explained best by the linear combination of HWCI and IMRI ( Figure 5). Notably, HWCI mostly contributed to the interannual variability in TMAX, greater than that of IMRI, indicating that remote forcing from the northwestern Pacific is more important in modulating heatwaves in Korea than northwestern India. The CMIP6 models simulate this property better than CMIP5. In CMIP5, the CMCC-CM, CMCC-CESM, and IPSL-CM5A-LR models showed the best performances. In CMIP6, the best five models based on CCs are GFDL-ESM 4, INM-CM5-0, EC-Earth3, INM-CM4-8, and KACE-1-0-G. 4. To summarize, CMIP6 is better than CMIP5 in terms of TR, indicating that CMIP6 models have been improved overall in simulating heatwaves in Korea, as well as the associated diagnostic indices. The best five models are EC-Earth3, EC-Earth3-Veg, SAM0-UNICON, KACE-1-0-G, and GFDL-ESM 4 which participate in CMIP6 (Table 6). Thus, the model performance in CMIP6 is overall superior to that in CMIP5.
This study quantitatively evaluated the simulation performance of the CMIP model using four diagnostic indices to explain heatwaves in Korea at the interannual time scale. The long-term change and variability in heatwaves in Korea could be affected by other factors than those suggested in this study (Min et al., 2020). Further studies on the mechanism of heatwaves will require further development of indices, and further collection of CMIP6 participation models will lead to more comprehensive performance assessments. Notably, significant scientific progress has been made by CMIP, but there is still room for improvement in the GCMs on local, regional, and global scales, especially concerning extreme events.