CEDAR Electrodynamics Thermosphere Ionosphere (ETI) Challenge for systematic assessment of ionosphere/thermosphere models: Electron density, neutral density, NmF2, and hmF2 using space based observations
Corresponding author: J. S. Shim, Goddard Planetary Heliophysics Institute, University of Maryland Baltimore County, NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA. (email@example.com)
 In an effort to quantitatively assess the current capabilities of Ionosphere/Thermosphere (IT) models, an IT model validation study using metrics was performed. This study is a main part of the CEDAR Electrodynamics Thermosphere Ionosphere (ETI) Challenge, which was initiated at the CEDAR workshop in 2009 to better comprehend strengths and weaknesses of models in predicting the IT system, and to trace improvements in ionospheric/thermospheric specification and forecast. For the challenge, two strong geomagnetic storms, four moderate storms, and three quiet time intervals were selected. For the selected events, we obtained four scores (i.e., RMS error, prediction efficiency, ratio of the maximum change in amplitudes, and ratio of the maximum amplitudes) to compare the performance of models in reproducing the selected physical parameters such as vertical drifts, electron and neutral densities, NmF2, and hmF2. In this paper, we present the results from comparing modeled values against space-based measurements including NmF2 and hmF2 from the CHAMP and COSMIC satellites, and electron and neutral densities at the CHAMP satellite locations. It is found that the accuracy of models varies with the metrics used, latitude and geomagnetic activity level.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
 There have been noticeable developments of many ionosphere/thermosphere (IT) models over the last 30 years [Schunk et al., 2002; American Institute of Aeronautics and Astronautics, 2010] that deepen our understanding of the ionosphere/thermosphere (IT) system. All of the models, however, have errors associated with their predictions of climate and weather of the ionosphere and thermosphere. Therefore, it is important to assess the IT models quantitatively in order to not only understand the strong and weak features in their prediction capabilities but also make improvements accordingly.
 The CEDAR (Coupling, Energetics, and Dynamics of Atmospheric Regions) community initiated the Electrodynamics Thermosphere Ionosphere (ETI) Challenge in 2009 to assess accuracy of a variety of IT models in predicting ionospheric-thermospheric parameters against measurements.
 The results of the CEDAR ETI Challenge using ground-based observations, such as vertical drift at Jicamarca, and NmF2 and hmF2 from ISR, have been presented by Shim et al. . Model simulations from up to 10 models were compared with the measurements during nine time intervals (two strong, four moderate geomagnetic storm events, and three quiet periods). The validation study was the first to quantitatively evaluate a wide variety of IT models ranging from empirical to physics-based, coupled IT and data assimilation models. However, the study focused on only the ionospheric parameters and was limited in latitude coverage. In this work, we demonstrate the Challenge results obtained by using space-based observations including the ionospheric parameters, NmF2 and hmF2 derived from radio occultation measurements by Low Earth Orbit (LEO) satellites (CHAMP and COSMIC), and electron density at the CHAMP (CHAllenging Minisatellite Payload) locations, and the thermospheric parameter as well, such as neutral density at the CHAMP locations. The high inclination orbits of CHAMP and COSMIC give greater latitude coverage for the measurements. In this study, therefore, dependence of the model performance on latitude was investigated by calculating the four skill scores for three latitude regions, low, middle, and high geographic latitudes, during the selected nine time intervals.
 The CEDAR ETI Challenge is supported by the Community Coordinated Modeling Center (CCMC), which develops and provides tools used for a large number of model validation studies. All measurements and model simulation results used are available on the CCMC website for use by the space science communities.
2. Setup of the Challenge
 We chose all nine events, which were categorized into three levels of geomagnetic activity, to investigate the effect of geomagnetic activity on the IT model performance. Three time intervals were selected from the GEM (Geospace Environment Modeling) Challenge [Pulkkinen et al., 2010, 2011; Rastätter et al., 2011] (see Table 1). Three different geomagnetic activity levels, strong storm (Kp_max ≥ 7), moderate storm events (4 ≤ Kp_max < 7) and quiet periods (Kp_max < 4) were defined by the Kp index. The three GEM events were divided into moderate (E.2001.243) and strong (E.2005.243 and E.2006.348) storms. All but the moderate 2001 GEM event were for low solar flux conditions (F10.7 < 100). Kp values for the selected events (the GEM events, one quiet and one moderate conditions) are shown in Figure 1.
Table 1. Events Studied in the CEDAR ETI Challenge
Date (DOY) and Time
2006/12/14(doy 348) 12:00 UT-12/16 (doy 350) 00:00 UT
2001/08/31(doy 243) 00:00 UT-09/01 (doy 244) 00:00 UT
2005/08/31 (doy 243) 10:00 UT-09/01 (doy 244) 12:00 UT
2007/04/01 (doy 091) 00:00 UT-04/02 (doy 092) 12:00 UT
2007/05/22 (doy 142) 12:00 UT-05/25 (doy 145) 00:00 UT
2008/02/28 (doy 059) 12:00 UT-03/01 (doy 061) 12:00 UT
2007/03/20 (doy 079) 00:00 UT-03/22 (doy 081) 00:00 UT
2007/07/09 (doy 190) 00:00 UT-07/10 (doy 191) 00:00 UT
2007/12/07 (doy 341) 00:00 UT-12/09 (doy 343) 00:00 UT
 Compared are the modeled values with observed values for (1) neutral and (2) electron densities at the CHAMP locations, and (3) NmF2 and (4) hmF2 derived from radio occultation measurements from CHAMP and COSMIC (Constellation Observing System for Meteorology, Ionosphere, and Climate) satellites during each event.
3. Submissions of Model Simulations
 For the comparison of neutral density at the CHAMP orbits, eight submissions of model simulations from five models (JB2008, NRLMSISE-00, CTIPe, GITM and TIE-GCM) were used. Nine submissions from eight models (IRI, SAMI3, USU-IFM, CTIPe, GITM, TIE-GCM, JPL-GAIM and USU-GAIM) were compared with NmF2, hmF2 and electron density measurements. The model outputs used for the study were either provided by modelers (via the submission interface at the CCMC website developed for a number of model validation studies) or simulated by the IT models hosted at the CCMC [Webb et al., 2009]. A unique model identifier was used to distinguish multiple submissions from the same model that were driven by different boundary conditions and/or inputs (e.g., 1_JB2008 and 2_JB2008) (see Table 2). In Table 2, model results generated by the CCMC are identified by a superscript “a.”
Table 2. Models Submitted for the CEDAR Challenge
Model Setting ID
Grid (lat × lon × alt)
The model results are submitted by the CCMC using the models hosted at CCMC. Different model setups are different model setting identification numbers.
USU-GAIM23 with GPS TEC observations from up to 400 ground stations (−60° < lat < 60°)
44 × 24 × 83 (90 km < alt < 1,400 km)
 We briefly described the models and the submissions of their simulations in Shim et al.  (please refer to all references included) except for two empirical thermosphere models (JB2008, NRLMSISE-00). In the following sections, we provide short descriptions of the two models and their three submissions.
3.1. JB2008 (1_JB2008 and 2_JB2008)
 Jacchia-Bowman 2008 (JB2008) is an empirical thermospheric density model that is developed as an improved revision to the Jacchia-Bowman 2006 model based on Jacchia's diffusion equations [Jacchia, 1965, 1971; Bowman et al., 2008a, 2008b, 2008c]. New exospheric temperature and semiannual density equations are employed to represent the major thermospheric density variations. In addition to the F10.7 index for radio flux, JB2008 and its predecessor use EUV flux from the SOHO satellite, and solar middle ultraviolet (MUV) flux from NOAA satellites. Daily and 81-day centered averages of these solar indices are used in a formula to derive the “global nighttime minimum exospheric temperature,” from which the neutral densities are calculated from the date, time, location, and altitude. JB2008 uses an additional correction to the exospheric temperature, due to geomagnetic activity, that is derived from the Dst index and based on results by Burke . Another means of calculating the correction to the average exospheric temperature was shown by Weimer et al. . This alternative method uses the total Poynting fluxes into the polar regions, calculated from empirical models that use the solar wind and IMF measured by the ACE satellite for input values [Weimer, 2005].
 Two submissions, 1_JB2008 and 2_JB2008, for the neutral density at the CHAMP orbits using JB2008 were used (see Table 2). 1_JB2008 is JB2008 run with corrections to global nighttime minimum exospheric temperature due to auroral heating computed from the Dst index, while 2_JB2008 is JB2008 run with the temperature corrections derived from the total Poynting fluxes [Weimer, 2005; Weimer et al., 2011].
3.2. NRLMSISE-00 (1_MSIS)
 NRLMSISE-00 (Naval Research Laboratory Mass Spectrometer and Incoherent Scatter Extended) is an empirical model based on the earlier models MSIS-86 [Hedin, 1987] and MSISE-90 [Hedin, 1991]. NRLMSISE-00 and the associated NRLMSIS database include composition and temperature measurements by satellite, rocket, and incoherent scatter radar. They also include total mass density from satellite accelerometers and from orbit determination. The model covers the altitude range from the ground to the exobase (<1400 km), and provides altitude profiles of temperature, number densities of species (He, O, N2, O2, Ar, H, and N), total mass density, and the number density of a high-altitude anomalous oxygen component of total mass density [Picone et al., 2002]. NRLMSISE-00 accounts for the main drivers of the upper atmosphere: the solar EUV flux and geomagnetic heating. The 10.7-cm solar radio flux (F10.7) is the standard proxy for the solar EUV, while the daily Ap and the 3-hourly ap geomagnetic indices measure the geomagnetic component of space weather.
 We used the total neutral mass densities inferred from accelerometer measurements on the CHAMP satellite that are available at http://sisko.colorado.edu/sutton/data.html [Sutton et al., 2005]. CHAMP was orbiting the Earth with an inclination of 87.3° and taking measurements for ten years since its launch on 15 July 2000 [Reigber et al., 2002]. Due to the high inclination, CHAMP measurements cover almost all latitudes (see bottom plots in Figures 2a and 2b), while all local times are sampled roughly once every four months. The neutral densities are 3-degree latitude average values with a cadence of about 45 s. The absolute uncertainty of the CHAMP neutral densities derived from drag-based measurements is 10∼15% [Bruinsma et al., 2004] and includes combined effects of several sources of errors such as accelerometer noise, accelerometer calibration, coefficient of drag, solar radiation pressure model, and neutral winds. The average error in the neutral density for the nine selected events ranges from about 1.5 × 10−13 to 3 × 10−13 kg/m3, which corresponds about 6 to 14%, for the altitudes around 340∼390 km seen in most of this study. For the study, 1-min average observed values of electron and neutral densities were compared with modeled values of every minute.
 The in situ electron densities from the PLP (Planar Langmuir Probe) onboard the CHAMP satellite were used as ground truth. The PLP takes measurements of electron density at the satellite position every 15 s. The accuracy of the PLP measurements is within 10% [Liu et al., 2007]. The CHAMP PLP data were provided by the Information System and Data Center (ISDC, http://isdc.gfz-potsdam.de/).
 We also used NmF2 and hmF2 obtained from the electron density profiles (EDPs) retrieved from the CHAMP and COSMIC GPS radio occultation (RO) measurements provided by the University Corporation for Atmospheric Research (UCAR) COSMIC Data Analysis and Archival Center (CDAAC) (http://cosmic-io.cosmic.ucar.edu/cdaac) [Schreiner et al., 2002]. The CDAAC radio occultation EDPs retrieved by Abel inversion have uncertainty due to assumptions and approximations used in the inversion method [Schreiner et al., 1999; Lei et al., 2007; Wu et al., 2009]. It was found that the NmF2 and hmF2 obtained with RO measurements and ground-based measurements show an agreement within about 10–30% [Hajj and Romans, 1998; Schreiner et al., 1999; Chu et al., 2010]. It was also found that the Abel retrieval method generally demonstrates good performance except at lower altitudes such as the E and F1 layers [Yue et al., 2010, 2011]. For the E.2001.243 event, there were no data available, and for the E.2005.243 events, only CHAMP measurements were used, since the COSMIC constellation of six satellites was launched on 14 April 2006 [Anthes et al., 2008]. The hmF2 values obtained from the CHAMP/COSMIC RO measurements at tangent points, which are on the line of sight between LEO and GPS satellites, range between about 200 and 350 km in most of the selected events except during the storm's main phase of the E.2006.348 (not shown here). Therefore, the electron densities at the CHAMP height (340∼390 km), which is greater than the CHAMP/COSMIC hmF2, are smaller than NmF2 RO measurements for most cases. There is another group of hmF2 values near 100 km (not shown here). For the study, the hmF2 values less than 140 km were excluded that correspond to peak heights of E region. Also, the modeled hmF2 values obtained from EDPs by interpolation and those larger than 140 km were used.
 We used four metrics in order to quantify the model accuracy. Here, ‘metric’ means functions, which give one real number (skill score) for one set of modeled and observed data.
5.1. Root-Mean Square (RMS) Difference
 For quantitative model assessment, the root-mean square difference is a widely used method to determine the differences between the observed and model values defined as
where xobs and xmod are values obtained from observation and model prediction, respectively. RMS errors of 0 indicate perfect agreement of modeled values with observations, hence the RMS error approaches 0 as the model prediction gets better. Note that the unit of the RMS error is the same as the unit of observed and modeled values.
5.2. Prediction Efficiency (PE)
 Prediction efficiency against the mean observed value is also used for assessment of models:
where xobs and xmod are again values from observation and model prediction, respectively, and 〈xobs〉 is the mean value of the observed data. In this study, we used the observed mean value 〈xobs〉 as a reference model rather than any empirical model. PE can vary from negative infinity to 1 (PE = 1 means perfect prediction). A value of 0 means that the model accuracy is comparable to the variation of the measurements about their mean in an aggregate sense. Negative values indicate that model error in an RMS sense is larger than the variation of the observations about their mean and imply that the observed mean is a better predictor of the observations than the model. Local time dependence of the selected physical parameters (electron and neutral densities, NmF2 and hmF2) was considered by calculating PE for the daytime (06:00–18:00 LT) and the nighttime (18:00–06:00 LT) separately using daytime and nighttime mean values of observed data. Physical conditions corresponding to less IT variability (e.g., quiet time) require corresponding increases in model accuracy to achieve comparable skill scores to more variable times (e.g., storms).
5.3. Ratios of the Maximum Change in Amplitudes and Maximum Amplitudes
 We also used metrics based on ratio in order to measure capability of models to predict maximum amplitudes or short-term temporal changes during a certain time interval, although accuracy of models may be low with respect to the RMS error and/or PE that measure how well modeled values are correlated with observed values. Two ratios were considered: the ratio of the maximum change (max − min) and the ratio of the maximum (max) molded values to observed values:
where (xobs)max and (xmod)max are the observed and modeled maximum amplitudes for a certain time window. A ratio of 1 indicates perfect model prediction, while the ratio(max − min) and the ratio(max) greater than 1 suggest overestimation of maximum variations and maximum values by models.
 In Shim et al. , it was shown that selecting an appropriate time window length is crucial in calculating the two ratios, which is dependent on the time window length. It was found for this study that a suitable length of time window is 1 h, compared to 4 and 7 h for calculating the ratios of the vertical drifts at Jicamarca. The ratios of max − min of the model, which performed the best in predicting the vertical drift variability, move away from 1 as the time window length increases from 1 to 7, while the opposite holds true for the models that performed worse.
 For this study, we selected a 90-min time window length that is close to the period of the CHAMP (about 94 min) and COSMIC (about 100 min) satellites. Due to the daytime and nighttime alternation during the 90 min, the ratio(max − min) represents the ratio of the diurnal variation (difference between daytime maxima and nighttime minima).
6.1. Neutral Density
Figure 2 displays the observed and modeled neutral mass densities along the CHAMP track during the first eight hours for the E.2006.348 strong storm event (Figure 2a) and for the E.2007.079 quiet time event (Figure 2b). Top plots in Figures 2a and 2b use black and colored curves to show observation data and modeled values. In the bottom plots, the CHAMP orbit track is shown as a function of local time (dashed lines) and latitude (solid lines).
 In Figure 2, during the E.2007.079 quiet time event and before storm onset for the E.2006.348 event, most models produce diurnal and latitudinal variations of neutral mass density similar to those observed, however, differences in the model performance are clearly seen. For the E.2007.079 event, the results from the empirical model JB2008, 1_JB2008 and 2_JB2008 are almost identical and agree with the measurements better than the results from the others, which tend to underestimate (e.g., CTIPE) or overestimate (e.g., NRLMSISE-00, GITM, and TIE-GCM) the neutral density at the CHAMP orbit. During the E.2006.348 event, none of the models succeed in producing the observed abrupt increases in neutral density in the morning sector at high latitudes in the southern hemisphere, although both GITM models show some increases. However, from Figure 2, only limited qualitative conclusions can be obtained.
 The model performances were quantified by using metrics to make explicit comparisons. Figure 3 shows the ranking of eight model simulations using four different metrics for neutral density along the CHAMP track; RMS error, PE, ratio(max − min), and ratio(max) (from top to bottom). To find out model performance dependency on latitude, the skill scores using the four metrics were calculated for three latitude regions, which are low (|lat| < 25°), middle (25° < |lat| < 50°), and high (|lat| > 50°) geographic latitudes (from left to right in Figure 3). In Figure 3, squares, circles, and triangles indicate the average values for strong storms, moderate storms, and quiet periods, respectively. Ranking of the model performance is based on the multievent average (denoted by crosses) of the three geomagnetic activity levels. The model that performs best is placed in the extreme left. In the bottom two panels, the models nearer the thin black horizontal line (ratio = 1) perform better than the others located farther above (below) the line that overestimate (underestimate) the maximum changes and/or maximum values.
 The RMS differences for the neutral densities seem to get larger with increasing geomagnetic activity. During the strong storms, all models show the largest RMS error. Highly ranked models including 1_JB2008, 2_JB2008, 1_MSIS, TIE-GCMs (1_TIE-GCM and 2_TIE-GCM), show similar differences in RMS error between strong storms and quiet periods that are about 1 × 10−12 kg/m3 ∼ 2 × 10−12 kg/m3. However, relatively low ranked models such as 1_CTIPE, GITMs (1_GITM and 3_GITM) show larger differences up to about 6 × 10−12 kg/m3. PE also shows dependency on geomagnetic activity. Most models tend to have better PE during storms than quiet times in all latitudes.
 The two submissions, 1_JB2008 and 2_JB2008, rank at the top and are followed by another empirical model 1_MSIS. The model rankings based on the RMS error and PE are similar but not the same (see first and second rows in Figure 3) due to the fact that the ranking in terms of PE is obtained by using the average of daytime and nighttime PE, while the RMS error does not depend on local time. In addition, PE is normalized by the standard deviation of the observations, whereas the RMS error is not normalized.
 Compared to the three empirical model results, TIE-GCMs show rather worse performance, but they show comparable performance in terms of RMS. 1_CTIPE produces worse RMS differences and PE than TIE-GCMs for the storm events, although the two scores of 1_CTIPE are better than theirs for the quiet events. GITMs perform worse than the other models especially during the strong storm events. 3_GITM shows the largest RMS errors and negative PE during the strong storms. However, for moderate storms, GITMs show comparable performance to TIE-GCMs, and better performance than 1_CTIPE.
 1_MSIS shows the best ratio(max − min) in low and middle latitudes for all geomagnetic levels. 2_JB2008 and 1_JB2008 show the best ratio(max) in low and middle/high latitudes, respectively, although 1_TIE-GCM has a ratio(max) closest to1 as a result of the counterbalance of the overestimation during quiet periods and underestimation during storm events in an average sense. As for the ratios, therefore, the model ranking for each geomagnetic condition is more focused than the ranking based on the average over the all three geomagnetic conditions in this paper.
 The two JB2008s, for the quiet periods in the three latitude regions, produce the ratio(max) close to 1 and the ratio(max − min) smaller than 1. This indicates that the JB2008s predict well daytime maximum neutral density, but overestimate nighttime minimum values. The two GITMs, for moderate storms in all latitudes, show the same features. In terms of ratios, during the strong storms, among the physics-based coupled IT models, TIE-GCMs, which tend to underestimate ratios, are better than 1_CTIPE and GITMs, which overestimate ratios. However, during the moderate storm events, the coupled models produce similar ratios (less than 1).
6.2. Electron Density
Figure 4 displays the observed and modeled electron densities at the CHAMP orbit during the first five hours for the E.2005.243 strong storm event (Figure 4a) and for the E.2007.079 quiet time event (Figure 4b). In the top plots in Figures 4a and 4b, the black and colored curves represent values from observations and models, respectively. The bottom plots in Figures 4a and 4b show the CHAMP orbit track as a function of local time (dashed lines) and latitude (solid lines). During the E.2007.079 quiet time event, 1_GITM tends to overestimate electron densities for almost all latitudes and all local times, while most of the other models show a similar tendency to overestimate with a small peak at low latitudes in the morning sector (Figure 4b). Some models produce the ionospheric equatorial anomaly better than the others (not shown clearly here). In higher latitudes, the observed values show better agreements with the results of the empirical model 1_IRI and physics-based coupled model such as 1_CTIPE and 2_TIE-GCM than the results of the physics-based ionosphere models and the data assimilation models, which have limited latitude coverage. It is also shown that, in middle latitude regions, most models produce relatively well the electron densities during the quiet time, while the simulation results of most models noticeably differ from the measurements during the storm.
Figure 5 shows the ranking of the models based on the four skill scores for electron density at the CHAMP orbit. Figure 5 is the same as Figure 3 but for predicating electron density. It should be noted that 1_SAMI3_HWM93 data at high latitudes were excluded due to lack of reliability since SAMI3 does not include high latitude driving forces (e.g., the auroral precipitation and the convection electric field pattern), and 1_GITM has data only for one strong (E.2005.243), two moderate storm events (E.2007.091 and E.2007.142), and two events (E.2007.079 and E.2007.190) during quiet conditions. Therefore, comparing 1_GITM with other models based on the averaged performance taken over the events needs caution.
 All models, except GITMs in low latitudes, show the RMS increases due to increases of geomagnetic activity from low (quiet period) to medium (moderate storm) and to high (strong storm) levels. For example, the RMS increase reaches up to about 2.5 × 105/cm3. However, a change of geomagnetic activity from medium to high level appears to produce smaller increases in the RMS error than the above mentioned increases in geomagnetic activity especially in high latitudes. All models, except GITMs in low latitudes, show best performance during the quiet periods in terms of RMS error. Although PE and ratios show less systematic dependency on geomagnetic activity than the RMS error does, most models tend to have better PE and ratios during storms than quiet times in low and middle latitudes.
 In terms of the both RMS error and PE, 1_JPL-GAIM, 1_USU-GAIM, and 1_IRI rank at the top in low, middle, and high latitudes, respectively, and they are followed by coupled and physics based models. In low and middle latitudes most models produce better PE (near 0) during the storms than during quiet time periods as mentioned above. However, at high latitudes, all models have better (worse) PE than in low and middle latitudes during the quiet period (storms), which results in reduced differences in PE between geomagnetic activity levels. For data assimilation models, 1_JPL-GAIM and 1_USU_GAIM, the inconsistency in ranking at different latitudes (ranking near middle at high latitudes and higher in lower latitudes) is possibly due to limited data assimilation. For this study, 1_JPL_GAIM and 1_USU-GAIM submissions were obtained using ground based GPS TEC data between ±55° geomagnetic, and between ±60° geographic latitudes, respectively.
 In terms of ratios, in low latitudes, 1_JPL-GAIM and TIE-GCMs show good agreements with observations for the all three geomagnetic conditions. In middle latitudes, 1_TIE_GCM shows the best performance in producing diurnal variations and maximum values of electron density for the quiet periods, while 1_USU-GAIM ranks at the top for storm cases. At high latitudes, 1_IRI and 1_CTIPe produce better ratios for quiet and moderate conditions, whereas TIE-GCMs, 1_JPL-GAIM and 1_USU-GAIM show better ratios during the strong storms than the others.
 For quiet conditions, most models tend to overestimate diurnal variations and the daytime maximum of electron density in all three latitude regions, except for TIE-GCMs in high latitudes. In low and middle latitudes, 1_GITM, 3_GITM, 1_CTIPE and 1_USU-IFM show relatively larger differences in the ratios between the three levels of geomagnetic activity than the others.
 For predicting electron density, data assimilation models and the IRI empirical model show better scores than physics-based ionosphere and coupled IT models in low and middle latitudes especially in terms of RMS difference and PE. In high latitudes, the IRI empirical model and physics-based coupled IT models rank higher than the others.
 1_ SAMI3_HWM93 and 1_USU-IFM, which are physics-based ionospheric models, produce comparable performance in terms of RMS error and PE, although 1_USU-IFM produces slightly smaller RMS error during the storms, and 1_SAMI3_HWM93 shows relatively better agreement with observations during the quiet periods, and it produces better diurnal variations and maximum values of electron density than 1_USU-IFM for most cases.
 TIE-GCMs and 1_CTIPE among five coupled model submissions show similar performance in producing electron density along the CHAMP tracks, and they show better performance than the other two, GITMs, for all cases. 2_TIE-GCM is slightly better for the storms and worse for the quiet events than 1_TIE-GCM. 1_CTIPE performs worse than TIE-GCMs in low latitudes, while 1_CTIPE is better for moderate storms in middle latitudes and for the quiet periods in high latitudes.
 Differences between performances of the two data assimilation models, 1_JPL-GAIM and 1_USU-GAIM, are hardly seen except for larger RMS error of 1_USU-GAIM for strong storm in low latitudes and larger negative PE of 1_JPL-GAIM for quiet events in low and middle latitudes.
6.3. NmF2 and hmF2
Figures 6 and 7 show the model performance of predicting NmF2 and hmF2, respectively. The model scores are the averaged performance taken over the CHAMP and six COSMIC satellite measurements. In Figure 6, similar to electron density and neutral density, most models produce increasing RMS errors as geomagnetic activity grows with a few exceptions (e.g., 1_JPL-GAIM and 1_CTIPE for NmF2 predicting at high latitudes). Differences in the RMS error of NmF2 between geomagnetic activity levels decrease as latitude increases. For example, RMS error differences between low and high geomagnetic activity levels for 2_TIE-GCM is about 4 × 105/cm3 at low latitudes, it decreases to about 1.5 × 105/cm3 at high latitudes. In terms of RMS error and PE, 1_JPL-GAIM and 1_IRI rank highly at all three latitude regions and for all three geomagnetic activity levels. 1_JPL-GAIM is better than 1_USU-GAIM in predicting NmF2 at low and high latitudes, while the two models show similar performance at middle latitudes based on RMS error and PE. 1_JPL-GAIM also shows better agreement with NmF2 observations than1_USU-GAIM in terms of ratios. 1_USU-GAIM has ratios greater than 1 for all cases, while 1_JPL-GAIM tends to underestimate the ratios at high latitudes. 1_SAMI3-HWM93 and 1_USU-IFM produce similar scores, but 1_SAMI3-HWM93 is slightly better than 1_USU-IFM during the quiet periods in terms of RMS error and PE. Among the coupled physics based IT models, in terms of RMS and PE, the two TIE-GCMs are better than 3_GITM and 1_CTIPE, although 2_TIE-GCM shows the worst performance in predicting NmF2 at low latitudes during the strong storms, while 3_GITM is better than 1_CTIPE in low and middle latitudes. In terms of ratios, for quiet conditions, 1_CTIPE shows better ratios for most cases than the others, however during strong storms it has the worst ratios (less than 1) in middle latitudes. During strong storms, 2_TIE-GCM produces better ratios in middle and high latitudes than 1_TIE-GCM, while the opposite is true in low latitudes. 3_GITM shows the best ratio(max) in middle latitudes during storms.
 As shown in Figure 7, RMS errors in predicting hmF2 tend to increase as geomagnetic activity increases like RMS errors in predicting NmF2. All models produce the largest RMS errors during the strong storms in all three latitude regions, however, the errors in middle latitudes are less than those in low and high latitudes. RMS error differences between quiet and moderate storm conditions are smaller than the difference between moderate and strong storms. Most models produce similar average RMS errors except for 3_GITM, which has the largest RMS errors for most cases. Most models produce better PE during moderate storms than during quiet periods and strong storms, although the differences are not significant. Highly ranked models in terms of the ratio(max − min) in low and middle latitudes show similar performance for all geomagnetic conditions. Models show good agreement with observed ratios of max for all cases. 1_SAMI3_HWM93 and 1_USU-IFM hardly show differences in the skill scores, however, 1_USU-IFM shows better ratio(max − min) in low latitudes. 1_JPL-GAIM appears slightly better (worse) than 1_USU-GAIM in predicting hmF2 at low (middle) latitudes, while they show similar performance at high latitude based on RMS error and PE. 1_JPL-GAIM also shows better agreement with hmF2 observations at low latitudes than1_USU-GAIM in terms of ratios, however, 1-USU-GAIM has better ratio(max − min) in middle and high latitudes. For most cases, two TIE-GCMs and 1_CTIPE perform similarly and better than 3_GITM, although 2_TIE-GCM shows worse ratio(max − min) at high latitudes than 1_CTIPe and 3_GITM.
7. Discussion and Conclusions
 We quantified the accuracy of various Ionosphere/Thermosphere (IT) models in predicting electron density, neutral density, NmF2 and hmF2 against space-based measurements using four different metrics, which are RMS error, prediction efficiency (PE), ratio of the maximum change in amplitudes, and ratio of maximum amplitudes within a 90 min time window. In addition, dependence of the model performance on geomagnetic activity and latitude was investigated by calculating the four skill scores for three latitude regions, low, middle, and high geographic latitudes, during the selected nine time intervals. The nine events were binned into three geomagnetic levels by maximum value of Kp during the time interval. Measurements used as ground truth are electron and neutral densities obtained from CHAMP, and NmF2 and hmF2 derived from radio occultation measurements by the CHAMP and COSMIC satellites. The average value of skill scores over all nine events was used to rank the model performance.
 Our study indicates that the model accuracy depends on geomagnetic activity. The RMS errors increase with increase of geomagnetic activity for most cases, although RMS errors of a few models during quiet periods are slightly larger or similar to those during storms. This nonlinearity of the exceptional cases was also found in our earlier paper on IT model evaluation using ground-based observations [Shim et al., 2011]. Possible causes of the nonlinearity are the simultaneous changes in the external driving forces (for example, neutral wind, composition and temperature and electric fields), which vary with each storm, depending on the magnetospheric energy input into the ionosphere. The prediction efficiency and ratios also show dependency on geomagnetic activity, however, they show less systematic (and even opposite) dependency than RMS error does. For predicting electron and neutral densities at the CHAMP locations, most models tend to have better PE and ratios during storms than quiet times in low and middle latitudes. This is likely due to the increased observed variability during storms that tends to produce a higher skill score for PE. Also, for predicting hmF2, the same feature is seen in middle and high latitudes. More than half of the models tend to produce better ratios of electron density during the storm rather than the quiet time intervals, and the ratio(max) of hmF2 hardly depends on geomagnetic activity in all three latitudes. Most of models ranked higher tend to produce smaller differences in the skill scores among the different geomagnetic conditions.
 The results of the study also indicate that model accuracy varies with the type of metrics and latitude. For example, in the predictions of electron density at the CHAMP track, the ionospheric empirical model 1_IRI and data assimilation models, 1_JPL-GAIM and 1_USU-GAIM rank at or near the top in terms of RMS error, while data assimilation models and coupled IT models rank higher in terms of ratio(max − min) especially during storms. The data assimilation models are ranked top with respect to RMS in low and middle latitudes. However, they perform worse than 1_IRI in high latitudes probably due to the limitation of latitude coverage of simulation results and observations used for the data assimilation models. The submission of 1_JPL_GAIM for this study was generated by assimilating ground based GPS TEC measurements from about 200 stations located between ±55° geomagnetic latitude, although COSMIC TEC measurements were also assimilated. The 1_USU-GAIM used for the study only assimilates GPS TEC measurements between ±60° geographic latitude, thus, high-latitude electron densities are the same as those from the physics based model USU-IFM, which is used as a background, without assimilating any data. For reproducing hmF2 from CHAMP and COSMIC, two physics based models, 1_USU-IFM and 1_SAMI3_HWM93, and 1_IRI performed better than others in terms of RMS and PE in low and middle latitudes, although the performance differences among the models are not significant. For thermosphere neutral density predictions, three empirical model submissions, 1_JB2008, 2_JB2008 and 1_MSIS rank higher for almost all cases even during the storms. 1_JB2008 and 2_JB2008, however, have better RMS and PE, and worse ratios of max − min than 1_MSIS.
 In addition, by comparing the same types of models, we find that two physics-based ionospheric models, 1_ SAMI3_HWM93 and 1_USU-IFM perform similarly in general. Although 1_USU-IFM performs slightly better in predicting electron density during the storms in terms of RMS error, and 1_SAMI3_HWM93 produces relatively better diurnal variations (max − min) and maximum values of electron density. Among the five physics based coupled IT models, it appears that, for most cases, performance of 2_TIE-GCM, 1_TIE-GCM and 1_CTIPE are similar to each other and better than GITMs, except the GITM models were the only models to even come close to reproducing the enhanced neutral densities in storms at high southern latitudes (see Figure 2a). For reproducing of electron density at the CHAMP orbit, 1_CTIPE performs worse than TIE-GCMs (1_TIE-GCM and 2_TIE-GCM) in low latitudes, while 1_CTIPE is better in middle latitudes for moderate conditions and in high latitudes for the quiet conditions. In reproducing the neutral density at the CHAMP orbit, 1_CTIPE produces worse (better) RMS and PE than TIE-GCMs for the storm events (quiet events). 3_GITM shows the largest RMS error and negative PE during the strong storms. However, for moderate storms, GITMs show comparable performance to TIE-GCMs, and perform better than 1_CTIPE. 2_TIE-GCM shows worst performance in predicting NmF2 at low latitudes during the strong storms in terms of RMS and PE. In general, two data assimilation models, 1_JPL-GAIM and 1_USU-GAIM show similar performance. However, for electron density prediction, 1_USU-GAIM has larger RMS errors for strong storms in low latitudes and 1_JPL-GAIM has larger negative PE for the quiet events in middle latitudes. 1_JPL-GAIM appears slightly better (worse) than 1_USU-GAIM in predicting hmF2 at low (middle) latitudes. However, at high latitudes, 1_USU-GAIM has better ratio(max − min). The two submissions, 1_JB2008 and 2_JB2008, using the thermospheric empirical model JB2008 are slightly better than 1_MSIS in terms of RMS and PE. 1_MSIS shows the best ratio(max − min) in low and middle latitudes for all geomagnetic levels. 2_JB2008 and 1_JB2008 show the best ratio(max) in low and middle/high latitudes, respectively.
 It is worth pointing out improvements of model performance caused by enhanced and/or more complex input drivers. 2_TIE-GCM (driven by Weimer high-latitude electric potential with dynamic critical crossover latitudes) is better than 1_TIE-GCM (driven by Heelis high-latitude electric potential with constant critical crossover latitudes) in electron and neutral densities predicted during the storms for all latitudes. Systematic improvements in 2_TIE-GCM, however, are not seen for quiet and moderate storm conditions. The improvement of 2_TIE-GCM in predicting ionospheric parameters during the strong storm was also shown in Shim et al. . From the comparison of 1_GITM and 3_GITM, it is found that 3_GITM performs better for electron density, whereas 1_GITM performs better for neutral density. The differences in input and boundary conditions between 1_GITM and 3_GITM (see Table 2) produce the different performance of GITM. Two JB2008 runs for neutral density, 1_JB2008 (with exospheric temperature corrections derived from the Dst index) and 2_JB2008 (with the temperature corrections derived from Weimer  total Poynting fluxes), show slight differences in skill scores. 2_JB2008 shows better scores during moderate storm events for most cases, while 1_JB2008 is better for the strong storms. Although not all performance caused by enhanced and/or more complex input drivers shows systematic improvement, the results of the comparison will help model improvement.
 Furthermore, the results of this systematic assessment of IT models provide a baseline for future validation studies using new models and improved models. Such assessments also provide a basis for understanding the role of data to improve assimilative models, and can suggest what observational systems are most useful for improving IT specifications and forecasts. All measurements and model simulation results used for the challenge are available on the CCMC website (http://ccmc.gsfc.nasa.gov) for use by the space science communities.
 The CHAMP neutral density data used in this study are obtained from http://sisko.colorado.edu/sutton/data.html. Portions of this research were performed at the Jet Propulsion Laboratory, California Institute of Technology under contract with the National Aeronautics and Space Administration.