Machine‐Learned HASDM Thermospheric Mass Density Model With Uncertainty Quantification

A thermospheric neutral mass density model with robust and reliable uncertainty estimates is developed based on the Space Environment Technologies (SET) High Accuracy Satellite Drag Model (HASDM) density database. This database, created by SET, contains 20 years of outputs from the U.S. Space Force's HASDM, which currently represents the state of the art for density and drag modeling. We utilize principal component analysis for dimensionality reduction, which creates the coefficients upon which nonlinear machine‐learned (ML) regression models are trained. These models use three unique loss functions: Mean square error (MSE), negative logarithm of predictive density (NLPD), and continuous ranked probability score. Three input sets are also tested, showing improved performance when introducing time histories for geomagnetic indices. These models leverage Monte Carlo dropout to provide uncertainty estimates, and the use of the NLPD loss function results in well‐calibrated uncertainty estimates while only increasing error by 0.25% (<10% mean absolute error) relative to MSE. By comparing the best HASDM‐ML model to the HASDM database along satellite orbits, we found that the model provides robust and reliable density uncertainties over diverse space weather conditions. A storm‐time comparison shows that HASDM‐ML also supplies meaningful uncertainty estimates during extreme geomagnetic events.

. Imperfect geometry and gas-surface interaction modeling currently contributes to inconsistencies between current in situ density data sets (March et al., 2021).
The thermosphere is the neutral region of the upper atmosphere ranging from ∼90 km to 500-1,000 km depending on space weather conditions. The thermosphere is influenced by complex internal and external factors. Extreme Ultraviolet (EUV) emissions from the Sun heat up the thermosphere forcing expansion and increasing density at a given location. Effects from solar emissions can be represented by different solar indices and proxies (e.g., F 10.7 ), which serve as reliable model drivers . The Sun continuously emits particles in the form of solar wind, which interact with and disrupt the Earth's magnetosphere. During eruptive events, such as coronal mass ejections, there are strong interactions between the particles and the magnetosphere that result in energy being deposited into the thermosphere through Joule heating and particle precipitation at high latitudes (Knipp et al., 2004). This energy enhancement causes large increases in density and is difficult to model Oliveira et al., 2017). The planetary amplitude index (ap) has 28 discrete values that represent the level of global geomagnetic activity. During geomagnetic storms, the storm-time disturbance index (Dst) can be used to improve density modeling .
These modeling limitations -both driver forecasting and underlying model uncertainty -put a stress on Space Domain Awareness (SDA), which shifts the focus from catalog maintenance to threat assessment (Holzinger & Jah, 2018). In SDA, decisions are made based on uncertainty information, when available. Uncertainties in density modeling can result in large discrepancies between expected and observed satellite positions (Bussy-Virat et al., 2018;. The most recent Drag Temperature Model (DTM-2020) became the first prominent thermosphere model to provide uncertainty estimates . They determined 1-σ uncertainties as a function of the inputs . Another major improvement in density modeling capabilities came with the introduction of real-time data assimilation. The High Accuracy Satellite Drag Model (HASDM; Storz et al., 2005) is an assimilative model that leverages an empirical model (JB2008) and Dynamic Calibration of the Atmosphere (DCA) to correct the density nowcast with satellite observations. The updated nowcast can then be propagated forward in time. This model/technique arguably represents the current state of the art in our ability to accurately predict atmospheric density. HASDM outputs had not been publicly available until the recent release of the SET HASDM density database .
The continued commercial satellite launches are creating a necessity for managing the increasing traffic in orbit. A model that balances prediction accuracy and robustness of uncertainty estimates will be a critical tool for the rising importance of Space Traffic Management (Muelhaupt et al., 2019). In this work, we investigate different modeling techniques using machine learning (ML) to create a surrogate for the database. This allows us to extrapolate beyond the limits of the existing database. A straightforward way to accomplish this is by creating a nonlinear regression model on the full density grid. However, this creates limitations in the context of uncertainty quantification (UQ). To combat this, we focus on the development of nonlinear reduced order models (ROMs). For dimensionality reduction, a linear technique called principal component analysis (PCA) is utilized. PCA has been applied to this HASDM data set by  for scientific investigation but is used here as part of the modeling framework. The predictive models are standard feed-forward neural networks or artificial neural networks (ANNs) that employ nonlinear activation functions. We explore various custom loss functions in training in order to obtain reliable or calibrated uncertainty estimates that are crucial for satellite collision avoidance operations.
The paper is organized as follows. We first explain the data (mass density outputs from the HASDM database) and the various techniques used for model development. For clarification, the density space refers to the full 3D density grid (or full state), while the latent space refers to the PCA coefficients (or reduced state). The ML models operate in the latent space. Then, we show the performance and UQ capabilities using a standard mean square error (MSE) loss function. We present a method to obtain calibrated uncertainty estimates of the ML model outputs using the negative logarithm of predictive density (NLPD) and the continuous ranked probability score (CRPS) loss functions, which are based on the probability and cumulative distribution functions (CDFs) of the normal distribution. The best model (called HASDM-ML), in terms of accuracy and reliability, is then compared to the HASDM database in the density space.

Data
HASDM is an assimilative framework using the Jacchia-Bowman 2008 Empirical Thermospheric Density Model (JB2008; Bowman et al., 2008) as a background density model. HASDM improves upon the density correction work of Marcos et al. (1998) andNazerenko et al. (1998) to modify 13 global temperature correction coefficients with its DCA algorithm. HASDM uses observations of more than 70 carefully chosen calibration satellites to estimate local density values. The satellite orbits span an altitude range of 190-900 km although a majority are between 300 and 600 km (Bowman & Storz, 2003). The HASDM algorithm uses a prediction filter that employs wavelet and Fourier analysis for the correction coefficients (Storz et al., 2005). Another highlight of HASDM's novel framework is its segmented solution for the ballistic coefficient. This allows the ballistic coefficient estimate to deviate over the fitting period for the satellite trajectory estimation.
SET validates the HASDM output each week and archives the results. The archived values from 2000 to 2020 make up the SET HASDM density database, upon which this work is based. The database contains density outputs with 15° longitude, 10° latitude, and 25 km altitude increments spanning from 175 to 825 km. This results in 12,312 grid points for every three hours from the start of 2000 to the end of 2019. For further details on HASDM, the reader is referred to Storz et al. (2005) and for details on SET's validation process and on the database the reader is referred to Tobiska et al. (2021). Solar and geomagnetic indices/proxies are the key drivers for JB2008, and subsequently HASDM. F 10.7 (referred to as F 10 in this work) is a solar proxy used widely in thermospheric density models. It represents 10.7 cm solar radio flux which does not directly interact with the thermosphere, but it is a reliable proxy for solar EUV heating. The S 10 index is a measure of the integrated 26-34 nm solar EUV emission, which is absorbed by atomic oxygen in the middle thermosphere. M 10 is a proxy that represents the Mg II core-to-wing ratio and is a surrogate for far ultraviolet photospheric 160-nm Schumann-Runge Continuum emissions that cause molecular oxygen dissociation in the lower thermosphere. The last solar index used by JB2008 is Y 10 which is a hybrid measure of solar coronal X-ray emissions and Lyman-alpha. S 10 , M 10 , and Y 10 have no relation to the 10.7 cm wavelength but are converted to solar flux units (sfu) through linear regression with F 10 . The 81-day centered averages of these four drivers are used as inputs to JB2008 and are denoted by the "81c" subscript, resulting in a total of eight drivers for solar activity.
For temperature equations corresponding to geomagnetic activity, JB2008 utilizes ap and Dst. The 3-hr ap is indicative of overall geomagnetic activity and is only measured at low to mid-latitudes. It takes on 1 of 28 values on an uneven discrete grid between 0 and 400. JB2008 uses Dst during geomagnetic storms if the minimum Dst < −75 nT, and it is the primary parameter for energy being deposited into the thermosphere during storms. For more information on the JB2008 model, the reader is referred to Bowman et al. (2008). Information on how to access and use JB2008 can be found at the link in the Data Statement. The data are split into training, validation, and test sets using 60%, 20%, and 20%, respectively, as shown in Figure 1. The data splitting scheme was chosen to include a full range of conditions in the training set while leaving enough diverse data for the validation and independent test set. Table 1 shows the number of time steps in the SET HASDM density database across various space weather conditions. The cutoff values for F 10 and ap are obtained from  where they were used to bin space weather driver forecasts.
In Table 1, there is clear under representation of geomagnetic storms in this vast data set. This can cause limitations in model development, because over 98% of the data set corresponds to ap ≤ 50. Hierarchical modeling could be used for data of this nature, but we proceed with the development of a single comprehensive model. This decision was made due to the limited data for high geomagnetic activity and for simplicity in model development.

Principal Component Analysis
PCA is an eigendecomposition technique that determines uncorrelated linear combinations of the data that maximize variance (Hotelling, 1933;Pearson, 1901). PCA has been previously used to analyze atmospheric density models and accelerometer-derived density sets. Some researchers have applied PCA to CHAllenging Minisatellite Payload (CHAMP) and Gravity Recovery and Climate Experiment (GRACE) accelerometer-derived density data in order to analyze the dominant modes of variation and identify phenomena encountered by the satellites (Calabia & Jin, 2016;Lei et al., 2012;Matsuo & Forbes, 2010). PCA has also been leveraged for dimensionality reduction to create linear dynamic ROMs Linares, 2017, 2020;.  recently applied PCA to the orbit-position-derived SET HASDM density database to identify the dominant modes of variability and to compare the database to JB2008.
The SET HASDM data set is a prime candidate for PCA with the goal of predictive modeling due to its high dimensionality; the 12,312 HASDM grid locations would inhibit further UQ work due to computer memory constraints. The three spatial dimensions are first flattened -reshaped into a column of the 12,312 componentsto make the data set two dimensional (time × space). Then a common logarithm (log 10 ) of the density values is taken in order to reduce the variance of the data set. Next, we subtract the temporal mean over all 20 years for each cell to center the data and save it for later use. Finally, we perform PCA using the svds function in MATLAB. PCA decomposes the data and separates spatial and temporal variations as in Equation 1, where ∈ ℝ is the log-transformed model output state (HASDM density at all n = 12,312 grid locations), ̄ is the mean, ̃ is the deviation from the mean, s represents the spatial dimension, r is the choice of order truncation (here r = 10), α i (t) are temporal coefficients, and U i are orthogonal modes or basis functions. The value of r is chosen, because it captures approximately 90% of the variance and has been shown to be an optimal choice for observability and accuracy for data assimilation (Mehta & Linares, 2017;. Equation 2 shows how to derive these modes.   In Equation 2, m represents the ensemble size (two solar cycles with HASDM or m = 58,440 time steps) and X represents the log-transformed, mean-centered data with column ̃ = ̃ ( , 1) in Equation 1. The left unitary matrix, U, is made of orthogonal vectors that represent the modes of variation, Σ is a diagonal matrix consisting of the squares of the eigenvalues that correspond to the vectors in U, and V is a matrix composed of the right singular vectors of X. We encode the data (X) into temporal coefficients (α i ) by performing matrix multiplication between Σ and V T . To decode back to the density, the coefficients are multiplied by U with the temporal mean added at each cell. Taking the antilogarithm (10 y ) of the resulting values completes the back transformation. Figure 2 shows the first 10 PCA coefficients generated using the methods described above which will be used to train the model (described in Sections 2.4 and 2.5). The first five PCA coefficients are discussed in detail by . They noted that there was strong correlation between α 1 and solar activity and between α 2 and annual variations. Beyond that the coefficients were correlated with different space weather or temporal inputs with various strength depending on the phase of the solar cycle.

Hyperparameter Tuning
In ML, developing an accurate model previously required a manual architecture selection process that did not guarantee optimal performance for a given application (Elsken et al., 2019). Recent developments in hyperparameter tuning make this process much quicker and more thorough as there is significantly less user intervention, and hence a saving in time needed to run this process. In this work, we use KerasTuner which allows the user to define the ranges and choices of any hyperparameter and to choose a search algorithm (Abdelminaam et al., 2021;Haegel & Husa, 2020;Licata, Mehta, Weimer et al., 2021;O'Malley et al., 2019;Rogachev & Melikhova, 2020).
All subsequent mentions to a tuner refer to the use of the KerasTuner package. We run all tuners for 100 trials (or architectures) with the first 25 being randomly sampled from our hyperparameter space while the last 75 trials take advantage of the tuner's Bayesian optimization scheme. We allow the tuner to choose an optimizer, the number of hidden layers, the number of neurons in each hidden layer, and activation functions. The tuner can choose unique neuron counts and activation functions within each layer. The number of total trials and initial/ random trials was chosen through early experimentation. We found that increasing the number of trials simply increased the model development time without any noteworthy performance improvements.
The tuner trained three identical models for each trial (repeat weight initialization and training) and returned the model with the lowest validation loss after 100 training iterations or epochs. The Bayesian optimization scheme chooses future architectures based on the validation losses resulting from previous trials. Once completed, the tuner returns the best 10 models, which we can evaluate on the training and validation sets to find the most accurate and generalized model. The metrics used to evaluate the resulting models are mean absolute percent error (for accuracy) and the calibration error score (see Section 2.5.1).

Regression Modeling
A traditional approach to regression modeling with UQ is to develop Gaussian process (GP) models (Chandorkar et al., 2017;Rasmussen, 2004). In the early stages of model development, we attempted this approach -training GP regression models on each of the PCA coefficients. However, we could only use 10 years of data for training, limited by computational resources or random access memory, which resulted in higher predictive error. In addition, the resulting models (10 GP regression models each trained on a single PCA coefficient) were 6.83 GB each. This makes subsequent work cumbersome as we would need to load 68.3 GB of models to make evaluations. Therefore, the results in this work only pertain to the feedforward neural networks with MC dropout.
There are three methods we explore in developing HASDM-ML. First, we create a ROM using PCA for dimensionality reduction and a nonlinear ANN for prediction with the MSE loss function. Note: the term "prediction" here refers to model outputs, not forecasts. We then modify this technique using a custom loss function in an attempt to obtain a model with calibrated uncertainty estimates. We test two loss functions to achieve this, NLPD and CRPS; details of which are given in the following section. Loss functions inform the model of the objective, so the weights can be modified to minimize the losses. ANNs have two key components: features (or inputs) and labels (or outputs). We try using three separate input sets for our regression models, the first two are explained in Table 2. The first set is the JB2008 input set, referred to as JB.  showed that the SET HASDM density database contained evidence of poststorm cooling mechanisms which cannot be captured solely by geomagnetic indices at epoch. Therefore, we introduce a second set that is similar to the first JB2008 Inputs Historical JB2008 Inputs Solar Geomagnetic Temporal Solar Geomagnetic Temporal Dst 12 , Dst 15 , Dst 18 , Dst 21 Table 2 List of Inputs in the Two Sets Used for Model Development but with a time history of the geomagnetic indices. We refer to this as JB H (Historical JB2008). Unlike the actual JB2008 inputs, all of our input sets contain sinusoidal transformations to the day of year (doy) and universal time (UT) inputs. The four resulting temporal inputs (shown in Equation 3) represent annual (t 1 and t 2 ) and diurnal (t 3 and t 4 ) variations. The generic form (2πx/y) allows for the input (x) to oscillate between −1 and 1 over the period (y). The use of cyclical functions of time as inputs has been previously demonstrated (Licata, Mehta, Weimer et al., 2021;Weimer et al., 2020).
In the JB H set, the geomagnetic indices are extensive in an effort to improve storm-time and poststorm modeling.
The "A" subscript (seen in the geomagnetic column of Table 2) refers to the daily average and the numerical subscripts refer to the value of the index that many hours prior to the epoch. The combination of two numbers references the number of previous hours the index is being averaged over (e.g., ap 12-33 refers to the average ap value between 12 and 33 hr prior to the prediction epoch). The ap time series is the same one used in the Naval Research Laboratory Mass Spectrometer Incoherent Scatter Radar Extended model (NRLMSISE-00; Picone et al., 2002). The authors found that using different time histories of ap and Dst (shown in Table 2) resulted in generally more calibrated models on independent data (see Section 3). For completeness, the results will also be shown using an input set that adopts the same time history for Dst as the ap time history in Table 2, both geomagnetic indices using the NRLMSISE-00 time series. This input set will be referred to as JB H0 .

Uncertainty Quantification
Dropout is a regularization tool often used in ML to prevent the model from overfitting to the training data (Srivastava et al., 2014). In standard feed-forward neural networks, each layer sends outputs from all nodes to those in the subsequent layer where they are introduced to weights and biases. Deep neural networks can have millions of parameters and thus are prone to overfitting. This causes undesired performance when interpolating or extrapolating.
Dropout layers use Bernoulli distributions (a binomial distribution with one trial), one for each node, with probability P. This makes the model probabilistic since the distributions are sampled each time that a set of inputs are given to the model. If a sampled Bernoulli is a "1", the node's output is nullified and the output of the layer is scaled based on the number of nullified outputs. Dropout is believed to make each node independently sufficient and not reliant on the outputs of other nodes in the layer (Alake, 2020).
In traditional use, dropout layers are only activated during training to uphold the deterministic nature of the model. However, measures can be taken in order for this feature to remain activated during prediction making the model probabilistic. When passing the same input set to the model a significant number of times (e.g., 1,000), there is a distribution of model outputs for each unique input. This process is referred to as Monte Carlo (MC) dropout. Essentially, every time the model is presented with a set of inputs, random nodes are dropped out providing a different functional representation of the model. Gal and Ghahramani (2016) show that MC dropout is a Bayesian approximation of a GP.
In this case, the model uses the input set to predict the 10 PCA coefficients. Using the MC samples, we estimate the sample mean and variance for each of the predicted coefficients/outputs (Efron & Tibshirani, 1993). The loss is computed for each output separately. Each unique input can be passed to the model k times and there will be k × 10 outputs. The mean and variance are computed from those outputs with respect to the repeated axis, k. The two loss functions used to improve uncertainty estimation (in addition to MSE) are NLPD and CRPS. NLPD is based on the logarithm of the probability distribution function (pdf) of the Gaussian distribution, and is shown in Equation 4 (Camporeale & Care, 2021;Matheson & Winkler, 1976).
In Equation 4, y is the ground truth (α i ), μ is the sample mean of the prediction, and σ is the sample standard deviation of the prediction, each being computed for all 10 outputs. For clarity, the log used in the NLPD loss function is the natural logarithm. The second loss function for uncertainty estimation is CRPS which is shown analytically in Equation 5 (Gneiting et al., 2005). The main difference between NLPD and CRPS is that CRPS is also based on the CDF of the Gaussian distribution as opposed to only the pdf.
An important aspect of using the loss functions described in Equations 4 and 5 is the preparation of the training data. The data are traditionally in the following form. The features are set up as the number of samples (n), with n I denoting the number of inputs, resulting in the input shape (n × n I ). The labels are set up as the number of samples with n o being the number of outputs, resulting in the output shape (n × n o ). To implement these loss functions, we stack each input and output by the number of MC samples, k. This is a repeated axis, meaning all samples along this axis are identical about k; the samples are not identical about n. The resulting shapes of the features and labels are (n × k × n I ) and (n × k × n o ), respectively. This allows us to determine the mean and standard deviation for each sample in the batch within the loss function.

Latent Space UQ
Since there are multiple models and loss functions to compare, we have to implement a metric to judge each model's ability to provide reliable uncertainty estimates. To do so, we modified a calibration error equation from Equation 4 in Anderson et al. (2020) shown as In Equation 6, m is the number of prediction intervals investigated, r is the choice order of truncation for PCA (the number of model outputs), p(α) is the expected cumulative probability (or prediction interval), and p(̂) is the observed cumulative probability. The prediction intervals of interest in this work span from 5% to 99% with 5% increments -(0.05, 0.10, 0.15, …, 0.90, 0.95, 0.99). P(̂) is computed empirically shown in Equation 7, where n is the number of samples, I is the indicator function, is the lower bound of the prediction interval, is the upper bound of the prediction interval, and α i is the sample. The indicator function returns a one if the inequality is true and a 0 otherwise. To compute the bounds, we use the pair of equations given in Equation 8 (Davison et al., 1997;Pevec & Kononenko, 2014).
where z is the critical value used for the prediction interval. This is calculated using the Gaussian CDF where PI is the prediction interval of interest (e.g., 95% or 0.95). Comparing the expected (p(α)) and observed ( (̂)) , cumulative probabilities is done qualitatively by plotting the calibration curves: (̂) versus p(α). The curves show how well calibrated the uncertainty estimates are at capturing the appropriate percentage of true samples. A perfectly calibrated model will have a straight 45° calibration curve. If a calibration curve is above or below the 45° reference line, the model is over or underestimating the uncertainty, respectively. The calibration error score (Equation 6) is a quantitative measure of the average deviation from perfect calibration in the latent space, averaged across each output. In this work, we measure robustness and reliability through the calibration error score and the calibration curves. Since this refers to latent space calibration, we use α in Equations 6-9 but for density (Section 2.5.2), α would be replaced with ρ.

Density UQ
While latent space calibration is important because the model is trained on PCA coefficients, determining the reliability of the model's predictions on the resulting density is the ultimate goal. To examine this, we look at the orbits of CHAMP (Luhr et al., 2002) and GRACE (Bettadpur, 2012). These satellites had onboard accelerometers from which mass density has been estimated (Bruinsma & Biancale, 2003;Calabia & Jin, 2016;Doornbos, 2012;Liu et al., 2005;March et al., 2019;Sutton, 2008).  recently compared the SET HASDM density database and JB2008 to both CHAMP and GRACE-A density estimates over the lifetime of their measurements. We use the satellite positions for density calibration assessment between HASDM and HASDM-ML, because it is computationally inefficient to compute the calibration curve across all 719, 513, 280 density values in the HASDM database (58,440 × 12,312). Performing that assessment would also be difficult in terms of visualization.
We evaluate HASDM-ML 1,000 times every 3 hr across the entire availability of CHAMP (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010) and GRACE (2002GRACE ( -2011 position data listed in the measurements presented by  and interpolate them to the satellite locations using trilinear interpolation. For clarification, only the satellite positions are considered, not the density estimates. This model evaluation and interpolation allows us to compute the observed cumulative probability of HASDM-ML relative to the HASDM database in terms of density. Due to the redundancy and computational expense, we only interpolate the model and database density every 500 samples (5,000 and 2,500 s for CHAMP and GRACE-A, respectively). The CHAMP orbit comparison uses 23,795 HASDM prediction epochs (40.7% of the total available HASDM data) with the density being interpolated to at least two satellite positions per prediction due to the cadence of this study. The GRACE orbit comparison uses 24,602 HASDM prediction epochs (42.1% of the total available HASDM data) with the density being interpolated to at least four satellite positions per prediction. The number of satellite positions per prediction comes from the number of positions used every 3 hours (HASDM cadence). This provides a wide view of the model's UQ capabilities considering the wide array of positions and conditions covered. To perform these simulations, the model had to be evaluated 23, 795, 000 and 24, 602, 000 times for CHAMP and GRACE, respectively. These numbers come from the number of HASDM prediction epochs and the number of MC runs (1,000). HASDM-ML can perform these predictions in 17.27 s for CHAMP and 17.54 s for GRACE on a laptop with a NVIDIA GeForce GTX 1070 Mobile graphics card. Using the central processing unit, the model takes 143 s for CHAMP and 152 s for GRACE.
Geomagnetic storms are particularly difficult conditions to model accurately. Therefore, we look at four geomagnetic storms from 2002 to 2004 where we can evaluate HASDM-ML's reliability across unique events. Two of the events take place in 2002, which is outside the training data set, while the other two are from 2003 to 2004 and are seen in training. Information on these storms can be found in Table 3. To conduct this assessment, we employ the methodology from the Licata,  storm-time case study. The model is evaluated over a 6-day period encompassing a storm then interpolated to the CHAMP locations (10 s cadence). Again, the interpolation to satellite positions is conducted to assess and visualize the implications of density UQ along a satellite orbit. During each three-hour prediction period, the density grids remain constant. All 1,000 HASDM-ML density variations are then averaged across each orbit. We consider the average altitude for each 6-day period to estimate the orbital period. The mean and 95% prediction interval bounds are computed to compare to the corresponding HASDM densities and shown in Figure 7 in an orbit-averaged form. We also show the orbit-averaged JB2008 predictions for comparison. In total, the six days amount to 48 model prediction epochs that result in 51,840 interpolated densities (1,000 MC runs) from which we compute the observed cumulative probabilities.

Results and Discussion
Upon running each input set with all three loss functions through individual tuners, the best 10 models from each tuner (in terms of training and validation metrics) are evaluated on the entire training, validation, and test sets 1,000 times. The mean absolute error results from the best of the 10 models for each input set/loss function are shown in Table 4. The mean absolute error is computed for the model prediction (mean coefficient predictions converted to density through PCA) relative to the HASDM database.
The addition of historical geomagnetic indices clearly improves the model performance with error reductions ranging from 0.72% to 2.09% (comparing the JB columns to the columns of JB H and JB H0 ). As mentioned in Section 2.4, the motivation for using the time series geomagnetic indices was to improve storm-time and  Table 3 Information on the Four Storms Used in the Calibration Analysis poststorm performance. However, Table 1 shows that these conditions account for a small subset of the data meaning the notable performance improvement with the JB H and JB H0 input sets show that it likely improves the predictions across all conditions. In general, the CRPS models have the lowest error and the JB H0 models have the lowest error with respect to the input sets. Table 5 shows the calibration error score for the same models, this time using both the mean and standard deviation of the coefficient predictions (refer to Equation 6).
The incorporation of the custom loss functions reduces the calibration error score by an order of magnitude relative to models trained with MSE, which tend to underestimate the uncertainty. The best performing loss function, in regard to calibration, is NLPD. To choose the best overall model, we focus on the test performance as those data are completely independent from the training process. We weigh the calibration performance more heavily than the prediction error as reliable uncertainty estimates are the most valuable asset for a thermospheric density model. The JB H + NLPD model is within 1% of the error of all better-performing models (Table 4), and it has the lowest test calibration error score with scores within 0.30% of all more calibrated models on the training and validation data. As the calibration error score is a composite of the scores from each PCA coefficient, we show the calibration curves of all coefficients on the independent test set for the best JB H + NLPD model, in panel (b), alongside the best JB H + MSE model, in panel (a), for comparison in Figure 3.
The calibration curve in panel (b) for all PCA coefficients roughly follows the perfectly calibrated 45° line with α 5 being the only coefficient that prominently underestimates uncertainty. However, there is minimal contribution to the full state (density) after the first few coefficients, so this should not greatly impact the resulting density. For PCA, the coefficients are ordered to capture most-to-least variance, so α 1 has significantly more impact on the reconstruction of the data compared to α 10 , for example, In sharp contrast to the JB H + NLPD results, panel (a) shows the model trained with the MSE loss function is not nearly calibrated, as is evident in Table 5. There is a significant underestimation of the uncertainty at all cumulative PIs because the model is not trained with any terms for its variance.
The JB H + NLPD model shown in Figure 3 will be the focus of all subsequent analyses and will be referred to as HASDM-ML. To investigate the model's reliability on density in an operational nature, we look at the orbits of CHAMP and GRACE-A each over 8-year periods with a cumulative altitude range of 300-530 km. HASDM-ML was evaluated in three-hour increments from 2002 to 2011, and was interpolated to the satellite positions at all epochs discussed in Section 2.5.2. The results for the CHAMP orbit are displayed in Figure 4. Figure 4 panel (a) shows the density ratios of HASDM-ML and JB2008 relative to HASDM. The HASDM-ML ratios have much lower variance than the JB2008 ratios. The mean ratios for both models are 1.03. However, 95% of the HASDM-ML ratios are between 0.75 and 1.25 compared to 86% for JB2008. The surrogate ML model is imperfect in its mean prediction, as seen in Table 4, but panels (b) and (d) show that the density uncertainty is reliable. The calibration curve is exceptional with the observed cumulative probability being within 1% of the expected value for all 20 cumulative PIs tested. Figure 5 shows the same analysis along the GRACE-A orbit. Figure 5 panel (a) again shows that the HASDM-ML density ratios have much less variance than JB2008. For these GRACE-A positions, the mean density ratios are 1.05 and 1.07 for HASDM-ML and JB2008, respectively. 86% of the HASDM-ML ratios are between 0.75 and 1.25 compared to 72% of JB2008 ratios. Panels (b) and (d) also demonstrate that although the model densities are not identical to HASDM, HASDM-ML provides uncertainty estimates that are reliable. Panel (d) reveals that at higher GRACE altitudes, there is slightly less agreement with the expected and observed cumulative probabilities with the largest discrepancy being just over 1%. Scatter plots comparing HASDM-ML and JB2008 to HASDM densities for both satellite orbits are displayed in Figure 6. Figure 6 highlights the prediction accuracy of HASDM-ML compared to JB2008. Both models are well centered on the perfect-prediction line (in black), but as seen in Figures 4 and 5 HASDM-ML has a tighter spread about this line. To clarify, this scatter plot is representative of prediction accuracy and is not the same as the calibration curves seen in other figures. The coefficient of determination (R 2 ) is higher for HASDM-ML along both satellite orbits, and R 2 is higher for both models along the GRACE-A orbit. Figure 7 shows HASDM and HASDM-ML orbit-averaged densities during four geomagnetic storms with prediction intervals and the associated calibration curves.
Across all of the storms investigated, the mean prediction of HASDM-ML follows the trend of HASDM density. Even when the model deviates, HASDM densities are mostly captured by the uncertainty bounds (computed using Equation 8). Panels (a) and (b) represent storms not contained in the training data set, which show that HASDM-ML is well generalized, even during these highly nonlinear events. In panel (a), HASDM-ML and JB2008 overestimate the peak density, but HASDM-ML is able to better capture the timing. For this storm, JB2008 predicts a delayed and longer impact of the geomagnetic storm. The mean absolute error for HASDM-ML and JB2008 relative to HASDM are 11.91% and 13.03%, respectively. In panel (b), both models have similar predictions to HASDM for the first peak (day 2), but JB2008 has an elongated period of density overprediction from days 4-6. The mean absolute error for HASDM-ML and JB2008 relative to HASDM for this storm are 9.86% and 14.37%, respectively. The storm in panel (c), while in the training set, highlights the improved performance of HASDM-ML. After the storm, JB2008 predicts much higher densities than both HASDM and HASDM-ML. Licata, Mehta, Tobiska et al. (2021) compared the orbit-averaged densities of HASDM and JB2008 to CHAMP and GRACE-A during this storm and found that the low poststorm densities predicted by HASDM were similar to the density estimates from both satellites which HASDM-ML is also showing. The errors for this storm are 8.46% for HASDM-ML and 25.29% for JB2008. For the last storm, panel (d), all three models predict similar trends in density. JB2008 has the most deviation, particularly between the two main phases of the storm and in the last 36 hr. The mean absolute errors are 12.64% and 19.81% for HASDM-ML and JB2008, respectively.
The calibration curves corresponding to each event show the robust nature of HASDM-ML's uncertainty estimates. None of the calibration curves, at any of 20 cumulative PIs tested, deviated from perfect calibration by more than 10.7%. We show the combination of all four calibration curves (averaged) to give a broad sense of the calibration across the storms. This curve is well calibrated and does not deviate from perfect calibration by more than 3.7%. Note: perfect calibration here is seen in the 45° line in panel (e) and the line y = 0 in panel (f). While the observed cumulative probability values deviated from the expected values (particularly for the individual storms), these are highly nonlinear periods where density models tend to be unreliable.

HASDM-ML Performance Metrics
To assess the conditions in which HASDM-ML can improve, the global mean absolute errors relative to HASDM are computed as a function of space weather conditions. The results are shown in Table 6 and the number of samples in each bin can be found in Table 1. The bottom half of the table contains the global mean absolute errors of JB2008 relative to HASDM for comparison.
The results from Table 6 show that HASDM-ML is robust to different F 10 and ap ranges when ap ≤ 50 since these errors do not vary by more than 2%. The only conditions in which the mean absolute error exceeds 11% is when ap > 50, which only accounts for 1.80% of the samples. This shows that more samples may be required for this specific condition in both the training and evaluation phases. The last row contains the errors only as a function of F 10 which shows that across all four solar activity levels, the error deviates by only 1.24%. The bottom-right cell shows that the error across all 20 years of available data is only 9.71%. JB2008 densities are much less similar to HASDM. HASDM-ML is more accurate over all 20 space weather conditions considered and the improvement ranges from 3.75%-9.16%. As a function of F 10 , HASDM-ML has the most significant improvement for low solar activity (F 10 ≤ 75 sfu). As a function of ap, HASDM-ML has the largest improvement for high geomagnetic activity (ap > 50). Across all the available data from the SET HASDM density database, HASDM-ML has 5.65% lower error than JB2008.

Summary
In this work, we developed a surrogate ML model for the SET HASDM density database with robust and reliable uncertainty estimation capabilities (Figures 4, 5 and 7). PCA was selected for dimensionality reduction due to its wide use in the field. A Bayesian search algorithm was leveraged in an attempt to identify the optimal architecture for each input set and loss function tested. We found that of the nine input-loss function combinations explored, the combination of a JB2008 input set with historical geomagnetic drivers (JB H ) and the NLPD loss function resulted in the most comprehensive model. This model, HASDM-ML, has 9.07% error across the 12-year training set and an average 10.69% error over the combined 8-year validation/test sets (Table 4). It also had the lowest calibration error score on the test set, meaning it was the most calibrated to independent data (Table 5). We also compared its calibration curves for each output across the test set to that of the MSE model with the same inputs. This showed that the MSE model considerably underestimated the uncertainty while the NLPD model was well calibrated across the 10 outputs. Upon selecting HASDM-ML, we evaluated its uncertainty capabilities across the entire orbits of CHAMP ( Figure 4) and GRACE-A ( Figure 5), both using almost half of the time span of the HASDM data. This assessment showed that the mean prediction at the satellite locations closely matched that of the HASDM data set.
Across all 20 prediction intervals tested over this period (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011), the model provided an observed cumulative probability that never deviated more than 1% of the expected value for CHAMP's orbit and never deviated more than 1.15% for GRACE-A's orbit. A separate storm-time evaluation unveiled that across four storms, HASDM-ML provides similar density to HASDM and its uncertainty estimates remained robust and reliable ( Figure 7). The results from the density calibration tests are significant, because thermospheric density models providing uncertainty estimates is still a novel concept. Additionally, uncertainty estimates themselves are not   . Panel (f) shows the difference between the observed and expected cumulative probability for all the curves in panel (e). meaningful unless they are well calibrated, and HASDM-ML is able to provide that. HASDM-ML was also more accurate than JB2008 relative to HASDM for all four storms and across the 20 space weather conditions considered (Table 6).

Limitations and Future Work
To further improve HASDM-ML, additional data can be introduced, particularly in the coming solar maximum. As seen in Section 3.1, periods of ap > 50 were the source of the highest overall error and the introduction of additional storms could reduce that error. Hierarchical modeling is another approach to potentially combat this limitation, as previously mentioned. We hypothesize that a nonlinear dimensionality reduction method could also improve performance, especially in modeling the highly nonlinear storm response (Turner et al., 2020). The developed model can be used in research (e.g., to study historical storms where HASDM outputs are unavailable) as well as in an operational setting. Another area of future work could be to develop a nonlinear dynamic ROM based on the SET HASDM density database. While the background model is static, the assimilative framework represents the dynamic thermosphere and a dynamic ROM could better model certain phenomena dealing with the dynamic evolution of the system. The dynamic ROM can also provide a framework for further data assimilation .

Data Availability Statement
Requests can be submitted for full access to the Space environment technologies (SET) High Accuracy Satellite Drag Model density database at https://spacewx.com/hasdm/ and all reasonable requests for scientific research are accepted as explained in the rules of road document on the website. The historical space weather indices used in this study can be found at the following links: F 10.7 : https://www.spaceweather.gc.ca/forecast-prevision/solarsolaire/solarflux/sx-en.php, planetary amplitude index (ap): https://doi.org/10.5880/Kp.0001, and storm-time disturbance index (Dst): https://doi.org/10.5880/Kp.0001. The remaining solar indices and proxies can be found at https://spacewx.com/jb2008/ in the SOLFSMY. TXT file. The primary source for nowcasts and forecasts of the solar drivers is provided by SET at the JB2008 link or on the Unified Data Library (UDL) at https://unifieddatalibrary.com/. Nowcasts and forecasts for ap are provided by the National Oceanic and Atmospheric Administration Space Weather Prediction Center and SET at the same links. The nowcasts and forecasts for Dst are provided by SET on UDL or at https://spacewx.com/new_sam_ops/. The forecasting models for these drivers have been benchmarked by . Free and one-time only registration is required to access the nowcasts and forecasts. CHAllenging Minisatellite Payload and Gravity Recovery and Climate Experiment position data were obtained from the measurements presented by  at http://tinyurl.com/densitysets.