Improving Thermospheric Density Predictions in Low‐Earth Orbit With Machine Learning

Thermospheric density is one of the main sources of uncertainty in the estimation of satellites' position and velocity in low‐Earth orbit. This has negative consequences in several space domains, including space traffic management, collision avoidance, re‐entry predictions, orbital lifetime analysis, and space object cataloging. In this paper, we investigate the prediction accuracy of empirical density models (e.g., NRLMSISE‐00 and JB‐08) against black‐box machine learning (ML) models trained on precise orbit determination‐derived thermospheric density data (from CHAMP, GOCE, GRACE, SWARM‐A/B satellites). We show that by using the same inputs, the ML models we designed are capable of consistently improving the predictions with respect to state‐of‐the‐art empirical models by reducing the mean absolute percentage error (MAPE) in the thermospheric density estimation from the range of 40%–60% to approximately 20%. As a result of this work, we introduce Karman: an open‐source Python software package developed during this study. Karman provides functionalities to ingest and preprocess thermospheric density, solar irradiance, and geomagnetic input data for ML readiness. Additionally, it facilitates developing and training ML models on the aforementioned data and benchmarking their performance at different altitudes, geographic locations, times, and solar activity conditions. Through this contribution, we offer the scientific community a comprehensive tool for comparing and enhancing thermospheric density models using ML techniques.

many space applications, including but not limited to collision avoidance, re-entry, orbital lifetime estimation, orbit propagation, and tracking (Vallado, 2001).
The most common models used to forecast thermospheric density from solar irradiance and geomagnetic drivers are either physics-based or empirical models.The most widespread empirical models used in the operational community are NRLMSISE-00 (Picone et al., 2002), JB-08 (Bowman et al., 2008), and HASDM (Storz et al., 2005).The first two are publicly accessible, while the last model, maintained by Air Force Space, is not available open-source.Among the most used and publicly available ones, JB-08 is considered among the most accurate ones (Marcos et al., 2006).It is well known that these models present biases and do not always perform accurately, especially in periods of high and low solar activity (He et al., 2018;Licata et al., 2021).On the other hand, physics-based models also exist, which attempt to accurately model the underlying physics phenomena behind the Sun-Earth interaction in the atmosphere.These models, such as TIE-GCM (Richmond et al., 1992), were demonstrated to be more accurate than empirical models during geomagnetic storms (S.Bruinsma et al., 2021).However, results across all geomagnetic storm conditions, altitude ranges and irradiance conditions, do not always confirm their superiority with respect to empirical models.Moreover, their complexity and heavy computational burden prevent their use for operational applications (e.g., collision avoidance).
In addition, to improve the thermospheric density estimation problem, several researchers have discussed and investigated the inclusion of more geomagnetic and/or solar irradiance data in both physics-based and empirical models, for better modeling the Sun-Earth interaction, and provide better models.While some have proposed new solar irradiance and geomagnetic proxies (Bowman et al., 2008), others have advocated the necessity of including EUV irradiance inputs (instead of proxies) in the prediction models (Vourlidas & Bruinsma, 2018).
While semi-empirical approaches combine empirical mathematical equations with fitting data, machine learning (ML) models use very few prior assumptions about data to form relationships between variables.The hypothesis is that ML models, given enough data, can learn a representation of the thermospheric system without any physics-based assumptions and might be able to outperform semi-empirical existing models.Starting from a preliminary study performed by the authors (Bonasera et al., 2021), and with the aim to test this hypothesis, this paper develops ML models that can accurately predict thermospheric density variations.Furthermore, it also proposes and releases an open-source framework to train and analyze the performances of ML-based thermospheric density models and benchmark them against empirical models.The main findings and contributions of this study are: 1. ML models show a remarkable 60% reduction in mean absolute percentage error (MAPE) compared to existing semi-empirical models, showcasing their superior performance despite the absence of physics-based assumptions; 2. the release of Karman-ML Thermospheric Neutral Density Model (KML) (Acciarini, Brown, & Baydin, 2023).
This model uses the same inputs as the JB08 model but with a significant performance improvement; combining multiple inputs into ML-ready data (or similar to this) 3. the release of a pipeline for combining multiple inputs from different sources, producing ML-ready data encompassing thermospheric density inputs, solar irradiance, and geomagnetic inputs.Through our work, researchers and operators now have a framework to handle solar irradiance and geomagnetic data (both raw measurements and proxies/indices), as well as empirical model predictions and precise orbit determination (POD)-derived thermospheric density from SWARM-A, SWARM-B, GRACE, GOCE, CHAMP satellites.This framework empowers the research community to develop new models by leveraging the same software infrastructure (as we showed with our proposed KML).A schematic illustration of our proposed workflow is shown in Figure 1.Furthermore, we have designed this framework in a modular fashion to encourage expansion and improvements by the open-source community, facilitating advancements through the inclusion of additional data sources; 4. the release of a benchmarking framework, where users can test their own models, as well as the provided ones, against POD-derived and empirical models of thermospheric density.This can be done at different geomagnetic storm conditions and altitudes, facilitating in-depth analysis of the results.
The concept of benchmarking physics-based, empirical, and ML models is not new and has been employed in several works (S.Bruinsma, 2015;S. L. Bruinsma et al., 2012;Jackson et al., 2020;Oliveira et al., 2020;Tobiska et al., 2021).However, none of these studies has focused on building an ML representation of the thermospheric density in low-Earth orbit from only accelerometer-derived density data, without the use of empirical In terms of ML models applied to the thermospheric density prediction task, several works have been published in recent years.Some have used artificial neural networks for predicting long-term thermospheric density trends (Weng et al., 2020), while others have produced ML models that use principal component analysis or reduced order modeling, together with density data from HASDM and JB-08 empirical models to produce an ML surrogate model for the thermospheric density (Licata & Mehta, 2022;Licata, Mehta, Tobiska, & Huzurbazar, 2022;Licata, Mehta, Weimer, et al., 2022;Mehta et al., 2018;Mehta & Linares, 2017;Turner et al., 2020).These works were not the only ones: there have been several attempts in both predicting short and long-term variations using ML (Chen et al., 2014;W. Li et al., 2023;Pérez & Bevilacqua, 2015;Pérez et al., 2014;Y. Wang & Bai, 2023;P. Wang et al., 2022P. Wang et al., , 2023;;Zhang et al., 2021).All these works, however, have never included direct solar irradiance measurements (instead of proxies) and geomagnetic activity measurements besides the Ap index, Dst, and SYM-H: their only considered inputs were either proxies and/or indices, as also happens in empirical density models.Instead, it is believed that more information on the Sun and geomagnetic activity, from actual measurements of satellites and ground stations, could enhance our understanding and prediction of the Sun's influence on the thermosphere (Emmert, 2015;Vourlidas & Bruinsma, 2018).
One of the goals of this work is to propose a software framework that can help bridge the discussed gap and enable the community to devise ML-based thermospheric density models using geomagnetic and EUV irradiance measurements.This paper is organized as follows: in Section 2, we introduce the data pipeline and the supported data sources, which constitute the backbone on which ML models can be trained and evaluated, as well as the ML methodology and benchmarking framework.While in Section 3, we discuss the results of our analysis both globally, and at different altitude and geomagnetic storm conditions.Finally, in Section 4, we discuss the conclusions of our work.

Data Pipeline
One of the outcomes of this work is a pipeline to produce a unified data set that supports data from several different sources.These include high frequency (every 30 or 10 s) POD-derived thermospheric density data from five space missions (CHAMP, GOCE, GRACE, SWARM-A, SWARM-B), processed by the TU Delft precise orbit determination group (Doornbos, 2012;Siemes et al., 2016Siemes et al., , 2023)), as well as other data related to satellites' state and time (i.e., latitude, longitude, date).For more details on the thermospheric density data used in this work we refer the interested reader to the other publication of the authors (Bonasera et al., 2021).Empirical density data from NRLMSISE-00 and JB-08, as well as the needed inputs to run these models (e.g., sun declination and ascension, sidereal time, geomagnetic and irradiance indices).Geomagnetic inputs both in the form of measurements ).In particular, two different solar irradiance data products are supported.First, the daily (with 1-day cadence) and flare (with 1-min cadence) "stan bands" from FISM2 empirical model: this is an empirical model of the solar spectral irradiance from 0.01 to 190 nm at 0.1 nm spectral bins.It is derived from SORCE XPS L4, SDO-EVE, and SORCE SOLSTICE data, but it allows a continuous time and spectral coverage of solar irradiance data, which is very appealing for thermospheric density modeling purposes (Chamberlin et al., 2020).The "stan bands" are a set of 23 bands derived with a wavelength binning scheme covering the UV from 0.01 to 121 nm, that are used as inputs into Earth ionospheric and thermospheric models (Solomon & Qian, 2005).Second, a set of solar irradiance proxies that are commonly used in empirical thermospheric density models: these include F10.7, M10.7, S10.7, and Y10.7 (Tobiska et al., 2008).
Upon inputting the data into the aforementioned pipeline, a unified data set is generated, allowing for targeted queries at specific time points.This data set encompasses all or specific components of the aforementioned inputs at the desired time, accompanied by the corresponding POD-derived thermospheric density, serving as ground truth data.Additionally, users also have the option to access the temporal history of FISM2 and OMNIWeb data, at any given date (if these data are provided).In this way, temporal relationships between space weather and thermospheric density variations can be studied.In our experiments, in order to train our ML model, we used a data set of approximately 38 million POD-derived thermospheric density data points, where we only trained using the same inputs as the empirical models, in order to have a fair comparison.In Figure 2, we show an example of how four solar indices (F10.7,S10.7, M10.7, Y10.7) and two geomagnetic inputs (dDst/dT, Ap) varied during the 2015 St. Patrick Storm, and the resulting thermospheric density behavior, derived from SWARM-B satellite (shown in the fourth row in the figure).As can be seen, the severe geomagnetic storm (i.e., G4 class) that happened between 17 and 19 March 2015 has caused a steep increase in the thermospheric density.As it can be observed, while both geomagnetic indices reached a maximum in that period, however, not all the solar irradiance proxies have reflected this behavior (since they capture different regions of the EUV spectrum).

Machine Learning Architecture
The study utilizes two distinct types of neural network models: NRLMSISE/JB08 inputs and JB08 + OMNI inputs.The NRLMSISE/JB08 input model comprises four layers, each containing 500 nodes.It takes empirical model inputs as its input and maps them to the outputted density.In particular, we use the following inputs for the NRLMSISE-00 ML model: day of the year, year, seconds in a day, right ascension and declination of the Sun, sidereal time, altitude, longitude, latitude, local solar time, Ap daily index, Ap 3-hourly index, F10.7 observed and F10.7 averaged.While for the ML model using the same inputs as JB-08, the same inputs as NRLMSISE-00 are used but instead of Ap indices, dDst/dT values (i.e., variations of the Dst index as a function of temperature variations) are used, and the solar proxies are F10.7,S10.7, M10.7 and Y10.7 (both observed and averaged).In The JB08 + OMNI input model processes the NRLMSISE/JB08 inputs through a sequence of four layers with 500, 500, 500, and 175 nodes, respectively, resulting in 175 outputted features.In a similar manner, the OMNI inputs are also processed through four layers with 500, 500, 500, and 175 nodes, generating 175 additional outputted features.These 350 features (175 + 175) are concatenated and passed through a final processing four-layer network consisting of 500 nodes.This network maps all the inputs to the outputted density.
The NRLMSISE/JB08 inputs model architecture was chosen due to its ability to maintain a rather simple network, without sacrificing too much accuracy.As we will discuss in Section 3, we have compared this architecture with simpler ones (e.g., without hidden layers) and established the most performing one that was still maintaining a low level of complexity.This study is more concerned with demonstrating that a large volume of thermospheric density data allows data-driven methods to build accurate models of thermospheric density.The accurate study of the best architecture to perform thermospheric density prediction was not the main goal of this study, and it is therefore postponed for future studies.Second, concerning the JB08 + OMNI model, the reason why the abovementioned architecture was chosen is modularity.This study looks at the effect of including extra input features, OMNI data, into the model to see how much model performance can be improved.The method of creating a separate network to process and encode different data sources was therefore employed where the processed features are simply concatenated and further processed in a final network.This allows the modularity to add new data sources without significantly increasing the model complexity.
Concerning the ML model training, we use mean squared error as a loss function, normalizing the output (i.e., target density) by taking the natural logarithm of its value multiplied by 10 12 , which results in the following expression where y is the neural network output, while ρ t is the target thermospheric density value.To transform the neural network output into the desired predicted density, one can simply do: (2) This is helpful during training because it has been observed that the logarithm of the empirical density and true density ratio is typically Gaussian (Bezděk, 2007;Doornbos, 2012).Hence, in our case, we are defining the loss as a function of the logarithm of the ratio of the predicted and true density (as it can be easily demonstrated via the properties of the logarithm): this gives the neural network an easier time learning the thermospheric density.
Adam is used as the stochastic gradient descent optimization algorithm (Kingma & Ba, 2014), and Pytorch as ML framework (Paszke et al., 2019).We also performed Bayesian parameter search for the weight decay and learning rate and established a value of 1.37 × 10 −8 for the weight decay parameter, and 5 × 10 −5 for the learning rate parameter.Furthermore, due to the inherent randomness of the stochastic gradient descent algorithm, each run of the training process produces different weight initializations, leading to the optimization algorithm reaching different points of minimum loss in the neural network.To address this, we conducted each experiment five times, with five different seeds, and calculated the average of the metric score over those runs.Therefore, the reported model performance is always based on the average MAPE across these five runs.

Data Set Partitioning, Metrics, and Benchmarking Framework
The data set was divided into training, validation, and test sets.The partitioning algorithm organized the data set into months and allocated them to the respective sets.Each year was arranged into months, with the test month cyclically permuting by 2 for each year.The same process was applied to the validation set.The remaining months were added to the training set.This ensured that the ML models were trained on non-consecutive data, which would highly bias the model, and that the months used for training were not always the same, as this might cause the model to overlook seasonality events.
In terms of metrics, due to variations of thermospheric density values in the low-Earth orbit environment (which ranges from values of about 10 −12 kg/m 3 to values of about 10 −16 kg/m 3 ), reporting root mean squared error does not provide a useful metric to evaluate the models across different altitude ranges.Instead, we use the MAPE: this not only gives a more intuitive interpretation of the loss (since it represents the average percentage difference between predicted and actual values), but it is also a relative error metric, which makes it a scale-invariant.This latter property is very desirable for data sets like the thermospheric density one, where the magnitude of the predicted values can vary by several orders of magnitude.
To compare thermospheric density models, the proposed evaluation includes the following categories.Due to the significant variation in density at different altitudes, the data set is divided into 50 km increments, enabling a fair comparison between models.Additionally, the density can be greatly influenced by different geomagnetic conditions.Consequently, the overall performance of the model across the entire data set can vary based on these conditions.Hence, the model's performance is reported across various subcategories of the test data set.These subcategories are determined by altitude ranges: 200-250 km, 250-300 km, 300-350 km, 350-400 km, 400-450 km, 450-500 km, and 500-550 km.They are also categorized based on geomagnetic conditions, measured by the Ap index: Quiet (Ap 0-15), Mild (Ap 15-30), Minor storm (Ap 30-50), and Major storm (Ap 50+).

Results
Table 1 shows the MAPE from several density models for different subcategories of the test set, split by altitude and storm conditions based on the Ap index.The significance of the results is described below.
• NRLMSISE.The MAPE for the NRLMSISE model.
• JB08.The MAPE for the JB08 model.Notably, overall there is a significant reduction in MAPE versus the NRLMSISE model (40.71 vs. 63.18).This is not always the case, as the NRLMSISE model outperforms at the lowest altitude bracket for example, (200-250 km).• ML(NRLMSISE).This is a two-hidden layer neural network, with 500 nodes for each hidden layer, that is trained on exactly the same inputs as NRLMSISE.This model significantly outperforms the NRLMSISE model.Overall, the MAPE improves by 61% (24.63 compared to 63.18).This massive reduction in error is the first important result of the study.It shows that data-driven ML models can widely outperform semi-empirical physics-based model, even on exactly the same inputs.The drag on satellites is proportional to the density of the thermosphere, through which they travel.Therefore an improvement of 61% in the MAPE represents a 61% improvement in the calculation of the drag on the satellite, which is important for satellite operations.The only case where the NRLMSISE model outperforms the ML(NRLMSISE) model is in the Major storm category (Ap > 50).This is likely due to the relatively small amount of extreme storm data points.• ML(JB08 inputs, single linear layer).This model is a simple model for a point of comparison.It is essentially a linear regression with linear combinations of the JB08 inputs.Unsurprisingly the model does not outperform the JB08 model.• ML(JB08, 1 hidden layer).This model includes a hidden layer of 500 nodes with a rectilinear activation for each transformation except the final layer.This simple model is able to outperform the JB08 model in every single sub-category.Again to reiterate-this is on exactly the same inputs as the JB08 model.It outperforms the ML(JB08 inputs, single linear layer) model significantly.This is because it has access to second order polynomial combinations of the JB08 inputs and so is much better able to model the density.• ML(JB08 inputs, 2 hidden layers).This model is a two-hidden layer model, of 500 nodes each, with rectilinear activations.It is trained on exactly the same inputs as JB08.Importantly, it significantly outperforms the JB08 model at all subcategories, with a reduction in overall MAPE of 39% (24.81 compared to 40.71).Similarly to the ML(NRLMSISE) analysis, this drop in the error on the density has significant implications for users of the JB08 model, because this model can widely outperform it on exactly the same inputs.Notably, it is slightly outperformed by the NRLMSISE model overall, but it has much lower errors in the higher storm categories.Compared to the previous two ML(JB08) models, it improves in all subcategories.Despite still being a fairly basic model, this shows that the mapping from the JB08 inputs to density is made more accurate by including increasingly non-linear transformations.It was noted during training that this improvement disappears once moving to 3 hidden layers or more.It is fully expected that there are better modeling choices for this task, but this is reserved for future studies.• ML(JB08 inputs + OMNI).The architecture for this model is described in Section 2.2.1.The idea is to add more measurements related to geomagnetic activity to see the improvement in model performance using this fuller picture.The inclusion of 48 hr (at 1-hr resolution) of these values results in further accuracy across all sub-categories compared to the ML(JB08 inputs, 2 hidden layers) model.This is an important result since it motivates the use of these measurements in future models.However, the overhead from having to use a significantly larger input data size might make this particular model less appealing.Obviously, learning exactly which extra geomagnetic parameters and at what time window back in time improves the density modeling will be a significant result, but this is saved for future studies.
An alternative way of comparing models is by looking at how the ratio between the observed and model density is distributed.As was observed in past works, this has typically a log-normal distribution, with values generally ranging from 0.5 to 2.0, and with varying variance, depending on the error made by the model (Doornbos, 2012).For a perfect model, one would expect this distribution to have zero variance and to be centered in one.However, this will not be the case, and any empirical, ML, or physics-based model will deviate from this behavior.In Figure 3, we display the distribution for the ML-based Karman model (KML), against NRLMSISE-00 and JB-08 (on the test set).As for the previous experiments, we ran the ML models with five different seeds and constructed a model ensemble, which we refer to as KML.The quantitative results of this model compared to NRLMSISE-00 and JB-08 are shown in Table 1.As can be observed, the ML model has the closest mean and mode to one, and it also has the smallest variance: both these characteristics confirm that the model is the most accurate and precise to predict the ground truth thermospheric density.Furthermore, between the empirical models, we see that JB-08 has a smaller variance than NRLMSISE-00, and it also more closely captures the average ground truth density, compared to the other empirical model.We report all quantitative data of the mean and standard deviation of these models for different altitude bands and geomagnetic storm conditions in Table 2.
Due to the significant results in terms of performance improvements over NRLMSISE-00 and JB-08 empirical models, as an outcome of this work, we release the KML model trained using JB-08 inputs.Furthermore, we also release Karman (Acciarini, Brown, & Baydin, 2023;Acciarini, Brown, Bridges, et al., 2023): an open-source Python package for data-driven thermospheric density modeling.This has been used to collect the data, train the ML models, and benchmark them against empirical models: our hope is that the framework can become a test bench for data-driven thermospheric density modeling, with shared data, metrics, and models.In Figure 4, we show a typical example of an output from the software.As we see from the plots of the first row, the model can return both the mean and standard deviation of the thermospheric density value, thereby providing a tool for quantifying the uncertainty of the prediction.This is achieved by combining an ensemble of neural networks (in our case five), trained with different weight initialization: more details on uncertainty quantification using an ensemble of models and their comparison with Monte Carlo dropout techniques were discussed by the authors in a previous publication (Bonasera et al., 2021).On the left plots, we display the thermospheric density data associated with CHAMP satellite, while on the right with GOCE.As we observe, three models are compared against the POD-derived thermospheric density: JB-08, NRLMSISE-00, and KML model with JB-08 inputs.As displayed in the plots in the second row, the KML model is able to maintain the lowest MAPE over the tested period of time.

Conclusions and Future Work
Thermospheric density variations are the main source of uncertainty for the determination of resident space objects' position and velocity in low Earth orbit.Since the Sun is the main driver of changes in the atmosphere's upper levels, it is important to better model the Sun-Earth system to aid this issue.Traditionally, models based on empirical laws and fitted through data have been used in operational settings.Empirical models use solar proxies and geomagnetic indices to model the Sun's influence, due to the difficulties of obtaining EUV measurements across the spectral bands that affect the atmosphere, which can only be captured from orbit.However, it is well known that these proxies do not fully capture solar activity (Vourlidas & Bruinsma, 2018).In the last 20 years, there has been a growing amount of space missions that have gathered the Sun's EUV irradiance data from orbit (e.g., TIME/d, SOHO, SDO, GOES, etc.).Moreover, advances in ML have allowed the development of neural network models that can very well capture the nonlinear relationships between data, in a black-box and data-driven fashion.
Motivated by the above-mentioned aspects, in this work we develop Karman: an open-source ML and benchmarking framework for data-driven thermospheric density modeling.In our proposed software package, we enable users to develop and train thermospheric density ML models and benchmark their outputs against empirical models (e.g., JB-08 and NRLMSISE-00).In particular, we devise the library in order to support the ingestion of data from disparate sources, ranging from geomagnetic data of NASA OMNIWeb high-resolution service, to irradiance data from solar proxies (e.g., F10.7, M10.7, S10.7, Y10.7) and EUV irradiance empirical models (e.g., FISM2).The data can be queried with an associated temporal history and can be fed into ML models for forecasting the corresponding thermospheric density variations, where the POD-derived density from several satellites (e.g., CHAMP, SWARM-A, SWARM-B, GOCE, GRACE) is used as a target.We show that simple ML models can easily outperform empirical models, using the same inputs, at different altitude ranges and geomagnetic storm conditions.However, it is crucial to ensure that the statistical properties of the training data closely resemble those of the validation and test data for the ML model to generalize successfully.In instances where this condition is not met, our analysis in the Appendix reveals that the ML method may experience a decline in performance.We also release one of these models, which uses the same inputs as JB-08 and nearly halves its accuracy, in terms of MAPE.Moreover, we also highlight that the ingestion of further data (in our case OMNIWeb data) can further reduce the error by a few percentage points, corroborating that a modular framework to support data from different sources such as the proposed one, can become an essential tool for the community.
In the future, we plan to use Karman for the direct ingestion of EUV spectral irradiance data and for investigating whether these data can significantly benefit the neutral density prediction task at high altitudes.Moreover, we also plan to use the framework to investigate the temporal relationship between the Sun activity and the Earth's thermosphere, to better clarify the open questions on how far back in time each phenomenon from the Sun affects the Earth's thermosphere.

Appendix A: Model Inference on Unseen Satellite Data
During the development of neural network models, it is crucial to have a train/validation/test split that avoids the model simplify learns to interpolate data, and to overfit to them.In order to avoid that the model simply learns to interpolate the underlying data (which is acquired along satellites' orbit), in our work we used held-out months of data as test and validation.At the same time, we also permuted the held-out months across different years, to make sure that seasonal effects are well represented during training.In this appendix, the objective is to explore a different train/validation/test split, where the model is trained on four satellites, and then tested in a different and unseen satellite.While this is an interesting case to study, we believe that it is not ideal (and hence its inclusion in the appendix), since it will test the model on a set of data that might be sampled from a different underlying statistical distribution, and neural networks are well known to perform poorly in out-of-distribution data.In fact, the held-out satellite might be at a completely different altitude regime, or it might observe the solar cycle in a different period or during a particular geomagnetic storm, which is not well or at all represented during training.Nonetheless, we decided to perform this experiment, carefully choosing the held-out satellite to avoid out-of-distribution testing: thus, we used SWARM-A data as a test satellite.The choice of SWARM-A was mostly driven by the fact that its density observations are at an altitude range that is covered by the other satellite data.
We employ a similar training setup as the one used for the KML model.In Table A1 we show the MAPE results for the SWARM-A satellite, while Figure A1 shows a normalized histogram for the ratio between the observed density and the model density for the ML model (yellow), the JB08 model (blue) and their NRLMSISE-00 model (red).Furthermore, in Table A2, we also present the quantitative data for the density ratio results, in terms of both overall performance, as well as by different altitude and geomagnetic storm bins, reporting mean and standard deviation of the results.The ML model used was using the same inputs as JB-08.We display in the figure the distribution of the ratio of the modeled and observed data (on the test set).As it can be confirmed both qualitatively and quantitatively, the learned model still manages to outperform both empirical models.The overall MAPE achieved was about 31%, against 38% of JB-08 and 61.89% of NRLMSISE-00.This confirms that even in this train, validation, and test split setup, the ML-based model still manages to have an edge over the empirical models, even when the train, validation, and test split is made using a held-out satellite.
data, thereby providing an alternative model that can outperform state-of-the-art techniques.Furthermore, none has provided an open-source platform or shared data pipeline that facilitates the comparison of different models on the same data sets.Additionally, it remains hard to investigate the potential performance improvements that can be achieved by incorporating additional data into these models, due to the lack of a shared and open-source framework.As a result, progress in this field has been fragmented and lacks a unified approach.Our framework aims to address this gap by providing a centralized platform, enabling the research community to align their efforts and work toward common objectives.

Figure 1 .
Figure 1.Schematic illustration of Karman machine learning workflow.Supported inputs and the expected output are shown.
(from NASA's OmniWeb data service(King & Papitashvili, 2005;NASA, 2023), and indices [from Celestrack(Kelso & Sean, 2010)] and Space Environment Technologies, SET (Kent, 2023)).Some examples of geomagnetic measurements are interplanetary magnetic field magnitude, X, Y, Z-components of the magnetic field vector in GSM and GSE coordinates, SYM/D, SYM/H, ASY/D, ASY/H indices, also including some derived values such as the X, Y, Z-components of the bow shock nose, and many others (a full description of all the supported inputs can be found in the OMNI website).Solar irradiance inputs both in the form of measurements (from the Laboratory of Atmospheric and Space Physics, LASP, of CU Boulder (LASP, 2023)) and indices (from both the LASP datacenter as well as Space Environment Technologies[SET]

Table 1
Final Benchmarking Results, Expressed as Average Mean Absolute Percentage Error Over Five Runs

Table 2
Figure 3. Ratio of Observed Density to Model Density for the KML, NRLMSISE-00, and JB08 thermospheric neutral density models.Mean and Standard Deviation of Ratio Between Observed and Model Densities

Table A1 Mean
Absolute Percentage Error Results for the NRLMSISE, JB08, and ML(JB08) Models While Using SWARM A Density Observations as a Held-Out Test Set Figure A1.Ratio of Observed Density to Model Density for an machine learning (ML) model trained on JB08 inputs, NRLMSISE-00, and JB08 thermospheric neutral density models.

Table A2
Mean and Standard Deviation of Ratio Between Observed and Model Densities