Atmospheric chemistry surrogate modeling with sparse identification of nonlinear dynamics

Modeling atmospheric chemistry is computationally expensive and limits the widespread use of atmospheric chemical transport models. This computational cost arises from solving high-dimensional systems of stiff differential equations. Previous work has demonstrated the promise of machine learning (ML) to accelerate air quality model simulations but has suffered from numerical instability during long-term simulations. This may be because previous ML-based efforts have relied on explicit Euler time integration -- which is known to be unstable for stiff systems -- and have used neural networks which are prone to overfitting. We hypothesize that the creation of parsimonious models combined with modern numerical integration techniques can overcome this limitation. Using a small-scale photochemical mechanism to explore the potential of these methods, we have created a machine-learned surrogate by (1) reducing dimensionality using singular value decomposition to create an interpretably-compressed low-dimensional latent space, and (2) using Sparse Identification of Nonlinear Dynamics (SINDy) to create a differential-equation-based representation of the underlying chemical dynamics in the compressed latent space with reduced numerical stiffness. The root mean square error of the ML model prediction for ozone concentration over nine days is 37.8% of the root mean concentration across all simulations in our testing dataset. The surrogate model is 11$\times$ faster with 12$\times$ fewer integration timesteps compared to the reference model and is numerically stable in all tested simulations. Overall, we find that SINDy can be used to create fast, stable, and accurate surrogates of a simple photochemical mechanism. In future work, we will explore the application of this method to more detailed mechanisms and their use in large-scale simulations.


Introduction
Modeling atmospheric chemistry is computationally expensive, which limits its widespread practice and also limits its integration into Earth System Models (ESMs) (Brasseur & Jacob, 2017).Typical atmospheric chemical mechanisms include reactions that are highly nonlinear and have characteristic time scales varying across several orders of magnitude (Logan et al., 1981), resulting in a numerically stiff system of ordinary differential equations (ODEs) (Verwer & Simpson, 1995).The Master Chemical Mechanism (MCM) (Jenkin et al., 1997;Saunders et al., 2003)-the most explicit of such models-contains 5832 species and 17,224 reactions, whereas models used in large-scale simulations contain more simplified mechanisms where chemical species are lumped by chemical bonds (Gery et al., 1989;Zaveri & Peters, 1999) or physical and chemical properties (Stockwell et al., 1997).In all cases above, the resulting system suffers from high dimensionality, meaning it has many state variables that must be repeatedly computed.It also suffers from high stiffness, meaning that the characteristic time scales of different processes vary by orders of magnitude, requiring the system state to be computed at a large number of intermediate time steps to simulate the time-evolution of the system.These two phenomena combine to cause the high computational cost of these models.Here, we will explore methods to reduce the computational cost of an atmospheric chemistry model by reducing its dimensionality and its stiffness.
Previous works have demonstrated the promise of machine learning (ML) for accelerating air quality models.Neural networks (NN) have been proven as a universal function approximator (Hornik et al., 1989) and applied in air quality prediction (Gardner & Dorling, 1998).Boznar et al. (1993) created a multi-layer NN to make short-term (30 minute) predictions for sulfur dioxide (SO 2 ) concentrations around a thermal power plant.Viotti et al. (2002) trained a general form of a three-layer NN for one-hour air pollutant prediction in an urban area.Sousa et al. (2007) combined principle component analysis (PCA) with a three-layer NN to make predictions for next-day hourly ozone concentration.
Despite the promising results in short-term prediction for air pollutants, challenges remain in long-term recurrent time series prediction, including the compounding accumulation of error (Sorjamaa et al., 2007).Increasing modeling capacity-for example by creating deeper neural networks-might increase long-term performance but increases challenges related to overfitting and exploding and vanishing gradients (Kolen & Kremer, 2001;Srivastava et al., 2014).An alternative approach relies on combining theorydriven and and data-driven modeling techniques, which are often considered two distinct fields but can in fact be greatly synergistic and complementary (Reichstein et al., 2019).A growing field, physics-informed machine learning (PIML), aims to use prior knowledge obtained from observational, empirical, physical, or mathematical insights about the world to improve the performance of an ML algorithm (Karniadakis et al., 2021).Knowledge about the source and nature of the dataset, encoded as an inductive bias (Baxter, 2000), can provide a useful constraint on ML models.Keller and Evans (2019) created a random forest-based surrogate model for gas-phase chemistry in the GEOS-Chem.They separated long-lived from short-lived species and imposed conservation of atoms for the NO x family.Despite the success in short-term prediction, exponential error growth occurred for simulation times longer than a few weeks.Kelp et al. (2020) used a recurrent training regime to make predictions on the box model CBM-Z (Zaveri & Peters, 1999), using an autoencoder (Kramer, 1991) to reduce the dimensionality from 101 chemical species into 16 features without significant loss in performance and reached two orders of magnitude speedup.However, this model also suffered from exponential error accumulation for simulations longer than a few weeks.
Variants of NN have been developed to address the limitations with sequential data, such as recurrent neural network (RNN) (Elman, 1990), long-short term memory (LSTM) (Hochreiter & Schmidhuber, 1997), and neural ordinary differential equations (NODE) (R. T. Q. Chen et al., 2019); however, they still share some common shortcomings in that they are uninterpretable "black-box" models, are highly data-intensive (Cabaneros et al., 2019), need tedious trial-and-error experiments on model structure and size (Zhang et al., 1998), and are unable to extract interpretable information and knowledge from the data (Karniadakis et al., 2021).These variants require a hidden state which generally has a much larger dimension than the input, rendering them compatible with inclusion as an operator in a gridded transport model.In addition, training and running a NODE-based model requires iterative numerical integration of the neural network which is prohibitively expensive in large-scale systems (Djeumou et al., 2022).
Alternatives to "black-box" methods include parsimonious "white-box" modeling to describe experimental data (Purnomo & Hayashibe, 2023), as parsimonious models aim to balance predictive accuracy with model complexity, thus preventing overfitting and promoting interpretability and generalizability (Brunton et al., 2020).Schmidt and Lipson (2009) developed a genetic programming-based approach for detecting conservation laws from experimentally collected data without any prior knowledge.Quade et al. (2016) demonstrated how symbolic regression can be applied to identify the future state of dynamic systems.Finally, Brunton et al. (2016) developed a sparse regressionbased method to identify a nonlinear dynamic system, which is the approach we will use here owing to its successful previous application to a wide range of problems (Lai & Na-garajaiah, 2019;Hoffmann et al., 2019;Wang et al., 2021;Jiang et al., 2021;Pasquato et al., 2022).
In this work, we create a machine-learned surrogate model to make fast, stable, and accurate predictions of air pollutant concentrations without the exponential error growth observed in previous work.We achieve this by (1) reducing the dimensionality through singular value decomposition to create a compressed low-dimensional latent space consisting of latent species with high interpretability and (2) using sparse identification of nonlinear dynamics (SINDy) to train a model to represent the underlying chemical dynamics in the compressed latent space (Figure 1).To our knowledge, previous works have either reported numerical instabilities or have not included experiments that quantify numerical stability, and we are the first to report numerical stability over long simulation periods under a wide range of conditions. 2 Materials and Methods

Dataset Preparation
We aim to emulate a simple photochemical mechanism (Sturm & Wexler, 2020, 2022) which contains 11 chemical species and 10 essential reactions related to ozone (O 3 ) formation: nitrogen oxides (NO x ) chemistry, volatile organic compound (VOC) chemistry, formation of a peroxy radical from VOC chemistry, and further reaction of the peroxy radical with nitrogen monoxide (NO) to form nitrogen dioxide (NO 2 ) and hydroxyl radical (OH) (Figure 2; Table S1).
We use Sobol sampling to generate 3750 sets of random initial conditions including temperature, pressure, emission strength, and time phase for time-varying emissions, and a starting time for calculating the evolution of the solar zenith angle.We randomly sample the initial concentrations of the chemical species following Sturm and Wexler (2020), as shown in Table S2.For each emitted chemical species, the emission flux is set as the  S1.
sum of two components: the first component is a constant background emission, and the second component is a sine function with a period of 24 hours to give the emissions a diurnal pattern (Equation 1): where E(i) is the total emission strength of species i, e(i) the randomly-sampled emission strength of species i, t 0 (i) is the starting time (or phase shift) of the emission of species i, and t is time in hours.To prevent the chemical species from accumulating in the model box, we remove O 3 , HNO 3 , HO 2 , NO 2 , CO, and H 2 from the model box at a constant rate (Table S2).Consistent with our goal of qualitatively representing atmospheric chemical dynamics so that we can study methods for surrogate modeling, we choose the removal rates empirically to prevent accumulation, rather than measuring or theoretically deriving them.
We partition the 3750 sets of initial conditions into training, validation, and testing sets with a ratio of 8:1:1, using 3000 of the randomly sampled initial conditions to conduct simulations over 72 hours to create a training dataset of 3000 cases and using the remaining two sets of initial conditions (each containing 375 cases) to conduct simulations over 240 hours to create a validation and a testing dataset.(To test generality and numerical stability, our test simulations are for longer periods than our training simulations.)Solutions are saved every 60 minutes of simulation to create our training and testing time series.To avoid initial transient conditions and unrealistic system behavior (Jacobson, 1997;Glynn, 2005), we discard the first 24-hours of data from the training, validation, and testing datasets.

Dimensionality Reduction
As discussed in the introduction, a main cause of the computational intensity of an atmospheric chemistry model is its high dimensionality-the large number of state variables that must be computed at each time step.Therefore, one way to reduce the computational cost of our model is to reduce its dimensionality.There is also a secondary benefit of dimensionality reduction, which is that lower-dimensional systems are easier to fit machine-learned models to (Bellman, 1966;Ayesha et al., 2020;Vong et al., 2019).
Here we use an explicit and automatic approach based on principal component analysis (PCA) (Jolliffe & Cadima, 2016) to reduce the dataset dimensionality.We standardize the dataset by subtracting its mean value over the simulation period across all cases and perform singular value decomposition on the standardized data.We calculate the explained variance ratio (Maćkiewicz & Ratajczak, 1993) which quantifies how much of the total variance in a dataset is accounted for by each singular value decomposition (SVD) mode.In our case, each SVD mode is a linear combination of chemical species, so we refer to our SVD modes as "latent chemical species", which are groups of chemical species lumped automatically during the dimensionality reduction.Each latent species represents a combination of the original species that is uncorrelated with all other latent species, with each latent species sequentially representing as much variance within the training dataset as possible.Therefore, to reduce the dimensionality of our model, we can choose a subset of the latent chemical species that explains a sufficient fraction of the overall variance in the training dataset and discard the remaining latent species.
We select the number of the dimensions to which the dataset is compressed, take the same number of columns, or singular vectors, from the left singular matrix, and multiply the dataset array by the selected singular vectors.The resulting product array is the representation of the system state in our latent space, where each dimension records the evolution of a latent species that is a linear combination of original chemical species.To reconstruct the original species concentrations from the latent representation, we conduct the reverse transformation by multiplying the latent space array with the transpose of the same singular vectors.Reconstruction loss inevitably occurs during such reverse transformation, but it can be negligible if the latent species explain a sufficiently high fraction of the overall variance in the dataset.
For public health and regulatory purposes, a specific group of air pollutants including ground level ozone is most important (Weschler, 2006), while other species such as the intermediate radicals are largely included in models to help to improve the prediction of the important pollutants (Kelp et al., 2020).Therefore, to "focus" our model on predictions of ozone, we increase the importance of ozone in our latent species by multiplying its concentration by a predetermined coefficient (β) before singular value decomposition to artificially enlarge the fraction of overall variance explained by ozone.After compressing the original dataset into a latent space, latent species containing ozone will be prioritized by a factor of β.To reconstruct the ozone concentration, we divide it by the coefficient β after the reverse transformation.The optimal value of β is determined through hyperparameter optimization as described in Section 2.5.

Stiffness Reduction
The definition of ODE stiffness is ambiguous but can be characterized by dynamics with widely separated time scales (Kim et al., 2021).In atmospheric chemical mechanisms, it means some chemical species evolve much more rapidly-or some chemical processes are much faster-than others.Adaptive timestep numerical methods have been developed to capture both fast and slow dynamics, but choosing proper timesteps is computationally expensive as it includes heavy computational tasks such as calculating the Jacobian matrix (Huang & Seinfeld, 2022).Here we encourage our surrogate model to be less stiff than the original by training it on datasets of one-hour timesteps.Such selection of time step prioritizes the model's identification of relatively slow, extended chemical processes by smoothing out the fast, instantaneous ones related to species of short lifetimes (such as intermediate radicals).Prioritizing non-stiff dynamics during training results in surrogate models that are less computationally expensive.

Sparse Identification of Nonlinear Dynamics
We create a surrogate model based on the Sparse Identification of Nonlinear Dynamics (SINDy) (Brunton et al., 2016).SINDy is a data-driven method for identifying the governing equations of a dynamical system based on observations of the system's dynamics.It is based on the assumption that the dynamics of a given system can be represented as a linear combination of a small number of candidate function terms, each multiplied by a coefficient that represents the strength of that term in the dynamics.In the present case, the kinetic dynamics of an atmospheric chemical mechanism can be expressed in the form of a set of ordinary differential equations (ODEs): where ċi is the time derivative of the concentration of chemical species i (its net reaction rate) and f (c i (t)) is the right-hand side (RHS) of the ODE which is defined by the law of mass action.We assume the underlying dynamics in the ODE can be represented by: where C = [c 1 , c 2 , . . ., c n ] is the vector of the states, Ċ is the vector of time derivatives of the states, Θ(C) = [θ 1 (C), θ 2 (C), . . .θ n (C)] is composed of column vectors of candidate function terms that can define the dynamic system, and Ξ = [ξ 1 , ξ 2 , . . .ξ n ] is a sparse matrix composed of column vectors of coefficients representing the strength of the active candidate function terms.
A sparsifying algorithm is used to solve the regression problem Ċ = Θ (C) Ξ, where Ċ is a matrix of training data, to obtain the sparse coefficient matrix Ξ.In this problem, each row of Ċ is an observation of a rate of change and each column is a state variable of the system of interest.Each row of Θ (C) corresponds to observation in Ċ and each column represents a candidate function term.Each row of Ξ corresponds to a candidate function term and each column represents a state variable.(Figure 1 contains a graphical overview.)In SINDy, sequentially thresholded least squares iteration (STLSQ) (Brunton et al., 2016;Martensen et al., 2023) is implemented by iteratively solving a sparse regression problem on the training data and the candidate function library, and coefficients less than a chosen threshold are set to zero, resulting in a sparse coefficient matrix Ξ.A common goal of SINDy is to discover equations governing a data-generating process.Here, we instead use it to discover a system of equations that is less computationally expensive than the reference ODEs we use to generate the training data.

Training Procedure
We use SINDy to train the surrogate model to emulate the latent variables compressed from the reference model.We define and posit a library of candidate function terms that could appear on the right-hand-side (RHS) of the governing equations, including chemical species emission rates, removal rates, polynomials of latent species concentrations, solar zenith angle, pressure, and temperature (Table S3).We use the sequentially thresholded least squares method implemented in the Julia (Bezanson et al., 2017) library DataDrivenDiffEq.jl(Martensen et al., 2023).
SINDy attempts to find a model that explains rates of change in the training dataset, and does not guarantee that the resulting model will be numerically stable when used in a time-series simulation.In practice, we find that our SINDy models do sometimes become numerically unstable, but we are able to remedy this by adding a buffer terma higher-order polynomial term with a small, negative weight coefficient (ϵ)-to each equation in our SINDy-discovered model as suggested by Hirsh et al. (2022).If the library of candidate functions includes polynomial terms up to order n, we add a term −ϵx n+1 i if n is even, or −ϵx n+2 i if n is odd (Figure 4).Moreover, we construct our model in a manner that allows us to use modern numerical integration methods (Section 2.6) rather than  (Kelp et al., 2020).
To improve the model performance and generalizability, we optimize the hyperparameters using random search (Bergstra & Bengio, 2012) in Hyperopt.jl(Bagge Carlson, 2018).Hyperparameters include the threshold of the least square iteration (λ), the weight coefficient of ozone (β), and the weight coefficient of the buffer term (ϵ) (Table 1).We set the optimization metric as the root mean square error (RMSE) between the true value and model prediction on the validation dataset.After finishing hyperparameter optimization, we repeat the whole training procedure using the selected set of hyperparameters and evaluate the model performance on the separate testing dataset.

Computational Speed
We compare the speed of the surrogate model and the reference photochemical mechanism for simulating 3000 cases on randomly sampled initial conditions over 48 hours.We calculate the total simulation time and the time required per integration time step for each model.
We create a symbolic reaction network system for the reference mechanism using the Julia package Catalyst.jl(Loman et al., 2022) and ModelingToolkit.jl(Ma et al., 2022).
After training the surrogate model, we rebuild it using the ModelingToolkit.jl(Ma et al., 2022).We convert the reference model into an ODE system and solve them using the Rosenbrock23 (order 2/3 L-Stable Rosenbrock-W, an ODE solver for stiff problems) solver, and convert the surrogate models into ODE systems and solve them using Tsit5 (Tsitouras 5/4 Runge-Kutta method, an ODE solver for non-stiff problems) solver in Dif-ferentialEquations.jl (Shampine & Reichelt, 1997;Tsitouras, 2011;Rackauckas & Nie, 2017, 2019).We measure and evaluate the model performance using BenchmarkTools.jl(J.Chen & Revels, 2016).We run all simulations on a single thread of an Intel Xeon Gold 6248 CPU core.

Latent Space Interpretation
Here we explore the interpretability of our latent chemical species.The original training dataset has a dimension of 11 chemical species.We calculate the fraction of total variance explained as the function of the number of latent species (Figure 3A).Starting from one latent species, the fraction of total variance increases as additional latent species are added and reaches around 100% at four latent species.Latent species after the fourth one can be considered redundant as the fraction of total variance explained by those species is near zero.From the perspective of keeping the number of variables low without losing chemical details, we consider three latent species to be sufficient to represent the dataset generated from the reference chemical system, as three principal components can represent over 85% of the total variance.In the reference chemical mechanism, HO 2 , H 2 O 2 , O 3 , and NO 2 correspond to the four most important latent species, as they outweigh the other chemical species in the singular vectors (Figure 3B).CO and HCHO are allocated across latent species 6 and 7, while HO, NO, O, and H 2 are mapped to latent species 8, 9, 10, and 11, respectively.In the mechanism, O 3 is the species we give extra weight to, and NO 2 is an emitted species and plays a significant role in both the production and destruction of O 3 .H 2 O 2 and HCHO are also emitted and products of their decomposition reaction participate in the interrelated processes of ozone formation.Species such as CO, HNO 3 , and H 2 have low importance because they do not react once being produced (they just accumulate in the model box) and they have a high correlation with those important reactant species.Therefore, as four latent species account for almost 100% of the total variance explained, our latent space representation suggests that HO 2 , H 2 O 2 , O 3 , and NO 2 can represent the reference chemical system while other species such as H 2 can be treated as redundant for the purpose of explaining the variance in the system.

Model Interpretation
Following the above-mentioned training procedure, we obtain interpretable surrogate ODEs with sparse representation (Figure 4).The concentration of the first latent species (c 1 ) is approximately reversely mapped to HO 2 .The product of pressure and temperature, emission of HCHO account for 46.6%, 41.5% of the total variance in the time derivative of c 1 , respectively.The second latent species (c 2 ) can be considered as a linear combination of H 2 O 2 and O 3 .The solar zenith angle, emission of H 2 O 2 , and emission of NO 2 account the most of the total variance (40.2%, 29.8%, and 10.5%) in its derivative.Similarly, the third latent species (c 3 ) is dominated by O 3 and H 2 O 2 , and its time derivative is highly dependent on the product of pressure and temperature (73.1%) and the product of pressure, temperature, and the concentration of the first latent species (17.7%).

Surrogate Model Accuracy
In this work, we focus on creating a chemical mechanism surrogate model that can make accurate predictions over a nine-day long period without exponential error accumulation.Here we choose to focus on predicting ozone concentration, given its impor-   S4) for each latent species.ci is the concentration of latent species i, c(name) is the concentration of chemical species name, P is pressure in atm, T is temperature in K, and S is solar zenith angle.
tance in relation to public health and regulation and the limited set of chemical species in our reference model, but the methods here could be used to prioritize a different pollutant or set of pollutants just as easily.We optimize the ozone weight coefficient (β) by random search as described in Section 2.5.
The overall RMSE of the ML model prediction on the testing dataset for O 3 concentration over nine days is 0.034 ppm, as compared to a 0.090 ppm root mean O 3 concentration across the simulations.We then sort the 375 testing cases by RMSE between the reference model prediction and the surrogate model prediction for O 3 and visualize the three each of the best, median, and worst cases as representatives to illustrate model performance (Figure 5A).The model also makes simultaneous predictions for the other species (Figure S1 to S10).The model prediction trajectories for the best three simulations follow the diurnal cycle correctly and have RMSEs for O 3 of 0.0078, 0.0095, and 0.0099 ppm, which are 8.67%, 10.5%, and 11.0% of the RMS O 3 concentration (0.090 ppm) in the testing dataset.Similarly, the RMSEs in the three median cases are all 0.026 ppm.In the worst testing cases, the trajectories are periodic, but amplitudes briefly spike to values much larger than the true function.Although the highest RMSE is 0.075 ppm, the trajectories of the worst cases do not suffer from exploding gradients and the error does not accumulate during the simulation period (Figure 5B), which represents an improvement over previously published results (Kelp et al., 2018(Kelp et al., , 2020)).
Considering the error percentiles for the testing cases (Figure 5B), most testing cases (90%) have errors less than 0.05 ppm averaged over the simulation time steps.The remaining cases, as illustrated in the 90% to 100% distribution, have maximum errors larger than 0.3 ppm at the time steps after the first day.Despite the larger absolute error of these minority cases, the surrogate model remains numerically stable in all the testing cases.As the model is only trained on a small training dataset of two-day simulations and tested on a testing dataset of nine-day simulations, we conclude that the surrogate model captures the underlying system dynamics in most cases over a long period, but it might not generalize to conditions that were not included in the training data.

Computational Speed
We calculate the computational time using the Rosenbrock23 ODE solver for 3000 simulations over 48 hours of simulation time for both the reference chemical mechanism and the SINDy-based surrogate model.The total elapsed time for running 3000 simulations on the reference model is 83.7 s, and the time spent on the surrogate model with one, two, three, and four latent species are 2.9 s, 6.6 s, 10.4 s, and 27.0 s (reaching a speedup of 29×, 13×, 8×, and 3×, respectively).The surrogate models take fewer integration time steps.The reference model requires 976 steps on average for the ODE system integration, while the surrogate model with one, two, three, and four latent species requires 149, 339, 386, and 465 steps on average.
As discussed in Section 2.3, our surrogate model is expected to be less stiff than the reference model, thereby allowing us to use solvers designed for non-stiff problems.Therefore, we also calculate the computational time using the Tsit5 ODE solver for 3000 simulations over 48 hours of simulation time for the surrogate model.For surrogate models with one, two, three, and four latent species, the total computational time are 2.2 s, 4.7 s, 7.5 s, and 20.3 s (reaching a speedup of 38×, 18×, 11×, and 4×, respectively), and the average numbers of integration timesteps are 31, 71, 82, and 106.Computational speed ratios for the reference model vs. ML models are shown in Figure 6.While surrogate models with smaller latent spaces are faster, there is a balance between the computational speed and the model performance (Figure 6).Starting from one latent species, the RMSE decreases with more latent species, while the computational time increases approximately proportionally to the number of terms on the RHS of the surrogate ODE(s), which increases nearly quadratically with the number of latent species (Figure S11 to S14).Interestingly, adding the fourth latent species worsens the prediction of O 3 , probably due to the extra stiffness and redundancy that it adds, as the fourth latent species is not highly correlated to O 3 (Figure 3B).To meet the criteria of creating the fastest and most accurate model, we select the model with three latent species as a trade-off that is 11× faster and has 12× fewer integration timesteps than the reference model.

Discussion and Conclusion
In this work we have explored the feasibility of the SINDy approach to create surrogate models for atmospheric chemical mechanisms.Our SINDy-based model is numerically stable over the nine-day testing period under various environmental conditions.To our knowledge, previous works have either reported numerical instabilities or have not included experiments that quantify numerical stability, and we are the first to report numerical stability over long simulation periods under a wide range of conditions.Our surrogate model achieves a 11× speedup compared to the reference mechanism in terms of the total integration time and requires 12× fewer integration timesteps when using the fastest ODE solver for each system, and each timestep is 3.2× faster when both models use the same ODE solver.The surrogate model achieves such speedup because (1) the ODE system in the SINDy-based model has its dimensionality reduced to three state variables, and fewer state variables mean fewer computations; (2) we select a small and concise candidate function library based on prior understanding of the reference model so that the surrogate ODEs are highly simplified; and (3) because we use training data saved at one-hour intervals, the dynamics of the latent state variables all have similar characteristic time scales, allowing the surrogate model to be less numerically stiff than the reference model and to be solved by faster non-stiff ODE solvers such as Tsit5 (Section 2.6).
We reduce the dimensionality of the dataset from 11 chemical species into an interpretable latent space with three latent species.These latent species are linear combinations of chemical species that are either critical in the O 3 formation and destruction cycle or related to the pollutant emissions.The surrogate model representation is also a human-interpretable system of differential equations.As discussed above, the time derivatives of the latent species are highly dependent on the meteorological conditions (atmospheric pressure, temperature, and solar zenith angle) and emissions, indicating its potential ability for simulations under various environmental conditions.
The results and findings herein represent certain step forward from previous literature; however, there is still more work to be done to achieve the full potential of this approach.First, the selection of terms for inclusion in the candidate function library is important as the performance of the SINDy-based surrogate model is sensitive to the choice of terms.This requires a deep prior understanding of the underlying dynamics in the reference chemical mechanism-a slight difference in the selected library terms might lead to substantially different surrogate dynamics which suffer from instabilities and failures.This difficulty may be exacerbated for larger-scale models, which contain more complex sub-mechanisms or multi-phase mechanisms such as inorganic and particulate matter physics and chemistry (Riemer et al., 2009), leading to more complicated dynamics than in the reference model used here.However, the difficulty in choosing candidate terms may be mitigated by applying more robust variants or extensions of SINDy that are less sensitive to the selection of candidate functions such as E-SINDy (Fasel et al., 2022), which uses bootstrapping techniques among the constructed library terms.Second, some cases in the surrogate model simulations have occasional concentration spikes.We are able to prevent these from becoming exploding errors by using buffer terms as explained in Section 2.5, but we are not yet able to prevent the spikes completely.This could be mitigated by using SINDy variants such as weak-SINDy (Messenger & Bortz, 2021) which is trained on time-series integrations rather than individual time steps.
In future work, we will scale up from the simplified mechanism used here to create compressed versions of the chemical mechanisms typically used in atmospheric chemical transport modeling, such as CB-6 (Yarwood et al., 2010), SAPRC (Carter, 1990), the GEOS-Chem mechanism (Bey et al., 2001) and even the Master Chemical Mechanism (Jenkin et al., 1997;Saunders et al., 2003).We will also expand our surrogates to cover aerosol dynamics and other atmospheric phenomena.By implementing the aforementioned methodology, we expect to obtain a larger speedup factor when creating surrogate models for a larger chemical mechanism as we hypothesize that larger chemical mechanisms include variables which are more strongly correlated with each other than exist in the small-scale mechanism used here.

Open Research
The source codes for dataset generation, model training, and plots are available through Zenodo (https://zenodo.org/records/10465785).

Figure 1 .
Figure 1.Overall schematic for creating SINDy-based ML surrogate models of an atmospheric chemical mechanism: (1) Run reference photochemical mechanism to generate reference data (Section 2.1); (2) Reduce the dimensionality by compressing data into a latent space to reduce the computational cost of the surrogate model (Section 2.2); (3-4) Create a surrogate model based on the Sparse Identification of Nonlinear Dynamics (SINDy) (Sections 2.4 and 2.5); (5) Transform data from latent variables back to original variables; (6) Test performance (Section 3.3).

Figure 2 .
Figure 2. Reference photochemical mechanism used for surrogate training.Black arrows from species to reactions (brown dots) indicate reactants, and are labeled with their input stoichiometry.Black arrows from reactions to species indicate products, and are labeled with their output stoichiometry.Reactions are also shown in TableS1.

Figure 3 .
Figure 3. Latent space variance and species mapping.(A) shows the total variance explained by latent species, and (B) shows the mapping relations between chemical species and latent species. where

Figure 4 .
Figure 4. Surrogate model equations.Each term on the RHS is ordered and colored according to the fraction of variance it explains in training dataset (TableS4) for each latent

Figure 5 .
Figure 5. Surrogate model performance.(A) Reference (black solid line) and the surrogate (red dashed line) trajectories for three each of cases with lowest ("best"), median, and highest ("worst") RSME respectively.(B) Absolute error percentiles for surrogate model testing simulations where the shaded area is the fraction of simulations that has a lower absolute error than the value in the legend.

Figure 6 .
Figure 6.ML model computational speed and test set RMSE for ozone as functions of the number of latent species.The red line compares the speed of the reference and ML models using the Rosenbrock23 solver; the blue line compares the reference model with the Rosenbrock23 solver to the ML models with the Tsit5 solver.(The reference model is too stiff for use with the Tsit5 solver.)

Table 1 .
Sampling space of hyperparameter optimization