Convergence in simulating global soil organic carbon by structurally different models after data assimilation

Current biogeochemical models produce carbon–climate feedback projections with large uncertainties, often attributed to their structural differences when simulating soil organic carbon (SOC) dynamics worldwide. However, choices of model parameter values that quantify the strength and represent properties of different soil carbon cycle processes could also contribute to model simulation uncertainties. Here, we demonstrate the critical role of using common observational data in reducing model uncertainty in estimates of global SOC storage. Two structurally different models featuring distinctive carbon pools, decomposition kinetics, and carbon transfer pathways simulate opposite global SOC distributions with their customary parameter values yet converge to similar results after being informed by the same global SOC database using a data assimilation approach. The converged spatial SOC simulations result from similar simulations in key model components such as carbon transfer efficiency, baseline decomposition rate, and environmental effects on carbon fluxes by these two models after data assimilation. Moreover, data assimilation results suggest equally effective simulations of SOC using models following either first‐order or Michaelis–Menten kinetics at the global scale. Nevertheless, a wider range of data with high‐quality control and assurance are needed to further constrain SOC dynamics simulations and reduce unconstrained parameters. New sets of data, such as microbial genomics‐function relationships, may also suggest novel structures to account for in future model development. Overall, our results highlight the importance of observational data in informing model development and constraining model predictions.


| INTRODUC TI ON
Soils store more carbon than the atmosphere and vegetation combined (Ciais et al., 2014;Jackson et al., 2017).A small change in soil carbon storage can significantly impact the atmospheric carbon dioxide concentration and the future trajectory of climate.
Substantial research has been conducted to understand the factors underlying the formation of soil organic carbon (SOC) and its persistence.While there is a general agreement that the SOC balance depends on plant carbon input as the source of SOC and organic matter decomposition as the main SOC loss pathway, there are two contrasting paradigms on the regulation of decomposition.The conventional paradigm focuses on chemical recalcitrance and physical protection as the key factors controlling decomposition and, thus, CO 2 emissions back to the atmosphere (Schmidt et al., 2011).A more recent paradigm focuses instead on soil microorganisms and soil carbon stabilization as the key determinants in partitioning carbon inputs between accumulation and loss (Bradford et al., 2016;Cotrufo et al., 2013Cotrufo et al., , 2015;;Tao et al., 2023).These two paradigms are the conceptual foundation of two classes of process-based models used to simulate global SOC dynamics (Table 1).Because these model classes have distinctive structures that reflect different underlying theories and assumptions on soil carbon dynamics (Chandel et al., 2023), large differences in the simulated SOC emerge among models, leading to highly uncertain predictions (Wieder et al., 2018).Diverging simulations of SOC storage and its spatial distributions across the globe hinder a better understanding of the soil carbon cycle and its feedback to climate change (Ciais et al., 2014;Luo et al., 2016;Todd-Brown et al., 2013).
In simulating SOC dynamics, state-of-the-art process-based models following the two paradigms differ structurally regarding soil carbon pool classification, SOC decomposition kinetics, and representation of carbon transfer processes (Table 1).Soil organic carbon can be separated into conceptual pools with different turnover rates that reflect heterogeneity in their decomposition rates.For example, models derived from the Century model (Parton et al., 1987) that center their simulations around the "pool turnover" paradigm (Luo, 2022;Schimel, 2023) differentiate substrates according to turnover times, with labile substrates that cycle rapidly (i.e., active SOC) and chemically or physically protected pools that cycle slowly (i.e., slow and passive SOC).In contrast, recently formulated process-based models that highlight the role of microbial processes define carbon pools as measurable entities that can be validated with field observations (Abramoff et al., 2022)-for example, microbial biomass, dissolved organic carbon, particulate organic carbon, and mineral-associated organic carbon (Table 1).
In representing SOC decomposition, a theory developed back in the 1940s (Jenny, 1941) and consolidated in the 1980s (Parton et al., 1988) portends that organic matter decay in soils follows firstorder kinetics: dSOC dt ∝ − k × SOC, where the loss rate of SOC (i.e., k) is independent of its pool size (i.e., SOC).Therefore, with this formulation, the SOC storage changes over time is proportional to its pool size (Forney & Rothman, 2012).With increasing evidence pointing to soil microorganisms as a key factor in soil carbon dynamics, a newer generation of models has explored the possibility of nonlinearity in SOC decomposition (Allison et al., 2010;Georgiou et al., 2017;Schimel & Weintraub, 2003;Wang et al., 2021) (Table 1).Among various nonlinear functions that can be used to describe decomposition, the Michaelis-Menten kinetics (i.e., dSOC dt ∝ − v ENZ × SOC K + SOC ) considers the interplay between the substrate (i.e., SOC) and the extracellular enzymes (i.e., ENZ) that catalyze the decomposition of organic matter.While not new (Briggs & Haldane, 1925), this formulation is now being frequently used in soil carbon cycle models (Schimel & Weintraub, 2003;Wilson & Gerber, 2021).Specifically, parameter v specifies the maximum SOC decomposition rate at its saturated content for a given enzyme content.The inverse of the Michaelis-Menten constant (K) specifies the enzyme's affinity for its substrate in a catalyzed reaction.
Process-based models also differ in allocating the decomposed carbon to other carbon pools or heterotrophic respiration as CO 2 (Table 1).While soil microbes mineralize SOC into CO 2 through their metabolism, transfers of decomposed carbon from one pool to another could result from either an exclusive effect of microbial processes or an integrative effect of biological, chemical, and physical reactions (i.e., including both microbial and non-microbial transfer).Specifically, when a model explicitly defines a microbial biomass carbon pool, carbon received by this pool is partitioned according to the microbial carbon use efficiency (CUE)-that is, the ratio of carbon assimilated in new biomass over the total substrate carbon uptake (Geyer et al., 2016;Manzoni et al., 2018;Tao et al., 2023).Correspondingly, carbon transfers among different soil compartments that happen without microbial carbon assimilation can be interpreted as results from other biochemical processes (e.g., microbial exudation and mortality) or organo-mineral interactions (Tao et al., 2023).In contrast, for models without explicit representation of microbial biomass and assimilation processes, carbon transfer implicitly integrates the effects of both microbial physiology and other chemical or physical reactions.Depending on the model structure, a range of relations between long-term SOC and microbial traits, such as CUE or carbon inputs to soils, emerge (Georgiou et al., 2017;He et al., 2023;Wutzler & Reichstein, 2008).
In addition to structural differences among varieties of processbased models, parameter values that quantify the strength and represent properties of different processes in the soil carbon cycle also contribute to the uncertainty of model simulations (Luo & Schuur, 2020), especially when they are not well constrained by observations.Most current Earth system models adopt the Centurytype model structure using first-order SOC decomposition kinetics.
Notwithstanding their structural similarity, varying parameter values among different models contribute to the divergent estimates of SOC storage both at the site level and across the globe (Luo et al., 2015;Todd-Brown et al., 2013).Moreover, the same model with different choices of parameter values (i.e., parameterization) could also generate varying patterns between SOC and key model components, such as microbial CUE (Tao et al., 2023) and plant carbon input (Tao et al., 2024).However, choices of parameter values and model structure are not fully independent in affecting model simulation: Different model structures can, in some cases, converge to similar results in the long term via parameter adjustments.For example, the Michaelis-Menten kinetics, when the affinity of the enzyme for its substrate is extremely low, such that the Michaelis-Menten constant is much higher than the substrate concentration ([K]SOC), the nonlinear decomposition kinetics will converge to linear kinetics with respect to the substrate (Lasaga, 1998;Wilson & Gerber, 2021).

TA B L E 1
Major differences in simulating soil carbon cycle among process-based models following two paradigms for SOC loss pathways (see also Figure 1).

Pool turnover-centered paradigm
Microbe-centered paradigm

Carbon pool classification
Carbon pools are conceptually defined turnover time (i.e., average time a carbon compound stays in the soil).These models usually do not explicitly define microbe-related carbon pools such as microbial biomass, dissolved organic carbon, and enzyme Carbon pools are defined by their functions in the soil carbon cycle.These models usually explicitly define microbe-related carbon pools such as microbial biomass, dissolved organic carbon, and enzyme by representing specific microbial processes such as assimilation, catabolism, mortality, and enzymatic reactions

Decomposition kinetics
First-order kinetics.Decomposition rate is only dependent on the donor pool size (i.e., the amount of substrate being decomposed) Microbial explicit kinetics, such as Monod, Michaelis-Menten, reverse Michaelis-Menten, and logistic type kinetics.Decomposition rate is a function of both donor pool size and catalysts

Carbon transfer scheme
Organic carbon is transferred among conceptual pools, and CO 2 is emitted whenever a transfer happens.
Organic carbon is transferred among functionally explicit pools, and CO 2 is emitted only when microorganisms assimilate carbon from substrates in metabolism.

Model example used in this contribution
Community Land Model version 5 (CLM5) (Lawrence et al., 2019) CarbOn cycle and Microbial PArtitioning Soil model (COMPAS) (Tao et al., 2023) F I G U R E 1 Distinctive model structures of CLM5 (a) and COMPAS (b).CWD, coarse wood debris; SOC, soil organic carbon.(Luo et al., 2011).Conventional data assimilation techniques such as the Bayesian inference-based Markov Chain Monte Carlo (MCMC) method have been used at the site level to tune process-based models for better performance in simulating soil carbon cycle (Li et al., 2016;Xu et al., 2006).Recently, the newly developed PROcess-guided deep learning and Data-driven modeling (PRODA) approach (Tao & Luo, 2022)

| Global vertical soil organic carbon profiles
We obtained SOC data in globally distributed soil profiles from the World Soil Information Service (WoSIS) and other data sources.WoSIS compiled soil data, after quality assessment, from soil profiles distributed across 173 countries (Batjes et al., 2020).The 2019 snapshot of the WoSIS dataset consists of 111,380 soil profiles with SOC content information (unit: g C kg −1 soil).We estimated the SOC stock (g C m −3 ) by SOC Stock = SOC Content × BD (Yigini et al., 2018), where BD is the bulk density of soil (g m −3 ).Note that SOC stock was multiplied by 1 − G 100 to account for the volumetric coarse fragment fraction (G, unit: %) at each grid of the global map (data source: SoilGrids, https:// soilg rids.org).When the measured bulk density was absent in the dataset, we used a pedo-transfer function to estimate it (Grigal et al., 1989;Yigini et al., 2018): BD = + × exp(− × OM), where OM is organic matter, calculated as SOC × 1.724, with SOC content in percent (%); α, , and γ are fitting parameters.After fitting data of WoSIS (i.e., 78,913 layers from 16,248 profiles that simultaneously recorded bulk density and SOC content) to this equation, we obtained that α = 0.32,  = 1.30, and γ = 0.0089.The pedo-transfer function explained 55% of the variation in the bulk density.Using the pedo-transfer function does not introduce substantial extra uncertainties in the SOC stock database.At those 16,248 soil sampling sites that recorded bulk density and, thus, SOC stocks, we compared the field measurements with their corresponding values estimated from the pedo-transfer function.The pedo-transferred estimates explained 68% of variation in field-measured SOC stocks.We conducted a t-test to quantify whether the difference between fieldmeasured and pedo-transferred SOC stocks (i.e., pedo-transferred estimates minus field measurements) differ from 0. The results suggested that the mean difference is −0.05 kg C m −3 , but such a small bias was not significantly different from 0 (p-value = .30,df = 78,192, t = −1.03).
In addition, we obtained an additional dataset of SOC stock in permafrost regions, which combined the data from a previous study (Mishra et al., 2020) and the Northern Circumpolar Soil Carbon Database (NCSCD) (Hugelius et al., 2013).This dataset contained 2546 soil profiles with SOC stock (g C m −3 ) information for permafrost regions in North America, northern Eurasia, and Qinghai-Tibet Plateau.Combining this dataset with the WoSIS dataset, in total, we obtained data from 113,926 soil profiles as the raw data.The geographical distributions of all soil profiles are shown in Figure S1.
Not all the soil profiles are used in this study.We pre-processed the 113,926 SOC profiles to ensure the quality of the data before we conducted our analysis.We first excluded SOC profiles with no more than two observation layers or the maximum observation depths of no deeper than 50 cm from this study as such data do not provide enough information on key processes underlying SOC storage.After this screening, we retained 72,377 profiles.
To further examine the suitability of the data for model optimization, we conducted data assimilation for each of the 72,377 SOC vertical profiles with both the Community Land Model version 5 (CLM5) and the CarbOn cycle and Microbial PArtitioning Soil model (COMPAS) using the Markov Chain Monte Carlo (MCMC) method.Model structures of CLM5 and COMPAS are described in Sections 2.2 and 2.3, respectively.The method of data assimilation is briefly described in Section 2.4 below and in detail by Tao et al. (2020).
We used two statistics, that is, Gelman-Rubin (G-R) statistic and Nash-Sutcliffe modeling efficiency (NSE) coefficient, to ensure the quality of model calibration against SOC data along the vertical profiles.We calculated the G-R value (Gelman et al., 2014) for each of the SOC profiles to test the convergence of the site-level data assimilation results after running three independent series of MCMC simulations (see Section 2.6 for details of MCMC).A G-R value approaching 1.0 suggests well-converged data assimilation results.A large G-R value, in contrast, indicates inconsistent data assimilation results from these independent MCMC simulations, and such results may not be trusted.Therefore, we set a threshold of G-R = 1.05 and excluded SOC profiles with G-R > 1.05, with 66,935 profiles remained for CLM5 and 59,476 remained for COMPAS to be included in further analyses.We found that it was more difficult for the independent MCMC simulations to converge when using COMPAS model than using CLM5 in data assimilation, probably because of the nonlinearity and a lack of vertical transport for the mineral-soil carbon part in COMPAS (see Section 2.3).Thus, the final adopted profiles for COMPAS are fewer than those for CLM5.
We used the NSE coefficient (Janssen & Heuberger, 1995) (NSE) to evaluate the effectiveness of retrieving information from observations by process-based models.NSE is expressed as: At the site-level data assimilation, the summation in Equation 1extends to all sampling depths at a given site.A value of NSE close to 1 indicates that SOC distributions with depth can be well captured by process-based models so that information contained in the observations can be retrieved to evaluate processes underlying SOC storage.In contrast, a small value of NSE indicates that the model cannot capture the variability in the data, suggesting that such SOC vertical profiles may not offer enough information on the investigated processes underlying SOC storage.While it is possible that the negative NSE values could also result from the fact that process-based models are still not sophisticated enough to capture extreme irregularities in observations, we set the threshold as NSE = 0.0 to include as many profiles as possible in the analysis.Moreover, the soil profiles included in this study are inclusive to diverse vertical shapes in SOC.For example, for the COMPAS model, 66.2% of the 57,267 profiles show monotonically decreasing SOC stocks with soil depths, 4.4% of them record the highest SOC stock at the middle of the soil depths and 29.4% of them show zigzagged SOC stock with increasing soil depths (Tao et al., 2023).Eventually, only 4% (2209 out of 59,476) and 6% (4004 out of 66,935) of the profiles for CLM5 and COMPAS, respectively, were excluded due to negative NSE values.Moreover, we randomly selected a subset of these excluded SOC profiles to visually cross-check their shapes.We found that the thresholds are effective for controlling the suitability of data.
After all the data pre-processing procedures, we obtained data assimilation results from 62,931 soil profiles for CLM5 and 57,267 soil profiles for COMPAS with which we estimated global SOC storage and its components.Our data pre-processing criteria did not cause significant discrimination against profiles belonging to specific soil orders or ecosystems or different vertical shapes (Tao et al., 2023).
Meanwhile, the coverages of selected soil profiles across multidimensional covariate spaces do not differ much between CLM5 and COMPAS (Figure S2).Thus, the main conclusions drawn from this study are unlikely influenced by our data pre-processing criteria.

CLM5 is the latest version of the land model of the Community Earth
System Model version 2 (CESM2) (Lawrence et al., 2018(Lawrence et al., , 2019)).The soil carbon part of CLM5 centers its simulations around the pool turnover paradigm (Table 1).Similar structures have been widely used in most of the state-of-the-art Earth system models.CLM5 uses conceptual soil carbon pools (i.e., active, slow, and passive SOC), and thus, microbial processes are only implicitly represented in the model structure.Meanwhile, CLM5 adopts first-order kinetics in simulating SOC decomposition.SOC dynamics in CLM5 can be expressed in a uniform matrix equation (Huang et al., 2018;Lu et al., 2020;Luo et al., 2022): This matrix equation has six components (Table S1), including plant carbon inputs (I(t)), carbon input allocation to different pools and depths (B), substrate decomposability (or baseline decomposition rates) (K), carbon transfer efficiency (A), environmental modifier ( (t)), and vertical transport (V(t)).
CLM5 describes seven carbon pools in the soil, including four litter pools (i.e., coarse woody debris (indicated by subscript CWD), metabolic litter (ML), cellulose litter (CL), and lignin litter (LL)) and three soil organic carbon pools (i.e., active (aSOC), slow (sSOC), and passive (pSOC) soil organic carbon pools).Each of the carbon pools is simulated in 20 layers to a maximum depth of 8.4 m.The state of different carbon pools (i.e., carbon stocks) can be expressed as: where each of the seven block elements (i.e., x i (t)) of X(t) has 20 elements to represent the 20 soil layers.In total, CLM5 simulates carbon transfer among 140 pools.Consequently, there are 140 dimensions for vector B of carbon input allocation, matrix K of substrate decomposability, matrix A of carbon transfer from one carbon pool to another, matrix (t) of environmental modifiers, and matrix V(t) of vertical transport.Plant carbon input (I(t)) is a scalar.In this study, parameters (Table S1) that generate the above elements in the matrix equation will be optimized by the PRODA approach.
Specifically, I(t) is allocated to different litter pools in different layers along the soil profile via the allocation vector B. Organic carbon in pool vector X(t) is decomposed following first-order kinetics as described by matrix K: (2) (3) x aSOC (t) x sSOC (t) where k i is independent from the state of its corresponding substrate x i (t).Moreover, we used the environmental modifier (i.e., (t)) to account for the effects of environmental conditions on the decomposition processes.(t) is calculated from functions of soil temperature ( T ), soil water potential ( W ), nitrogen and oxygen availability ( N−O ), and soil depth ( D ).
Organic carbon from any carbon pool is further partitioned by either microbial or non-microbial processes between a receiver carbon pool and CO 2 released to the atmosphere.All these processes can be summarized in the A matrix: where all the block elements in the A matrix (a i,j ) are diagonal matrices with the dimension of 20. a ij represents the carbon transfer fraction from the donor (j) pool to the recipient (i) pool (see carbon transfer flows in Figure 1).Note that CLM5 does not differentiate carbon transfers mediated by microbial processes from those mediated by non-microbial processes (e.g., organo-mineral interactions).Thus, a i,j in Equation 5are integrative values reflecting carbon transfers contributed by both microbial and non-microbial processes.
The transport matrix V of CLM5 is a tridiagonal matrix that describes vertical carbon movement between adjacent soil layers within the same carbon pool via bioturbation and cryoturbation.
At steady state, the analytical solution of SOC stock by CLM5 was calculated as , where the overbars indicate the mean values of related matrices ( (t) and V(t)) and scalar (I(t)) over the period of forcing data.The matrix representation for process-based soil carbon cycle models has been described in detail by Huang et al. (2018), Lu et al. (2020), andLuo et al. (2022).

| Structure of COMPAS model
COMPAS explicitly represents the microbial-driven carbon partitions in soil carbon cycle simulations.In addition to applying Michaelis-Menten kinetics in representing organic matter assimilation and decomposition, COMPAS differentiates soil organic carbon into field-measurable components, such as microbial biomass, extracellular enzyme, dissolved organic carbon, and mineral-associated organic carbon.Thus, we choose COMPAS as the representative model based on the microbe-centered paradigm.Specifically, COMPAS follows the same structure proposed by Allison et al. (2010) for SOC dynamics, which is further embedded within the structure for 20-layered vertical soil profiles.The description of vertical layers was adopted from CLM5.Organic carbon dynamics represented by COMPAS can be expressed by the same matrix framework as shown in Equation 2(Table S2).Yet, COMPAS structurally differs from CLM5 in classifying soil carbon pools, expressing substrate decomposition, and explicitly describing microbial partitioning processes in carbon transfer (Table 1 and Figure 1).Equation 2describes COMPAS with 160 dimensions to represent eight pools in each of the 20 soil layers.Vector X(t) has eight block elements to represent four litter carbon pools (indicated by subscripts CWD, ML, CL, and LL) and four soil organic carbon pools (i.e., dissolved organic carbon (DOC), mineral-associated soil organic carbon (mSOC), microbial biomass (MIC), and extracellular enzymes (ENZ)): Each of the eight block elements (i.e., x i (t)) of X(t) has 20 elements to represent the 20 soil layers.Similarly, there are 160 dimensions for vector B, matrix K, matrix A, matrix (t), and matrix V(t).Plant carbon input (I(t)) is still a scalar as in CLM5.Parameters (Table S2) that generate the above elements in the matrix equation will be optimized by the PRODA approach.
Different from CLM5, organic carbon pools in vector X(t) of COMPAS can be transferred to recipient pools either through microbial-or enzyme-mediated kinetics, or without going through microbial metabolism.These transfers are described by the baseline decomposition matrix K: While all the litter organic carbon pools and two mineral-soil organic carbon pools (i.e., MIC and ENZ) are decomposed following first-order kinetics with constant baseline decomposition rates, the baseline decomposition rates of DOC and mSOC are functions of carbon pool states.Specifically, the baseline decomposition rate of DOC (i.e., the baseline rate of microbial assimilation of DOC) is: . Parameters v max,assim and v max,decom represent the maximum DOC assimilation and mSOC (5) decomposition rates, respectively.K m,assim and K m,decom are the Michaelis constants for DOC assimilation and mSOC decomposition, respectively.
The COMPAS model also explicitly differentiates carbon transfers by microbial processes from those in non-microbial processes.The decomposed organic carbon is either partitioned by microorganisms to microbial biomass growth versus respiration (i.e., according to the microbial CUE), or alternatively, transferred to other carbon pools with a fraction that is not mediated by microbial processes (i.e., non-microbial carbon transfer).All these processes are summarized in the A matrix: Because DOC is always assimilated by the microbes with release of CO 2 (Figure 1), the microbial CUE for DOC ( DOC ) equals a MIC,DOC .In contrast, organic carbon in the metabolic, cellulose, and lignin litter pools is decomposed by microbes following firstorder kinetics to generate CO 2 and grow biomass while a fraction of litter organic carbon is broken down without going through microbial metabolism and, thus, directly transferred to DOC or mSOC.In this case, the microbial CUE for the three litter carbon pools can still be expressed as: , and , respectively.
COMPAS applies the same approach to simulate carbon input allocation (B), environmental modifier (i.e., (t)) and transport matrix V as that used in CLM5.It should be noted that while COMPAS and CLM5 use the same scheme to simulate B, (t), and V, parameter values (Tables S1 and S2) that were used to calculate the above elements in the matrix equation were estimated independently by the PRODA approach.
In calculating the steady state of different carbon pools by COMPAS, Equation 2 can be separated into two equations: one for litter carbon cycle and transport, and the other for mineralsoil SOC cycle, because there is no carbon transfer from mineralsoil carbon pools to litter carbon pools (i.e., a litter pool,soil pool = 0 in the A matrix).Since A, K, (t), and V are all independent from litter carbon pool states (i.e., X), the analytical solution of litter carbon stock at the steady state (SS) can be calculated as For the soil organic carbon pools, the related K matrix is carbon pool state-dependent (see Equation 7).We assumed that there is no vertical transport for mineral-soil organic carbon pools such that litter is added to different soil layers and transported vertically, and then, it is transferred to soil pools that are immobile in that layer.According to a method reported by Georgiou et al. (2017), the steady-state solutions for soil organic carbon pools are: where u S i is the carbon input from litter pools (L j ) to a mineral-soil carbon pool (S i , see Extended Data Figure 3 for corresponding carbon flows for each mineral-soil carbon pool) and is expressed as Note that all the elements with bold font indicate vectors of the corresponding variables or parameters for the 20 soil layers.All the multiplications shown in Equation 9are element-wise operations.

| Inputs and environmental conditions
For both CLM5 and COMPAS, the carbon input for the litter carbon pools (i.e., net primary productivity, NPP) and environmental forcings (e.g., soil temperature and moisture) are from 20 years of monthly model outputs (Table S3) by CLM5 at the steady state using a preindustrial forcing (i.e., I1850Clm50Bgc, from year 1901 to 1920) at 0.5-degree resolution.
We used the 20-year annual mean values of different components in Equation 2 to calculate total soil organic carbon stock at steady state.

| Customary parameter values for model simulations
We compared the model simulation results of CLM5 and COMPAS by (1) applying customary parameter values and (2) the parameter values optimized by the PRODA approach.For CLM5, we applied the parameter values used in its current version (Lawrence et al., 2019).
In the default scheme, most of the selected 21 parameters of CLM5 are constants across the globe, except two carbon transfers that depend on sand content and the parameter controlling plant carbon input allocation that depends on plant functional types (Table S1).
For COMPAS, it is a newly constructed model and thus does not have well-tuned parametrization for global simulation.We applied the global mean values of the selected 23 parameters after site-level data assimilation as the customary parameter values for COMPAS to drive the global simulation.

| PROcess-guided deep learning and DAta-driven modeling (PRODA)
The PRODA approach integrates big data with Bayesian data assimilation and deep learning to optimize soil carbon cycle simulation with process-based models (Tao & Luo, 2022).We used the PRODA approach to optimize both CLM5 and COMPAS at the global scale.Data (8) We conducted Bayesian data assimilation by using the MCMC method for each of the SOC profiles to estimate the parameter values of the process-based models that best-fit model simulations with SOC observations.Because the soil profile data collected from field measurements of soil organic carbon include all components of organic matter (e.g., microbial biomass carbon), we used the sum of modeled mineral-soil carbon pools classified in CLM5 and COMPAS for each layer to be compared with soil profile data at the corresponding sampling layer.
Specifically, at site-level data assimilation, for each SOC profile, we applied an adaptive Metropolis algorithm (Haario et al., 2001) to generate the posterior distributions of a total of 21 parameters of CLM5 (Table S1) and 23 parameters of COMPAS (Table S2) related to six model components with two phases of simulations (i.e., a test run and a formal run).We first conducted a test run assuming uniform distributions for each of the preselected parameters as the proposal distributions (i.e., prior knowledge).The prior ranges of the uniform distributions for each parameter are shown in Tables S1 and S2.The proposal distributions continuously generated a set of parameter values for the process-based models to simulate SOC storage.We then evaluated whether the proposed parameter values should be accepted or not by comparing their model simulation results with SOC observations.In the formal run, we used the accepted sets of parameter values obtained in the test run as the proposal distributions and assumed that all the target parameters are multivariate Gaussian distributed.
We proposed new sets of parameter values and evaluated them to be accepted or not following the same rule in the test run.Unlike the test run, the proposal distributions in the formal run were continuously adjusted according to the newly accepted sets of parameters.
We set 20,000 iterations for the test run and 50,000 iterations for the formal run.Eventually, we controlled the acceptance ratio (i.e., the ratio of accepted sets of parameters out of the total number of iterations) of the formal run between 10% and 50%.We set the burn-in coefficient as 50%, where the first half of the accepted parameter values in the formal run was discarded, and the second half was used to generate the posterior distributions of parameters.
We calculated the mean values of the posterior distributions of parameters as the final estimates of parameter values.We ran three independent series of MCMC for each SOC profile and calculated the G-R statistic to test the convergence of data assimilation results.
The mean G-R values of the target parameters were further calculated as the holistic performance of MCMC for each SOC profile.
The mathematical foundations of Bayesian data assimilation and technical details of the MCMC method were documented by Tao et al. (2020).
It should be noted that the data assimilation was conducted under the assumption that SOC profiles are at steady state (i.e., dX(t) dt = 0).This assumption makes data assimilation computationally more feasible than that under non-steady state (see the non-steady-state data assimilation in Zhou et al. (2013) and Zhou et al. (2015)).While soil carbon stocks in some ecosystems (e.g., agricultural soils) may not be at the steady state because of the concurrent climate change and human activities, previous research demonstrated that such disequilibrium component of the transient carbon cycle dynamics, especially in SOC pools, is minor in comparison with the amount of SOC storage that was developed over thousands of years (Lu et al., 2018).
We included parameters related to both non-microbial and microbial processes (Tables S1 and S2) in the site-level data assimilation and the following global optimization with the PRODA approach.
While we acknowledge that biological processes (and thus their related parameter values) may change in response to external disturbance, in this study, we focus on the long-term spatial patterns of vertically distributed SOC under the steady-state assumption.We used multi-year mean values of a preindustrial forcing (no climate change happened yet) to simulate SOC storage.Therefore, the optimized parameter values should be regarded as long-term averages.
Moreover, we designed a parameter recovery experiment to confirm whether parameters related to microbial processes (e.g., the Michaelis-Menten constants) can be recovered from data assimilation under the steady-state assumption.We randomly chose 200 sites across the world for COMPAS and used prescribed parameter values with different across-site variability to generate a set of synthetic SOC data.The synthetic vertical SOC profile (20 datapoints at the 20 prescribed soil layers in COMPAS) was further used in the MCMC data assimilation to retrieve optimized parameter values.We found a satisfactory agreement between the retrieved parameter values and their prescribed values (e.g., "mm_const_assim" and "mm_const_decom" in Figure S3).For parameters whose prescribed values did not show much across-site variability (e.g., "tau4s1" and "pl1s1" in Figure S3), MCMC method also refrained from assigning them extra variation across sites.The results of the recovery experiment supported the efficacy of using the MCMC method to retrieve optimized parameter values from observations under the steady-state assumption.
We trained a fully connected multilayer neural network to predict the site-level parameter values estimated from data assimilation with a suite of 60 environmental variables (Table S4).We chose variables that represent the climatic, vegetation, edaphic, and geographic conditions at different sites because they are conventionally regarded as the driving factors that regulate the formation and stabilization of SOC (Jackson et al., 2017).Parameters in process-based models quantify the strength of different soil carbon cycle processes and therefore should also have relationships with these environmental variables (Luo & Schuur, 2020).To achieve a better training effectiveness, we first normalized all the environmental variables and parameters to the interval of [0, 1] according to their maximum and minimum values.We then conducted a set of pre-experiments to determine the best configuration of the neural network.The neural network used in the final training consisted of four hidden layers.The node numbers for each hidden layer were 256, 512, 512, and 256, respectively.We used a rectified linear unit (ReLU) as the activation function and a gradient descent optimization algorithm (adadelta) as the optimizer.
The loss function was designed as the multiplication of L1 (i.e., ratio loss (RL): ) and L2 (i.e., mean squared error (MSE): we set a drop-out ratio of 20% for each of the hidden layers.

| Global maps of SOC, residence time, and related model components
Global maps of parameters predicted by the best-guess neural network using the gridded environmental variables were applied to the two process-based models to generate global maps of SOC storage and its related components (i.e., 57,267 sets of site-level data assimilation results for COMPAS and 62,931 for CLM5).In addition, we conducted bootstrapping experiments to quantify the simulation uncertainties of CLM5 and COMPAS after being optimized by the PRODA approach.The original SOC database used by CLM5 and COMPAS was sampled with replacement 200 times and was used to train and validate the neural network.Following a common practice in neural network training, for each bootstrapping, 90% of the data were used as training data, and the remaining 10% were used for validation.The predicted parameter values after neural network training were then applied to CLM5 and COMPAS to simulate SOC stock and its related model components.The uncertainty maps of SOC storage and its related components are shown in Figures S4 and S5.
It should be noted that uncertainties shown in the global generalization by the PRODA approach only quantify the variation of trained neural networks in predicting site-level data assimilation results (i.e., the mean value of parameters' posterior distribution).
Limited by its optimization algorithm (Tao & Luo, 2022), PRODA is not able to consider propagating the uncertainties in parameters' posterior distribution in the site-level data assimilation to the global scale.Developing the next-generation data assimilation approach that can directly integrate process-based models into deep learning algorithms will be the solution to retrieve process understanding and simultaneously address parameter uncertainties in optimization.
We retrieved the system-level carbon transfer efficiency (CTE), plant carbon inputs, allocation of input carbon to different soil layers, substrate decomposability, environmental modifications, and vertical transport from the optimized parameters of COMPAS and CLM5 (Tables S1 and S2) via the PRODA approach.All the six model components investigated in this study are ensembles of processes that were represented by different parameters in the process-based model.Note that all the system-level components discussed in this study are for the soil system that integrates both litter organic carbon and mineral-soil organic carbon.Specifically, we calculated the system-level carbon transfer efficiency as the sum of carbon transfer coefficients along each carbon transformation pathway (i.e., a ij in Equations 5 and 8) weighted by the carbon fluxes over all the pathways in the soil system: where a ij represents the carbon transfer fraction from the donor pool (j) to the recipient pool (i); x j,z is the carbon pool size at depth z (g C m −3 ); k j is the depth-independent baseline decomposition rate (yr −1 ) of the corresponding carbon pool; z represents the environmental modifier at depth z; and Δz is the thickness of zth soil layer.Note that CTE along the carbon transfer pathway from donor pool j to recipient pool i (i.e., a ij ) is weighted by the flux size from donor pool j (i.e., ∑ z x j,z k j z Δz ), which measures the amount of decomposed carbon along the j to i transfer pathway, normalized by the total flux in the soil system (i.e., ∑ j ∑ z x j,z k j z Δz).

A higher CTE value indicates a larger amount of carbon remained
in the recipient soil pool after organic carbon is decomposed or transformed by biological and/or chemical and physical reactions, which, by definition, also associates with less CO 2 released back to the atmosphere.It should be noted that this weighted average transfer efficiency is defined differently from the system CUE in Tao et al. (2023), which was instead calculated as ratio between the sum of carbon fluxes entering the microbial pool and the sum of carbon fluxes leaving the donor pools.
The baseline decomposition rate (unit: year −1 ) expresses the rate of organic carbon decomposition at optimal soil temperature and water conditions.We calculated the system-level baseline decomposition rate (K system , unit: year −1 ) by weighting the baseline decomposition rate of SOC pools by their carbon pool sizes: Similarly, we weighted the vertical transport rate (year −1 ) and environmental modifiers (unitless) at different soil depths (z) by their corresponding sizes of SOC stock (i.e., x z , with unit of g C m −2 ): (10) Carbon input is distributed vertically according to the distribution of root biomass at different soil depths (Jackson et al., 1996).
Therefore, to quantify how effectively the input allocation process distributes litterfall and root exudation to different soil depths, we calculated the fraction of carbon input allocated to soil layers below 5 cm as the system-level index for plant carbon input allocation: where Y z is the cumulative fraction of input carbon at soil depth of D z ; n is the number of soil layers.A larger system-level input allocation index indicates that more carbon from litterfall and root exudation will be allocated to deeper soils.This index differs between models because the parameters describing the vertical distribution of carbon inputs are optimized independently in the two models, even if we used the simulated total litterfall (equivalent to NPP) in CLM5 as the plant carbon input for both models.

| RE SULTS
Process-based models with different structures and customary pa- The two structurally different models simulate similar SOC storage and spatial patterns after being constrained by the same SOC data using the PRODA approach.At the site level, we found that posterior distributions of selected parameters after data assimilation could differ greatly from their customary values (Figure S6) and from site to site.We further used PRODA to generalize the emerging spatial heterogeneity of optimized parameter values in site-level data assimilation to the global scale and found similar SOC simulations by CLM5 and COMPAS.Based on the best-guess neural network predictions that were trained by all available site-level data assimilation results (see Section 2.7 for details), PRODA-optimized CLM5 explains 57% (median 56%, one-sigma confidence interval 53%-57% in 200-time bootstrapping) of the spatial variations in SOC at measured soil depths across the globe (Figure S7a).The predictive performance of COMPAS after PRODA optimization is similar to that of CLM5, explaining 55% (median 53%, one-sigma confidence interval 52.5%-54% in 200-time bootstrapping) of the spatial variations in global SOC observations (Figure S7b).
In simulating global SOC patterns, CLM5 continues to simulate higher SOC storage in the boreal regions than in the tropics.In addition to higher SOC in East Siberian and Alaska, PRODA-optimized CLM5 also identifies western Siberian lowlands as areas holding high SOC storage (Figure 3a,c).Meanwhile, after being constrained by observations, the simulated SOC storage in tropical regions increased to an average value of more than 10 kg C m −2 (Figure 3b,c).
Simulation results by COMPAS after PRODA optimization now fol- Simulations of key components related to SOC storage also converge after the two structurally different models are constrained by the same set of SOC data (Figure 4).We assessed the spatial patterns of six components simulated by the two models: carbon transfer efficiency, baseline decomposition, environmental modifier, carbon input allocation, vertical transport rate, and plant carbon input.
The carbon transfer efficiency quantifies the ratio of decomposed carbon being transferred from one carbon pool to another.
CLM5 and COMPAS represent the carbon transfer efficiency differently (Figure 1).COMPAS explicitly describes microbial CUE that partitions the metabolized organic carbon into microbial biomass accumulation versus respiration and the non-microbial carbon transfer related to the transformation of carbon from one carbon pool to another via organo-mineral interactions (Figure 1b).
In contrast, CLM5 fuses microbial CUE and non-microbial carbon transfer in its structure, such that the related parameters do not differentiate these two processes but integrate their effects in simulations (Figure 1a).Thus, it is not surprising that the values of the carbon transfer efficiencies are in general different between the two models, with higher values predicted by CLM5 compared with COMPAS (Figure 4c).Yet, despite the difference in structure,  10), (d-f) baseline decomposition (K system , see Equation 11), (g-i) environmental modifier ( system , see Equation 13), (j-l) carbon input allocation (B system , see Equation 14), (m-o) vertical transport rate (V system , see Equation 12), and (p-r) plant carbon input (same for both models).Uncertainty maps of these components with CLM5 and COMPAS in a ies have demonstrated that models sharing the same first-order kinetics for SOC decomposition estimated contrasting soil carbon residence time (Wei et al., 2022;Zhou et al., 2018) and age (He et al., 2016;Shi et al., 2020) due to their different choices of parameter values.These differences resulted in large uncertainties in simulating global SOC storage (Todd-Brown et al., 2013).While all these simulations are, to some degree, plausible under given assumptions and theories, we need to identify the most probable F I G U R E 5 Relationship between Michaelis-Menten constants and their corresponding substrate content in COMPAS after being constrained by observational SOC profiles.For decomposition, "Substrate" is mineral-associated organic carbon (mSOC) and K m = K m,decom .For assimilation, "Substrate" is dissolved organic carbon (DOC) and K m = K m,assim .The converged simulations of SOC and its related components demonstrate the fact that although it is impossible to include all the processes in the soil carbon cycle into one process-based model, unresolved processes can be well accounted for in model parameter values at resolved scales after data assimilation (Luo & Schuur, 2020).In this study, COMPAS explicitly describes the microbial CUE that represents the carbon partitioning process in microbial physiology and nonmicrobial carbon transfer that relates to other biological, chemical, and physical reactions driving organic matter transformations in soils.CLM5, however, does not differentiate these two processes in its structure but represents them through aggregated carbon transfer coefficients (see Methods).After being optimized by the PRODA approach, CLM5 simulates similar spatial patterns of the carbon transfer index with COMPAS (Figure 4).Similarly, a previous study reported that a process-based model that does not explicitly couple nitrogen-related processes with the soil carbon cycle can still well represent nitrogen limitation after its parameters were constrained by data (Wang et al., 2022).et al., 2018;Luo, 2022;Schädel et al., 2014;Xu et al., 2016;Zhang et al., 2008).

| Data assimilation identifies most probable decomposition kinetics at global scale
Microorganism-centric kinetics (e.g., Michaelis-Menten kinetics) that considers enzymatic depolymerization has been advocated in recent years to account for the nonlinearity in organic carbon decomposition such that the decomposition rate is a function of both the substrate and the enzyme concentrations.Nonlinear kinetics can help capture the spatial variability of soil carbon dynamics (Wieder et al., 2013) and is necessary for understanding lignin decomposition (Liao et al., 2022) and priming effects (Wutzler & Reichstein, 2008).In this study, our data assimilation results show that, at the global scale, nonlinearity in COMPAS does not necessarily lead to more accurate quantification of SOC storage than CLM5.In fact, after being informed by data constraints, the Michaelis constants in COMPAS were much larger than their corresponding substrate concentrations (Figure 5).
In such a case, the Michaelis-Menten kinetics can be mathematically approximated by a linear structure with respect to its corresponding substrate, but also including a first-order effect of the receiver pool, resulting in a multiplicative kinetics.
It should be noted that diversity in model structures is still necessary for a better understanding of the soil carbon cycle at different spatial and temporal scales.Microbial models with nonlinear structures can be useful for studying complex carbon dynamics at small scales that linear models cannot explain (Liao et al., 2022;Manzoni & Porporato, 2007).Meanwhile, microbial responses to environmental fluctuations are highly nonlinear and can be captured only by modeling specific microbial processes (Brangarí et al., 2020).Moreover, models simulating SOC storage with different structures can perform differently across subregions, suggesting that some structures are more suitable for certain pedoclimatic conditions.For example, we have detected different pat-  data worldwide, the dataset used in this study still has substantial measurement uncertainty in SOC content data (Batjes et al., 2020).
The absence of SOC content information at deeper soils and irregularities of vertical SOC profiles resulting from measurement errors could cause difficulties in data assimilation convergence and parameter optimization to simulate SOC storage accurately (see descriptions in Section 2.1).Thus, higher oversight of quality control and quality assurance is critical to improving prediction and understanding of SOC storage across scales.
Moreover, beyond SOC content data, an array of measurements could be used in the PRODA approach to further improve model predictive ability and inform model development.
Measured carbon pools with clear physical meanings, such as particulate and mineral-associated organic carbon, can help constrain their conceptual counterparts in models (Abramoff et al., 2022;Guo et al., 2022).Meanwhile, time series flux data for the decomposition of different soil carbon pools and isotopes could help better understand decomposition kinetics and varying nutrient limitation mechanisms (Manzoni et al., 2021).In addition to carbon pool and flux data, microbial trait data can inform some model parameters or offer avenues for testing emerging properties such as CUE.For example, data related to microbial carbon use efficiency could constrain carbon transfer-related parameters, but only if measurements represent in situ conditions (e.g., using the 18 O incorporation method instead of adding labile 13 C sources) (Geyer et al., 2019).Moreover, including observations related to vegetation and hydrology dynamics in data assimilation may be more effective in understanding the spatial patterns of carbon input allocation and vertical transport.

| CON CLUS ION
This study highlights the importance of high-quality field-measured data in informing model development and constraining simulations.
⎦, assimilation was first applied at each SOC profile to estimate parameter values.Twenty-one parameters for CLM5 and 23 parameters for COMPAS were optimized for each SOC profile so that the processbased model simulations can best fit local observations.Because we conducted data assimilation independently at each observation site, optimized values of the same parameter vary across space.We further used a neural network to generalize those estimated parameter values after the site-level data assimilation to the global scale.The global parameter maps predicted by the neural network were then used in the process-based models to simulate global SOC storage and retrieve the spatial patterns of related model components across the globe.

TAO
et al.
where para i,true is the ith parameter value optimized in the site-level data assimilation, para i,pred is the ith parameter predicted by the neural network, and N is the total number of parameters of the process-based models to be predicted by the neural network (N = training size × 23 for COMPAS and training size × 21 for CLM5).While both L1 and L2 are additive loss functions, we decided to use their multiplicative composite (i.e., L1 × L2) as the loss function because training with either L1 or L2 loss alone did not yield sufficient prediction accuracy.The batch size for each iteration of optimization was 32.We set a maximum of 6000 epochs to train the neural network and selected the model with the lowest validation loss as the final training result.To avoid overfitting in training the neural network, values show diverging results in representing global SOC storage and spatial patterns.With its customary parameter values, CLM5 simulates much more SOC in the boreal regions than in the tropics.In East Siberia and Alaska, SOC storage is more than 50 kg C m −2 for the first meter, whereas in the Amazon and Congo basins and Indonesia, the average SOC storage is less than 10 kg C m −2 (Figure 2a,c).As COMPAS does not have well-tuned parameter values at the global scale, we used the global mean values of the selected parameters after site-level data assimilation as the customary parameter values.With such customary parameterization, COMPAS simulates distinctively different SOC patterns from CLM5 across latitudes.Tropical regions with the highest carbon input are simulated to store the largest amount of SOC.The average SOC storage declines from more than 20 kg C m −2 in Amazon, Congo, and Indonesia to less than 5 kg C m −2 in boreal regions (Figure 2b,c).The correlation between the simulated spatial patterns of SOC by CLM5 and COMPAS is −0.026 (logarithmically transformed SOC values, df = 45,213, p < .0001).Despite the contrasting spatial patterns, both models reasonably estimate the total global SOC storage with their customary parameter values.CLM5 and COMPAS simulate 1281 Pg C and 1308 Pg C preserved as SOC for the first-meter soils across the globe, respectively.For comparison, as two commonly used observation-based statistical products, HWSD (FAO/IIASA/ ISRIC/ISSCAS/JRC, 2012) and WISE (Batjes, 2016) estimate 1260 Pg C and 1408 Pg C for the global first-meter SOC storage, respectively.
Diverging SOC simulation by structurally different models with customary parameter values.(a) SOC estimated by CLM model, (b) SOC estimated by COMPAS, (c) latitudinal variation in estimated SOC by the two models.
low a pattern similar to that by CLM5.The correlation between simulations by COMPAS and CLM5 is 0.51 (logarithmically transformed SOC values, df = 45,213, p < .0001).Notably, differences still exist in simulating sub-continental patterns by these two models.While both models simulate the highest SOC storage in western Siberian lowlands, Alaska, and Canadian Shield, COMPAS simulates more SOC in the tropics but less SOC in East Siberian than CLM5.The total SOC storage simulated by COMPAS is slightly higher than that by CLM5.Globally, the total SOC storages in the top 1 m of soil estimated by PRODA-optimized CLM5 and COMPAS are 1469 Pg C and 1507 Pg C, respectively.
CLM5 and COMPAS simulate similar spatial patterns of systemlevel carbon transfer efficiency (Figure 4c, Pearson correlation coefficient = 0.52, df = 45,228, p < .001)after being constrained by the same observed SOC dataset.Both models show higher carbon transfer efficiency in boreal regions than in the tropics (Figure 4a,b), which indicates that in boreal regions, more carbon is maintained in the soil system after SOC is decomposed or transformed by biological and/or chemical and physical reactions instead of being released back to the atmosphere as CO 2 .The rate of SOC decomposition is determined by the substrate decomposability (as indicated by the baseline decomposition) and modified by surrounding environmental factors (i.e., soil temperature and moisture).A high baseline decomposition rate indicates the organic substrate is chemically and physically more accessible to soil microorganisms (e.g., simpler chemical compounds or weaker interactions with the soil mineral matrix).In contrast, a lower environmental modifier value indicates that SOC decomposition is more restricted by either low temperature or too much or little soil water.CLM5 and COMPAS assume first-order and Michaelis-Menten kinetics in representing SOC decomposition, respectively.Notwithstanding their different assumptions on the decomposition kinetics, PRODA-optimized CLM5 and COMPAS agree on the highest baseline decomposition rates and the lowest environmental modifier values in boreal regions across the globe (Figure 4d-i).The correlation coefficients between the simulations by the two models are 0.55 (df = 45,228, p < .001)for F I G U R E 3 Converging SOC simulation by structurally different models after data-model fusion by the PRODA approach.(a) SOC estimated by CLM model, (b) SOC estimated by COMPAS, (c) latitudinal variation in estimated SOC by the two models.Uncertainty maps of SOC storage simulations with CLM5 and COMPAS in a 200-time bootstrapping experiment are shown in Figures S4 and S5.Spatial patterns of different model components retrieved by CLM (left column) and COMPAS (central column) models using the PRODA approach.The right column shows comparisons between the model components retrieved from the two models.The model components were: (a-c) carbon transfer efficiency (CTE system , see Equation

|
200-time bootstrapping experiment are shown in Figures S4 and S5.| 13 of 19 TAO et al.baseline decomposition and 0.80(df = 45,228, p < .001)for the environmental modifier.However, not all components we investigated show convergence after data assimilation.Vertical transport quantifies the rate of organic carbon moving from the surface to deeper soil layers.The plant carbon allocation represents the vertical distribution of carbon inputs.While CLM5 and COMPAS adopt identical mathematical functions to describe these two processes (except vertical transport of mineral-soil carbon), no agreement was reached on simulated spatial patterns after the related parameters of the two models were optimized by the PRODA approach (Figure4j-o).Moreover, it should be noted that the retrieved model components using CLM5 and COMPAS are usually far from 1:1 lines even when they are well correlated.While the two models agree well on the magnitude of the simulated environmental modifier (Figure4i), the linear CLM5 simulates higher carbon transfer efficiency values (Figure4c) but lower baseline decomposition rates (Figure4f) than the nonlinear COMPAS.This pattern may occur because parameters related to carbon transfer efficiency and baseline decomposition compensate each other in CLM5 and COMPAS for a similar SOC storage simulation.Even though we used the same plant carbon input (i.e., the total amount of carbon from plant to litter) from CESM2 outputs in simulating SOC storage by the two models (Figure4p-r), COMPAS and CLM5 simulated differently how carbon transfers from litter to mineral soils (Figure1), as quantified by the ratio between the amount of carbon transferred from litter to mineral soils and the total carbon input.COMPAS simulates larger amounts of litter carbon to be transferred to mineral soils than CLM5 (FigureS8), which requires higher baseline decomposition rates in COMAS than CLM5 to reach similar simulated SOC storage, as shown in Figure4d-f.The nonlinear decomposition kinetics in COMPAS can be approximated as first-order kinetics with respect to both donor and receiver carbon pools after being constrained by observed SOC data.Compared with the linear first-order kinetics used in CLM5, COMPAS specifies SOC decomposition and DOC assimilation as nonlinear Michaelis-Menten kinetics.Thus, both the catalyst (i.e., microbes for DOC assimilation and enzyme for mSOC decomposition) and the substrate concentration (i.e., DOC for DOC assimilation and mSOC for mSOC decomposition) regulate substrate decomposition.Mathematically, when the Michaelis constants (i.e., K m,decom and K m,assim ) are much larger (e.g., 100 times larger) than their corresponding substrate concentrations and the catalyst (i.e., DOC in assimilation and MIC in decomposition) concentrations remain stable, the Michaelis-Menten kinetics can be approximated by first-order kinetics with respect to DOC in assimilation and mSOC in decomposition.After data assimilation at each SOC profile using COMPAS, we found that both K m,decom and K m,assim in the Michaelis-Menten equation are more than 100 times that of their substrate concentrations (i.e., SOC and DOC concentrations) for most of the soil profiles (Figure5).Thus, the nonlinear kinetics for enzyme-based mSOC decomposition and microbe-based DOC assimilation can be approximated by first-order kinetics with respect to mSOC and DOC after COMPAS is constrained by globally distributed SOC vertical profiles.While losing the nonlinear character of the donor pool effect, these kinetics laws still retain the effect of microbial biomass or enzyme carbon, resulting in multiplicative kinetics.Data assimilation enables converged SOC simulations by structurally different modelsThe divergent simulations by process-based models with different structures and customary parameter values reflect large uncertainties in the current understanding of soil carbon dynamics with different theories and assumptions.In this study, CLM5and COMPAS structurally differ in classifying soil carbon pools, quantifying SOC decomposition kinetics, and representing carbon transfer processes.The structural differences between these two models contributed to the contrasting SOC spatial patterns across the globe (Figure2).Uncertainties arise also from poorly constrained parameters.Model parameters quantify the strength or represent the properties of different processes in regulating the soil carbon cycle(Luo & Schuur, 2020).When they are not well constrained, differences in parameter values across models can cause additional large simulation uncertainty.Previous stud- understand how the soil carbon cycle responds to a changing climate.Our results show that the vast inter-model uncertainty in simulating global SOC storage is mainly due to the lack of common observational data constraints in major processes.Regardless of their difference in structure, our results show well-converged global SOC simulations by CLM5 and COMPAS after being optimized by the PRODA approach with the same soil carbon observations.The convergence in SOC simulations arises from the fact that the PRODA approach effectively constrains the spatial patterns of parameters of process-based models by the common observational data.Parameters in CLM5 and COMPAS are both conceptually and functionally different from each other due to their structural dissimilarity (e.g., the turnover time values for conceptually different carbon pools and the carbon transfer coefficients in CLM5 and COMPAS; see Figure1and Methods for details).However, the spatial distributions of parameters aggregate into six model components defined in the same way, which exhibit some agreement between the models.Carbon transfer efficiency, baseline decomposition rate, and environmental modifiers have been identified as determinants in explaining the spatial patterns of global SOC storage by process-based models(Tao et al., 2023) (see also FigureS9).In this study, these components show converged spatial patterns despite structurally different models after being informed by observations.In contrast, other model components that are less important for determining global SOC storage (e.g., carbon input allocation and vertical transport) did not converge in the simulations by CLM5 and COMPAS.This difference is probably caused by insufficient information in the data to constrain parameters underlying these specific components (more discussion on this issue in Section 4.3).
terns of SOC storage simulated by CLM5 and COMPAS in boreal (e.g., East Siberia) and tropical regions (e.g., Amazon and Congo Basins), even though the common observational SOC data constrained both models.The Michaelis-Menten kinetics investigated in this study is only one possibility from an array of theories.How other nonlinear kinetics, such as reverse Michaelis-Menten kinetics(Tang & Riley, 2019), perform in simulating SOC at different scales in comparison with linear models requires more studies in the future.

4. 3 |
More and high-quality data required to diminish prediction uncertaintyUncertainty still exists in predicting SOC storage by structurally different models after PRODA optimization (FigureS6).The PRODA approach used in this study reveals the spatial heterogeneity of model parameters after site-level data assimilation.Thus, at the global scale, PRODA optimizes about 1.41 million parameter values (21 selected parameters for each of the 66,935 vertical SOC profiles) for CLM5 and 1.37 million parameter values (23 selected parameters for each of the 59,476 vertical SOC profiles) for COMPAS across observational sites.The posterior distributions of different parameters showed substantial uncertainties after data assimilation at the site level.In an example of data assimilation at one site (FigureS6), while a few parameters can be well constrained by vertical SOC profile data, resulting in narrower posterior distributions than the priors, more than half of the selected parameters had weak identifiability to the observations such that their posterior distribution showed flat shapes within the prior ranges.The identifiability of different parameters is associated with the convergence of their corresponding model components by structurally different models and further affects the final global SOC simulations(Luo et al., 2009).For parameters well constrained by vertical SOC profiles in data assimilation, their corresponding model components (e.g., carbon transfer efficiency, baseline decomposition, and environmental modifiers) also showed similar spatial patterns between CLM5 and COMPAS despite differences in model structures.The revealed spatial patterns of these model components further presented high explanatory power to predict model-simulated SOC spatial patterns across the globe(Tao et al., 2023) (FigureS9).In contrast, for parameters that are less identifiable after data assimilation, different choices of optimized parameter value could lead to similar simulation of SOC storage, causing the so-called equifinality problem.Even simple models constrained by detailed data face this problem(Marschmann et al., 2019).Thus, the spatial pattern of their corresponding components, such as vertical transport and carbon input allocation, did not agree well between CLM5 and COMPAS after data assimilation in different models.Their spatial variability was also less responsible for the predictive accuracy of global SOC simulations (FigureS9).In the future, improved performance of process-based models in simulating the global patterns of SOC storage relies on a better understanding of those key components (e.g., carbon transfer, baseline decomposition, and environmental modifier) and their underlying mechanisms (e.g., microbial carbon use efficiency and organo-mineral interactions).The equifinality problem (or weak identifiability of parameters) imposes challenges to using the optimized models to predict future SOC changes under climate change.In this study, we found that the spatial patterns of vertical transport and carbon input allocation may be less consequential to simulating steady-state SOC storage at the global scale.However, both these processes can influence the physical disconnection of SOC from decomposers, so they could regulate the transient dynamics of SOC in response to climate change, warranting further investigations.Moreover, despite reasonable correlations between results retrieved from the two structurally different models, carbon transfer efficiency and baseline decomposition simulated by CLM5 and COMPAS are numerically different (i.e., not on the 1:1 line in Figure 4).Whether structurally different models after PRODA optimization can also predict converged SOC changes at different temporal scales is still an open question.Higher oversight of data quality control and broader inclusion of other types of observational data related to soil carbon cycle at different spatial-temporal scales are the keys to resolving the equifinality problem and better predictions of SOC dynamics.Our results demonstrated that applying the PRODA approach with observational constraints can effectively realize converged simulations of SOC storage by structurally different models, even if they could generate contrasting simulation results before PRODA optimization.While providing comprehensive and quality-controlled soil at the global scale, namely a linear first-order kinetic model in CLM5 and a nonlinear Michaelis-Menten kinetic model in COMPAS.Our data assimilation results suggest that firstorder kinetics may be the simplest and effective mechanism in ex- Representations of organic carbon decomposition in soils has been debated for decades.In this study, we compared two possible SOC decomposition kinetics