Unlocking the power of machine learning for Earth system modeling: A game‐changing breakthrough

Artificial intelligence (AI) technology has been rapidly reshaping all aspects of our lives since the 1990s. The recent release of ChatGPT in November 2022 represents one of the biggest advancements in AI since AlphaGo won the firstever game against a human professional Go player in 2015. Machine learning (ML)— one of the AI tools— is able to solve complex relationships in a system, handle big data, improve its own efficiency and predictability with more data, learn new knowledge (i.e., unknowns) through deep learning (DL), and evolve openaccess algorithms and data from diverse disciplines (LeCun et al., 2015). These advantageous capabilities, coupled with rapid advancements in computing power, have been recognized in all fields of science and engineering, as is evidenced by the cascading escalations in their applications. One major reason for the widespread and speedy adoption of ML technology is because the human brain has a limited capacity for comprehending large, complex systems. Scholars and practitioners in natural science have adopted various MLs to address basic and applied challenges, such as modeling complex causes, processes, and consequences in the Earth system at multiple temporal and spatial scales (Reichstein et al., 2019). The Earth system is traditionally examined through the construction of simulation models, also known as Earth system models (ESMs; sometimes called terrestrial biosphere models, Sun et al., 2023), built to estimate past, current, and future conditions. ESMs evolved from a suite of simple algorithms (Chen, 2021) to approximate the complex interactions among components of the Earth system (Fisher & Koven, 2020; Gettelman et al., 2022). ESMs are capable of integrating critical processes from atmospheric science, biogeochemistry, biological systems, human influences, and ecosystem processes to meet the diverse needs of multiple disciplines. Yet, modern ESMs pose unprecedented challenges, such as the number and types of ESMs (30+ in CMIP6 of IPCC; Smith et al., 2020), the high number of potential parameters (hundreds to thousands), complex architecture and structure, accurate values of the parameters (i.e., parameterizations), large discrepancies among the models and high uncertainty for their predictions, and applications at high spatial and temporal resolutions (Schaefer et al., 2012). An even greater challenge arises from the computing time, which hinders their practical use, especially at landscaperegional scales. Sun and colleagues recently looked into whether ML tools could be good alternatives of conventional ESMs for predicting the functions of terrestrial ecosystems (Sun et al., 2023). In their pioneering work, they applied the same input variables (27) to three versions of ORCHIDEE (i.e., a major ESM) and Baggin decision trees (i.e., a ML tool) for predicting terrestrial carbon, nitrogen, and phosphorous at global scale. They demonstrated that ML reduced the computing demand by 78– 80% while maintaining similar or even more accurate predictions than ORCHIDEEs. Such reductions are substantial, though computing power soon may no longer be a bottleneck hindering ESM runs due to rapid advancements in computing technology. An equally important finding in the Sun et al. study relates to the contributions of input variables for predicting carbon, nitrogen, and phosphorus production. It was especially interesting to see that ML only required a small number of input variables (20– 25) compared to a similar number of input variables to reach better accuracy in predictions using ORCHIDEE for different components of carbon, nitrogen, and phosphorus (figures 2– 4, respectively, in Sun et al., 2023). Over the past century, quantitative models in climatology, forestry, agronomy, ecology, and other fields have evolved from a few algorithms with a few input variables for predicting system functions

Artificial intelligence (AI) technology has been rapidly reshaping all aspects of our lives since the 1990s. The recent release of ChatGPT in November 2022 represents one of the biggest advancements in AI since AlphaGo won the first-ever game against a human professional Go player in 2015. Machine learning (ML)-one of the AI tools-is able to solve complex relationships in a system, handle big data, improve its own efficiency and predictability with more data, learn new knowledge (i.e., unknowns) through deep learning (DL), and evolve open-access algorithms and data from diverse disciplines (LeCun et al., 2015). These advantageous capabilities, coupled with rapid advancements in computing power, have been recognized in all fields of science and engineering, as is evidenced by the cascading escalations in their applications. One major reason for the widespread and speedy adoption of ML technology is because the human brain has a limited capacity for comprehending large, complex systems. Scholars and practitioners in natural science have adopted various MLs to address basic and applied challenges, such as modeling complex causes, processes, and consequences in the Earth system at multiple temporal and spatial scales (Reichstein et al., 2019).
The Earth system is traditionally examined through the construction of simulation models, also known as Earth system models (ESMs; sometimes called terrestrial biosphere models, Sun et al., 2023), built to estimate past, current, and future conditions. ESMs evolved from a suite of simple algorithms (Chen, 2021) to approximate the complex interactions among components of the Earth system (Fisher & Koven, 2020;Gettelman et al., 2022). ESMs are capable of integrating critical processes from atmospheric science, biogeochemistry, biological systems, human influences, and ecosystem processes to meet the diverse needs of multiple disciplines. Yet, modern ESMs pose unprecedented challenges, such as the number and types of ESMs (30+ in CMIP6 of IPCC; Smith et al., 2020), the high number of potential parameters (hundreds to thousands), complex architecture and structure, accurate values of the parameters (i.e., parameterizations), large discrepancies among the models and high uncertainty for their predictions, and applications at high spatial and temporal resolutions (Schaefer et al., 2012). An even greater challenge arises from the computing time, which hinders their practical use, especially at landscape-regional scales.
Sun and colleagues recently looked into whether ML tools could be good alternatives of conventional ESMs for predicting the functions of terrestrial ecosystems (Sun et al., 2023). In their pioneering work, they applied the same input variables (27)  Over the past century, quantitative models in climatology, forestry, agronomy, ecology, and other fields have evolved from a few algorithms with a few input variables for predicting system functions to hundreds of equations and thousands of variables so that all interested processes are modeled (Chen, 2021). Consequently, the number of input variables for ESMs continues to increase despite difficulties in generating reasonable values at pixel level (a.k.a. tiles).
For example, we often are faced with a lack of accurate and dynamic global land cover maps, as well as associated land surface properties, for parameterizing ESMs. Even more challenging is deriving accurate values for all the intermediate variables and their relationships for a specified time and location. Understanding and revising the structure, logic connections, and parameterizations are the major focuses of modelers. While inclusion of more processes (i.e., more algorithms) has the benefit of integrating more knowledge into the models and meet the needs of diverse disciplines, there are also values in reducing the number of parameters and algorithms so that an ESM can be applied for various purposes without the need to understand the details of all mechanisms, as well as without access of supercomputers.
We can anticipate an increase in the development of ESMs that can be parameterized and run by non-modelers and applied to broader applications, including ecosystem, landscape, and regional applications where spatial resolutions of 10-25 km and temporal resolutions of less than a month (e.g., day, hours, etc.) are often needed for location-specific applications (e.g., Zou et al., 2022). Were ML tools applied with a small number of input variables, without the need for in-depth knowledge of detailed processes and algorithms, and with accurate predictions as demonstrated in Sun et al. (2023) and others (e.g., Pal & Sharma, 2021), our knowledge, resource management, adaptation strategies, and policy making would be significantly and promptly improved.
I am particularly excited to see that predicted C, N, and P from MLs do not always match well with those of ORCHIDEE ( figure 5 in Sun et al., 2023), and the ranks of important forcing factors are not the same among plant functional types (figure 6). Among the 15 PFTs, not only are the lists of significant drivers different, but their ranks in importance vary greatly as well. While these results are expected and have been reported in other recent publications (e.g., Irvin et al., 2021), a major take-away is that our efforts to estimate the values of input variables should be weighted by PFT, prediction purposes (e.g., carbon vs. nitrogen), location, and so forth. These lessons have direct implications for ML applications when used as alternatives.
In sum, Sun and colleagues provide a unique and promising exploration of the application of ML on the complex contingent regulations of biogeochemical processes that are the focuses of ESMs.
Based on a recent review of the literature, Pal and Sharma (2021) concluded that that ML-based techniques can enhance performance, reduce uncertainties, improve parameter optimization, and improve predictions. The needs for incorporating MLs into conventional process-based modeling frameworks, such as improving parameterization or replacing a less-constrained or semi-empirical sub-model, have also been highlighted (Reichstein et al., 2019).
However, there are thousands of ML algorithms, with new ones emerging daily. Traditional ML learns and predicts based on passive observations (e.g., random forests, boosted decision trees, etc.), whereas DL focuses on an agent interacting with an environment to learn and take actions that maximize its chance of success in achieving its goals (Alpaydin, 2020). For example, both recurrent neural networks (RNNs) and graphic neural networks (GNNs) have recently gained the ability to grasp temporal and spatial relationships and variable-length data (Hinton et al., 2012;Reed et al., 2021)the exact cases in ESMs where pixels are spatially and temporally correlated. Here, RNN and GNN enable us to connect nodes as a directed graph along a temporal/spatial sequence, thereby addressing the temporal and spatial dependencies of input, intermediate, and output variables in ESMs. Because conventional ML assumes that data are independent and identically distributed, it cannot take advantage of the information available in temporal correlations, nor can it address the consequences of rare, abrupt changes in climate forcings (e.g., extremes) or human disturbances on variables in ESMs. Future efforts will likely be expanded from Sun et al. (2023) to explore different DL tools that may be more computationally effective, easy-to-use, and more accurate projections than ESMs.

DATA AVA I L A B I L I T Y S TAT E M E N T
No data were as used for this commentary.