The Terrestrial Biosphere Model Farm

Abstract Model Intercomparison Projects (MIPs) are fundamental to our understanding of how the land surface responds to changes in climate. However, MIPs are challenging to conduct, requiring the organization of multiple, decentralized modeling teams throughout the world running common protocols. We explored centralizing these models on a single supercomputing system. We ran nine offline terrestrial biosphere models through the Terrestrial Biosphere Model Farm: CABLE, CENTURY, HyLand, ISAM, JULES, LPJ‐GUESS, ORCHIDEE, SiB‐3, and SiB‐CASA. All models were wrapped in a software framework driven with common forcing data, spin‐up, and run protocols specified by the Multi‐scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP) for years 1901–2100. We ran more than a dozen model experiments. We identify three major benefits and three major challenges. The benefits include: (a) processing multiple models through a MIP is relatively straightforward, (b) MIP protocols are run consistently across models, which may reduce some model output variability, and (c) unique multimodel experiments can provide novel output for analysis. The challenges are: (a) technological demand is large, particularly for data and output storage and transfer; (b) model versions lag those from the core model development teams; and (c) there is still a need for intellectual input from the core model development teams for insight into model results. A merger with the open‐source, cloud‐based Predictive Ecosystem Analyzer (PEcAn) ecoinformatics system may be a path forward to overcoming these challenges.

Model Intercomparison Projects (MIPs) are that "working together." MIPs are designed to hear different perspectives to sample the population of possible ways the world works. In theory, if that sample is sufficiently robust, then one might converge toward a central tendency of likely truth. Ensemble model means typically perform better than most individual models (Schwalm et al., 2015). The distribution of multimodel variability could be used to construct "emergent constraints" for benchmarks to base future projections (Bretherton & Caldwell, 2020;Brient, 2020;Cox et al., 2018;Eyring et al., 2019;Hall et al., 2019;Nijsse et al., 2020;Tokarska et al., 2020). In practice, however, that sampling is often not robust and likely skewed toward models that are derivatives of one another, thereby compounding local biases (Caldwell et al., 2018;Huntzinger et al., 2013Huntzinger et al., , 2017Sanderson et al., 2015Sanderson et al., , 2021. Still, MIPs are a necessary starting point for a conversation; otherwise, it is more like a lot of voices shouting in dissonance. Due to numerous user decisions, the same model code run 10 times can have 10 different outputs. If our interest is in understanding variability in biophysical processes (Bonan & Doney, 2018), then this is a problem, as there should, in theory, be no process representation variability for a single model (with notable exceptions, e.g., Lawrence et al., 2019;Niu et al., 2011). What could potentially cause different outputs from the same model and can we control for it? Starting with the starting point or spin-up-a model should move forward from the same initial conditions every time, but this can easily vary depending on how one wants to create the world Shi et al., 2013). The forcing data used could vary, for example, from one meteorological data set to another, even if the model stays the same (Badgley et al., 2015;Lawrence et al., 2019). How a model/modeler uses the same forcing data could vary, for example, mapping of model plant functional types (PFTs) onto a land cover/land use change (LCLUC) data set (e.g., "mixed forest," different crops). A model might have functions that can be turned on or off and may be reliant on availability of certain forcing data (e.g., nitrogen cycle). Finally, the output may all be identical, but the interpretation of the output can vary from person to person or be altered in transfer or analysis. Now, compound this all with multiple different models, and we can see how comparing models for variability in process representation can quickly become muddled by other factors that can influence model output. MIPs attempt to control for many of these factors, but struggle to ensure that their protocols are rigorously met. For TBMs alone, there have been a lot of MIPs (global, regional, site; offline, coupled), and there will be more-e.g., PILPS
As a means to address some of these outstanding issues in MIPs, we sought to develop a prototype system whereby models are run with common operating software on the same hardware systems (Fer et al., 2021;Kumar et al., 2006). This approach is similar to multimodel systems, such as the Land Information System (LIS; Kumar et al., 2006), the Terrestrial Observation and Prediction System (TOPS; Nemani et al., 2002), and the Predictive Ecosystem Analyzer (PEcAn; Dietze et al., 2013). Such a system ensures that spin-up protocols are run consistently across models, user decisions on forcing data such as LCLUC-to-PFT mapping are systematic, there is no tuning, and that output is saved, transferred, and analyzed uniformly. Further, logistical coordination among multiple teams can be streamlined. We focused on global TBMs and used the relatively strict MsTMIP protocol  Huntzinger et al., 2013). Overcoming two major challenges was required for this to be successful: (a) development of a software system that could interface consistently across disparate models as well as supercomputing infrastructure and (b) compilation and setup of multiple models. Here, we describe this system, called the Terrestrial Biosphere Model Farm, named with tribute to a similar initial effort by others (Denning et al., 2009). The Model Farm conceptually builds on efforts such as LIS with hydrological strengths, but here with more of an emphasis on terrestrial carbon cycling. We show an example output for illustration, but the focus of this paper is descriptive rather than output analytical. Finally, we provide experiential lessons-learned and recommended paths forward.

Data and Models
We primarily used the MsTMIP data and simulation protocols (Huntzinger et al., 2013;Wei et al., 2014). Phase I consisted of simulations from years 1801 to 2010 and Phase II from 2011 to 2100, all globally at 0.5° × 0.5° resolution. Phase I forcing data included climate, LCLUC, atmospheric CO 2 , and nitrogen deposition. These 5 of 16 drivers were turned on/off (i.e., variable or constant) semi-factorially such that the influence of each driver can be isolated in analysis (Huntzinger et al., 2013). Spin-up to initial conditions was dictated by steady-state criteria (Huntzinger et al., 2013). Climate variables were given at a 6-hourly temporal resolution derived from a CRU-NCEP merged product (Harris et al., 2014;Kalnay et al., 1996). LCLUC was derived from a merged LUH-SYNMAP product, resulting in 47 classes annually (Hurtt et al., 2011;Jung et al., 2006). Monthly atmospheric CO 2 was constructed for consistency with GLOBALVIEW among other databases (Wei et al., 2014). Finally, annual nitrogen deposition was constructed based on Dentener et al. (2006).

Model Farm
The Terrestrial Biosphere Model Farm was designed to keep individual TBM codes intact and treated as modular or compartmentalized, which can also facilitate the replacement of models by new versions. Each TBM is wrapped in the Farm software infrastructure, interfacing with the original model codes without modification, enabling different forcing data sets to be used, and unifying model output into defined formats.
The Terrestrial Biosphere Model Farm consists of five stages from distribution of forcing data to collection of model output ( (Table 2).

Stage 1: Data Distribution
Data Distribution restructures all input files into the supercomputing environment. This is the most time-consuming of all the stages, but is done only once per forcing data set, and used for all models. Here, the global forcing data are read, split into "patches" (i.e., groups) of cells, and distributed onto supercomputing nodes for parallel processing such that each patch has an approximately similar computational load. This split can be done because there is no lateral/horizontal cell-to-cell transport (e.g., water, carbon, and energy) for these models. Some patches may contain more grid cells than others due to empty ocean cells (i.e., land-only data). The global grid from MsTMIP was a 1-degree spatial resolution at 181 × 360 latitude and longitude or 65,160 cells (Wei et al., 2014). We distributed these data into 2,000 patches, each with ∼32 cells (or "points") per patch, on 2,000 matching supercomputing processors/nodes based on memory and storage capacities of the nodes (with modifications as we migrated supercomputers). An index file assigns a patch and point number for each grid cell along with a time dimension, also forcing each model to operate with the same spatial resolution. This index file may be matched to index files from new data sets that need to align in space and time, for example, connecting historical data (1901-2010; Phase I) to future projections (2011-2100; Phase II), which may have disparate grid sizes and north/ south orientations. MD5 checksums ensure that no data are corrupted in transfer from one computer to another.
The time required to complete Data Distribution varies depending on hardware infrastructure. Initially, the Data Distribution stage took 3-4 months to complete, but eventually came down to 3-4 days to complete per data set as we migrated across supercomputers. The bottleneck in the first supercomputer, Zodiac, was due to file transfer Input/ Output (I/O) capacity and file count capacity, which were later expanded in the subsequent supercomputers, Halo and Gattaca. While I/O was expanded sufficiently in Halo, our project storage was reduced (from 30 TB in Zodiac to 10 TB in Halo), which required additional time to manage. MsTMIP forcing data required >500 GB (split roughly equally between Phase I and II). Finally, in Gattaca, the I/O and storage were sufficient for efficient Data Distribution.

Stage 2: Data Conversion
Data Conversion restructures the input forcing data to meet the unique requirements of individual models. Stage 2 is applied to each model after the model is successfully compiled and tested with provided test data or single point data on the supercomputer. Stage 2 ensures that there are no changes to the model code within the Farm.  Memory/node 24 (2 GB/core) 64 (4 GB/core) 384 (8 GB/core) Local scratch None 1 TB HDD/node 365 GB SSD/node|1.9 TB HDD/node a n = 4. b InfiniBand. Specifically, forcing data may be converted to different: (a) formats such as NetCDF 3 or 4, ascii, *.mat, etc.; (b) temporal resolutions (e.g., hourly and daily); and (c) units, depending on the model. LCLUC is assigned to model PFTs. Configuration parameters are generated for each model to specify which data sets to access. This control module can be updated to swap out different forcing data sets for different domains and experiments. Spin-up data may be reorganized for length of runs. Time required for data conversion was approximately 6 hr per data set transformation. Additional storage was required for the converted data (e.g., >2 TB for SiB3). After Stage 2 is complete, a model may be ready to spin-up and run various global simulations.

Stage 3: Spin and Run Model
Spin and Run Model runs the model to meet defined steady-state initial conditions and/or executes a desired transient model experiment. Here, independent instances of the model for each of the 2000 patches are scheduled for parallel execution. In theory, Stage 3 should run smoothly; in practice, however, several issues must be managed for complete success. In spin-up, individual patches may fail to meet spin-up requirements within the run time allotted and must be rerun (Shi et al., 2013). Incorrect or incongruent LCLUC assignments may cause a model to crash a patch due to unspecified model parameters for the assignment. Forcing data sets may not align with each other (e.g., LCLUC assignment, but no climate). An automated failed-patch detector was written to rerun patches with longer run times to handle some spin-up issues. Other failed patches had to be identified visually, for example, noticeable as stripes or unexpectedly high/low values in a global map or as spikes in diagnostic zonal/latitudinal line plots. These patches were then gathered by the Farm operator for rerun and/or debugging. For efficiency, the Model Farm framework allowed for individual patches to be rerun without requiring the entire global data set to be rerun. Spin-up required 1-2 days for most patches to complete, depending on model and supercomputer. Transient run time also required 1-2 days, depending on model, supercomputer, and simulation (e.g., 100 years historical-only vs. 200 years historical plus future).

Stage 4: Compact Output
Compact Output converts output from individual models into common formats and units for analysis. Model output from Stage 3 is produced in individual model-dependent formats, units, variables, and time steps. Stage 4 unifies model output across models. Here, we convert all model outputs into NetCDF, common units (e.g., as defined by MsTMIP), standardized NaN/water/ice values, and monthly time steps. The monthly time step conversion reduces storage and transfer time demand, as most models operate and output at sub-monthly time steps (e.g., daily for HyLand and 6-hourly for CABLE), and which gives Stage 4 its name. For example, pre-Stage 4, HyLand used 1 TB per simulation; whereas, CABLE used nearly 4 TB per simulation. Monthly compaction requires approximately half a day (e.g., HyLand) to a day (e.g., CABLE) per simulation per model depending on the Stage 3 output. Converting output file formats (e.g., *.out files from CENTURY to NetCDF) requires approximately another day.

Stage 5: Collect Output
Collect Output converts all the individual patches into a common global grid. The final stage in the Model Farm requires collecting all the outputs, which are still partitioned into patches on individual supercomputer nodes and pieces the patches back together into a cohesive and standardized global grid (still at the monthly time step from Stage 4). Metadata are added at this stage, including, for example, model source code, model version, and contact information. Model output files are organized as one variable per file per model. Stage 5 collection requires less than half a day per simulation, depending on the number of variables saved.

Results
The values with comparison models. A specific model of interest is given a thicker line than the other models for visualization (Purdy et al., 2019). Axis and map ranges are automatically set to minimum and maximum extent of data visualized. The user must develop their own customized packages to answer specific scientific questions and to alter and improve esthetic visualization of the model output. The Experiment package produces inter-simulation comparisons. The Diagnostics package requires a few hours per variable to produce the plots for the global MsTMIP-structured output.
An example of the Single Variable package visualization is shown in Figure 3. Here, we show Net Primary Productivity (NPP) for CENTURY as the highlighted model for the MsTMIP Phase II simulation (2011-2100) with CESM1-CAM5 climate and RCP4.5 CO 2 scenario. The top-left map displays the 2011-2100 annual NPP.
The top-middle map shows the mean NPP from the MsTMIP Phase I models, SG3 simulation (CO 2 , climate, and LCLUC all "on"); we note that the 1901-2010 time period for SG3 is not an equivalent time comparison to the 2011-2100 runs, but it provides a first-order check. The top-right map shows the difference between the first two maps.
The bottom-left panel of Figure   An additional set of diagnostic visualizations are produced from the Experiment package to compare among different model experiments for a single given model (Figures 4 and 5). This is to ensure that there was indeed a difference among experiments, which may not happen if there was a problem in setting up the experiment. Moreover, this is also to ensure that the difference between model experiments coincides with theoretical expectations, for example, increasing NPP with increasing CO 2 (Schimel et al., 2015;Walker et al., 2021). Figures 4 and 5 are separated because the magnitude difference between Phase I experiments ( Figure 4) and the output from Phase II ( Figure 5) is so large that the Phase I differences would be too small to see on the raw Phase II scale. For the inter-simulation runs, we find that line plots are more useful for diagnosis than are maps, as differences and trends are more readily noticeable to the eye than in maps.

Discussion
The Terrestrial Biosphere Model Farm aims to reduce the time and skill required to execute a model experiment. An undergraduate student with no modeling experience and moderate programming skills can integrate a new TBM, spin-up the TBM, and execute at least one experiment with a 10-week summer internship. Indeed, many of the coauthors on this paper were those undergraduate interns. The core technical team would already have the TBM available and compiled with supported libraries installed and debugged before those internships began. Interns would spend the first few weeks learning about TBMs in general, studying the model code, and mapping the model I/O structure. They would then spend at least a couple of weeks mapping the LCLUC forcing data onto the TBM PFT structure. Another week or two would be spent connecting the model to the Farm and running tests. The next couple of weeks would be spent on spin-up and debugging. The final couple of weeks could allow the first MsTMIP simulation, assuming all else was successful to that point (which was not always). Some interns continued working on their model into their school year or returned for a following summer for more model runs and experiments. These examples illustrate the relative ease and challenges that the Farm afforded for running TBMs in contrast to, for example, traditional model development centers.
Generally, the Terrestrial Biosphere Model Farm provided several complements to traditional MIPs. Once the models were properly setup, it was relatively fast to run multiple experiments across models. Moreover, a team organizing a MIP might be interested in many different types of experiments-more than they would be comfortable specifying not wanting to bother so many people and risking incomplete runs. With a centralized Farm, once the forcing data and experiments are setup, multiple models can run with relative ease. One example of different types of experiments that the Farm supported was on LCLUC data set development for NASA's Carbon Monitoring System (CMS) (Kennedy et al., 2018;Neeti & Kennedy, 2016;Y. Zhou et al., 2021). Here, the team focused on new and different types of LCLUC data products and their impacts on TBMs. We were able to ingest these different LCLUC data sets and provide the corresponding variability in TBM output. Ultimately, future MIPs should expand to include not only multiple models, but also multiple forcing data sets (Dietze et al., 2014). Additionally, the Farm can be used as a "sandbox" with which to test and debug any problems with the new experiments. Once those are worked out, then the MIP can be extended to individual modeling teams so that each team does not have to waste time working out those bugs.
For each model, there were a number of user decisions that had to be made in setup, for example, how to map the LCLUC forcing data onto the model PFTs. For instance, a pixel may be classified as "mixed forest" in the forcing data, but a model may have multiple forest PFTs (Poulter et al., 2011). By erring consistently toward a single definition of mixed forest, for better or for worse, we could at least constrain differences in model output due to those types of user decisions. A similar process may be considered for classifying different crop types. Still, LCLUC mapping and debugging was a very time-consuming process. Likewise, while MIPs may specify spin-up requirements and forcing data, it is challenging to know if the modeling teams adhered completely to those specifications. In the Farm, all models were run uniformly across those specifications, even if models had different ways of going about spin-up, time steps, and use of data. However, while these aspects potentially reduce spread among model outputs, we also saw ways in which one could tune models to match benchmark data sets (e.g., scalars for stocks and fluxes). Being model and output agnostic, we did not tune. But, as noted earlier, models may converge toward each other in traditional MIPs simply because of tuning (Dommenget & Rezny, 2018;Fer et al., 2021;Hourdin et al., 2017;Notz, 2015;Raäisaänen, 2007;Scheiter et al., 2013).
Nonetheless, there were challenges in running the Farm and using its output. While "supercomputers" may be super at fast processing, they can be bottlenecked by file storage and transfer (Xie et al., 2012). Multiple models and multiple scenarios for hundreds of years for thousands of pixels result in a lot of data. Supercomputers are not data archive centers, and so a conflict arises when these requirements are levied upon them at least in the relative short term. There needs to be a place to move intermediary model output before they are finalized for storage at a data archive, allowing time for debugging and output assessment. Moreover, supercomputers are a shared resource, and one cannot monopolize time on them for extended periods. Patches that require extra spin-up time to reach equilibrium run into that conflict, requiring disproportionate labor to complete a small number of pixels. Additionally, supercomputers are continually evolving, updating, and being replaced (Lim et al., 2015). Migration and re-setup of the Farm are very time-consuming for each new system.
Beyond the technical challenges, there are bounds to the scientific utility of the Model Farm output. Like supercomputers, individual models are continuously evolving and updated by the respective modeling teams (e.g., Bonan, 1998;Bonan & Levis, 2006;Bonan et al., 2002;Lawrence et al., 2011Lawrence et al., , 2019. It requires significant labor to repeatedly replace Farm models with their most recent versions; inevitably, the models in the Farm may not be up to date. As such, examination of model Farm output may not necessarily be from the most recent model versions. This issue is common for traditional MIPs too, though modelers sometimes update their results later for the MIPs as their models change. Some MIPs manage to conduct regular updates . On the other hand, models in the Farm might be considered more stable than newer model versions that may still have bugs to be discovered. Further, when trying to publish results from most MIPs, reviewers typically hit those papers with, "Why did the models behave the way they did?" This is challenging for any individual modeler to answer about their own model, let alone a centralized system that runs other people's models as effective "black boxes." So, how do we overcome these challenges, yet still advance the needs for robust MIPs? In Fer et al. (2021), we presented a roadmap to community cyberinfrastructure for ecological data-model integration. Five community tools are needed to accelerate the merging of models and data: (a) shared workflows, (b) scalable data ingest, (c) calibration, (d) benchmarking, and (e) iterative data assimilation. While the Farm addresses (a), (b), and some of (d), all five of these tools are implemented within a multimodel workflow management system called PEcAn LeBauer et al., 2013). PEcAn, like the Model Farm, is a tool to run multiple TBMs on a centralized system. However, unlike the Model Farm, PEcAn is an open-source utility that is available to anyone (pecanproject.org), can be installed on anything from a laptop to a supercomputer, or can be run in the cloud. As such, PEcAn operates as a distributed network of nodes rather than a single centralized server, which allows MIPs to either be fully centralized on one node-analogous to the Farm-or run by individual teams and synced via the PEcAn API, which may reduce the "other people's models" challenge mentioned earlier. Because PEcAn is containerized in Docker, it also greatly reduces the challenges of getting new models compiled and running on new serves as well migrating the system when systems are upgraded.
PEcAn provides a number of language-agnostic API tools to assist with model calibration, data fusion, analysis, and provenance tracking (Fer et al., 2018;LeBauer et al., 2013;Shiklomanov et al., 2020). As part of its data assimilation workflow, PEcAN can quantify, propagate, and analyze uncertainties associated with parameter, driver, and initial condition uncertainty distributions (Dietze et al., 2014). Further capabilities are in progress that are not too dissimilar from a community model including, for example, links to benchmarking and validation efforts (Abramowitz, 2012;Blyth et al., 2011;Collier et al., 2016;Kelley et al., 2013;Luo et al., 2012;Randerson et al., 2009;Schwalm et al., 2010). PEcAn's initial development has been focused on site-to continental-scale research; whereas, the Model Farm has focused on global runs, requiring the distribution and gathering of inputs and outputs. Still, like the Farm, PEcAn requires significant modeler time to develop couplers for inputs, runs, and updates. Nonetheless, PEcAn provides an excellent platform for unifying models for MIPs and ultimately saves researchers' time on developing the informatics and analysis tools built into PEcAn; we highly recommend that modelers begin to explore it. A blend of a centralized Farm team with the open platform of PEcAN could be a beneficial combination. The next frontier is a merger of the strengths of the Model Farm and those of PEcAn, which can create a whole that is larger than the sum of its parts.

Conclusion
Earth system modelers are like the blind men and the elephant, each trying to seek the truth in front of them-the future of the Earth; and, each saying something different, yet similar. By working together through MIPs, we may be able to piece that picture together ( Figure 6). Yet, MIPs are challenging, and we need to ensure that each of those voices accurately depicts a defined perspective. We attempted to systematically unify and control for a number of "free parameters," or degrees of freedom, in MIPs through a centralized Model Farm. While solving these problems raised other challenges, we did bring something novel to the conversation of how MIPs are operated and interpreted. As we collectively continue to strive toward understanding the future, we must not only continue working together, but we must do so with improved consistency, transparency, and efficiency. And, we must do so quickly, as time is of the essence-the elephant may be gone by the time we figure out what was in front of us.  Schwalm et al. (2015).

Acknowledgments
This paper is dedicated to Gary Block, founding software developer of the Model Farm, who was taken from us by COV-ID-19 during the writing of the paper. The authors thank C. Prentice, C. Miller, G.
Stephens, R. Kennedy, P. Ciais, F. Maignan, K. Wong, Y. Elshorbany, D. Rivera, N. Perakalapudi, E. Villanueva, and V. Kantchev. The authors thank the JAMES editor and reviewers for constructive comments. The research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004). California Institute of Technology. Government sponsorship acknowledged. Support was provided in part by NASA programs: CMS, CARBON, TE, ABoVE, and CARVE; support was also provided by JPL RTD and JPL SRTD. Copyright 2021. All rights reserved. Funding for the Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsTMIP; https://nacp.ornl.gov/MsTMIP. shtml) activity was provided through NASA ROSES Grants #NNX10AG01A and NNX14AI54G. Data management support for preparing, documenting, and distributing model driver, and output data was performed by the Modeling and Synthesis Thematic Data Center at Oak Ridge National Laboratory (MAST-DC; https:// nacp.ornl.gov) with funding through NASA ROSES Grant #NNH10AN681. Finalized MsTMIP data products are archived at the ORNL DAAC (https://daac. ornl.gov). The authors also acknowledge the modeling groups that provided results to MsTMIP.