Performance and Accuracy Implications of Parallel Split Physics‐Dynamics Coupling in the Energy Exascale Earth System Atmosphere Model

Simultaneous calculation of atmospheric processes is faster than calculating processes one at a time. This type of parallelism is beneficial or perhaps even necessary to provide good performance on modern supercomputers, which achieve faster performance through increased processor count rather than improved clock speed. The scalability of the Energy Exascale Earth System Model (E3SM) Atmosphere Model (EAM) is limited by the fluid dynamics which scales up to the number of mesh cells in the global mesh. In contrast, the suite of physics parameterizations in EAM is scalable up to the total number of physics columns, which is an order of magnitude greater than the number of mesh cells. A proposed solution to unlocking the greater potential performance from the physics suite is to solve the physics and dynamics in parallel. This work represents a first attempt at parallel splitting of the grid‐scale fluid dynamics model and the subgrid‐scale physics parameterizations in a global atmosphere model. We will demonstrate that switching to parallel physics‐dynamics coupling extends the scalability of the EAM to up to 3 times the previous peak scalability limit and is up to 20% faster than the sequentially split coupling at the highest core counts and the same time step. Decadal simulations of both coupling approaches show very little impact to the model climate. This improved performance does not come without drawbacks, however. Parallel splitting requires a shorter time step and other modifications which largely offset performance gains. A mass fixer is required for conservation. Techniques for mitigating these issues are also discussed.


Introduction
The next generation of high-performance computing systems will be faster due to greater parallelism rather than faster chips. Another trend has been the shift toward hybrid systems that employ a combination of central processing units (CPUs) and graphics processing units (GPUs) (Lapillonne & Fuhrer, 2014). GPUs show promise in that they provide an order of magnitude more processing for the same relative power consumption. Their downside is that they use simpler processors, and porting code from CPUs to GPUs presents its own challenges (Bertagna et al., 2019;Michalakes & Vachharajani, 2008). In order to continue solving bigger and bigger problems, the global climate modeling community must embrace these new systems, which often have hundreds of thousands of processing units.
In the past, when the next generation of high-performance computing meant faster processors, a significant speedup in the time to solution could be obtained simply moving to a new system. In particular, models had already been developed to scale up to the available number of computational cores. Next-generation machines, however, require more parallelism than current models can provide. This requires a revisiting of model implementation with a focus on how to produce code that can take advantage of more and more processing units. In other words, global climate models must explore ways to generate more parallelism.
Global atmosphere models solve a complex global system that spans a large range of temporal and spatial scales. To handle this, the atmosphere is separated into two parts: the resolved scale fluid dynamics and the subgrid-scale physical parameterizations. Physics and dynamics are typically coupled together using a sequential splitting (SS) (also known as time splitting) approach. As described in more detail in Section 2, in SS each process is computed sequentially and its effect is felt by subsequent processes. An alternative approach which has not yet found widespread use but has recently attracted attention is parallel splitting (PS), whereby the tendency for each process is calculated independently and all process tendencies are combined at the end of each time step. Historically, evaluation of parallel splitting in climate models has focused on accuracy and climate impact rather than computational efficiency (Barrett et al., 2019;Beljaars, 1991;Williamson, 2013). One exception is Balaji et al. (2016), which showed that running radiation in parallel with other physics processes enabled more frequent radiation calls without a penalty in time to solution.
Here we focus on applying the physics suite in parallel with dynamics because-at least for the Energy Exascale Earth System Model (E3SM)-physics and dynamics are the two largest contributors to atmosphere runtime. Thus, parallelizing them provides the biggest opportunity for potential speedup.
The paper is organized as follows. Section 2 will introduce the time-splitting techniques. Section 3 will discuss the E3SM Atmosphere Model (EAM) and our parallel splitting implementation. Section 4 highlights a series of complications that arise when parallel splitting physics and dynamics. Section 5 will present the results of the parallel split simulations. Finally, Section 6 will present a summary of the results and potential future direction. Appendix A illustrates the differences between parallel and SS by analytically examining a simple test problem, and Appendix B provides the processors counts for the simulations used in this study.

Time Splitting in Global Atmosphere Models
As mentioned in Section 1, global atmosphere models must treat a complex global system with a range of temporal and spatial scales. The typical spatial resolution for a global climate simulation ranges from 25 km  to 100 km  or more. Subgrid-scale processes are represented by a set of physics parameterizations. These provide a tendency which is then applied in the resolved scale fluid dynamics model, typically following an SS coupling implementation.
There are a handful of variations of the SS approach. Sequential update splitting (SUS) updates the model state after each process in the sequence. Thus, each process is passed an updated model state which reflects the impact of all processes that have preceded it. Sequential tendency splitting (STS) uses the same base state (typically the state at the beginning of the time step) for each process and incorporates the tendencies from previous processes as a constant forcing term. This is particularly useful in cases where one process is substepped at a shorter time step than the other processes, in which case the tendencies can be applied partially at the substep time step. Each of these approaches is used in EAM: SUS for the physics parameterization computations and a variant of STS to incorporate the impact of physics in the dynamics calculation. In actuality physics/dynamics coupling in E3SMv1 follows a hybrid sequential splitting (HSS) approach which will be discussed in greater detail in Section 3. All of these approaches are formally first order. Higher-order symmetric update splitting methods can also be derived. These involve applying all processes at half a time step in one order and then applying them again in the reverse order for another half time step. These methods are not used in EAM and thus will not be discussed in this work. There are many references that explore these approaches with idealized settings and in more depth (see Dubal et al., 2004;Sportisse, 2000;Staniforth et al., 2002aStaniforth et al., , 2002b. The parallel split (PS) approach specifies that the individual tendencies for each process be calculated independently and then combined at the end of the time step, the main distinction being that neither set of processes are aware of what the other process is doing until both have completed. This is currently the approach used in E3SM to handle subgrid microphysical tendencies. In contrast to the SS approach, PS does not incur a splitting error for single step methods (e.g., forward Euler). A further advantage of this approach is that if more computational resources are available, they can be fully utilized by handling each process simultaneously. Despite the promise of greater utilization of computer resources, this coupling mechanism has not found widespread use in the weather and climate community. One reason for this is that until recently distributing work over the spatial degrees of freedom has provided more than enough work for even the biggest supercomputers. In addition Beljaars et. al. found that STS yielded more skillful forecasts (Beljaars, 1991;Beljaars et al., 2004). We find in this paper a third reason: Parallel splitting may require a shorter time step for stability.

Model Description
This study is based on the atmosphere component of Version 1 of the E3SM . The EAM (Rasch et al., 2019;Xie et al., 2018) is broken into four main sections, which are run sequentially; see Figure 1 left panel; (1) the SE-dycore (Dynamics) which solves the equations of state (Dennis et al., 2012;Taylor et al., 2007), (2) a set of Before-Coupling Physics (BC Physics) parameterizations which handle cloud physics and radiation, (3) the surface model contributions which account for the other components in the E3SM suite (land, ocean, and sea ice), and (4) a set of After-Coupling Physics (AC Physics) parameterizations which handle the remaining physics parameterizations, as well as apply an energy fixer. The BC Physics set is the larger of the two and includes (1) the Zhang-McFarlane deep convection scheme (Neale et al., 2008;Richter & Rasch, 2008;Zhang & McFarlane, 1995), followed by (2) the Cloud Layers Unified By Binormals (CLUBB) (Bogenschutz et al., 2013;Golaz et al., 2002) which resolves shallow convection, turbulence, and cloud macrophysics; (3) aerosol activation (Abdul-Razzak & Ghan, 2000); (4) the two-moment Morrison-Gettelman cloud microphysics scheme ; (5) aerosol wet deposition (Liu et al., 2016;Wang et al., 2013); and (6) Rapid Radiative Transfer Model for GCMS (RRTMG) radiation (Iacono et al., 2008;Mlawer et al., 1997). The AC physics suite includes (1) the dry deposition of aerosols (Liu et al., 2012) and (2) gravity wave calculations (Richter et al., 2010). In both BC and AC physics the set of physics parameterizations are solved using SUS splitting. As the names would suggest, BC and AC physics occur before and after EAM is coupled with the remaining surface components in the full E3SM suite.
The spectral element dycore solves the global set of equations using a spectral element cubed-sphere mesh (Dennis et al., 2012). Each cube face contains an ne × ne grid of spectral elements. The spectral element basis functions are defined as the set of Lagrangian interpolation functions at the Gauss-Lobatto-Legendre (GLL) points-typically referred to as Legendre-Gauss-Lobatto (LGL) point outside of the climate modeling community. There are np × np GLL points per element. Each GLL point is used as a column for physics calculations. There is no horizontal communication in the physics. For a domain with ne spectral elements per cube face and order np there are 6ne 2 total elements and 6(np − 1) 2 ne 2 + 2 unique GLL points (= number of physics columns), once overlapping GLL points between elements are accounted for. The default spectral resolution is np = 4; thus, there are approximately 9 times more columns than elements in a given domain.
As mentioned in Section 2 and illustrated in Figure 2, EAM uses a HSS approach to handle the coupling of physics and dynamics. In HSS the physics provides an updated model state for conserved quantities (water vapor, liquid cloud mass, etc.) and model tendencies for nonconserved quantities (temperature and velocity) to the dynamics. The physics contribution to the conserved quantities such as water species and aerosol tracers is applied following a SUS approach, while the physics tendencies for the nonconserved quantities such as temperature and velocity are applied using an STS approach with their effect "dribbled in" over a set of remapping substeps. EAM adopted this HSS strategy to mitigate issues that arose when either SUS or STS alone were used to couple all physics with dynamics. The problem with STS is that conserved quantities became overdepleted and subsequently corrected in a way which violated energy conservation. SUS is not desirable because the rigid sudden adjustment to the model state every physics time step produced spurious gravity waves that degraded the solution fidelity. The application of HSS addressed these two concerns and produced a good solution. A detailed description of these problems and the solution is provided in Zhang et al. (2018). All default model simulations employ this coupling mechanism.

Default Model Performance
As noted earlier the SE-dycore scales up to the total number of elements in the dynamics mesh while physics scale to the total number of columns (which is 9 times larger). The default implementation allows the user to request as many cores as there are columns, but only as many cores as there are elements will be used for dynamics; all excess cores will sit idle during dynamics. This means that peak scalability is limited by the dynamics.

Implementation of Parallel Splitting in EAM
The goal of this study is to separate the computation of the physics and the dynamics onto separate computational cores. This task is made difficult by the four components of the atmosphere model described in Section 3.1. In particular, there are a series of MPI communication barriers between these sections of code which limit when dynamics can be run in parallel with the physics. These barriers and our parallelization strategy are mapped out in Figure 1, right panel. The most critical use of MPI barriers is related to the surface coupling: At this point all computational cores must be reassigned to surface processes. With the exception of the ocean model, the surface models must be run sequentially with EAM. As a result it is not possible to run the atmospheric dynamics concurrently with the full suite of physics parameterizations without moving away from a four-stage atmospheric software architecture-which would require massive infrastructure investment. Keeping the current software architecture, we could run dynamics concurrently with either the BC or AC physics parameterizations but not both. Given that BC physics is the most expensive of the two, it was chosen. A complete restructuring of EAM would allow for more complete parallelization. In that sense, the results shown here provide a lower bound on the benefits of PS. That said, AC physics only accounts for ≈10-15% of the overall physics run time so the potential for gains is fairly modest. We are currently rewriting the EAM in a way that will enable full physics-dynamics parallelism as part of a larger long-term effort.
Mechanically, the changes needed to run dynamics and physics in parallel were relatively minor and focused on two parts of the EAM infrastructure. First, the initialization of physics and dynamics domain decompositions had to be adjusted so that a computational core could only be assigned to the dynamics or the physics but not both. Second, in the standard SS configuration the physics produces tendencies for the prognostic variables which are passed to dynamics. These tendencies are then applied in dynamics following one of the approaches described in Section 2. At the end of the dynamics step the model state is passed back to physics to initiate the next time step. In the PS configuration the transfer and subsequent application of the physics tendencies was delayed until after both physics and dynamics had finished their respective calculations. As will be discussed in the following section, the final application of vertical remapping and horizontal hyperviscosity was also delayed until after the physics tendencies could be applied to the updated dynamics state; see Figure 3.

Experimental Design
In order to assess the impact of shifting to a parallel split physics/dynamics coupling paradigm, a series of simulations were conducted using the default sequentially split version of E3SMv1 and the parallel split version of E3SMv1. Parallel split and sequentially split simulations were conducted with a variety of total core counts to assess performance impact. All simulations were conducted with all forcings and sea surface temperatures (SST) set at annually repeating year 2000 conditions. All simulations use 72 vertical levels in the atmosphere. Ten year simulations at 1°resolution were used to evaluate the impact of parallel split versus sequentially split coupling on model climate. Computational performance was tested with 5 day simulations on the 7.5°, 2.7°, 1.9°, and 1.0°meshes without writing any output; see Table 1 for mesh details. As explained further in Section 4.1, the parallel split simulations required a reduced Δt for numeric stability; see Table 2. The simulations using the sequentially split configuration were run with the default Δt as well as the reduced Δt. Scalability was tested on two computational platforms: a conventional CPU machine called Quartz at Lawrence Livermore National Laboratory (running 2.1 Ghz Intel Xeon E5-2695 chips with 36 cores/node)

Stability Due to Time Step
The time step limitations for SS can differ greatly from those for PS; see Appendix A.1 for greater detail. Thus, many of the assumptions that control the time step limitations for a SS model must be reconsidered when switching to a PS implementation.
As evident in Table 2, the time step for a simulation in EAM is actually a collection of time steps that are determined for each individual process in the model. The main time step in the model can be referred to as the physics or coupling time step and represents the frequency of physics and dynamics coupling. Within both dynamics and physics there are a set of processes which are substepped at a smaller time step. Dynamics is controlled by the CFL condition, which typically requires a shorter dynamics time step than physics. Additionally, the vertical pressure level thickness is allowed to expand and contract due to the vertical Lagrangian transportation of mass. This requires a remapping time step where the vertical pressure levels are remapped onto a standard set of pressure levels to avoid the complete collapse of any layers. See Dennis et al. (2012) for more details of the vertical pressure coordinate system. The remap time step is typically shorter than the physics time step but larger than the dynamics time step. In physics, it was determined empirically that the CLUBB macrophysics and MG2 microphysics must be tightly coupled and run at a reduced cloud time step. Conversely, the relatively slow process of radiation can be run at a much larger time step and is thus typically superstepped in the model. In the case of the default 1°EAM configuration the physics time step is 30 min. In dynamics the solution is remapped every 15 min or twice per full time step, and the dynamics itself is run at a 5 min time step or 3 times for every remap step and six substeps total. The cloud time step was empirically determined to be optimal at 5 min, so CLUBB and MG2 are substepped 6 times per physics step. Finally, radiation is run hourly and thus is only updated every other time step. Similar configurations are set for other mesh resolutions.
Empirically, we determined that stability for the PS implementation requires the physics time step to be reduced to roughly the remapping time step. For a typical 1°simulation this meant a reduction of physics time step to 15 min. This is a problem because it eats away at the efficiency gains we get by parallelizing physics and dynamics. This slowdown is not as detrimental as it sounds because the dynamics, macrophysics, microphysics and radiation time steps remain unchanged. Thus, in order to maintain stability in the model, dynamics, macrophysics, and microphysics only need to be substepped 3 times instead of 6 and the radiation can still be super stepped hourly. As a result the reduction in time step only produces about a 20% performance penalty in the cost of the atmosphere component. Note that for the 1.9°we found that there is no need to reduce the time step. The physics time step is equal to the remapping time step by default and for this mesh simulations remained stable at the default time step. The impact on the full model was not examined in this study; the increased coupling frequency between the atmosphere and surface is expected to incur an extra performance cost.

Stabilization Mechanisms
Most global atmosphere models apply some form of diffusion to limit the propagation of high-frequency modes. Formally, the SE dycore has three steps, (1) application of the physics tendencies, (2) solve the conservation of momentum and mass from the equations of state, and (3) the application of horizontal hyperviscosity to remove high-frequency modes in the solution (see Dennis et al. 2012). In the default SS configuration the hyperviscosity is applied following every dynamics step. Because the physics tendencies are coupled before dynamics, the hyperviscosity smooths any high-frequency noise introduced by the physics. In a PS configuration the hyperviscosity becomes decoupled from the physics tendencies allowing high-frequency waves to propagate through the simulation. In order to ensure that hyperviscosity is able to act on the solution after physics, the application of hyperviscosity after the final dynamics step is delayed until after the physics and dynamics impacts have been combined.
As water vapor mass is moved during the dynamics, pressure layers can collapse leading to numerical inconsistencies. To mitigate this issue, the spectral element dycore employs a vertical remapper which projects the 10.1029/2020MS002080

Journal of Advances in Modeling Earth Systems
current solution onto a set of regular pressure-based vertical coordinates. The remapper also has a slight smoothing effect on the vertical profiles of the solution. Remapping before physics tendencies are applied can introduce an inconsistency as the physics tendencies are defined on the set of pressure layers from the beginning of the time step. To ensure consistency, the remapping step is similarly delayed until after physics and dynamics have been combined; see Figure 3.
This application of hyperviscosity and vertical remapping occurs once the physics is finished and is part of the nonparallel portion of the code. Since they are somewhat expensive, this also limits the benefit of parallel splitting. In our experiments the extra cost incurred by this step was equivalent to half the cost of the rest of dynamics. Models with more numerical diffusion and without a floating vertical grid may be able to reap more benefits from parallelization by not having to employ these steps.

Mass Conservation
A major concern when applying the tendencies of two different processes in parallel is the risk of mass conservation violations. This issue arises from the fact that neither process is aware of the other; thus, if both processes sufficiently deplete a resource, their combined tendency can cause that resource to become negative. This is already a risk for STS and is one of the main reasons that EAM switched to the HSS.
There are a number of approaches to mitigate mass conservation violations. The simplest solution is mass clipping, by which any negative mass is "clipped" to zero. This has the impact of increasing the local mass to accommodate the strong negative tendencies. The risk of this approach is that globally you are increasing the mass which is nonphysical and could lead to unrealistic feedbacks. Despite these drawbacks, this is the approach employed by default in any case where mass violations might occur in EAM. More sophisticated methods "borrow" the mass from neighboring columns. This can be done locally over an element or a vertical column. Redistributing mass to account for mass conservation violations helps preserve global mass. However, this redistribution of mass in the system could lead to unrealistic states or tendencies. Additionally, there may not be a sufficient amount of mass in the neighboring columns or levels to account for the negative mass that must be fixed, so conservation may still not be entirely assured.
Global mass redistribution could potentially fix the issue by borrowing a very small amount of mass everywhere to fill in all holes. The two main benefits to this approach are that global mass will be conserved and that the amount of mass that must be redistributed is relatively small. Thus, we may expect the nonphysical effects of the movement of mass to be minor. This approach has not been widely adopted for two reasons; one is that the perturbations from the transportation of even a small amount of mass over long distances instantaneously could propagate errors that impact the fidelity of the long simulations that typically run with GCMs. A second concern is that global operations require interprocessor communication, which can greatly slow down simulations.
We explored each of these approaches to address mass conservation violations in the PS version of EAM. There was little difference observed in the climate of the different simulations over our 10 year simulation horizon, so all simulations in this study use the mass-clipping approach. Figure 4 contrasts the total water mass (the sum of vertically integrated vapor, liquid, and ice) and precipitation from the standard model's HSS approach and from PS with mass clipping. For ease of comparison, the climatological annual cycle is removed from both panels. This graphic gives no indication that clipping negative water leads to growth in total mass in the system. In fact, the least squares regression for PS over this 10 year period (dashed red line near zero) has lower slope than for HSS. In addition to long-term drift, it is possible that clipping could affect individual events by changing water availability. We will show in Section 5.1 that such changes have no discernible impact on model climate over 10 years. We see in Figure 4 that there is a slight tendency (by ∼10%) for PS to have larger global average anomalies than HSS. This may or may not be related to clipping. The main way that clipping might affect extremes is through increased precipitation, so we include in the lower panel the time series of global average precipitation (with annual cycle removed). Precipitation does have nonzero trend over the 10 year simulation because the model has not come into equilibrium yet. The long-term slope of the PS run is, however, lower than that for HSS. This lays to rest the concern that clipping is supplying extraneous precipitation to PS. PS standard deviation is ∼3% smaller than that of HSS. In short, we find no clear indication that clipping in PS is having an appreciable impact on model behavior. More testing (particularly of impact on local events and trends over centennial scales) would be necessary before PS was adopted operationally.

Climatological Impact
Climatologies for a set of 10 year sequentially split and a parallel split simulations were evaluated using the E3SM-Diags toolkit (Zhang et al., 2017), which is similar to the AMWG Diagnostics package (https://www. cesm.ucar.edu/working_groups/Atmosphere/amwg-diagnostics-package). The physics time step for both simulations followed the reduced time step criteria required for stability in the PS version of the model (Δt = 15 min). See Table 2 for a full list of model time steps used. Figure 5 shows a set of Taylor diagrams (Taylor, 2001) comparing the parallel split simulation with the sequentially split simulation for the four seasons. It is clear from the figure that the two simulations are incredibly similar. The worst case is over the Northern Hemisphere winter months of December, January, and February. But even here, the spatial correlation for all variables is greater than 0.97. The main takeaway from these results is that switching from sequentially split to parallel split has very little climatological impact on the solution. Running longer and including an interactive ocean may change this result, but it is outside the scope of this initial performance-focused exploratory study.

Performance Impact
Our main motivation behind adopting a parallel split paradigm for coupling physics and dynamics is to improve the computational efficiency of the model at high core counts. In principal both of these components are perfectly scalable. As noted earlier, the dycore scales to the total number of elements and the physics scales to the total number of columns which is 9 times larger for typical E3SM configurations. In the default SS configuration, the overall scalability of the model is limited to the least scalable of these two components, in this case the total number of mesh cells. A PS configuration scales to roughly the sum of maximum core counts for both physics and dynamics. Thus, there is the potential to extend the scalability of the model by a full order of magnitude using parallel split physics/dynamics coupling.
In reality, the time to solve the dynamics on one mesh cell is larger than the time to solve one physics column. Thus, in PS with a dynamics core for each element and one column per physics core, the cores assigned to physics can sit idle for some time. Optimum performance can be gained by a proper load balancing of physics and dynamics cores such that the approximate time to complete one dynamics step and one physics step is equal. For all simulations here a reasonable effort was made to optimize the physics/dynamics load balancing on a case-by-case basis to provide the best estimates of PS performance; see Appendix B for details. The ratio of computation time between a mesh cell and a physics column varied by mesh resolution. At the lowest resolution, 7.5°, an individual physics column was on average more computationally expensive than a mesh cell, with an average element:column cost ratio of 0.53. This is most likely related to the increased frequency of radiation, which has the same time step as the physics. For the simulations at higher resolution, the average cost of a mesh cell varied depending on if radiation was called during the time step. For time steps when radiation was not active, the average cost of a mesh cell was greater than a physics column with ratios of 2.17, 1.52, and 2.51 for resolutions of 2.7°, 1.9°, and 1°, respectively. These ratios dropped to 1.34, 1.11, and 1.51 when considering time steps where radiation was active. A more comprehensive study focusing on load balancing could potentially increase the performance gains obtained in this study.
The timing results are shown in Figure 6. On the positive side, the parallel split simulations scale to roughly triple the number of cores that the sequentially split simulations do. Unfortunately, the parallel split simulations are much slower overall at most core counts. It is not until we reach the highest possible core counts of one or two columns per core that the parallel split runs surpass the performance of the sequentially split solution. At the highest core counts, the parallel split solution is roughly 15-20% faster than the sequentially split solution with the same time step and only slightly faster than the sequentially split solution at the default time step for the 1.9°and 1°resolutions. This pattern is observed on both sets of computational architectures, suggesting that the scalability and performance improvements of the parallel split solution is Figure 6. EAM performance in simulated years per compute day (SYPD) for four meshes at multiple core counts for sequential split (solid lines) and parallel split (dashed lines) at the reduced Δt and at the default Δt (dotted line). The vertical dashed lines show the peak scalability for sequentially split. Recall that for the 1.9°r esolution there is no reduced Δt simulation. Cori-KNL results are shown in blue; quartz results are shown in red.

Journal of Advances in Modeling Earth Systems
independent of computational platform. Given that we are-in theory-unlocking ≈10 times more parallelism and running the two most expensive pieces of the atmosphere simultaneously, this result is a bit disappointing. Better results may be possible with other models or more massive infrastructure changes. Here we see that PS is consistently more expensive in terms of overall computational hours for similar time to solution when compared with HSS at both the default and reduced time steps. It is only for the fastest time to solutions that PS is cheaper than HSS, and only marginally so, and only for the same time step. This is a reflection of the relatively high base computational cost for PS, as measured by the static computational cost at the slowest simulations. This floor in computational cost for PS is ∼56% higher than the HSS (Δt reduced ) simulations and ∼84% higher than the default HSS simulations. The general trend of PS accomplishing a shorter time to solution than HSS for the same time step demonstrates that there is room for improvement, as long as the base computational cost for PS could be significantly reduced.

Impact of Stability Fix on Performance
The most striking observation from the performance results is that the parallel split model underperforms at low core counts. This can be explained by considering the steps taken to stabilize the model after the tendencies from both physics and dynamics are combined at the end of the time step. The hyperviscosity and vertical remapping steps are expensive. As a result, there is significant speedup in terms of the actual time to solution for the individual dynamics and physics solvers but a tremendous sink in time spent in the hyperviscosity and remapping steps as the last action in a time step. The hyperviscosity and remapping infrastructure is contained inside the dynamics solver; thus, only the cores which have been assigned a section of the dynamics domain are able to contribute to the hyperviscosity and remapping steps. As a result, the cores assigned to physics must all sit idle waiting for the final stabilization step to be applied. In a simulation with optimal load balancing, the number of cores assigned to physics is significantly greater than the number assigned to dynamics; thus, having physics cores sit idle during hyperviscosity and remapping is inefficient.
Recall that the final application of hyperviscosity and vertical remapping has a cost equivalent to 50% of the dynamics step. In a case where physics and dynamics are properly load balanced, an improvement in the PS efficiency by roughly 50% would be possible without this extra stabilization step. Instead, it is necessary to maintain significant resources for dynamics to minimize the cost of this step. We were able to run the SE dycore stably without moving the final hyperviscosity step, but doing so required a significantly reduced model time step of Δt = 5 min.
Recent work by Herrington et al. (2019), which explores running the physics parameterizations on a coarser finite volume mesh, may obviate the need to apply hyperviscosity after physics. Preliminary parallel split results using a PG2 physics domain have shown that smoothing from mapping between SE and FV meshes is enough to retain stability at the reduced time step.

Conclusions
A parallel split algorithm for coupling physics and dynamics in an atmosphere model has been successfully implemented. Parallel splitting did not show a significant change to model climate. There was a marginal improvement to model performance at the highest core counts. It was necessary to reduce the time step of the simulation to maintain stability; in practice the time step matched the remapping time step. Additionally, the internal diffusion mechanisms of hyperviscosity and vertical remapping needed to be applied after the coupling of physics and dynamics tendencies to keep the model stable. These two requirements incurred an extra computational cost on the parallel split solution that largely countered the performance gains from running dynamics and physics in parallel on separate cores. Note, however, that using a shorter physics/dynamics coupling time step has benefit in itself because it reduces time truncation error. PS may be viewed as an efficient way to accomplish this. At the highest possible core counts a performance improvement of 10-20% over the default model was observed.
It is important to consider that this study represents in some sense a lower bound on the performance improvements that can be gained from parallel splitting. Software infrastructure issues in the E3SM model limited the degree of parallelism we could obtain without a significant overhaul of the code. A more comprehensive rewrite of the atmosphere model could unlock greater performance gains from parallel splitting, as evidenced by the improved scalability of the parallel split simulations. Most of the performance loss is due to the extra stabilization step that is taken after the physics and dynamics tendencies are coupled together. If hyperviscosity and vertical remapping could be (a) made computationally more efficient or (b) be distributed over more cores, this would decrease their performance impact. Similarly, in models where the physics calculations are considerably more expensive, the benefit of being able to apply more cores to physics through parallel splitting will improve the relative performance. Examples of these sorts of models would be bin microphysics or cloud resolving models (CRMs).
Based on these results, parallel splitting of physics and dynamics does offer some potential for improved performance, especially at the highest core counts. But a great deal more work would need to be done in terms of the code infrastructure and efficient stabilization mechanisms before it would be a valid approach for the domains and machines that currently run global atmosphere models to adopt.
where β and G are both constant and represent fast and slow processes, respectively. The term G can be thought of as a subgrid-scale forcing term which is constant over the time step, while βF = (a + bi)F may represent some combination of damping (real part) and oscillation (imaginary part). We will use time splitting to apply different time steps for each process in the sequence. Suppose dF dt ¼ −βF is substepped n times at a time step of Δt β , then dF dt ¼ G is integrated with a long time step of Δt G = nΔt β . If we consider the fully explicit SUS solution where the slow process is applied first in the sequence, we get Step 1: Step 2: Note that this has an advantage in global atmosphere models where the physics parameterization can use much larger time steps than the dynamics.

A1. Time Step Constraints
To illustrate the time step constraint in SUS, we apply the following condition to Equation A3, jFðt þ Δnt β Þ=FðtÞj ≤ 1; which yields ð1−βΔt β Þ n 1 þ nΔt β G F ≤ 1; which implies the recursive formula for the upper bound, where the ∓ is determined by the sign of the ratio F/G, that is, if F/G > 0, then (−), and if F/G < 0, then (+). An obvious difficulty with this analysis is that we must recursively find the largest n such that this relationship still holds. Another issue is the dependence on F for determining Δt β and n. Although some knowledge regarding the behavior of G and/or limits on F can mitigate some of these issues.
If we conduct a similar analysis for the PS approach, we get ð1 − βΔt β Þ n þ nΔt β G F ≤ 1; (A9) which leads to an upper stability bound of where again the sign chosen for ∓ is determined by the sign of the ratio F/G, as with SUS.
Given Equations A7 and A9 and taking Δt β → 1 b ¼ 1 ReðβÞ , which is the largest time step that will not change the sign of F, we can clearly see the differences in both methods. For the SUS approach there is no bound on n. In the PS approach the time step is bound by the ratio of F and G. In fact, when F and G are the same sign, SUS will always have a larger upper bound on n. When the signs of F and G are opposite, the PS approach only has a higher upper bound on n when Δt β > 1. Thus, we can conclude that SUS and PS have different time step criteria, and it is only SUS that allows for an unbounded n.