A suite of standard ocean hydrographic and circulation metrics are applied to the equilibrium physical solutions from 13 global carbon models participating in phase 2 of the Ocean Carbon-cycle Model Intercomparison Project (OCMIP-2). Model-data comparisons are presented for sea surface temperature and salinity, seasonal mixed layer depth, meridional heat and freshwater transport, 3-D hydrographic fields, and meridional overturning. Considerable variation exists among the OCMIP-2 simulations, with some of the solutions falling noticeably outside available observational constraints. For some cases, model-model and model-data differences can be related to variations in surface forcing, subgrid-scale parameterizations, and model architecture. These errors in the physical metrics point to significant problems in the underlying model representations of ocean transport and dynamics, problems that directly affect the OCMIP predicted ocean tracer and carbon cycle variables (e.g., air-sea CO2 flux, chlorofluorocarbon and anthropogenic CO2 uptake, and export production). A substantial fraction of the large model-model ranges in OCMIP-2 biogeochemical fields (±25–40%) represents the propagation of known errors in model physics. Therefore the model-model spread likely overstates the uncertainty in our current understanding of the ocean carbon system, particularly for transport-dominated fields such as the historical uptake of anthropogenic CO2. A full error assessment, however, would need to account for additional sources of uncertainty such as more complex biological-chemical-physical interactions, biases arising from poorly resolved or neglected physical processes, and climate change.
 At the core of the international Ocean Carbon-cycle Model Intercomparison Project (OCMIP) [Orr et al., 2001; Orr, 2002] is a suite of coarse-resolution, global physical models that simulate the large-scale patterns of ocean circulation and water mass properties. The 13 models participating in Phase 2 of the project (OCMIP-2) vary considerably in their architecture, physical parameterizations, surface forcing, and numerical methods, resulting in often widely divergent advective and diffusive fields acting on the dynamical (i.e., temperature, salinity, and density) and passive tracers. An extensive set of hydrographic and circulation metrics have been developed within the physical oceanographic community for the evaluation of large-scale ocean general circulation models. These physical diagnostics provide strong, observation-based constraints on the skill of the simulated model transport that complement the transient tracer assessment approaches emphasized in OCMIP and are directly relevant to the evaluation of ocean carbon cycle simulations.
 The main objectives of this paper are threefold: to quantify the skill of the OCMIP-2 model solutions relative to observations using standard physical measures; to relate identified model-model and model-data differences to underlying model structure or physics; and to characterize the impact of these model physical errors on the OCMIP-2 carbon cycle tracer results. The selection of models in OCMIP-2 include most of the current main branches of global ocean climate model development. As such, this exercise also acts as a general survey of the field as a precursor for a comparable, more formal international intercomparison of ocean circulation models.
 Numerical ocean modeling has a long and rich history [Haidvogel and Beckmann, 1999], and our goal here is not to choose one model or set of models over the others participating in OCMIP-2. First, the question of model skill depends greatly on how one judges the solution; a model may perform quite well by some metrics but not against others. Second, the results of any particular simulation depend in a complex manner on a variety factors other than simply the base model code or architecture. As shown below, some of the largest variations among simulated ocean circulations occur for solutions from the same class of numerical models, the major differences being the surface boundary forcing and subgrid-scale physics. Third, the process of conducting the intercomparison exercise stimulated considerable development efforts by a number of the OCMIP modeling groups, and thus the Phase-2 submitted models often do not reflect the current situation.
 Given the number of models, architectures, and possible choice of model-data comparisons, an exhaustive analysis for the OCMIP-2 physical simulations is beyond the scope of a single paper. The reader is referred to the references listed in Table 1 that lead into the specific literature on the individual models. Short descriptions of the European models are given by Orr . Rather, we will focus on a limited set of diagnostics and integrative measures that are both relevant to the large-scale carbon cycle and applicable to most of the models in OCMIP-2. In particular, we present results on surface temperature and salinity, surface mixed layer depth, meridional heat and freshwater transport, the 3-D temperature and salinity fields, and meridional overturning circulation. Some of these physical diagnostics were not included in the original OCMIP-2 specifications and thus are not available for all models. A list of the 13 OCMIP-2 models, the abbreviations used to refer to the simulations in the text and figures, and some of their main features (architecture, resolution, forcing, parameterizations) are presented in Table 1.
Table 1. Physical Model Descriptions for OCMIP-2 Simulationsa
EBM: Energy Balance Model; HOR, ISOP, GM: Horizontal, Isopycnal, Gent and McWilliams parameterization; TKE: Turbulent Kinetic Energy closure; KPP: nonlocal boundary layer parameterization; KT: Kraus and Turner; N: Brunt-Vaisaila frequency.
 The OCMIP-2 tracer experiments can be roughly grouped into those associated with the pathways and timescales of ocean circulation, the physical and biological controls on the natural ocean carbon system, and the historical and future oceanic uptake of anthropogenic carbon. Simulations of the transient penetration of chlorofluorocarbons and bomb-radiocarbon into the thermocline and near deep-water formation sites were conducted to assess ocean ventilation over decadal timescales. Equilibrium natural radiocarbon is used in a similar exercise to constrain centennial to millenial circulation in intermediate and deep interior waters. A pair of equilibrium experiments were carried out simulating the large-scale distribution of ocean dissolved inorganic carbon (DIC) under pre-industrial atmospheric carbon dioxide levels. The abiotic carbon case incorporates only thermodynamic and air-sea gas exchange forcing on DIC and thus captures the so-called solubility carbon pump. The biotic carbon case adds an active biogeochemical component with organic and inorganic carbon particle export, subsurface remineralization, and nutrient and oxygen cycling. Starting from the equilibrium abiotic carbon solution, transient anthropogenic carbon uptake experiments were completed where atmospheric carbon dioxide levels were increased over time from the mid-1700s to the present (1990s) using a historical reconstruction and into the future using two different IPCC atmospheric scenarios, IS92a (until 2100) and S650 (until 2300).
 The predicted OCMIP-2 values for key integrated global biogeochemical quantities cover a significant range (export production (±40%), chlorofluorocarbon uptake (±30%), modern (1990s) anthropogenic CO2 uptake (±25%), and future (2100) anthropogenic CO2 uptake (±30%)) with the most pronounced differences among the models occurring in the Southern Ocean. Because of the manner in which the intercomparison was conducted, any model-model variations simply reflect differences in the simulated physical circulation. Our overall finding is that a significant fraction of the large model-model ranges in predicted tracer and carbon fields and integrated properties over the OCMIP-2 simulations result because of the inclusion of physically implausible solutions. Therefore the model-model spread likely overstates the uncertainty in our current understanding of the ocean carbon system, particularly for transport-dominated fields such as the historical uptake of anthropogenic CO2. Similar general results are found by [Matsumoto et al., 2004], who report that only about a quarter of the OCMIP-2 model suite are consistent with data-based metrics using chlorofluorocarbon and radiocarbon observations. However, these conclusions need to be tempered by the fact that matching physical and tracer metrics alone is a necessary but insufficient condition for accurately predicting ocean anthropogenic carbon uptake. Further, the range of OCMIP-2 model values does not account for error sources due to persistent physical biases due to the exclusion of key physical processes (e.g., mesoscale eddies, coastal dynamics) and errors in surface forcing parameterizations in this class of global, coarse spatial resolution ocean models.
2. Models and Simulations
 Ocean models can be broadly categorized based on vertical discretization [Griffies et al., 2000]. The majority of the OCMIP-2 simulations are z-coordinate, where vertical grid levels are aligned with the local geopotential. This coordinate system has been used since the earliest global ocean models, with many of the OCMIP-2 models tracing their origin back to work at the Geophysical Fluid Dynamics Laboratory (GFDL) in the late 1960s and early 1970s [Bryan, 1969; Cox, 1984]. An advantage of the z-coordinate is that the horizontal pressure gradient driving geostrophic flow can be computed in a straightforward manner with a small discretization error. A disadvantage is the occurrence of spurious diapycnal mixing and upwelling in the well-stratified, adiabatic ocean interior in regions where neutral (approximately isopycnal) surfaces are not oriented horizontally [Gent and McWilliams, 1990; Danabasoglu et al., 1994]. The artifactual upwelling is most apparent along western boundaries such as the Gulf Stream. Problems can also occur in representing the interaction of the flow field with topography, a relevant example being the bottom boundary layers associated with dense overflows [Beckmann and Dösher, 1997; Doney and Hecht, 2002].
 Isopycnal models form a second class of ocean simulations and are represented in OCMIP-2 by a single entry, the Norwegian NERSC model based on MICOM [Bleck et al., 1992]. In an isopycnal model, the vertical coordinate is divided into a series of discrete layers of uniform density. The depth and thickness of these layers is allowed to evolve with time in the simulation. Spurious diapycnal mixing is eliminated in isopycnal models by construction, but other issues arise such as (1) how to connect the discretized, adiabatic interior to the diabatic surface layer where density varies continuously, (2) how to calculate horizontal pressure gradients across sloping model layers, and (3) how to treat nonlinearities of the seawater equation of state and the choice of density coordinate. Global isopycnal models also can have difficulties in high-latitude, deep-convection zones where a density-based discretization results in very low vertical resolution of homogeneous water columns. Other formulations not represented in OCMIP-2 include terrain-following, sigma-coordinate models popular for coastal and upper ocean applications [Haidvogel et al., 1991] and hybrid models that attempt to combine the best aspects of two (or more) coordinate systems. A good example of the latter is HYCOM, a merger of a z-coordinate upper ocean model to represent the surface boundary layer with an isopycnal interior [Bleck, 2002].
 Surface forcing is a another major difference among the OCMIP-2 simulations. The OCMIP protocols (http://www.ipsl.jussieu.fr/OCMIP) did not specify a particular physical surface forcing, allowing the groups to use preexisting methods and data sets. While this can be a hindrance in attributing differences to different model architectures or physical parameterizations, the more flexible approach allowed a larger number of groups to participate. Some caution is warranted in comparing model results, however, because even within a single model, large variations in the simulated hydrographic fields can be generated using what are considered a reasonable range of heat and freshwater boundary conditions [Large et al., 1997]. Similar issues may also arise for the wind-driven circulation fields in OCMIP-2 because of the variations in momentum forcing.
 The surface heat and freshwater forcing in OCMIP-2 span the gamut from simple surface restoring of temperature and salinity (CSIRO, IGCR, MPIM) to coupling with an atmospheric energy balance model (PIUB). Simple temperature and salinity restoring techniques alone do not allow for an accurate estimation of both the surface property and flux since the flux is defined to go to zero as the model field approaches the observations. A number of the models (IPSL, MIT, PRINCE, SOC) avoid this inconsistency by also specifying net surface heat and freshwater fluxes from climatology in addition to temperature and salinity restoring [Barnier et al., 1995]. Others (LLNL, NERSC, NCAR, UL) prescribe some fluxes (e.g., solar insolation, precipitation) while computing the turbulent heat and freshwater fluxes using empirical bulk flux formula, which for a fixed atmospheric state results in a large effective restoring term on temperature [Doney et al., 1998]. Some form of weak restoring is often required on surface salinity in these formulations. The AWI, IGCR, and PIUB models use annual mean rather than seasonal forcing. Four of the OCMIP-2 models (LLNL, MPIM, PIUB, and UL) have active sea-ice models, whereas the NCAR model reverts to strong temperature and salinity restoring under sea ice. A number of recent papers have shown that simulated deep-water properties, thermohaline circulation, and transient tracer fields can be quite sensitive to the under-ice boundary conditions [Duffy and Caldeira, 1997; Doney and Hecht, 2002].
 The OCMIP-2 models have typical horizontal resolutions of 2°–5° and thus do not resolve mesoscale eddies. The lateral subgrid-scale mixing in four of the models (IGCR, MPIM, PIUB, and UL) is parameterized using horizontal tracer diffusion while most of the remaining models use either isopycnal mixing or a combination of isopycnal mixing and bolus velocity following Gent and McWilliams  and Gent et al. . In many of the models, enhanced vertical (diapycnal) mixing occurs via convective instability; others incorporate more sophisticated surface boundary layer models and subsurface diapycnal mixing schemes including Kraus and Turner  bulk mixed layer model coupled with the Pacanowski and Philander  mixing scheme (SOC), turbulent kinetic energy (TKE) models [Gaspar, 1988; Gaspar et al., 1990; Kantha and Clayson, 1994] (IPSL, NERSC, and UL) and the K-Profile Parameterization (KPP) model [Large et al., 1994] (NCAR). A bottom boundary layer scheme [Campin and Goosse, 1999] is included in the UL model to better simulate the flow of dense water down topographic features.
 Most of the OCMIP-2 models are primitive equation models, solving prognostically for the evolution in three dimensions of velocity, temperature, and salinity (or in the case of the NERSC isopycnal model, layer thickness and salinity). The IPSL model [Aumont et al., 1999] includes subsurface restoring terms for temperature and salinity (1-year time constant) that assure a close match to the observations over the full water column but introduce artificial diabatic terms. The MPIM model is based on the Large-Scale Geostrophic model with fully implicit time stepping allowing for long (1 month) time steps. In the PIUB model [Stocker et al., 1992], the primitive equations are zonally averaged and solved for Atlantic, Pacific, and Indian basins as well as a connecting Southern Ocean. The AWI model [Schlitzer, 2000, 2002] is an irregular grid, box inverse model whose circulation, surface forcing, and biogeochemical parameters are iteratively adjusted to minimize the model-data misfit for temperature, salinity, oxygen, nutrients, and inorganic carbon using adjoint techniques. All of the OCMIP-2 physical solutions except for the NERSC model have been run to an approximate equilibrium (several thousand or more deep-water years), in some cases using numerical acceleration techniques [Bryan, 1984; Danabasoglu et al., 1996]. Though no specific equilibrium measures were specified in OCMIP-2 for physical quantities, sufficiently strict criteria were set for the abiotic DIC simulation, globally integrated air-sea flux less than 0.01 Pg C yr−1, and natural radiocarbon, 98% of the ocean volume should have a drift of less than 0.001 permil yr−1 age equivalent in radiocarbon age to a change of 8.27 years per 1000 years of simulation.
 A series of transient tracer and carbon cycle simulations were conducted by the various OCMIP-2 modeling groups using identical chemical and biological parameterizations and forcing [Orr, 2002]. For each individual model, all of the tracer simulations use the same base, steady state model physics. Our focus here will be on characterizing the skill of OCMIP-2 model physics and exploring the implications of physical model-data errors on those tracer simulations for which we have solutions from all or nearly all thirteen modeling groups: chlorofluorocarbon (CFC-11 and CFC-12) [Dutay et al., 2002], natural and bomb-radiocarbon, equilibrium abiotic carbon (solubility pump) and biotic carbon (biological and solubility pumps), and historical and future anthropogenic carbon. Mantle 3He is an important diagnostic of model deep-water circulation but has only been simulated in subset of the OCMIP-2 models and thus will not be discussed further here [Dutay et al., 2004]. The formulation of the specific OCMIP-2 tracer and carbon simulations are described in detail in a series of “how to” documents on the OCMIP web page.
 The following analysis has been conducted on the submitted archive of OCMIP-2 model results. Monthly (annual) averaged fields are used from each model integration taken as representative of equilibrium conditions. We follow a uniform convention for displaying the results: multipanel color contour plots are used for two-dimensional fields with the model panels in alphabetical order and the observations, where available, in the upper right-hand corner. In many cases, model-data difference plots (model minus data) are used. Zonal averages are shown as line plots with a consistent color and symbol scheme.
3.1. Annual Mean Sea Surface Temperature and Salinity
 The skill of the models in replicating observed sea surface temperature (SST) and salinity (SSS) fields has implications for whether the models can form water masses with the correct properties and local vertical stratification. Errors in SST and SSS will also impact model surface values of dissolved inorganic carbon (DIC), alkalinity, and the seawater partial pressure of CO2 (pCO2) and thus the magnitude and pattern of air-sea CO2 flux in the abiotic and biotic carbon simulations. The annual average maps of model-data SST and SSS difference between the OCMIP-2 simulations and the Levitus World Ocean Atlas climatology [Levitus et al., 1994; Conkright et al., 1998] are presented in Figures 1 and 2. Because almost all of the models include some form of either explicit or implicit surface restoring terms, the model SST and SSS fields and surface heat and freshwater fluxes should be considered in conjunction. Accurate simulation of water mass formation rates requires skillful prediction of both surface properties and surface fluxes.
 The large-scale zonal patterns in SST are captured by all of the models, but there are some significant temperature biases in some of the simulations. For example, the IGCR, MPIM, NERSC, and PIUB simulations are substantially warmer than observations in the Southern Ocean. In the northern North Atlantic, the CSIRO, NERSC, and PIUB models stand out as having considerably higher mean temperatures while the AWI, IGCR, and MPIM models are too low; for the IGCR simulation, this is likely due to the additional cooling applied to that model in this region. Nearly all of the models display a cold bias of varying magnitude relative to the Levitus data in the equatorial Pacific and in some cases the equatorial Atlantic as well. Such error patterns can reflect problems with excessive equatorial upwelling or an overly shallow and cold thermocline, both of which may be related to the prescribed wind stress fields or poor vertical resolution.
 Equatorial upwelling zones are regions of substantial anthropogenic CO2 uptake because of the constant resupply of older thermocline waters to the surface [Sarmiento et al., 1992]. Excessive model upwelling rates, as suggested by the positive SST errors, would tend to bias model historical and future anthropogenic CO2 estimates toward higher values. The colder, subsurface waters are also nutrient rich. Therefore the excessive upwelling might also lead to too high biological productivity in these models. The cold bias may also be related to the equatorial nutrient trapping common in coarse resolution models [Najjar et al., 1992], which can result when the equatorial undercurrent and upper ocean vertical structure are not well resolved [Aumont et al., 1999].
 Regional temperature errors of ±2°C are commonly found in the model poleward flowing western boundary currents associated with biases in the simulated current positions and with overly broad structure, a common problem in coarse-resolution ocean models [Large et al., 1997]. Surface temperature errors of similar magnitude are seen in the Antarctic Circumpolar Current (ACC) region, often appearing as dipole patterns associated with a mismatch between the model and observed flow path. These variations in the simulated ACC trajectory, along with differences in model basin-to-basin upwelling velocities, mixed layer depth (see below), and the lateral pathways for mode and intermediate water ventilation of SAMW and AAIW, are also likely to contribute to the large Southern Ocean regional variations seen in the simulated air-sea CO2 flux across the OCMIP-2 solution [Orr, 2002]. In a recent summary, Sabine et al.  show that about 40% of the global anthropogenic CO2 uptake is associated with SAMW and AAIW. The AWI model has large small-scale temperature errors relative to Levitus, but this is expected given that the particular model is formulated with data from individual hydrographic sections that often deviate significantly from the Levitus climatology. Furthermore, the AWI model contains no explicit diffusion term.
 The model-data error in SSS appears to be closely tied to the strength of the applied surface salinity restoring term. Generally small model-data errors are found in the open-ocean domains for a large group of models (CSIRO, IGCR, IPSL, LLNL, MPIM, PRINCE, PIUB, and SOC). Exceptions are in the Arctic and in other areas near major river discharges, where differences in the choice of observational SSS climatology likely have a large effect. Model-data errors also occur off eastern North America, where the southward flow of low-salinity Labrador Seawater along the coast is likely underestimated in these coarse-resolution solutions in part due to problems with the Gulf Stream trajectory [Doney et al., 2003b]. The NCAR simulation has somewhat larger SSS model-data errors but uses a relatively weak salinity restoring timescale in addition to prescribed precipitation and an evaporation term that depends upon model SST. Systematic SSS errors are found in the NERSC and MIT models including excess salinity in the intermediate and deep water formation regions of the North Atlantic (Irminger and Labrador Seas), the North Pacific, and the Southern Ocean. In the MIT solution, the latitudinal gradients in SSS between the salty subtropics and fresher equatorial and subpolar zones in general are greatly damped compared to data. The weak surface salinity gradients, and freshwater transport estimates (see later), in this MIT configuration are attributable to the surface boundary condition for salinity, which consisted exclusively of a weak restoring toward Levitus sea surface salinities with a damping timescale of 2 years.
3.2. SST and SSS Annual Cycles
 The annual cycle in SST is a good measure of the seasonal dynamics, surface forcing, and boundary layer parameterizations of large-scale ocean models. We compute the amplitude of the annual SST cycle as one half the maximum minus minimum monthly SST at each grid point. In the Levitus World Atlas climatology (Figure 3), the SST amplitude exceeds 2°–3°C in the mid-latidues of both hemispheres, reaching as high as 5°–6°C in the western North Pacific and North Atlantic. Similar patterns and slightly weaker magnitudes are found in a number of the seasonally forced models (IPSL, LLNL, MIT, MPIM, NCAR, NERSC). For example, the NCAR model is driven by a bulk flux scheme that closely tracks observed SST [Doney et al., 1998, 2003b]. The Northern Hemisphere SST cycles in the PRINCE and UL models are somewhat larger than observed, and the UL solution also shows a second band of high seasonal SST variability in the Southern Ocean near the Antarctic not found in the other models. The SST amplitudes in the CSIRO model, which uses temperature and salinity restoring, are systematically reduced relative to observations and show some phase decorrelation with observations (correlation coefficient R∼0.8; all grid points and all months). For the remainder of the seasonal models, the phase agreements of the simulated seasonal SST cycle to observations (not shown) are generally reasonably good, with correlation coefficients exceeding 0.9.
 The seasonal SSS cycle (not shown) is not captured as well as that of SST in any of the OCMIP-2 models. The models tend to have significantly smaller standard deviations of the monthly values from the annual mean relative to the World Atlas data, ranging from about 10% (MIT, MPIM) to 70% (NCAR) of observed. The correlation coefficients are also low, 0.3–0.4.
 Thermal effects play a considerable role in driving the seasonal variation in surface pCO2, especially in the subtropics [Takahashi et al., 2002]. Warming a surface water parcel by 1°C, without any other changes in salinity, DIC, or alkalinity levels, results in an increase in pCO2 of ∼4% or roughly 14 μatm for water in equilibrium with a present-day atmosphere of about 370 μatm [Takahashi et al., 1993]. Model-data differences of 1–2°C in the seasonal SST amplitude are common in the midlatitudes, with even larger basin-wide errors for CSIRO and MIT, leading to significant differences in the thermally driven pCO2 and thus seasonal air-sea CO2 flux. Errors in the annual SST amplitude will also degrade model estimates for the seasonal net out-gassing of oxygen from the upper ocean [Najjar and Keeling, 1997].
3.3. Mixed Layer Depth
 Winter mixed layer depths are indicative of the seasonal supply of nutrient and dissolved inorganic carbon (DIC) rich subsurface waters to the surface ocean and the regional patterns and magnitudes of mode, intermediate, and deep-water formation. The average mixed layer depth for the boreal and austral winter seasons are shown in Figures 4a–4b for the OCMIP-2 simulations and the Levitus World Ocean Atlas climatology [Levitus et al., 1994; Conkright et al., 1998]. The mixed layer depth is diagnosed in a consistent fashion across all of the models using a density criteria of 0.125 Δσθ relative to the surface. For the models without seasonal forcing, the same annual average mixed layer field is shown in both the boreal winter and boreal summer figure panels, and the model fields are cropped to exclude the middle to high latitudes of the summer hemisphere.
 The Levitus data indicate winter mixed layer depths of 150–250 m in the Kuroshio extension, western Bering Sea, and Gulf Stream regions, with significantly deeper mixing (500 to >1500 m) in the eastern subpolar North Atlantic, Irminger, Labrador, and Nordic Seas (Figure 4a). The models are in general agreement with this pattern, but with substantial variations. For example, a number of the models locate the maximum, subpolar North Atlantic mixed layer depth either in the Irminger Sea (IGCR, LLNL, NCAR, PRINCE) or equally in the Irminger and Labrador Sea (MIT, NERSC). Experiments with the NCAR model show that subpolar convection location is resolution dependent, with Labrador Sea convection developing only at higher resolutions [Doney et al., 2003b]. Unlike most of the simulations and the Levitus data, the AWI model concentrates the deep winter mixing in the western North Atlantic and somewhat farther south.
 In the North Pacific, the AWI and IGCR models appear to overestimate the mixing depths east of Japan in the Kuroshio. Many of the models exhibit patches of deep winter mixed layers (>400) in the Sea of Japan, Sea of Okhotsk, and/or western Bering Sea regions (AWI, CSIRO, IGCR, LLNL, MIT, NCAR, NERSC, PRINCE, SOC, UL) somewhat analogous to observations of deep ventilation in the Sea of Okhotsk that then leads to North Pacific intermediate water formation through internal mixing processes [Talley, 1993]. On the basis of the OCMIP-2 CFC-11 results [Dutay et al., 2002], three of the models (MIT, NERSC, UL) produce too much North Pacific intermediate water.
 Deep winter mixed layers in the Southern Ocean are commonly observed in the band of 40°S–55°S associated with convective mixing and Subantarctic Mode Water (SAMW) formation and subduction and then again farther south along the continent in regional Antarctic Bottom Water (AABW) production zones. The winter convection patterns in the open Southern Ocean are quite similar in several of the z-coordinate models with isopycnal mixing and the Gent-McWilliams scheme (CSIRO, IPSL, LLNL, NCAR, SOC) and compare well against the Levitus climatology and chlorofluorocarbon-based SAMW ventilation estimates [Dutay et al., 2002]. A band of intermediate mixed layer depths (400–600 m) extends along the path of the Antarctic Circumpolar Current (ACC) in the Indian and Pacific basins with weaker mixing in the ACC in the Atlantic. An exception is the MIT model, which shows deeper mixing near 50°S in the Atlantic. In the PRINCE model the Pacific winter mixed layer depths are shallower than the data, leading to a smaller than observed CFC uptake. Consistent with previous studies [Danabasoglu et al., 1994], two of the z-coordinate models with horizontal mixing (IGCR and UL) significantly overestimate midlatitude convection, which is also reflected in their unrealistic ventilation patterns based on simulated chlorofluorocarbons [Dutay et al., 2002].
 On the basis of the winter mixed layer depths, bottom waters form to a greater or lesser extent in the Weddell Sea in all of the simulations and in the Ross Sea with the exception of the PRINCE model. In agreement with recent bottom-water observations [Orsi et al., 1999], deep winter mixing is found along the continent in the Indian Ocean sector in a handful of models, the AWI inverse calculation, the two 3-D simulations with active sea-ice (LLNL and UL), and the IGCR and NERSC solutions, which tend to have overly broad regions of deep mixing across much of the Southern Ocean. These differences in the spatial patterns of bottom-water production are also expressed in the simulated deep-water CFC distributions [Dutay et al., 2002]. This suggests for these coarse-resolution ocean models that deep convection is a useful proxy for simulated bottom-water formation even though the mechanisms in high-resolution models and the real ocean are considerably more complex and involve topographic bottom boundary currents. As discussed by Doney and Hecht , observational undersampling of winter conditions in the Indian Ocean sector may contribute to the poor performance of models using surface salinity restoring.
 Summer mixed layer depths in the North Pacific and Atlantic approach 25–50 m or less in most models, though values of 50–100 m are observed in the CSIRO, SOC, and NERSC solutions. The CSIRO model also predicts deep, vertically homogeneous water columns (>300 m) over the summer in the Nordic Seas. Similar mixing patterns are observed in the Southern Ocean austral summer, with significant mixed layer depths (100–300 m) observed in the CSIRO, NERSC, and SOC solutions but not in the other models. The deeper summer mixing in the NERSC and SOC simulations could be attributed to the presence of a bulk surface boundary layer model (TKE and Kraus and Turner, respectively). However, that does not explain the deeper mixing in the CSIRO model or the shallow mixing in the other models with active surface boundary layer parameterizations, including some also with TKE (UL and IPSL).
3.4. Surface Heat and Freshwater Fluxes and Meridional Transports
 Surface net heat and freshwater flux fields can provide important information on water mass formation processes in ocean climate models [Doney et al., 1998]. However, the uncertainties in observational based surface heat and net freshwater flux estimates are relatively large [Large and Nurser, 2001], and thus it is more difficult to assess model skill. The large-scale regional patterns of net heat and water fluxes for the OCMIP-2 models are broadly similar across all of the simulations and in general agreement with observational data sets. However, there is often considerable local variation among the models in the magnitude of the fluxes.
 Somewhat more reliable are estimates of the meridional heat and freshwater transports, which can be compared against data from hydrographic sections and atmospheric analyses. These transports are closely related to the surface fluxes since at equilibrium the zonally integrated net fluxes are simply equal to the divergence of the transports. The global and Atlantic basin transports are shown for heat (Figure 5) and freshwater (Figure 6) together with a variety of observational estimates from ocean sections, ocean inverse models, air-sea flux estimates and atmospheric residual calculations.
 In the Northern Hemisphere, the global heat transports are broadly similar across the analyzed models (CSIRO, IGCR, MIT, MPIM, NCAR, NERSC, PIUB, and PRINCE), with peak ocean northward heat transports of 1.2–1.7 PW near 20°N. There are, however, considerable differences among the models in where the simulations place the bulk of the Northern Hemisphere cooling. A group of simulations with the GM isopycnal mixing parameterization (CSIRO, MIT, PRINCE) tend to have greater heat transport divergence in the subtropics relative to the z-coordinate models with horizontal mixing (IGCR, MPIM, and PIUB). In the latter models, more heat is transported into the temperate and subpolar latitudes. The NCAR solution is intermediate between these two bounds. With the exception of the MPIM and PIUB results, the global northward heat transport from all of the models fall within the error bars of the two Northern Hemisphere Macdonald and Wunsch  estimates but are lower in the subtropical gyre and higher in the subpolar gyre than the Trenberth and Caron  curve (diagnosed from atmospheric residuals). They are also lower than the combined 24°N section values from Hall and Bryden  and Bryden et al. .
 The Northern Hemisphere heat transport is partitioned differently between basins across the models. Excepting the zonal PIUB model, the NCAR solution stands out with about 0.15–0.25 PW higher Atlantic heat transport. This difference is likely due to differences in the surface forcing and North Atlantic Deep Water formation rates, rather than to the use of the Gent-McWilliams isopycnal scheme, because several of the other models also incorporate the scheme.
 A larger spread of model heat transport results is found in the Southern Hemisphere, where model-model differences can get as great as 1 to 1.3 PW. The data constraints on heat transport in the Southern Hemisphere are weaker than for the Northern Hemisphere. Even so, a number of models (IGCR, MIT, PIUB) fall significantly below the 30°S constraint of Macdonald and Wunsch  and the Trenberth and Caron  curve. Of particular note are the larger poleward heat fluxes in the IGCR, MPIM, NERSC, and PIUB simulations between 40°S and 60°S that support correspondingly larger net heat losses and water mass transformation rates (intermediate and deep water formation) in the high-latitude Southern Ocean than in the other models. The Trenberth and Caron  atmospheric residual curve suggests a modest net heating of the Southern Ocean between 40°S and 55°S, a result supported by the flattening or reversal of the gradient of the poleward heat flux for this region in a number of the model solutions (CSIRO, PRINCE, NCAR, NERSC). A sign reversal of the transport gradient also is found in the MPIM heat transport, but the magnitude is significantly larger than in the other models or in the results by Trenberth and Caron .
 The meridional transport of dissolved inorganic carbon (DIC) due to the so-called “solubility pump” is roughly proportional to the negative of meridional heat transport. This relationship arises because of the temperature solubility effect (cold water holds more DIC at saturation) with some divergence expected regionally because of the finite air-sea gas exchange timescale for CO2 (∼1 year) and nonlinearities in carbonate system thermodynamics [Murnane et al., 1999; Sarmiento et al., 2000]. Thus the divergence in the model carbon transport estimates in the OCMIP-2 abiotic carbon simulations is a reflection of differences in the underlying physical heat transport.
 A comparable model-data evaluation can be conducted for meridional freshwater transport, in this case using the estimate of Wijffels et al.  as the observational metric (Figure 6). The large-scale pattern of the maxima and minima in the global freshwater transport curves are similar among the models and the Wijffels et al.  estimate. This is equivalent to the statement that the transitions between net positive and negative zonally averaged precipitation minus evaporation (and in polar regions sea-ice melt and production) occur at about the same latitudes in all of the solutions. However, the magnitude of the freshwater transport (surface input/removal rates) varies considerably across the models. The differences in the Southern Ocean are notable, in part because of the great surface area. The weak freshwater forcing of the MIT model and strong freshwater forcing in the IGCR model stand out as they did in the previous comparison of SSS.
3.5. Subsurface Temperature and Salinity Fields
 The simulated subsurface temperature (T) and salinity (S) fields in global ocean models reflect the often complex interactions of surface heat and freshwater forcing, subgrid-scale parameterizations (e.g., diapycnal diffusivities), and large-scale circulation [Large et al., 1997]. For ocean biogeochemical simulations, errors in the subsurface hydrographic properties may be indicative of underlying circulation problems relevant to the simulated tracer and carbon system fields of interest to OCMIP-2 (e.g., CFCs, radiocarbon, anthropogenic carbon). The T and S model-data differences between the OCMIP-2 simulations and the Levitus World Ocean Atlas climatology are shown as contour plots of the zonal average versus depth (Figures 7a–7b and 8a–8b) and global root mean square errors (Table 2).
Table 2. Model-Data Root Mean Square (RMS) Difference Between the OCMIP-2 Simulations and the Levitus World Ocean Atlas Climatology [Conkright et al., 1998] for Full Depth, Annual Mean Potential Temperature and Salinity
Temperature RMS, °C
Salinity RMS (PSS)
 By construction, the T and S errors in the AWI inverse model and IPSL simulation with deep restoring are relatively small. A series of models (CSIRO, LLNL, MIT, MPIM, NCAR, PRINCE, and UL) have intermediate magnitude RMS temperature errors, ∼1°C. Some commonalities are observed in the spatial patterns of the zonally average model-data mismatch in some (though not all) of the models in this group: positive temperature biases in the upper thermocline associated with too diffuse thermoclines (and perhaps too large effective diapycnal diffusivities); warm model temperature anomalies at 100–400 m depth under the equator and thus weak meridional thermal gradients and tropical current speeds; high model temperatures in the subpolar Atlantic and deep Arctic; and overly warm Antarctic Bottom Water. Similar temperature bias patterns are found in the IGCR and PIUB models but are of larger amplitude. The model-data errors in the NERSC model are inverted relative to those in most of the other models with cold biases in the main thermocline (overly sharp thermocline) and warm biases in the deep water.
 The model salinity error patterns vary more across the OCMIP-2 solutions, with smaller spatial scales. The near-surface tropical salinities are often too fresh in the models, and a number of simulations (MIT, MPIM, NERSC, PRINCE, and UL) are too salty in the middle to deep thermocline. Significant salt biases (±0.1) of both positive (MIT, NERSC) and negative (CSIRO, IGCR, LLNL, PIUB, SOC) sign are found in deep waters.
3.6. Meridional Overturning
 The meridional overturning circulations for the global ocean and Atlantic basin were available for most of the OCMIP-2 simulations (Figures 9a–9b). The figure shows, for each model, the overturning circulation applied to passive tracers, which can differ depending upon the lateral mixing parameterization. In models with horizontal or simple isopycnal mixing, tracer distributions respond to the Eulerian mean stream function, which is shown for these simulations. However, for models with the GM bolus velocity parameterization (MIT, NCAR, PRINCE, SOC), tracer fields are governed by the residual mean, the sum of the Eulerian and bolus velocities, which is shown instead for these simulations. The spatial patterns of the zonally averaged overturning circulation are broadly similar across the OCMIP-2 models and generally consistent with previous studies [Large et al., 1997]. Features in common in the global average include near-surface tropical wind-driven circulation cells in both hemispheres, the formation and mid-depth southward export of North Atlantic Deep Water (NADW), and a deeper northward flowing cell of Antarctic Bottom Water (AABW).
 The predicted, maximum overturning in the North Atlantic across the OCMIP-2 model solutions ranges from a low of 14–16 Sv (PRINCE, SOC) to a high of 27 Sv (UL), with a large group at an intermediate value of ∼24 Sv (IGCR, MIT, MPIM, NCAR, NERSC, PIUB). There are also significant differences among the models in the depth of the southward NADW outflow. The NADW outflows in the z-coordinate models are too shallow, most of the flow restricted to depths of less than 2500–3000 m, compared with data-based estimates (note that the meridional overturning stream function computed from the models is not directly measurable from field data but can be reconstructed using inverse models from observation-based circulation fields). This leads in the NCAR case to excessive fresh, older AABW inflow at depth and a large vertical dipole error pattern in the subsurface salinity and radiocarbon in the North Atlantic [Large et al., 1997]. A common feature in a number of large-scale ocean models, these problems are associated with errors in both the formation of the correct water masses in the Nordic Seas behind the Greenland-Iceland-Scotland ridge and excess mixing of the subsequent deep-water overflow plume in the z-coordinate frame [Beckmann and Dösher, 1997]. In contrast, the NADW outflow in the isopycnal NERSC simulation may overshoot the observations, penetrating all the way to the bottom in the zonal average with most of the NADW outflow deeper than 4000 m and almost no expression of a northward AABW cell.
 The Southern Ocean is the other region where major differences arise among the different model overturning circulations. In the Antarctic Circumpolar Current (ACC) region, roughly 50°S, the NERSC, PIUB, and UL (and to some degree PRINCE) simulations all indicate a substantial wind-driven overturning cell (18–30 Sv) often referred to as the Deacon cell. The feature is much reduced or absent in the MIT, NCAR, and SOC solutions, which were generated using the Gent-McWilliams (GM) isopycnal scheme, and in other PIUB simulations utilizing GM [Knutti et al., 2000]. As documented by Danabasoglu et al. , there is often a substantial cancellation between the Eulerian and bolus velocity terms in the Lagrangian tracer velocity (displayed for all GM solutions) for the ACC region when using the GM scheme. The presence of a large Deacon cell has been tied to unrealistic deep convection in the ACC region. Another overturning circulation difference occurs near the Antarctic coast where the IGCR, MPIM, and UL simulations show very large downwelling (6–20 Sv) along the topography. This is related to too much sea-ice driven AABW production in the UL simulation (and some of the other OCMIP-2 solutions not shown here for overturning circulation) leading to substantially excess simulated CFCs at depth [Dutay et al., 2002; Doney and Hecht, 2002] in the model.
 Ocean carbon models are fundamentally dependent on the skill of the underlying simulation of physical circulation [Doney, 1999]. While advances have been made, serious issues remain in modeling the large-scale ocean general circulation using global coarse-resolution models. Outstanding modeling questions involve the treatment of surface forcing, subgrid-scale parameterizations, and model architecture [McWilliams, 1996; Griffies et al., 2000]. Different choices result in considerable variation among the OCMIP-2 models in standard physical metrics (e.g., 3-D hydrographic fields, seasonal SST and mixed layer, heat and freshwater transports, and meridional overturning circulation), with some of the solutions falling well outside available observational constraints.
 These errors in the physical metrics point to significant problems in the underlying model representations of ocean transport and dynamics, problems that directly affect the OCMIP predicted ocean carbon cycle variables. For example, oceanic uptake and penetration of anthropogenic CO2 depends primarily on the physical ventilation of the thermocline, intermediate, and deep waters over decadal to centennial timescales. The subtropical and tropical thermoclines are overly diffuse in a number of the OCMIP-2 z-coordinate models, which would suggest an overestimate in the anthropogenic CO2 uptake. However, this effect is likely minimal compared to other errors because the simulated tropical and subtropical chlorofluorocarbon inventories and vertical penetration depths are similar to or somewhat low relative to observations [Dutay et al., 2002; Matsumoto et al., 2004]. Many of the models also exhibit possibly too high upwelling of cold, subsurface water in the equatorial Pacific, another potential positive bias for anthropogenic CO2 uptake. However, the largest common model errors with respect to anthropogenic CO2 are associated with excessive Southern Ocean SAMW and AABW formation and overly shallow NADW outflow. These problems are apparent in physical diagnostics (maximum winter mixed layer depth, meridional overturning, meridional heat transport, and subsurface hydrography; shown here) and tracer fields (chlorofluorocarbons, bomb and natural radiocarbon [Dutay et al., 2002; Matsumoto et al., 2004]).
 Models of the natural ocean carbon system (e.g., OCMIP-2 equilibrium abiotic and biotic carbon experiments) are sensitive to the same circulation problems as well as an expanded set of physical factors encompassing timescales from the seasonal cycle to millenia. Biases in simulated export production result from problems in the near-surface physical circulation and numerical errors arising from tracer advection schemes. Again, the issue of excessive equatorial upwelling arises as it impacts nutrient and DIC supply and export production. Model-data comparisons of surface pCO2 and air-sea CO2 flux are complicated by the large regional error patterns in simulated mean SSTs, the seasonal cycle, and convection. A number of models exhibit excess poleward heat transport in the Southern Hemisphere, which will distort the ocean carbon solubility pump and estimates of the lateral ocean inorganic carbon transport. Subsurface biogeochemical fields are determined by the complicated interplay among surface particle production, subsurface remineralization, and physical circulation. For the middle to deep ocean interiors, where ventilation timescales are centuries to millenia, a combination of physical (hydrography, meridional overturning by basin, potential vorticity) and tracer (natural radiocarbon) metrics are likely required.
 The substantial model-model ranges in OCMIP-2 predicted integrated quantities, such as historical and future anthropogenic CO2 uptake and export production, therefore, likely overestimate the uncertainties in ocean carbon cycle dynamics due to large-scale physical circulation. One interim solution is to refocus the OCMIP-2 analysis only on that subset of models deemed acceptable by some set of joint physical and transient tracer criteria based on observable fields (e.g., CFC, radiocarbon). Certainly, skillful simulation of ocean physical fields alone is not a sufficient test of a model to be used for predicting quantities such as anthropogenic carbon uptake, as similar temperature and salinity fields can be constructed for simulations with substantially different circulation and tracer behavior [Gnanadesikan et al., 2002]. However, conversely, the creditability of even the best transient tracer solution should be questioned if it poorly represents the dynamical fields, particularly when the goal is to project into the future under changing climate and ocean circulation conditions. For some quantities, for example, anthropogenic CO2 uptake, stricter criteria would exclude outliers leading to an overall reduction in the estimated model based uncertainties [Matsumoto et al., 2004], a valuable exercise given the importance of better ocean carbon constraints [Prentice et al., 2001]. However, none of the current generation of ocean general circulation models are ideal, and in the long term, our analysis clearly highlights the need to improve the representation of ocean physical transport in global ocean carbon models.
 Certain lessons can be drawn from the current suite of solutions. Models with only horizontal lateral subgrid-scale mixing (IGCR, MPIM, UL) show excess deep convection and other problems in the Southern Ocean [Danabasoglu et al., 1994; Dutay et al., 2002]. A cluster of the z-coordinate models with isopycnal lateral mixing parameterizations (CSIRO, LLNL, MIT, NCAR, PRINCE, SOC) display relatively similar physical behavior and generally good agreement with traditional physical metrics. However, even within this group of models, the resulting transient tracer and carbon solutions span, for a number of predicted quantities such as deep ocean ventilation and anthropogenic tracer uptake, nearly the entire range found within the full set of OCMIP-2 simulations.
 Sensitivity experiments conducted in the PRINCE model demonstrate that similar large-scale temperature and salinity solutions can be produced by a range of different isopycnal and diapycnal diffusivities if the parameters are covaried; however, widely divergent tracer ventilation patterns result [Gnanadesikan et al., 2002]. Thus time-dependent tracers continue to be invaluable in diagnosing errors in large-scale ocean circulation models [England and Maier-Reimer, 2001]. However, research to improve simulated tracer fields should be balanced by a comparable effort on meeting physical criteria.
 Some persistent biases exist in most or all of the OCMIP-2 simulations and may reflect inherent limitations of the current surface forcing, physical subgrid-scale parameterizations, and coarse spatial discretizations imposed at present by the computational constraints for multicentury, global calculations. Specific examples include the cold equatorial SSTs, overly broad western boundary, and Antarctic Circumpolar currents that also have displaced trajectories relative to observations, too diffuse thermoclines, and substantial errors in the patterns, rates, and mechanisms of intermediate and deep-water formation. In this sense, the range of model values from the OCMIP-2 suite may provide false confidence about our ability to quantify key biogeochemical measures. This is particularly true when one considers potential responses and feedbacks to the time-evolving ocean circulation under future climate change.
 By design, the coarse-resolution ocean models used in OCMIP-2 neglect almost entirely a variety of smaller-scale physical phenomena such as mesoscale eddies and coastal–open ocean exchange. Recent results clearly demonstrate that rectification of open-ocean mesoscale variability has significant, quantitative impacts on basin-scale physical [Smith et al., 2000] and biogeochemical [McGillicuddy et al., 2003] dynamics. Similarly, there is renewed debate on the role of the coastal ocean and continental margins on the large-scale ocean carbon system [Tsunogai et al., 1999]. On a related issue, coarse-resolution models generate significant amounts of shelf-driven or overflow controlled deep and bottom water (e.g., AABW, NADW), but often via incorrect, open-ocean convective mechanisms [Doney and Hecht, 2002]. There is a clear need for targeted, regional high-resolution numerical models to develop improved subgrid-scale biogeochemical parameterizations as well as multiscale nested simulations [Doney, 1999].
 Several other recommendations arise from the OCMIP-2 physics analysis. Careful attention should be paid to surface forcing, particularly in polar water mass formation regions and in maintaining the magnitude and phasing of the seasonal cycle. Well-known biases (e.g., overly shallow NADW in many z-coordinate models) also need to be resolved and the causes of major circulation differences among otherwise quite similarly constructed models identified. The best venue for such work is likely in the framework of a single, well-characterized model where a thorough suite of sensitivity experiments can be conducted for specific model parameterizations and parameter values. The OCMIP tracer and carbon experiments should be carried out in carefully constructed sensitivity studies within the framework of individual models, to remove some of the ambiguities with regards to different surface forcing etc., and on a larger number of models with different vertical coordinate systems to better explore the impact of model architecture. Finally, the AWI model and other inverse and data assimilation approaches should be further pursued, including examining the impact of assimilating seasonal and transient tracer data on the physical circulation.
 S. Doney and K. Lindsay acknowledge support from NASA through the U.S. OCMIP program and the U.S. JGOFS Synthesis and Modeling Project (NASA grant W-19,274). The National Center for Atmospheric Research is sponsored by the National Science Foundation. N. Gruber acknowledges support from NASA grant OCEAN-0250-0231. F. Joos and G.-K. Plattner acknowledge support by the Swiss National Science Foundation and the Swiss Federal Office of Science and Education through the EU-projects GOSAC and MilECLim and enjoyed scientific advice by T. F. Stocker, G. Delaygue, R. Knutti, and O. Marchal. European model contributions were supported by the EU GOSAC project (contract ENV4-CT97-0495). We also acknowledge support from IGBP/GAIM to maintain the OCMIP project. This is Woods Hole Oceanographic contribution number 11186 and U.S. JGOFS contribution number 1049.