Variability-based constraint on ocean primary production models

Primary production by plankton is essential to ocean ecology and biogeochemistry, so satellite models that estimate primary production from remote sensing data are indispensable in numerous scienti ﬁ c applications. However, because the corresponding in situ data are rarely measured when a satellite passes overhead, are measured on a much smaller spatial scale, and are highly variable, it is very dif ﬁ cult to compare different satellite models, evaluate how accurate they are, or to constrain their parameters. Here, a different approach to evaluating, comparing, and constraining these models is described, which accounts for or avoids all of these issues with the standard approach. This approach ﬁ nds excellent agreement overall with a recent satellite model, while also identifying room for improvement. Abstract Primary production (PP) is fundamental to ocean biogeochemistry, but challengingly variable. Satellite models are unique tools for investigating PP, but are dif ﬁ cult to compare and validate because of the scale separation between in situ and remote measurements, which also are rarely coincident. Here, I argue that satellite estimates should be log-skew-normally distributed, because of this scale separation and because PP measurements are log-normally distributed. Whether they conform to this distributional shape is therefore a powerful variability-based constraint on such models. Satellite models that do follow a log-skew-normal may then also be concisely characterized by three parameters (log-mean, log-standard deviation, and log-skewness). I show that the output from a recent satellite model (CAFE) over 2019 agrees excellently with the log-skew-normal, globally and for most spatiotemporal subsets investigated here. The exception is the Northern Hemisphere winter, which may suggest future model improvements. PP by plankton is essential to ocean ecology and biogeochem-istry, so satellite models that estimate PP from

In the open ocean, primary production (PP) by phytoplankton forms the basis of the food web. As such, PP controls the flow of carbon, nutrients, and energy into ocean ecosystems. Phytoplankton are responsible for~60 Pg C yr À1 of net PP (NPP = gross PP minus autotrophic respiration) (Buitenhuis et al. 2013) and therefore have an appreciable effect on global biogeochemical cycles and thereby climate, including by regulating a large carbon flux to the deep ocean via the biological pump. While total NPP is fairly well-constrained, understanding the variability and spatial and temporal patterns in NPP is a key oceanographic pursuit, with many applications and implications.
Satellite-based models of NPP that utilize remote sensing data to estimate NPP from space have revolutionized the study of ocean productivity thanks to their synoptic coverage. Such models have been refined and employed to myriad ends for decades (Behrenfeld and Falkowski 1997). Dozens of satellite NPP models have been proposed (Carr et al. 2006), whose structural differences evince the complexity of depthintegrated NPP and the challenge of estimating it from radiance. Improvements over the satellite ocean color era have been substantive (Silsbe et al. 2016) but there is still a great deal of room for improvement in terms of constructing, selecting, and tuning a NPP model that yields credible and accurate NPP patterns and variations.
The continued uncertainty in satellite-derived NPP is due in no small part to the challenge of comparing and validating NPP models with in situ measurements. While these measurements come with their own uncertainties (Marra 2009) and are biased in space and time, one should still in principle prefer the satellite model that corresponds best to in situ data. The problem is that there are few colocated in situ and satellite measurements for NPP. It is challenging enough to assemble a sufficient database of matched-up instantaneous concentration measurements, but making a depth-integrated rate measurement is a much more complex endeavor. This lack of in situ data has hampered model comparison, despite substantial efforts (Carr et al. 2006 and references therein). (Note here that while co-located in situ data are sparse, tens of thousands of NPP measurements have been made and assembled in the global ocean; Buitenhuis et al. 2013.) Where comparisons have been possible, such as at subtropical time series stations, those models that do capture the mean state do not capture the variability around that state (Siegel et al. 2001).
This last point has been used as a criticism of NPP models, but it is not necessarily reasonable to expect that satellite models reproduce the variability of in situ measurements, or even agree well pointwise, given that they consider very different spatial scales. In situ measurements are made from bottles with volumes on the order of liters, whereas satellites sample pixels that are several square kilometers large. Directly comparing these assumes that bottle measurements have a spatial footprint, that is, are representative, of a comparable size to a satellite pixel. Extensive recent research into submesoscale (0.1-10 km) dynamics suggests the opposite. It has been argued instead that the large nutrient-transporting vertical velocities caused by motions at these spatial scales and on phytoplankton growth timescales are particularly important for phytoplankton productivity, creating localized productivityenhancing environments and enormous small-scale heterogeneity (Mahadevan 2016). This may help explain why in situ NPP measurements are so variable (Cael et al. 2018), but also suggests that satellite NPP models integrating over these scales should have dampened variability and not match individual in situ NPP measurements exactly. It would be desirable to compare satellite NPP models and in situ NPP data in a way that (1) accounts for this scale separation, (2) does not penalize models for not matching individual in situ measurements, and ideally (3) leverages the many in situ measurements that have been taken at times and places other than when and where a satellite is passing overhead with a clear sky.
One way to achieve all of the above is to evaluate the probability distribution of outputs from a NPP model. As satellite models do not use space or time information explicitly, and as the probability distribution of a variable contains all of the non-spatiotemporal information about that variable, the strictest possible non-spatiotemporal criterion by which one can evaluate a satellite model is by evaluating the probability distributions it generates (Cael et al. 2018). If a NPP model is expected to generate output values that are distributed according to some probability density function p(NPP) over some region and interval, then the mismatch between p(NPP) and the actual model output's distribution is a powerful means of evaluating that model's performance. In other words, the fit of a log-skew-normal (LSN) distribution constitutes a strict test of satellite NPP models' variability. Departures of a given NPP model's distribution from LSN (at places and times where in situ NPP measurements are log-normal) are indicative of failings of that model to accurately represent the drivers of NPP. This does not require pairwise agreement between colocated in situ measurements and satellite pixels, addressing (2) and (3) above and the hypothesized distribution p(NPP) can incorporate information about the scale separation of satellite and in situ measurements, addressing (1). The purpose of this manuscript is to illustrate such an approach to evaluating NPP models.

Theory: LSN
In Cael et al. (2018), we showed that in situ NPP measurements were robustly well-described by a log-normal distribution, globally and over large regions in space and time. This observation builds upon the classic observation that chlorophyll measurements are also log-normally distributed (Campbell 1995). We hypothesized that this is because carbon fixation by phytoplankton occurs as a result of many distinct requirements (i.e., certain light, nutrient, temperature, absence-of-predation, etc., conditions must be met). One might naively assume that satellite model NPP should therefore be log-normal as well-but as I will explain below, scale separation changes the expected distributional form. Figure 1 shows very clearly that the NPP model used here, CAFE (Silsbe et al. 2016), is indeed on the whole far from log-normal: a log-normal variable has zero log-skewness (the skewness, i.e., third standardized moment, of the logarithm of that variable, for which I use the symbol θ) whereas CAFE's NPP estimates range from positively log-skewed with up to θ~1 to negatively log-skewed with as low as θ~À1. (I return to these spatial patterns in θ below.) What distributional form should we expect for NPP models? If in situ measurements are log-normally distributed, and satellite pixels measure spatial scales larger than the footprint of in situ measurements, then satellite NPP should be an average of log-normals. In other words, if a single satellite pixel is N times larger than the characteristic footprint of a log-normally distributed in situ NPP measurement, then a satellite pixel is measuring the average of log-normal variables, that is, NPP sat ¼ 1 N P N i¼1 NPP in situ with each NPP in situ being lognormal. Satellite NPP therefore should be proportional to the sum of log-normal variables. Note that this does not require N to be uniform in space or time, only that N > 1. (Also, here I focus on horizontal scale separation, but equivalently there is a vertical scale separation by virtue of the fact that satellite NPP models estimate depth-integrated NPP.) Sums of log-normals arise in wireless communications problems when one is interested in total interference. In this case, it has been shown that sums of log-normal random variables, even correlated ones and even for small values of N, are excellently approximated by a LSN distribution (Hcine and Bouallegue 2015): where ln is the natural logarithm, p is a probability density function, and ϕ and φ respectively are the probability and cumulative density functions of a standard Gaussian random variable. This is a generalization of the log-normal distribution where the relationship between the log-skewness θ of the distribution is confined to the range (À1, 1). The location, scale, and shape parameters (ξ, ω, α) are related to the log-mean, logstandard deviation, and log-skewness (μ, σ, θ) of a LSN random variable according to the equations: The log-skewness θ (or equivalently the shape parameter α; Fig. 2 illustrates how the shape of the distribution changes with θ) is influenced by the number of log-normals being summed over, and hence the spatial footprint of NPP measurements in this context, as well as the correlation between these summed log-normals (Hcine and Bouallegue 2015). The problem here is analogous to that in the communications literature: if in situ NPP measurements are log-normal, then NPP models using data from large satellite pixels should be LSN. This becomes a powerful test of these models, but also a concise way of characterizing the variability in NPP over space and time in terms of only three parameters.
Here, I demonstrate the application of the LSN to satellite NPP model output, using daily 4 km data from 2019 for the recent Carbon, Absorption, and Fluorescence Euphoticresolving (CAFE) model (Silsbe et al. 2016). Overall CAFE matches the LSN distribution extremely well, and also does robustly over various subsets in space and/or time. The one exception to this is the Northern Hemisphere winter (NHW), which is caused by a heavy tail of low productivity values. The results here are intended to be illustrative-there are numerous NPP models that could be compared in this way, and a challenging inverse problem of relating the LSN parameters to those of the underlying log-normal distributions that needs addressing before the estimated LSN parameters can be leveraged. These results nonetheless give strong support to the CAFE model's ability to resolve a sensible distribution for NPP variability over much of the ocean, and indicate a potential

Cael
Variability constraint on ocean PP models direction for model improvement, while also evincing the utility of the LSN for characterizing NPP variability on kilometer spatial scales.

Methods
The CAFE model is a mechanistic carbon-based approach to NPP that uses inherent optical properties derived from ocean color measurements and improves upon previous approaches by addressing key physiological and ecological attributes of phytoplankton. CAFE has been shown to explain the greatest variance and have the lowest bias out of a suite of 22 NPP models (Silsbe et al. 2016). Here, I analyze daily output at a nominal 4 km spatial resolution from CAFE over the course of 2019, based on MODIS data inputs (https://doi.org/ 10.7910/DVN/DQ6K7R). (N.B. These MODIS inputs are kriged to fill gaps in measurements before the CAFE model is run on its input data; this could potentially confound the probability distribution p(NPP) if the complete dataset was used, but in this case as I analyze sparse random samples of the model field, it is less of a concern. Regardless, the LSN test is agnostic of how a given NPP field is generated.) To test for robustness of the LSN in space and/or time, I also analyze both the fullyear global dataset, as well as five spatiotemporal subsets: The 4 km daily resolution is intractably large to analyze full datasets of. I instead randomly sample 100,000 values from each space-time subset (including the global ocean) and analyze these. I repeat this process 100 times to estimate the uncertainty associated with the random sampling. A 100 iteration is more than sufficient to estimate this uncertainty in this case (using 1000 iterations in some cases yielded identical results). The choice of 100,000 random samples also is inconsequential: using 10,000 or 1,000,000 random samples yielded similar results, with larger sample sizes (encouragingly) corresponding to slightly better correspondences between model output and the hypothesized distribution.
I fit a LSN distribution to these modeled NPP distributions by finding the parameter combinations that minimize Kuiper's statistic V (Kuiper 1960): P Fig. 2. Different log-skew normal distributions for a variable x with logmean 0, log-standard deviation 1, and varying log-skewnesses θ. The limit case of θ = AE1 is the half-log-normal: just the right or left half of the θ = 0 case. (c) Difference between model and log-skew-normal cumulative distribution functions (i.e., departure of data from the theorized distribution, or equivalently the difference between lines in b). Range is over 100 iterations of 100,000 randomly chosen locations and times. NPP* is the standardized logarithm of net primary production.
where F is the cumulative distribution function (CDF) of the CAFE output and F h is a hypothesized distribution such as a LSN, hence smaller Vs indicate better fits. Kuiper's statistic V is generally preferred to the more standard Kolmogorov-Smirnov statistic as it weighs all the data equally, rather that weighting toward the median (Press 1992). The fitting was performed on standardized logtransformed NPP values, NPP*: for computational efficiency (this narrows the size of the parameter space that must be sampled). Note that standardizing a variable by definition does not change the skewness of its distribution, the skewness being the third standardized moment. The code used in these analyses is available at (manuscript submitted). Figure 3 shows the global CAFE distribution and the closest LSN. The agreement is striking, with V = 0.0109 AE 0.0009 and the CDFs for any of the iterations never differ by more than 0.009 (n.b. this is also true using 1000 iterations). The estimated log-skewness is θ = 0.12 AE 0.04, substantially larger than zero. Globally, then, CAFE output appears to be excellently characterized by a LSN distribution. (Note that while the LSN can be an extremely heavy-tailed distribution, the global distribution's small value for σ = 0.325 AE 0.001 means that total productivity is not dominated not by extreme tail values and is instead determined by the probability distribution overall.) Figure 4 shows the value of Kuiper's statistic V and logskewness θ for the spatial and temporal subsets considered (I focus on these because the mean state and variance characteristics of CAFE are described in Silsbe et al. 2016). In general, the spatial and temporal subsets are almost all well-described as LSN-distributed. It is challenging to ascribe p-values to these fits because of the large sample sizes (Stephens 1974) but all of these in Panels 3b-3d are have V values corresponding to unambiguously good fits. Figure 5 gives spatial and temporal examples with high and low V and θ values; here one can see that the Indian Ocean 40 Â 40 box, the worst-fitting of the large spatial regions, with V = 0.028 AE 0.002, is visually a comparably good fit to the global case with V = 0.0109 AE 0.0009the difference is an almost indistinguishable underestimation of the CDF around NPP*~À1 and NPP*~1.5. The large spatial regions all have V < 0.03 and θ > 0, the latter likely related to times of year when blooms occur. The smaller spatial regions behave similarly, with V < 0.03 and θ > 0, indicating that the LSN description is not only applicable at large basin-type scales but also at smaller regional scales. Interestingly, when individual days are considered, the θ~0 and V decreasing monotonically with time from V ≈ 0.03 to V ≈ 0.01. This is curious but may just as likely be a coincidence for the year in question as a meaningful trend. The Southern Hemisphere seasons all have small V values and slightly positive θ values. The space-andtime subset of the North Atlantic Spring behaves similarly to these spatial or temporal subsets, with V = 0.0337 AE 0.0006 and θ = 0.27 AE 0.01.

Results
The exception in Figs. 4 and 5 is the NHW (and to a lesser extent the spring, presumably for similar reasons). In this case, θ is close to its minimum for a LSN variable at θ~À1, and V is more than three times larger than for any case shown in Panels 3b-3d. These CAFE output values are clearly not LSN-distributed. From the CAFE and LSN CDF's for the NHW (Fig. 5), it is clear that this poor fit is driven by their being a very heavy tail on low end of the CAFE NPP distribution in addition to relatively heavy tail on the high end, whereas the

Cael
Variability constraint on ocean PP models LSN corresponds to either zero (θ~0) or one heavy tail (θAE 1). This poor fit is not due to values at particular times of year or latitudes: isolating tropical or subpolar regions or particular months yielded similarly poor fits. This mismatch can either be interpreted as a failing of the LSN to capture a real phenomenon in the NHW, or a failing of CAFE model to accurately model NPP in the NHW. Because the log-normal distribution holds for in situ measurements in the NHW (Cael et al. 2018 Fig. 5) and this mismatch is primarily caused by the very heavy low tail, it seems most likely that the smallest 5% of CAFE NHW NPP estimates are artificially low. Note that when this lower tail is ignored, the remaining distribution conforms to a LSN similarly to the other subsets considered here, for example, by discarding all NPP values < 200 mg C m À2 d À1 , the good-ness-of-fit V is similar to other spatiotemporal subsets considered here (V = 0.032 AE 0.003) and θ > 0. These low productivity values may be due to the sensitivity of CAFE (like other NPP models) to mixed layer depth; erroneously overestimated mixed layer depths in the NHW impose photoacclimation in such models that artificially lowers NPP (Milutinovi c et al. 2009). Another possibility is that these low productivity values are due to optical constituents such as colored dissolved organic matter not being accurately accounted for, so that the phytoplankton properties used in the CAFE model are inaccurately estimated, leading to underestimated NPP (Siegel et al. 2005;Nelson and Siegel 2013). That this mismatch is driven by such a small fraction of NPP estimates further underscores the power of this distributional test for evaluating NPP models, which in turn underscores that CAFE appears to reproduce NPP variability excellently on satellite-relevant spatial scales. Resolving these NHW issues should improve not only NPP estimates not just in the NHW, but globally.
Altogether, it appears that the LSN distribution is an excellent and robust description of variability in a state-of-the-art satellite-based net primary productivity model. LSN is derived from the underlying log-normality of in situ measurements and the scale separation of in situ vs. satellite measurements. This in turn helps explain why satellite NPP measurements fail to capture the variability of in situ NPP measurements. The fit of a LSN distribution constitutes a strict test of satellite net PP models' variability, in this case identifying a potential issue of artificially low estimates in the NHW. It would be instructive to compare how LSN the outputs of different net PP models are. Note also that the same LSN test should apply equally to other bio-optical variables whose in situ measurements are log-normal, particularly chlorophyll (Campbell 1995), and that even if more matched-up measurements were available for these variables (as is the case with chlorophyll), the advantages of such a test-accounting for scale separation, not penalizing models for mismatches in individual measurements, and leveraging of more measurements than matchups alone-still apply. The log-skewness of the NPP model analyzed here also has interesting spatial patterns, with a clear hemispheric asymmetry and some additional second-order features. These patterns merit further investigation, for which the LSN is a useful lens as it condenses the variations in net PP into three parameters. In principle, these three parameter values can be used to estimate not only the two parameters of the underlying lognormal distribution that in situ measurements conform to, but also the spatial footprint of these bottle measurements. However, this inverse problem is complicated by the potential correlations between nearby measurements due to mesoscale and submesoscale phenomena. Resolving this would make it possible to utilize satellite measurements to subpixel-scale information about NPP variability, a problem of great interest in the submesoscale dynamics community. Critically, better understanding these parameters and using independently estimated values for them, instead of allowing them to be free fitting parameters as I have done here, would also be a much stricter and therefore preferable version of this distributional test.