Comparison of algorithms for estimating ocean primary production from surface chlorophyll, temperature, and irradiance



[1] Results of a single-blind round-robin comparison of satellite primary productivity algorithms are presented. The goal of the round-robin exercise was to determine the accuracy of the algorithms in predicting depth-integrated primary production from information amenable to remote sensing. Twelve algorithms, developed by 10 teams, were evaluated by comparing their ability to estimate depth-integrated daily production (IP, mg C m−2) at 89 stations in geographically diverse provinces. Algorithms were furnished information about the surface chlorophyll concentration, temperature, photosynthetic available radiation, latitude, longitude, and day of the year. Algorithm results were then compared with IP estimates derived from 14C uptake measurements at the same stations. Estimates from the best-performing algorithms were generally within a factor of 2 of the 14C-derived estimates. Many algorithms had systematic biases that can possibly be eliminated by reparameterizing underlying relationships. The performance of the algorithms and degree of correlation with each other were independent of the algorithms’ complexity.

1. Introduction

[2] Global maps of the upper-ocean chlorophyll concentration are now being generated routinely by satellite ocean color sensors. These multispectral sensors are able to map the chlorophyll concentration, a measure of phytoplankton biomass, by detecting spectral shifts in upwelling radiance. As the chlorophyll concentration increases, blue light is increasingly absorbed, and thus less is scattered back into space. Although global coverage can nominally be achieved every 1–2 days, the actual temporal resolution is reduced to ∼5–10 days because of cloud cover. Nevertheless, the coverage afforded by satellite remote sensing is vastly greater than that obtainable by any other means.

[3] A principal use of the global ocean chlorophyll data is to estimate oceanic primary production [Behrenfeld et al., 2001]. The mathematical models or procedures for estimating primary production from satellite data are known as primary productivity algorithms. In the early days of the Coastal Zone Color Scanner (CZCS), simple statistical relationships were proposed for calculating primary production from the surface chlorophyll concentration [e.g., Smith and Baker, 1978; Eppley et al., 1985]. Such empirically derived algorithms are still considered useful when applied to annually averaged data [Iverson et al., 2000], but they are not sufficiently accurate to estimate production at seasonal timescales. The surface chlorophyll concentration explains only ∼30% of the variance in primary production at the scale of a single station [Balch et al., 1992; Campbell and O’Reilly, 1988].

[4] Over the past 2 decades, scientists have sought to improve algorithms by combining the satellite-derived chlorophyll data with other remotely sensed fields, such as sea surface temperature (SST) and photosynthetic available radiation (PAR). These algorithms incorporate models of the photosynthetic response of phytoplankton to light, temperature, and other environmental variables, and some also incorporate models of the vertical distribution of these properties within the euphotic zone [Balch et al., 1989; Morel, 1991; Platt and Sathyendranath, 1993; Howard, 1995; Antoine and Morel, 1996a; Behrenfeld and Falkowski, 1997a; Ondrusek et al., 2001]. Algorithms have been used to estimate global oceanic primary production from CZCS data [Antoine and Morel, 1996b; Longhurst et al., 1995; Behrenfeld and Falkowski, 1997a; Howard and Yoder, 1997], and more recently from Sea-viewing Wide Field-of-view Sensor (SeaWiFS) data [Behrenfeld et al., 2001]. Global maps of the average daily primary production for varying periods (weeks, months, and years) are now being produced from Moderate Resolution Imaging Spectroradiometer (MODIS) data.

[5] While many of the photosynthetic responses (to light, temperature, etc.) are commonly represented, model-based algorithms differ with respect to structure and computational complexity [Behrenfeld and Falkowski, 1997b]. Models may be similar in structure but require different parameters depending on whether they describe daily, hourly, or instantaneous production, and even where these aspects are similar, algorithms often yield different results because of differences in their parameterization. Balch et al. [1992] evaluated a variety of algorithms (both empirical and model based), using in situ productivity measurements from a large globally distributed data set, and found that they generally accounted for <50% of the variance in primary production.

[6] In January 1994 the National Aeronautics and Space Administration (NASA) convened an Ocean Primary Productivity Working Group with the goal of developing one or more “consensus” algorithms to be applied to satellite ocean color data. The working group initiated a series of round-robin experiments to evaluate and compare primary productivity algorithms. The approach was to use in situ data to test the ability of algorithms to predict depth-integrated daily production (IP, mg C m−2) based on information amenable to remote sensing. It was decided to compare algorithm performances with one another and with estimates based on 14C incubations.

[7] Our understanding of primary productivity in the ocean is largely based on the assimilation of inorganic carbon from 14C techniques [Longhurst et al., 1995], and thus it was considered appropriate to compare the algorithm estimates with 14C-based estimates. However, it was recognized that the 14C-based estimates are themselves subject to error [Peterson, 1980; Fitzwater et al., 1982; Richardson, 1991]. The 14C incubation technique measures photosynthetic carbon fixation within a confined volume of seawater, and there are no methods for absolute calibration of bottle incubations [Balch, 1997]. Furthermore, there is no universally accepted method for measuring and verifying vertically integrated production derived from discrete bottle measurements. Despite this fact, here we treat the 14C-based estimates as “truth” and refer to the differences between algorithm-derived and 14C-derived estimates as “errors.” In all statistical analyses, however, the two are recognized as being subject to error.

[8] Participation in the round robin was solicited through a widely distributed “Dear Colleague” letter. A central ground rule was that the algorithms tested would be identified only by code numbers. The first round-robin experiment involved data from only 25 stations and was thus limited in scope. It was decided that a more comprehensive second round was needed. In this paper, we present results of the second round-robin experiment involving data from 89 stations with wide geographic coverage. Round two was open to all participants of round one, as well as to others who had responded positively to the initial invitation.

[9] The following questions were addressed: (1) How do algorithm estimates of primary production derived strictly from surface information compare with estimates derived from 14C incubation methods? (2) How does the error in satellite-derived chlorophyll concentration affect the accuracy of the primary productivity algorithms? (3) Are there regional differences in the performance of algorithms? (4) How do algorithms compare with each other in terms of complexity vis-a-vis performance?

2. Methods

[10] A subcommittee of NASA’s Ocean Primary Productivity Working Group was formed to administer the round-robin experiments (Table 1), and there were 10 participant teams (Table 2) who volunteered to test their algorithms. A test data set was assembled to be used for evaluating algorithms. Algorithm developers were provided with only the information accessible to spaceborne sensors, and they subsequently returned predictions of integral production at each station. Results were compared with estimates derived from the 14C incubations and with the results of other algorithms.

Table 1. Algorithm Testing Subcommittee of NASA's Ocean Primary Productivity Working Groupa
  • a

    These individuals were responsible for conducting the primary productivity algorithm round-robin experiment. They agreed not to participate by testing algorithms of their own.

Robert ArmstrongStony Brook University
Richard T. BarberDuke University
James BishopLawrence Berkeley National Laboratory
Janet W. CampbellUniversity of New Hampshire
Mary-Elena CarrJet Propulsion Laboratory
Wayne E. EsaiasNASA Goddard Space Flight Center
Richard IversonFlorida State University
Charles S. YentschBigelow Laboratory for Ocean Sciences
Table 2. Participant Teams Whose Algorithms Were Tested in Round Robin
David Antoine and Andre MorelLaboratoire de Physique et Chimie Marines
William Balch and Bruce BowlerBigelow Laboratory for Ocean Sciences
Michael Behrenfeld and Paul FalkowskiNASA Goddard Space Flight Center/Rutgers University
Nicolas HoepffnerJoint Research Centre of the European Commission
Dale KieferUniversity of Southern California
Steven LohrenzUniversity of Southern Mississippi
John MarraLamont-Doherty Earth Observatory
Vladimir VedernikovP.P. Shirshov Institute of Oceanology
Kirk Waters and Bob BidigareNOAA Coastal Services Center/University of Hawaii
James Yoder and John RyanUniversity of Rhode Island/Monterey Bay Aquarium and Research Institute

2.1. Test Data

[11] Data from 89 stations were obtained from nine sources (Table 3), representing diverse geographic regions and a variety of measurement techniques. The acquired station data included the downwelling photosynthetic available radiation incident on the water surface (daily PAR between 400 and 700 nm, in mol photons m−2) and measurements of the chlorophyll concentration, temperature, and PAR at discrete depths in the upper water column. From the profile data, we determined surface chlorophyll (Bsfc, mg Chl m−3), sea surface temperature (SST, °C), and the euphotic depth or 1% light level (Zm, m).

Table 3. Data Sets Used to Test Algorithmsa
Data SetRegionnIPLatitudeLongitudeMonthsYears
  • a

    Columns are the region, number of stations (n), average depth-integrated daily production (IP, mg C m−2), and ranges in latitude, longitude, months, and years.

AMERIEZAntarctica7315−66 −58−51 −38March–Nov.1983–1986
SUPERNorth Pacific1068850 53−145 −145June–Aug.1987–1988
EqPac nonequatorTropical Pacific13724−12 12−140 −140Feb.–Sept.1996–1996
NABENortheast Atlantic12102746 47−20 −19April–May1989–1989
EqPac equatorEquatorial Pacific811700 0−140 −140Feb.–Oct.1996–1996
Arabian SeaArabian Sea12119210 1957 67March–Aug.1995–1995
PROBESBering Sea9120355 58−167 −164April–June1979–1981
MARMAPNorthwest Atlantic8156040 43−71 −67Aug.–Sept.1981–1981
Palmer LTERAntarctica101795−65 −65−64 −64Feb.–Dec.1991–1992

[12] In addition, we were provided 14C-based estimates of the daily primary production (Pi, mg C m−3) at discrete depths Zi ranging from the surface to the 1% light depth. “Measured” integral production, IPmeas, was computed for each station by trapezoidal integration, using the formula

equation image

where the number of depths (m) varied among stations. Integral chlorophyll (IB, mg Chl m−2) was also computed over the same layer by a similar formula. The surface information provided to the algorithm developers and other information not provided (e.g., IPmeas, IB, and Zm) are listed in Table 4.

Table 4. In Situ Data Used to Test Primary Productivity Algorithmsa
StationRegionLat.Long.DateSSTPARSfc. ChlZmIBmeasIPmeas
  • a

    Surface and column-integrated data for the 89 stations that were used to test algorithms are given. Values are listed with the number of significant figures provided in the original data sets. Units are SST, °C; PAR, mol photons m−2; Surface (Sfc.) Chl, mg Chl m−3; 1% light level Zm, m; integral measured chlorophyll IBmeas, mg Chl m−2;; and depth-integrated daily production IPmeas, mg C m−2.

1EqPac nonequator12−1405 Feb. 199625.925.20.1148514.2309
2 5−14013 Feb. 199628.530.70.1797114.0427
3 3−14015 Feb. 199628.436.20.2107116.1670
4 −2−1401 March 199628.627.00.13210616.5586
5 −5−1404 March 199628.736.40.17010619.9675
6 12−14011 Aug. 9628.441.60.0637910.0319
7 5−14019 Aug. 199627.533.70.2847922.1561
8 3−14022 Aug. 199627.036.90.2307917.3637
9 2−14024 Aug. 199623.838.90.3557927.71630
10 −2−1403 Sept. 199625.535.20.2237920.41047
11 −3−1406 Sept. 199625.336.10.2747926.21362
12 −5−1408 Sept. 199625.930.80.2257921.7861
13 −12−14013 Sept. 199626.430.90.1357914.6323
14EqPac equator0−14023 Feb. 199628.314.60.2277118.9513
15 0−14024 Feb. 199628.343.80.2477117.6867
16 0−14029 Aug. 199624.836.60.3727928.81399
17 0−14030 Aug. 199624.933.20.2577925.91041
18 0−14014 Oct. 199625.033.50.2408227.91573
19 0−14016 Oct. 199625.334.60.2188225.31373
20 0−14018 Oct. 199625.232.70.2478227.51432
21 0−14020 Oct. 199625.132.70.2428223.11163
22PROBES55−16517 April 19793.526.32.213278.71090
23 56−16718 April 19794.028.81.923563.2891
24 57−16622 April 19793.432.47.0427134.11576
25 55−1656 May 19795.527.38.8614207.82306
26 55−16620 May 19795.033.44.0528113.12022
27 58−1648 June 19798.450.11.0632111.91309
28 55−16715 April 19813.928.20.963329.5364
29 55−1671 June 19816.936.94.341882.11053
30 56−1665 June 19817.534.30.682833.9214
31SUPER50−1453 June 19877.236.70.2825921.2595
32 53−1459 June 19876.926.40.6595736.8671
33 50−14515 June 19877.748.20.3446024.4913
34 50−14518 June 19877.421.50.5316027.01541
35 50−14520 Sept. 198711.735.50.4025618.9887
36 50−1458 May 19885.629.10.2708926.5366
37 50−14527 May 19887.036.50.1666912.1446
38 53−1455 Aug. 198811.523.50.1298115.8360
39 53−14519 Aug. 198811.820.20.1916814.8479
40 50−14525 Aug. 198812.039.80.2976016.0621
41AMERIEZ−58−3818 Nov. 1983−
42 −60−3821 Nov. 1983−1.440.20.436529.5457
43 −60−3823 Nov. 1983−1.350.40.416344.7273
44 −60−4027 Nov. 1983−0.713.94.7022123.1633
45 −65−4811 March 1986−
46 −66−4916 March 1986−
47 −65−5123 March 1986−
48Palmer LTER−65−6410 Dec. 1991−0.445.20.724055.31259
49 −65−6416 Dec. 1991−
50 −65−6428 Dec. 19912.267.12.5219184.86308
51 −65−644 Jan. 19920.742.911.5912157.23894
52 −65−6416 Jan. 19920.533.10.843551.5994
53 −65−6424 Jan. 1992−
54 −65−643 Feb. 19920.435.51.062620.2220
55 −65−6410 Feb. 19920.324.10.694734.0673
56 −65−6417 Feb. 19920.417.73.456963.2340
57 −65−6427 Feb. 19920.240.82.433558.8868
58Arabian Sea196719 March 199525.553.40.5414020.11327
59 106524 March 199529.051.40.0777311.1602
60 146527 March 199527.852.30.0786410.3679
61 166231 March 199527.151.20.1884014.6841
62 17603 April 199526.956.30.2183921.81145
63 18586 April 199526.957.30.1643810.8651
64 196722 July 199528.040.20.3086120.8886
65 146528 July 199527.452.50.5214623.81455
66 106531 July 199528.048.80.5834825.71542
67 16624 Aug. 199525.853.70.4364821.11522
68 185811 Aug. 199523.249.61.3582726.02141
69 185712 Aug. 199520.751.40.5692717.81518
70NABE47−2025 April 198912.649.90.5655934.4944
71 47−2026 April 198912.619.60.9086157.9876
72 47−1927 April 198912.613.80.7486552.3682
73 47−2029 April 198912.630.41.0615347.3910
74 46−2030 April 198912.745.90.8075545.11286
75 46−201 May 198912.713.90.8795936.5781
76 47−202 May 198912.553.61.0665048.61387
77 47−203 May 198912.724.81.1354932.9915
78 47−204 May 198912.513.71.1075064.2852
79 47−205 May 198912.450.31.2744961.61402
80 46−196 May 198913.127.10.7104983.81031
81 46−198 May 198913.021.81.7244082.21253
82MARMAP41−7127 Aug. 198119.635.03.621547.41716
83 43−7128 Aug. 198113.948.67.231468.53482
84 43−7028 Aug. 198114.948.61.722151.51161
85 42−6729 Aug. 198116.140.61.292537.5691
86 40−6931 Aug. 198119.433.20.523751.41248
87 41−702 Sept. 198114.440.33.762263.01864
88 41−702 Sept. 198118.740.30.644032.71412
89 41−713 Sept. 198118.021.10.594040.7904

[13] The measurement methods were consistent within each data set but differed between data sets. The equatorial Pacific (EqPac [Barber et al., 1996]), North Atlantic (NABE [Ducklow and Harris, 1993]), and Arabian Sea [Barber et al., 2001] data were from the Joint Global Ocean Flux Study (JGOFS) [Knudson et al., 1989; Chipman et al., 1993] process studies. Primary production measurements from these campaigns were based on 24-hour, in situ incubations, in accordance with JGOFS protocols. The SUPER data set [Welschmeyer et al., 1993] also used 24-hour, in situ incubations. Simulated in situ incubations were used to produce the Antarctic Marine Ecosystem Research at the Ice Edge Zone (AMERIEZ) data (24-hour incubations [Smith and Nelson, 1990]), the PROBES data (dawn-to-dusk incubations [Codispoti et al., 1982]), and the Marine Resources Monitoring, Assessment, and Prediction (MARMAP) data (6-hour incubations scaled by daily PAR measurements [O’Reilly et al., 1987]). The Palmer data from the Long-Term Ecological Research (LTER) site [Moline and Prezelin, 1997] were based on 90-min incubations in photosynthetrons [e.g., Prezelin and Glover, 1991] that were then scaled to estimate daily rates. This methodological diversity introduced a source of variance in the test data, and consequently in the algorithm performance statistics, that was largely confounded with regional effects. However, we accepted the diversity under the premise that a similar diversity might have existed in the data sets used to parameterize algorithms (see section 5).

[14] Integral primary production and surface chlorophyll spanned 2 orders of magnitude in the test data set, while SST and PAR varied over the wide ranges found globally (Figure 1). The symbols shown in Figure 1 denote the various data sets and regions, and these will be used consistently in subsequent figures. The widest ranges in production, biomass, and irradiance and the lowest temperatures were found in the Antarctic data (Palmer LTER and AMERIEZ). There was a general positive correspondence between IP and Bsfc, and between IP and PAR, but there were no simple empirical relationships useful for algorithm purposes. There was no apparent relationship between IP and SST at temperatures below 20°C, but in the equatorial Pacific and Arabian Sea, where surface temperatures were above 20°C, production decreased with increasing surface temperature.

Figure 1.

Relationships found in in situ data between daily depth-integrated primary production (IP, mg C m−2) and properties amenable to remote sensing. (a) Surface chlorophyll concentration (Bsfc, mg Chl m−3). (b) Sea surface temperature (SST, °C). (c) Above-water daily photosynthetic available radiation (PAR, mol photons m−2). Symbols shown here will be used consistently in all figures.

2.2. Evaluation Procedures

[15] The data analysis and evaluation of algorithm results were carried out at the University of New Hampshire (UNH) under the direction of the first author with input from other members of the algorithm testing subcommittee (ATS) (Table 1). The complete test data set was assembled and resident on computers at UNH, but algorithm codes were not exchanged. That is, each participant team was responsible for running its own algorithm code based on input data furnished by the ATS.

[16] The information provided for each station included (1) latitude and longitude to the nearest 0.1°, (2) day of the year, (3) incident daily PAR (mole photons m−2), (4) SST (°C), and (5) two values for the surface chlorophyll concentration (mg Chl m−3). One of the chlorophyll values (randomly assigned) was the measured surface chlorophyll (Bsfc), and the other was a simulated satellite-derived chlorophyll (Bsat) computed as

equation image

where ΔB was a pseudorandom normal (Gaussian) error with zero mean and standard deviation equal to 0.3. This error represents a factor-of-2 uncertainty in satellite-derived chlorophyll that has been reported for open ocean (Case 1) waters [O’Reilly et al., 1998; Gordon et al., 1985]. The 89 values of ΔB were statistically independent.

[17] Participants did not know which chlorophyll was the measured surface chlorophyll, Bsfc, and which was the corrupted “satellite” chlorophyll, Bsat. They were asked to return two algorithm estimates of integral production for each station, one for each chlorophyll value. Their results were then “unscrambled” to identify the integral production estimate based on measured chlorophyll, IPalg, and that based on Bsat, IPsat.

2.2.1. Performance Indices

[18] The performance of each algorithm was based on a log-difference error (Δ) defined as

equation image

which is a measure of relative error. Performance indices were the mean (M), standard deviation (S), and root-mean-square (RMS) of the 89 log-difference errors. Since the units of these indices are decades of log, and not easily translated into absolute terms, we also present three inverse-transformed values:

equation image
equation image

[19] Log-difference errors (Δ) for each algorithm tended to be symmetrically distributed about their mean and approximately normally distributed. Assuming an underlying normal distribution for Δ, Fmed would be the median value of the ratio

equation image

and 68% of the F values would lie within the “one-sigma” range (Fmin to Fmax).

2.2.2. Effect of Errors in the Satellite Chlorophyll

[20] The IPsat estimate based on the simulated satellite chlorophyll, Bsat, was subject to two errors: the relative error Δ defined in equation (3) and an error due to the satellite chlorophyll error, ΔB, which is

equation image

[21] To investigate the effect of errors in the satellite chlorophyll, Δsat was regressed against ΔB. The slope of this regression yields information about the sensitivity of the IP algorithm to errors in the satellite chlorophyll retrieval. A slope of 1 would indicate that the resulting error in IP is directly proportional to the error in Bsat, whereas a slope less (greater) than 1 shows less (greater) sensitivity.

2.2.3. Regional Analyses

[22] To investigate regional differences, performance indices were computed for each data set separately. Although there were methodological differences between data sets, we treated the different data sets as “regions” for the purpose of this analysis. A two-way analysis of variance (ANOVA) was performed on the Δ data to determine whether there were significant differences in algorithms, regions, and “interactions” between algorithms and regions.

2.2.4. Comparing algorithms

[23] To compare algorithms, a correlation coefficient (r) was calculated from the log-transformed results, log(IPalg), for each pair of algorithms. We did not use correlations (or r2 as a measure “percent variance explained”) to measure the performance of the algorithms themselves, because high r2 is not a sufficient condition for good agreement. That is, high linear correlation can exist despite systematic errors. In the case of two algorithms, however, we computed correlations and average ratios, and we also examined all pairwise plots to determine the degree of agreement. Results were considered in the context of differences in algorithm structure and complexity.

3. Algorithms

[24] Twelve algorithms by 10 teams were tested in the round robin. Each participant team was assigned a code number to identify its algorithm's results in subsequent comparisons. Code numbers ranged from 1 to 12. (Teams 2 and 10 dropped out after receiving code numbers). Team 4 submitted results for three algorithms, which were identified by letters (e.g., 4a, 4b, and 4c).

[25] Although the identity of the algorithm developers is not revealed, in accordance with ground rules, paragraphs describing each of the algorithms are provided in Appendix A. Participation in the round robin was voluntary, and thus the nature of the algorithms tested was purely fortuitous. As it turned out, the algorithms belonged to three of the four major categories of complexity described by Behrenfeld and Falkowski [1997b]. The category not represented was that of wavelength-integrated models (WIMs). In this category, time and depth are resolved, but light is not spectrally resolved.

[26] Five algorithms (numbers 1, 6, 8, 9, and 12) were from the wavelength-resolved model (WRM) class, which is the most detailed and highly resolved of all algorithm types. In algorithms of this category, the photosynthetic rate is computed at each depth and at various times throughout the day, based on a spectrally resolved underwater light field. In some cases the vertical profile of chlorophyll was modeled (1, 8, and 9), and in others it was assumed to be uniform (6 and 12). Four of the algorithms (1, 8, 9, and 12) applied a photosynthesis-irradiance relationship to calculate the chlorophyll-specific productivity. They required parameterizations of the maximum light-saturated rate of photosynthesis (PBmax) and photosynthetic efficiency (α, the slope of the PB versus E curve in low light). An alternative approach, used by algorithm 6, was to calculate the radiant energy absorbed by the phytoplankton and then apply a quantum yield (φ, moles carbon fixed per mole photons absorbed) to derive productivity. Temperature is generally used in the parameterization of PBmax or φmax.

[27] Five algorithms (4a, 4b, 4c, 5, and 11) belonged to the class of time-integrated models (TIMs), in which depth is resolved but both time and wavelength are integrated. These algorithms employed models of the daily production normalized to chlorophyll (Pz/Bz) as functions of the daily irradiance Ez at depth Z. Such models might resemble the photosynthesis-irradiance models (PB versus E), as was the case for the 4a–4c algorithms, or they might be based on a quantum yield approach (5 and 11). Algorithms 5 and 11 modeled the vertical distribution of chlorophyll, whereas the 4a–4c algorithms assumed a uniform chlorophyll profile. Chlorophyll was multiplied by the modeled Pz/Bz and then integrated over the water column to estimate depth-integrated production.

[28] Two algorithms (3 and 7) belonged to the Depth Integrated Model (DIM) category. In these models, there was no vertical resolution of chlorophyll, light, or other properties, but rather IP was derived from integrated (IB) or average euphotic zone chlorophyll, surface PAR, and SST. Details of the individual algorithms are provided in Appendix A.

4. Results

4.1. Comparisons With 14C-Based Estimates

[29] The 12 algorithms varied widely in performance (Figure 2 and Table 5). Estimates falling within a factor of 2 of the 14C-based estimates are points bounded by the dashed lines in Figure 2. Many of the estimates fell within this factor-of-2 range, with the most notable exceptions occurring at Antarctic stations (open and solid triangles) and at PROBES stations in the Bering Sea (solid diamonds).

Figure 2.

Scatterplots of algorithm-derived primary production (IPalg, mg C m−2) versus production measured in situ (IPmeas, mg C m−2) for 12 algorithms tested. Solid line represents perfect agreement, and dashed lines represent factor-of-2 relative errors. Algorithm category [Behrenfeld and Falkowski, 1997b] is shown in upper left corner of each plot.

Figure 2.


Table 5. Performance Indices for Relative Errors in Algorithms as Compared With Measured IPa
  • a

    Columns are the mean (M), standard deviation (S), and root mean square (RMS) of the log-difference error (Δ). The geometric mean and one-sigma range of the ratio (F = IPalg/IPmeas) are given by Fmed, Fmin, and Fmax, respectively. The algorithm type is based on the categories defined by Behrenfeld and Falkowski [1997b].


[30] Performance indices are listed in Table 5. As a benchmark, RMS values of <0.3 indicate agreement within a factor of 2. The RMS values comprise a random error (indexed by S) and a systematic error or bias, M. Most algorithms exhibited large biases as indicated by nonzero values of M, which translated to median ratios, Fmed, ranging from 0.42 (algorithm 11) to 1.4 (algorithm 8). If the biases could be eliminated, the RMS error would equal S, in which case 10 of the 12 algorithms would be within a factor of 2 (S < 0.3). It may be possible to eliminate biases by reparameterizing the underlying relationships between production, chlorophyll, and light. The sensitivity of the algorithms to model parameterization may be seen by comparing results for algorithms 4a, 4b, and 4c, which differed only in their parameterization of PBopt (the TIM equivalent of PBmax).

4.2. Effect of Bsat Error

[31] Errors in the satellite chlorophyll algorithm were simulated using a random number generator that introduced a factor-of-2 error in Bsat. Considering the magnitude of the Bsat errors, the resulting increases in IPsat errors (Table 6) were remarkably small. Numbers in parentheses are the relative changes in algorithm performance compared with their corresponding values in Table 5. The RMS differences increased between 3 and 27%, chiefly due to increases in S, as reflected by the expanded one-sigma range of F. Algorithm 6 seemed the least sensitive to the chlorophyll errors, while algorithms 3 and 5 showed the greatest sensitivity.

Table 6. Performance Indices When Algorithms Used “Corrupted” Satellite Chlorophyll (Bsat) Instead of Measured Chlorophylla
  • a

    Columns are the resulting values of RMS, Fmin and Fmax, and (in parentheses) the percentage change relative to the values in Table 4. The correlation, slope, and intercept are based on regressions of Δsat versus ΔB.

AlgorithmRMS, %Fmin, %Fmax, %CorrelationSlopeIntercept
10.31 (+17)0.37 (−13)1.16 (+12)0.990.50−0.003
30.28 (+27)0.50 (−15)1.80 (+11)0.980.62−0.008
4a0.30 (+17)0.41 (−11)1.57 (+10)0.990.57−0.001
4b0.29 (+21)0.42 (−12)1.41 (+12)0.990.56−0.001
4c0.35 (+10)0.48 (− 8)2.40 (+7)0.990.560.000
50.35 (+22)0.49 (−16)2.42 (+14)0.990.76−0.003
60.28 (+3)0.74 (−5)2.35 (+1)0.940.30−0.006
70.47 (+2)0.22 (−4)0.95 (+13)0.850.370.018
80.30 (+11)0.74 (−10)2.56 (+7)0.980.56−0.004
90.39 (+12)0.28 (−13)0.96 (+11)0.990.58−0.005
110.51 (+2)0.19 (−3)0.96 (+4)0.960.420.004
120.49 (+13)0.23 (−10)0.82 (+9)0.970.53−0.001

[32] Regressions of Δsat = log(IPsat/IPalg) versus ΔB = log(Bsat/Bsfc) provided additional insight concerning the sensitivity of IP estimates to errors in Bsat. The results for algorithm 4b shown in Figure 3 are typical. In all but three cases, correlations between Δsat and ΔB were >0.97; all were >0.85. These high correlations reflect the deterministic nature of the relationships between IPalg and Bsfc. For any algorithm, if IPalg were directly proportional to Bsfc, then the regression of Δsat versus ΔB would have a slope of 1. Instead, the slopes ranged from 0.30 to 0.76, indicating that errors in Bsfc produced less-than-proportionate errors in IP. This is due, in part, to the nonlinearity of IP with respect to chlorophyll, which affects the depth of integration as well as the light-harvesting capacity of the phytoplankton. Most slopes fell between 0.5 and 0.6, which is consistent with several studies [Eppley et al., 1985; Morel and Berthon, 1989; Morel, 1988], which found that IP varies approximately as equation image. The algorithm having the smallest slope (algorithm 6) was the one least affected by errors in Bsfc based on changes in its performance indices. Likewise, the two highest slopes correspond to the two algorithms (3 and 5) that showed the greatest sensitivity.

Figure 3.

Error in satellite-derived production [Δsat = log(IPsat/IPalg)] associated with a simulated error in satellite chlorophyll [ΔB = log(Bsat/Bsfc)] for algorithm 4b. This is typical of relationships seen for other algorithms (Table 6).

4.3. Regional Comparisons

[33] Performance indices for pooled results from the nine regions are listed in Table 7. The PROBES (Bering Sea) region was the only case where the algorithms, on average, overestimated the 14C-based estimates (by 14%). In all other regions the algorithms underestimated the 14C-based estimates (between 5 and 52%). The PROBES (Bering Sea) region had the lowest pooled RMS (0.23), whereas the EqPac equator had the highest value (0.50), largely because of a high negative bias (M = –0.32). No region was uniformly better or worse for all algorithms. Individual algorithm RMS values (not shown) ranged from 0.19 to 0.47 in the PROBES region and from 0.13 to 0.79 in the EqPac equator region. The Arabian Sea had the three lowest RMS values (0.07, 0.07, and 0.06, which were for algorithms 6, 7, and 8, respectively).

Table 7. Performance Indices for Pooled Data Within Each Region (Data Set)
EqPac nonequator13−
EqPac equator8−0.320.380.500.480.491.06
Arabian Sea12−
Palmer LTER10−

[34] The two Antarctica regions (Palmer and AMERIEZ) appeared to have the worst results, judging from outliers in Figure 2, and yet pooled errors from these regions did not have especially high values of RMS or M. These regions included the lowest and highest values of primary production, as well as extremes in other variables. The apparent poor performance might actually be a result of the range in the Antarctic data causing high and low values to be more conspicuous.

[35] The two-way (regions and algorithms) ANOVA confirmed what was obvious from Figure 2, namely, that there were highly significant differences among regions [F(8,960) = 30.4; p ≪ 0.0005] and among algorithms [F(11,960) = 80.9; p ≪ 0.0005]. The algorithm-region interaction was also significant [F(88,960) = 5.7; p ≪ 0.0005], indicating that algorithm performances were region dependent.

4.4. Algorithm Comparisons

[36] Although there were significant differences among algorithms, pairwise comparisons revealed high correlations (>0.9) in many cases. In general, the degree of correlation was unrelated to the algorithm complexity or category [Behrenfeld and Falkowski, 1997b]. Examples of several highly correlated pairs are illustrated in Figure 4. The two algorithms most highly correlated were 9 (WRM) and 4b (TIM). The three best-performing algorithms of each type [1 (WRM), 3 (DIM), and 4b (TIM)] were strongly correlated with one another. From these results it is clear that the structure or complexity of an algorithm seems to have no relationship to its performance.

Figure 4.

Scatterplots comparing algorithms 1 (WRM), 3 (DIM), 4a (TIM), 4b (TIM), and 9 (WRM).

5. Discussion

[37] Some of the variance in performance is likely due to methodological differences within the test data set itself, particularly in the diversity of 14C incubation methods. Short-term incubations (e.g., the Palmer LTER data) generally approximate gross primary production, whereas longer-term (e.g., 24-hour) incubations more closely approximate net primary production. If the algorithms were parameterized to yield net primary production, then they should consistently underestimate the Palmer LTER data. From Table 7, we see that the algorithms did, in fact, underestimate Palmer data, but to a lesser extent than the JGOFS data sets (EqPac and Arabian Sea), which were 24-hour incubations. There appeared to be no trends relative to the type of incubation (whether in situ, simulated in situ, or in a photosynthetron).

[38] The data furnished to participant teams did not include the year. In retrospect, we think this might have been a mistake, particularly in areas such as the equatorial Pacific where interannual variability is high and well understood. Another reason the year is important is that the “clean techniques” used for the past 2 decades generally produce higher estimates of primary production [Fitzwater et al., 1982]. These considerations would only apply if algorithms somehow adjusted for interannual variability or the technique used. Since the year of the measurement was not provided, the algorithms made no adjustment for these factors. It is interesting that the oldest data set (PROBES, 1979–1981) was the only one in which algorithms overestimated the 14C-based production, a result that could be explained if the algorithms had been parameterized for clean techniques. The highest negative bias was in the equatorial Pacific, where, on average, algorithm predictions were only half the measured IP values, but these data (Eqpac equator, February–October 1996) were from a “normal” year, before the onset of the 1997–1998 El Niño. This does not explain why the algorithms would be too low, assuming they were also parameterized for normal conditions.

[39] A comparison of algorithm predictions with measurements made at discrete locations does not account for the real value of the remote sensing measurement, namely, the improved spatial and temporal coverage afforded by satellite observations. The coverage and long-term consistency afforded by remote sensing can compensate to some degree for its lack of accuracy. Ideally, both ship-based and satellite measurements will be used to monitor for changes in primary production at large scales. A robust integrative model, assimilating both in situ and remotely sensed data, will likely be required. A relatively simple technique has been demonstrated with CZCS data whereby global satellite maps are adjusted by blending them with in situ measurements [Gregg and Conkright, 2001]. The result is to remove biases found in the satellite products. We observed significant biases relative to the 14C data and also when comparing algorithms with one another. We strongly urge that additional effort be invested to understand why algorithms differ systematically from one another and from the 14C data.

[40] The round-robin experiments elicited some debate as to whether computationally complex algorithms are worthwhile. The fact that simpler algorithms (DIMs and TIMs) performed as well as or better than complex algorithms (WRMs) suggests that the computational complexity may be unnecessary and, in fact, may be ill advised given concerns about scaling (see below). However, several participants argued in favor of the highly resolved models. They maintain that computational complexity is not an issue, because of the speed of modern computers, whereas the advantage is that it links algorithms to the experimental methods, carried out at the same scales, which inform our understanding of photosynthesis in the ocean [Kirk, 1994; Falkowski and Raven, 1997]. Such detailed algorithms have a greater opportunity to incorporate future advances in remote sensing that might provide information on accessory pigments, absorption by dissolved organic matter, or fluorescence yield. So far, however, there has not been a clear demonstration that additional complexity improves the performance of algorithms.

[41] Scaling issues are potentially an important concern that should be addressed with more rigor [Bidigare et al., 1992; Campbell et al., 1995]. Satellite-derived fields represent mean properties at much larger scales than the in situ data used to parameterize the algorithms. Typically, satellite-derived primary production represents the mean production over an area of at least 1 km2 (often much larger) and over timescales of a week or longer. Many of the algorithms employ nonlinear relationships that were derived from measurements made at the spatial scale of an incubation bottle and at the timescale of hours. When the same models are applied to chlorophyll, light, and temperature averaged over the satellite scales, the result is not necessarily the mean primary production at the larger scale. Variance existing within the larger “averaging bin” affects the mean IP at the larger scale, but this variance is generally not incorporated into model parameterizations (for a good discussion of this issue see Trela et al. [1995]). This points to the importance of matching the scales at which the models are parameterized to the scales of the satellite products.

[42] Satellite-derived primary production is much more difficult to “validate” than many of the other derived properties such as chlorophyll or SST. The latter can be validated by obtaining in situ measurements at the time of a satellite overpass. Although the spatial scale of the in situ measurement would not match that of the pixel (1 km2), at least the two would be simultaneous. Because incubations take several hours, the in situ primary production measurement will never match the timescale of a polar-orbiting satellite. At best, one can compare the estimate for a particular day and pixel. A more elaborate validation effort would be to observe diurnal changes in chlorophyll and light and then consider how this variability affects the satellite IP estimate.

[43] Knowledge of the vertical distribution of chlorophyll and light should improve primary productivity algorithms. The vertical distribution of light was represented in all algorithms except the DIMs (3 and 7), but only five algorithms modeled the vertical chlorophyll structure (1, 5, 8, 9, and 11). There was no evidence, however, that modeling the vertical structure was advantageous. One of the DIMs (algorithm 3), with no vertical resolution, did as well as or better than the algorithms with depth-resolved properties.

[44] Behrenfeld and Falkowski [1997a] demonstrated that the single most important parameter needed to improve algorithms is information on the maximum light-saturated rate of photosynthesis, PBmax (or PBopt). In many of the tested algorithms, temperature was used to derive this parameter, but the lack of consistency among available models suggests that temperature alone is not enough [Behrenfeld and Falkowski, 1997b]. Recently, a new PBmax model has been developed [Behrenfeld et al., 2002] that accounts for the effects of nutrient availability and photoacclimation. For this to be applicable to remote sensing, this model still requires the development of methods to assess the nutrient status and the physical structure, but results are promising. An avenue of current research along these lines involves the use of the natural (solar-stimulated) chlorophyll fluorescence, which can be remotely sensed by a sensor with sufficient spectral and radiometric sensitivity [Letelier and Abbott, 1996]. The MODIS instrument [Esaias et al., 1998] is currently making such measurements. The fluorescence yield may be inversely related to the quantum yield of photosynthesis [Falkowski and Kiefer, 1985; Kiefer and Reynolds, 1992], and thus if reliable measures of chlorophyll, PAR, and chlorophyll fluorescence can be made, these may be used to parameterize PBmax. A combination of satellite and in situ measurements will be needed to address these issues.

6. Conclusions

[45] Conclusions related to the four questions addresseed by this study are summarized as follows:

[46] 1. How do algorithm estimates of primary production derived strictly from surface information compare with estimates derived from 14C incubation methods? The 12 algorithms tested varied widely in performance. The best-performing algorithms agreed with the 14C-based estimates within a factor of 2. Two of these algorithms have been adapted by NASA for producing primary productivity maps with MODIS data. Most of the algorithms had significant biases causing them to differ systematically from the in situ data. A concerted effort should be made to understand the cause of the biases and to eliminate them if possible.

[47] 2. How does the error in satellite-derived chlorophyll concentration affect the accuracy of the primary productivity algorithms? The relative errors in primary productivity (Δsat) resulting from the simulated errors in surface chlorophyll concentration (ΔB) were highly correlated with ΔB. This fact reflects the deterministic relationship between production and chlorophyll in the underlying models. The slopes of the regressions (Δsat versus ΔB) ranged between 0.3 and 0.8, indicating that errors in surface chlorophyll produce less-than-proportionate errors in IP.

[48] 3. Are there regional differences in the performance of algorithms? There were significant regional differences, as well as algorithm-region interactions, indicated by the ANOVA results. No one region was uniformly better or worse for all algorithms. The region with the most serious biases was the equatorial Pacific, where algorithms underestimated in situ measurements by a factor of 2.

[49] 4. How do algorithms compare with each other in terms of complexity vis-a-vis performance? Many of the algorithms were highly correlated with one another. This was not surprising, since several are based on the same models, but what was surprising was that the level of agreement had no apparent relationship to the mathematical structure or complexity of the algorithms. In some cases, complex algorithms based on depth-, time- and wavelength-resolved models were highly correlated with simpler algorithms that were time and/or depth integrated. There were distinct systematic differences between algorithms. A future effort to understand systematic differences is strongly recommended.

7. Future Considerations

[50] Four of the algorithms tested are now being applied operationally to satellite data, or are planned for use with near-future missions. A third round-robin exercise is currently underway. In the third round robin, algorithms are given global fields of satellite-derived chlorophyll, SST, and PAR, and a detailed comparison of the algorithms is being conducted to determine how and where they differ.

[51] In accordance with our recommendation, future round robins will not be blind. The anonymous nature of the results presented here seriously diminishes their usefulness beyond the participants themselves. A more open approach would have facilitated detailed comparisons between algorithms to investigate, for example, why there were systematic differences (e.g., Figure 4). The only way this could have been done under the ground rules of a blind comparison would have been if the ATS ran the codes instead of the development teams. The level of effort involved on the part of the ATS was not feasible at the time this exercise was conducted.

[52] Comparisons with in situ data are also being made in the ongoing round robin. Algorithms will be compared with over 1,000 in situ measurements, all made according to JGOFS protocols. The number of stations (89) used for evaluating algorithms in the second round robin was much too small to adequately characterize the performance of algorithms. The goal of the algorithms should be net primary production, because that is what both land and ocean satellite products are intended to represent [Behrenfeld et al., 2001]. Thus 24-hour in situ incubations are the preferred method.

Appendix A:: Algorithms

A1. Algorithm 1

[53] This algorithm employed a photosynthesis-irradiance relationship with physiological P versus E parameters (α and Pmax) taken from the literature. The relationship of Eppley [1972] was used to compute Pmax as a function of temperature. The spectral downwelling irradiance incident at the surface was estimated based on the 5S code [Tanré et al., 1990] and on cloudiness determined as the ratio of the given PAR to clear-sky PAR estimated from the model. Chlorophyll profiles were based on statistical models that were selected based on the upper (surface) chlorophyll concentration as an index of the “trophic level” [Morel and Berthon, 1989; Berthon and Morel, 1992]. A bio-optical model, based on optical measurements made at sea, was used to propagate the radiative field through the water column. The shapes of the algal absorption spectra were derived from in vitro experiments. The magnitude of the spectra employed a statistical analysis of chlorophyll-specific absorption of algae as a function of the trophic level [Bricaud et al., 1995] and the Wozniak et al. [1992] results concerning variations in the quantum yield with trophic level [see Morel et al., 1996].

A2. Algorithm 3

[54] Chlorophyll concentration was assumed to be uniform over the euphotic layer, and IP was computed as: IPalg = PBBsfcZm, where PB is the daily primary production rate per mg chlorophyll and Zm is the depth of the 1% light level. A simple PB versus E model was used to compute PB as a function of the average PAR within the euphotic layer, E = E0/4.6. The PB versus E model used a constant value of α = 0.11 mg C (mg Chl)−1 h−1 (W m−2)−1 from Platt et al. [1991] for Atlantic noncoastal waters and used a relationship in which PBmax depends on SST [Eppley, 1972].

A3. Algorithms 4a–4c

[55] In these algorithms the relationship between daily carbon fixation and daily average PAR at each depth was calculated using a constant slope for the light-limited region of the water column and using various models for the maximum photosynthetic rate (Pbopt). The three versions differed with respect to the models used for Pbopt. Algorithm 4a employed a seventh-order polynomial fit to empirical data as described by Behrenfeld and Falkowski [1997a]; algorithm 4b used a modification of the Eppley model as described by Antoine and Morel [1996a]; and algorithm 4c assumed a constant value of Pbopt equal to 4.6 mg C (mg Chl)−1 h−1. The chlorophyll profile was assumed to be constant and equal to the surface value.

A4. Algorithm 5

[56] In this algorithm a chlorophyll-specific absorption coefficient for PAR was modeled as a function of time of year, ranging from 0.006 to 0.015 m2 (mg Chl)−1, with the maximum occurring in the summer months. The total attenuation coefficient for PAR included phytoplankton absorption, together with water and detrital attenuation, and then PAR irradiance profiles, Epar(z), were derived according to Beer's Law. Chlorophyll was assumed to have a Gaussian-shaped subsurface chlorophyll maximum for surface values <0.4 mg m−3 and was assumed constant with depth, otherwise. Production as a function of depth was then calculated using an irradiance-dependent formulation for quantum yield together with phytoplankton absorption and Epar(z), and production was then integrated over depth.

A5. Algorithm 6

[57] This algorithm calculates the spectral radiation absorbed by phytoplankton and multiplies that by a quantum yield to compute the hourly rate of primary production at each depth. The solar irradiance is split into spectral components via the 5S radiative transfer code [Tanré et al., 1990], and the spectral light field is propagated through the ocean using very simple two-stream approximations. The absorption and scattering coefficients required for this were obtained from the literature [Smith and Baker, 1981; Pope and Fry, 1997; Bricaud et al., 1995; Gordon and Morel, 1983; Petzold, 1972]. All absorption calculations were carried out spectrally and then integrated (400–700 nm). Quantum yield was calculated using a parameterization based on maximum quantum yield of 0.03 mol C (mol quanta)−1 and a light-dependent term. The chlorophyll profile was assumed to be vertically uniform.

A6. Algorithm 7

[58] This algorithm is based on empirical relationships developed by the author from data obtained on many expeditions in tropical, temperate, and polar regions. The primary production data was from half-day in situ incubations, and chlorophyll was measured by spectrophotometric methods without applying a correction for phaeopigments. Estimation of the daily primary production was obtained using a “psi-based” formulation: IP = ψ · E0 · DL · IB, where ψ (“psi”) was empirically modeled as a function of temperature for three trophic zones determined by the surface chlorophyll level. E0 was the daily incident radiation; DL was the hours of daylight, and IB was empirically modeled from the author's own data, where different models were applied depending on the surface chlorophyll level and the zone (tropic, temperate, or polar).

A7. Algorithm 8

[59] This algorithm employed a photosynthesis-irradiance relationship whose parameters were determined statistically for the biogeochemical province in which the station is located [Longhurst et al., 1995]. Similarly, the vertical chlorophyll profile was based on statistical models of profiles for each province. Surface incident irradiance was determined based on cloudiness (in a manner similar to that used by algorithm 1). A full radiative transfer code was then used to propagate spectral irradiance downward through the water column.

A8. Algorithm 9

[60] This algorithm is similar to that described by Morel [1991]; solar spectral irradiance was estimated using the Gregg and Carder [1990] model with a wind speed of 4 m s−1, water vapor of 2 cm, and visibility of 23 km. Clear-sky surface spectral irradiance was scaled to the measured surface PAR. The diffuse downwelling attenuation coefficient was estimated as the sum of the total absorption coefficient plus backscattering coefficient divided by the average cosine. Methods for estimating total absorption and backscattering are from Morel [1991, 1988]. The vertical profile of chlorophyll was simulated using the models of Morel and Berthon [1989], and the temperature dependence of PBmax was based on an Eppley model as modified by Antoine and Morel [1996a]. A constant value of 0.033 m2 (mg Chl)−1 was used for the chlorophyll-specific absorption coefficient at 440 nm. Daily primary production was determined by trapezoidal integration in hourly time steps over the photoperiod and at 0.5-m-depth intervals.

A9. Algorithm 11

[61] This algorithm used input data on surface chlorophyll, temperature, and light and estimated vertical profiles of these three properties over the euphotic zone. The depth distributions of chlorophyll and temperature were estimated using empirical relationships derived from a large globally distributed data set. The daily production at each depth was calculated as the product of chlorophyll, the daily PAR (Epar), and the chlorophyll-specific light utilization efficiency (ψ). The chlorophyll-specific light utilization efficiency, ψ, was constrained not to exceed a theoretical maximum based on the ambient temperature.

A10. Algorithm 12

[62] In this algorithm the light field was spectrally and vertically resolved, but a uniform vertical distribution of chlorophyll was assumed. The algorithm calculation requires knowledge of surface chlorophyll concentration, surface light, temperature, mixed layer depth, and the concentration of a limiting nutrient. From this information, estimates of the P versus E parameters were made, and thus P was determined at each depth and integrated to estimate IP.


[63] Support for NASA’s Ocean Primary Productivity Working Group was provided by NASA. The first author was funded by NASA’s Earth Science Enterprise as a member of the MODIS Instrument Team (NAS5-96063) and SeaWiFS Science Team (NSG5-6289). The other U.S. participants were funded by NASA’s Earth Science Enterprise through grants to the various individuals (too numerous to mention by number). The authors wish to thank Seung-Hyun Son, for assistance with the graphics, and Mark Dowell, Timothy Moore, and three anonymous reviewers whose helpful suggestions substantially improved the paper. In addition, we are grateful to the organizers of JGOFS, for open access to their data, and to colleagues who furnished data, including Barbara Prezelin and Mark Moline (Palmer LTER), Nick Welschmeyer (SUPER), Walker Smith (AMERIEZ), Jay O’Reilly (MARMAP), and Lou Codispoti (PROBES).