Criteria are proposed for evaluating sea surface temperature (SST) retrieved from satellite infra-red imagery: bias should be small on regional scales; sensitivity to atmospheric humidity should be small; and sensitivity of retrieved SST to surface temperature should be close to 1 K K−1. Their application is illustrated for non-linear sea surface temperature (NLSST) estimates. 233929 observations from the Advanced Very High Resolution Radiometer (AVHRR) on Metop-A are matched with in situ data and numerical weather prediction (NWP) fields. NLSST coefficients derived from these matches have regional biases from −0.5 to +0.3 K. Using radiative transfer modelling we find that a 10% increase in humidity alone can change the retrieved NLSST by between −0.5 K and +0.1 K. A 1 K increase in SST changes NLSST by <0.5 K in extreme cases. The validity of estimates of sensitivity by radiative transfer modelling is confirmed empirically.
Anding and Kauth  proposed that sea surface temperature (SST) could be determined from radiometric temperatures observed at two wavelengths within the “window” between 10 and 13 μm. Differential transmission allows the effect of the atmosphere to be (approximately) eliminated by linear combination of brightness temperatures (BTs), such that the SST can be estimated.
 Anding and Kauth's estimate of the likely accuracy of this “split-window” approach was 0.15 K. This was optimistic, given the radiometric noise and calibration of sensors flown, the variability of water vapour and aerosols in the atmosphere, and the difficulty of effective cloud screening. Nonetheless, split window algorithms have underpinned routine estimation of global SSTs with accuracy ∼0.5 K. A widely utilized time-series comes from the Pathfinder project for SST from the Advanced Very High Resolution Radiometer (AVHRR) series [Kilpatrick et al., 2001].
 As users become more demanding, exploring the limitations of SST retrieval techniques is increasingly necessary. This paper contributes some new approaches to this, highlighting characteristics of SSTs not thoroughly discussed in existing literature. We illustrate these using SSTs obtained with the Pathfinder methodology to make the results widely pertinent, but the approach is valid for all SSTs based on coefficients. These characteristics are regional bias, sensitivity to water vapour and imperfect sensitivity to SST. Such features are important for applications of SSTs in numerical weather prediction (NWP), operational oceanography and climate.
2. Criteria for Retrieved SST
 SSTs from satellites are usually validated against in situ observations, usually drifting buoys [e.g., Brisson et al., 2002]. A typical approach is to find the global mean and standard deviation of satellite-drifter differences. If the retrieval estimates the temperature of the ocean's skin layer [Saunders, 1967], which is usually cooler than the water below, assessment of “bias” relative to drifters should account for this [Merchant and Le Borgne, 2004]. Standard deviation is preferred to be as small as possible. The error in quality-controlled drifters when representing spatial scales corresponding to satellite observations is ∼0.2 K [O'Carroll et al., 2008].
 We propose three further desirable characteristics for SST retrievals: 1. On scales >1000 km (“regional scales”) bias should be small (say, <0.1 K). 2. Sensitivity to atmospheric water vapour should be negligible (close to zero). 3. Sensitivity to true changes in skin SST should be close to 1.
 Below, we present an obvious method of assessing the inherent bias on regional scales, and a means of assessing sensitivity based on radiative transfer modelling.
3. Methods and Data
 We use AVHRR and NWP forecast data for the Metop-A polar-orbiter, following Merchant et al. . Matches were extracted operationally by the Ocean and Sea Ice Satellite Application Facility (OSI-SAF) of the European Organisation for Exploitation of Meteorological Satellites. The matches span April 2007 to March 2008, and comprise night-time co-incidences in space of single AVHRR pixels with drifter SSTs within 3 hours (mean absolute time difference is 1 h 20 min). Only matches with low Saharan Dust Index (SDI < 0.2 [Merchant et al., 2006]) are used, to focus on clean-atmosphere effects. There are 233929 matches, well distributed geographically (Figure S1 of the auxiliary material).
 NWP forecast fields within 3 hours of each match are obtained from Météo-France's ARPEGE system, following Merchant et al.  including adjustment of forecast humidity. The NWP data comprise profiles of atmospheric temperature and humidity, and surface temperature at latitude-longitude resolution of 0.5°. The nearest profile is associated with each match. Brightness temperatures (BTs) are calculated from each NWP profile using RTTOV8.7 [Saunders et al., 2002]. We also find using RTTOV the partial derivative of the BT at each channel, λ, with respect to surface temperature (, where y indicates BT and x is surface temperature) and with respect to total column water vapour (, where w is total column water vapour, TCWV). The estimate is formed by perturbing the water vapour profile at all levels by a fixed proportion.
 The Pathfinder procedure for coefficient derivation is followed (http://www.rsmas.miami.edu/groups/rrsl/pathfinder/Algorithm/algo_index.html), the only distinction being the independent collection of matches, described above. The Pathfinder procedure calculates coefficients for a given month using a 5 month rolling window weighted most heavily to the central month. Within the window, observed BTs matched to drifting buoy observations are used to define the SST retrieval relationship by weighted least squares regression. An initial regression fit is used to identify outliers which are down-weighted in the ultimate regression that generates coefficients. The non-linear SST algorithm of Walton et al.  is applied separately on two ranges of difference between 11 and 12 μm BTs, namely, greater and less than 0.7 K. When used for retrieval, these two NLSST equations are linearly interpolated between BT differences of 0.5 and 0.9 K. Since, in this study, the coefficients are applied to the matches from which they are derived, the coefficients are ‘ideal’ within the limits of the NLSST formalism and derivation procedure.
 The Pathfinder NLSST equation is:
where is the retrieved SST, a0 is an offset coefficient, coefficients a1 to a3 weight the observed brightness temperatures at 11 and 12 μm (respectively y11 and y12), S = secant(θ) − 1.0 where θ is the satellite zenith angle, and xb is prior SST, restricted to the range −2° to 28°C. The in situ observations are used for xb.
4. Characteristics of Pathfinder NLSST
 The regression procedure to generate coefficients in equation (1) implicitly involves retrieved SSTs whose weighted mean square residual compared to drifters is minimized by the regression. Geographical variations in the bias intrinsic to the Pathfinder algorithm can be assessed by mapping the cell-averaged difference of the retrieved and the drifter SSTs – Figure 1a. These biases are “intrinsic” in that they arise despite applying the NLSST algorithm with optimum coefficients derived from the matches themselves. They show that the true BT-SST relationships are not accurately captured by the NLSST formalism. A coherent negative bias of −0.2 to −0.4 K is present in the equatorial Atlantic Ocean. There is a tendency to positive biases between +0.2 and +0.4 K in the Southern Ocean, east Indian Ocean, and Caribbean Seas. Other regions have biases within −0.2 to +0.2 K. For analysis of SST, it is preferable to reduce biases in the satellite SST product, rather than rely on bias correction in the analysis system, because any mismatch of scales between the bias correction scheme and the bias pattern may lead to elements of the original bias persisting in the analysis (R. Reynolds, personal communication, 2009).
 To assess the sensitivity of the NLSST to atmospheric water vapour, we estimate the partial derivative of the NLSST with respect to TCWV:
We would ideally like this sensitivity to be zero, but the coefficients have not been selected for this. In general, equation (2) does not evaluate to zero, in agreement with empirical observations [e.g., Kumar et al., 2003]. We estimate the partial derivatives from BT changes caused by a 5% negative perturbation of the absolute humidity at all levels in the atmosphere, with SST held constant. Figure 1b shows the TCWV sensitivity expressed as the change in NLSST for a +10% change in absolute humidity. Synoptic scale fluctuations ∼10% of TCWV are plausible throughout the marine atmosphere. In this data set, the average variability of TCWV is 30%, with variability being 10% to 15% in the tropics.
 Thus, overall drying or moistening of the atmospheric column can cause spurious fluctuations in retrieved NLSST wherever significant sensitivity exists. In validation, these fluctuations contribute apparent “noise” in the SST retrieval, increasing the standard deviation between satellite and drifter SSTs. This “noise” has the auto-correlation structure of the water vapour field. Mostly the sensitivity of the NLSST to w is modest. In some regions, positive sensitivity >0.1 K for a 10% increase in w is found. Negative sensitivity is more usual, with a value around −0.5 K for a 10% increase in w in some tropical regions. The highest sensitivities tend to be where water vapour loading is greatest, but sensitivity is not simply proportional to TCWV.
 The third desirable characteristic proposed is that the sensitivity of the NLSST with respect to true SST changes, i.e.,:
be very close to 1.0. The mapped estimate for the NLSST is shown as Figure 1c. The sensitivity is mostly less than 1.0, the average of the cells being 0.93. Minimum sensitivity occurs in the areas of high TCWV in the equatorial Atlantic and Pacific Oceans, especially the Tropical Warm Pool, where sensitivity dips below 0.5.
 Where sensitivity to SST is <1.0, SST frontal gradients and diurnal fluctuations will be under-estimated by the NLSST. Users of NLSST interested in quantifying these phenomena should take this into account.
 Sensitivity with respect to true SST being not 1.0 has an implication for the use of NLSSTs as a climate record: it indicates that prior information is embedded within the retrieved SST [Rodgers, 2000]. 1 − of the information determining the SST is prior information – i.e., is not contributed by the observed BTs. This prior information partly arises from the “guess SST”, xb. Figure 1d shows the map of mean , showing to what degree the explicit prior is influential. Generally, 1 − is about twice , so a comparable component of the prior information in the retrieval is not explicit. In the tropical regions where the atmosphere is least transmitting, BTs are least responsive to changes in SST and are strongly determined by the temperature of lower tropospheric water vapour; the NLSST retrieval then relies heavily on co-variability of atmospheric and surface temperature. Real variations in SST that are not associated with the climatological correlation between SST and lower tropospheric temperature are, in this situation, not fully reflected in NLSST.
 These results suggest that NLSSTs do not closely meet the three criteria for SST retrieval proposed in section 2. In the final section of this paper, we discuss the implications of this further. Before that, we briefly digress to establish, as far as possible, that the temperature sensitivity estimates we have formed are credible.
5. Verifying Sensitivity
 It is not possible directly to verify against the drifter data available, since over the time and space scales of change in drifter SST the atmospheric state cannot be taken as constant, as necessary to evaluate this partial derivative. However, independently of the drifter observations, we can verify the existence of differential sensitivity to SST between different types of retrieval, and show that they are consistent with simulation-based estimates. Some matches are obtained in areas of significant variability in SST. For an ideal SST retrieval, the variability of the retrieved SST is equal to the variability of the true SST over that area. However, if the sensitivity to true SST variations differs from 1.0, the apparent variability will be different in that same proportion. A limitation is that where the true variability of SST is small, radiometric noise will tend to dominate the apparent variability.
 Here we use the BTs for the 21-by-21-pixel box centred on the drifter location, an area over which column atmospheric variability is small if skies are clear. Using this box we can find directly from the data an estimate of the relative sensitivity of two different SST retrieval algorithms. The first is again Pathfinder-style NLSST. The second is a “triple” retrieval (SST3) used operationally at Météo-France [OSI-SAF, 2009], with radiative-transfer-based coefficients that are constant in time. SST3 exploits the 3.7 μm channel in addition to the split window, and is applicable only for night-time observations. Since SST3 weights most heavily the 3.7 μm channel, it gives a good contrast to NLSST.
 It is essential to minimize cloud contamination for this purpose. We select cases with >80% of the box flagged to be clear sky with the highest (“excellent”) level of confidence [OSI-SAF, 2009]. Only “excellent” pixels that are adjacent only to other “excellent” pixels are used. Atmospheric correction averaging [Harris and Saunders, 1996] over a 7 by 7 pixel moving window is applied to minimize radiometric noise in the retrieved SST. For this, the 11 μm channel is used as the reference for NLSST and the 3.7 μm channel is used for SST3, ensuring near-independence in the radiometric error for the two types of retrieval. Since there must be sufficient true SST variability to give a signal, we use only boxes with standard deviation in SST3 exceeding 0.25 K. True SST variability will give highly correlated NLSST and SST3, whereas cloud or aerosol contamination will introduce different relationships, so only SST boxes with a correlation coefficient between NLSST and SST3 > 0.97 are retained. 264 matches meeting these criteria are identified. The SST3 and NLSST estimates across the box for three examples are shown in the auxiliary material (Figure S2): the differences in the apparent contrast across frontal features are plain.
 For each case, we find the slope of the least squares fit of SST3 versus NLSST across the SST box. This slope is an empirical estimate of , the ratio of the sensitivity for the two types of SST retrieval. We also calculate this ratio from radiative transfer modelling using equation (3) for the NLSST term, and the equivalent for SST3.
 The calculated and empirical sensitivity ratios are compared in Figure 2. SST3 usually has a greater sensitivity to true changes in SST than NLSST, and the ratios are greater than 1. SST3 also tends to have lower regional biases and less sensitivity to water vapour (Figure S3). Estimates of derivatives and slopes tend to be noisy, and a ratio of such quantities is noisier still. Nonetheless, the correlation coefficient between the calculated and empirical ratios is 0.61 and is highly statistically significant (p < 0.001). The median proportion of the calculated ratio to empirical ratio is 1.002:1. This demonstrates that the relative sensitivity of different types of retrieval can be usefully estimated by radiative transfer modelling.
6. Discussion and Conclusions
 Linear regression ensures that regression-based SSTs are zero-bias and minimum variance for the domain spanned by the training data. This doesn't imply zero-bias and minimum variance regionally [e.g., Minnett, 1990]. Here, we present a global assessment of annual-mean regional bias relative to drifting buoys for Pathfinder-style NLSSTs. Intrinsic biases at scales of >1000 km are identified.
 In addition to regional bias, we identify two further characteristics of any SST retrieval system that are useful to evaluate: sensitivity to water vapour and sensitivity to true SST. These are more difficult to assess, since they depend on radiative transfer modelling of the response of the retrieval algorithm to changes in water column and surface temperature. However, such analysis will be valuable to some SST users, since there are significant deviations from the ideals of negligible sensitivity to w and unit sensitivity to x. One example could be study of air-sea interaction across fronts associated with the western boundary current of the north Atlantic along the coast of Florida. Figure 1c implies that the magnitude of SST change across such fronts is underestimated by >25% in a typical NLSST image for the region. This could affect the interpretation of air-sea coupling. A second example might be assessing diurnal variability. Infra-red SST retrievals in the tropical Atlantic, for example, under-estimate diurnal changes by more than a third, according to Figure 1c. A third example is that TCWV sensitivity may lead to exaggerated synoptic-scale variability in NLSST in humid regions. Further study is needed to assess these implications in detail.
 Variability in atmosphere humidity can occur in modes more complex than the general moistening or drying analyzed here. For example, drying of particular layers in the atmosphere is also able to induce errors in retrieved SST [e.g., Minnett, 1986]. By adapting the approach taken here, the sensitivity of retrieval to different local modes of variability could be assessed.
 Much SST variability occurs over time-scales comparable to or longer than the five month rolling window used to derive Pathfinder-style retrieval coefficients. There may also be an atmospheric change correlated to large SST anomalies. The degree to which NLSST coefficients successfully adjust to such cases is not captured by the results presented here. In future work, our approach will be adapted to explore such questions.
 SST is deemed an essential climate variable [Global Climate Observing System, 2003], of which Pathfinder offers a long and consistently derived record. Our analysis has identified an implicit dependence on prior information in NLSST, in addition to the explicit dependence on the guess SST. Particularly in tropical regions, NLSST is significantly dependent (up to ∼20% of information) on the climatology implicit in the set of matches from which the coefficients are derived. There has been a radical evolution of drifting-buoy coverage from the 1980s to the present day, and this implicit climatology may not have been consistent through the period. This is relevant to use of Pathfinder SSTs for evaluating climatic trends and variability.
 We have shown that it is useful to augment the traditional database of matched in situ and satellite observations with matched radiative transfer simulations of BTs and of their partial derivatives with respect to SST and atmospheric state. This enables new characteristics of SST retrievals to be assessed and refined. Forward simulations of BTs and their derivatives are also crucial to optimal estimation of SST [Merchant et al., 2008] and may be used for probabilistic cloud detection [Merchant et al., 2005]. While there is a significant effort involved to maintain an operational capacity for forward simulation in real time, there are also clear benefits, as illustrated here for NLSST, to understanding the error properties of the retrieved sea surface temperatures.
 This study was partly undertaken within the framework of the UK Natural Environment Research Council's National Centre for Earth Observation and was partly supported by EUMETSAT's OSI-SAF Visiting Scientist programme.