This study investigates how the quality of sea surface temperature (SST) observations made by drifting buoys (drifters) and ships for 1996–2010 can be improved through retrospective quality control (QC) against a reference field. The observations used are a blend of delayed mode data taken from the International Comprehensive Ocean-Atmosphere Data Set (versions 2.0 and 2.5) and real time data obtained from the Global Telecommunication System. A comparison of drifter and ship measurements on a platform-by-platform basis to high-quality SST estimates from the Along track scanning radiometer Reprocessing for Climate (ARC) project reveals drifter observations are generally of good quality but frequently suffer from gross errors, whilst ship observations are generally of worse quality and show a diverse range of measurement errors. QC procedures are developed which similarly assess drifter and ship SST observations through comparison with the Met Office Operational SST and Sea Ice Analysis (OSTIA). These procedures make use of seasonal background error variance estimates now available for OSTIA. Drifter observations displaying some commonly observed gross errors are flagged and ship callsigns whose observations are deemed unreliable are blacklisted. Validation of the QC outcomes against ARC and Argo demonstrates that this retrospective QC improves the quality of drifter and ship observations, though some limitations are discussed.
 Knowledge of sea surface temperature (SST) is an essential part of climate monitoring and operational forecasting activities. Observations of SST made in situ are used in many applications such as for monitoring Earth's global surface temperature, calibrating and validating satellite SST estimates, providing boundary conditions for atmospheric and oceanic analyses and reanalyses, and assessing the quality of forecast and hindcast SST fields. The treatment of in situ SST observations varies from one application to the next, but in many cases the observations are treated as “truth,” assumed unbiased or assumed free of gross or systematic errors. This is not correct. For example, investigations into the precision and accuracy of in situ SST observations reveal differences between measurement platform types (e.g., drifting buoys, moored buoys, and ships) and from one platform to the next [e.g., Kennedy et al., 2011c]. Also, as part of operational forecasting activities, each month the Met Office (the UK's national weather service) produces blacklists of in situ platforms whose observations are frequently in gross error or excessively biased or noisy; these are then excluded from future forecast assimilation (Met Office Observation Monitoring: http://research.metoffice.gov.uk/research/nwp/ observations/monitoring/index.html). In each of these two cases, the quality of in situ SST observations is assessed through comparison to some other source of SST data (satellite data and an SST analysis based on satellite and in situ data, respectively). Such an approach has also been adopted in other studies [e.g., Castro et al., 2012; Xu and Ignatov, 2010; O'Carroll et al., 2008; Kent and Berry, 2008] or operational systems (e.g., NOAA iQuam: http://www.star.nesdis.noaa.gov/sod/sst/iquam or Météo-France marine observation monitoring: http://www.meteo.shom.fr/qctools) that seek to assess the precision, accuracy, or stability of in situ SST observations.
 In this study, a similar approach is adopted to investigate how in situ SST observations from drifting buoys (hereon referred to as drifters) and voluntary observing and research ships (hereon referred to as ships) might be flagged for quality, chiefly through comparison with other sources of SST data. This is done by developing a new set of quality control (QC) procedures based mostly on “tracking” the quality of SST observations throughout the observing history of individual drifters and ships against some other reference SST and flagging observations that are of persistently poor quality. Although this is somewhat similar to the operational blacklisting described above, the tracking QC described here is applied retrospectively to SST observations from 1996 to 2010. Therefore, a more consistent SST reference can be used for assessing observation quality over time and the design of QC checks need not be constrained by their monthly application, allowing better temporal resolution and longer-term instrumental biases or random measurement errors to be considered. The QC procedures that are developed here use the Met Office Operational SST and Sea Ice Analysis (OSTIA) as a reference for assessing in situ SST observations. The 1996–2010 period was chosen to also make use of high-quality satellite SST estimates from the Along Track Scanning Radiometer (ATSR) series of instruments (ATSR1, ATSR2, and AATSR) produced by the ATSR Reprocessing for Climate (ARC) project [Merchant et al., 2012]. Note that although 20 years of ARC data are available, ATSR1 data are not used here as they are less stable than the data from ATSR2 and AATSR. The ARC SST estimates provide a valuable tool for verifying QC flags produced through comparison with OSTIA.
 The aim of this work is to produce a high quality set of drifter and ship observations from 1996 to 2010 for use in applications such as those listed at the start of this section. In producing QC flags through comparison with a reference field, it is extremely important not to systematically reject observations in regions where the reference field is uncertain, as these in turn tend to be regions where in situ observations are useful for validating other estimates. QC checks which make use of a reference field, such as the “tracking” quality checks described here, or those conducted as part of other systems (e.g., OSTIA [Donlon et al., 2011]; National Environmental Satellite, Data and Information Service [Xu and Ignatov, 2010]), are vulnerable to circularity, reducing independence between data sets when seeking to remove error. This is undesirable when constraining uncertainty as it can yield a false sense of agreement between different data sets [e.g., Thorne et al., 2005]. However, such QC checks can be useful and careful iterations of the process should result in improvements to the quality of the set of observations provided suitable care is taken. This study makes use of new seasonal background uncertainty estimates for OSTIA SSTs to try and mitigate against such problems.
 The study is structured as follows: section 'In Situ SST Data' describes the drifter and ship data used in this study, section 'Quality Control Procedures' describes the OSTIA and ARC data used as a reference for the QC checks and the design of the QC procedures developed for the in situ data, section 'Results' describes, assesses, and validates the results of the QC checks, whilst section 'Discussion and Summary' discusses and summarizes the work presented.
2. In Situ SST Data
 To assess the quality of drifter and ship SST observations over the period 1996–2010, data from July 1995 to June 2011 are used. The extra six months of data before and after the 1996–2010 period ensure that all platforms have a sufficiently long record available for tracking their observation quality.
 From July 1995 to December 1997, drifter data are taken from release 2.0 of the International Comprehensive Ocean-Atmosphere Data Set (ICOADS 2.0), which provides delayed mode surface marine data (including SST) from 1784 to 1997 [Worley et al., 2005]. From January 1998 to February 2011, drifter observations from the National Centre for Environmental Prediction Near Real Time (NCEP NRT) monthly product are used, obtained from the National Oceanic and Atmospheric Administration Earth System Research Laboratory (NOAA ESRL). This combined ICOADS 2.0 and NCEP NRT data set is used in the Met Office Hadley Centre gridded SST analysis HadSST2 [Rayner et al., 2006] and is chosen here because it is the data set used by Embury et al.  to generate in situ-ARC SST comparisons, which this study makes use of (section 'ARC'). NCEP NRT data are not available from NOAA ESRL after February 2011 so from March 2011 drifter observations are instead taken from the ICOADS Real Time (ICOADS RT) monthly product [Woodruff et al., 2011]. Both the NCEP NRT and ICOADS RT products are based on the same NCEP Binary Universal Form for the Representation of Meteorological Data (BUFR) product, which NCEP translates from original Global Telecommunication System (GTS) strings. The treatment of NCEP BUFR data does, however, differ for NCEP NRT and ICOADS RT data with ICOADS RT having a more complete form (ICOADS; Release 2.1 Real-Time archive overview: http://icoads.noaa.gov/rt.html). This must be borne in mind where the switch in data source occurs. In section 'Discussion and Summary', drifter QC outcomes for the ICOADS 2.5 data set [Woodruff et al., 2011] are briefly discussed. ICOADS 2.5 provides delayed mode surface marine data from 1662 to 2007 and is extended in near real time using ICOADS RT data. Both ICOADS 2.5 and ICOADS RT data are used in section 'Discussion and Summary' to determine the impact of our QC on delayed mode data for 1996–2010.
 The drifter QC procedures described in section 'Drifter Quality Control Procedures' rely on tracking individual drifters using their World Meteorological Organization Identifier (WMO ID). A difficulty when tracking drifters, however, is the possible reuse of a given WMO ID for a new deployment once the existing drifter reaches the end of its operational life-time. Whilst guidance from the Data Buoy Cooperation Panel (DBCP; a joint body of the WMO and the Intergovernmental Oceanographic Commission (IOC)) states that at least 3 months should be allowed prior to reuse of a drifter WMO ID (WMO; Rules for allocating WMO numbers: http://www.wmo.int/pages/prog/amp/mmop/wmo-number-rules.html), in practice a shortage of 5-digit WMO IDs in certain ocean regions has meant that the period for reuse was often shorter (David Meldrum, personal communication). The recent introduction of 7-digit WMO IDs has largely circumvented this problem. Histograms of the time gaps between consecutive reports by the same WMO ID (not shown) show evidence of two overlapping populations separated by a minimum at around 90 days. The population beyond the minimum is consistent with the deferred reuse of the WMO ID for a new deployment, as stated in DBCP guidelines. Unfortunately, it is not possible to fully separate these overlapping populations, so, where a period of more than 90 days occurs between consecutive drifter SST observations (that pass initial date and time base QC checks, see below) with the same WMO ID, this is treated as a new drifter deployment.
 From July 1995 to November 2007, ship data are taken from the same source as drifter data (ICOADS 2.0 and NCEP NRT). From December 2007, because of security concerns, all ship observations in the NCEP NRT and ICOADS RT product are masked using a generic callsign. Because the ship QC procedures described in section 'Ship Quality Control Procedures' rely on tracking individual ships using their unique callsigns, an unmasked set of ship observations is required from December 2007 onward. For climate monitoring purposes, the Met Office Hadley Centre produces a version of ICOADS RT data where this generic callsign has been replaced where possible, so from December 2007 ship data are instead taken from this source. It should be noted that whilst this unmasking procedure replaces the generic masked callsigns found in ICOADS RT data with ship callsigns from the Met Office Hadley Centre's marine observation database, these themselves are often nongeneric masked callsigns, which first came into use in 2006 (as per WMO EC-LIX Resolution 27). Whilst these nongeneric masked callsigns are unique to individual ships and so facilitate tracking, when a ship begins using a nongeneric masked callsign in place of its original unique callsign, these will be treated as two separate ship records. It is also noted that some ship data in this data set still possess generic masked callsigns such as “SHIP” or “MASK” and so these are not assessed by the ship QC procedures. In general, less than 5% of observations in the data set used possess generic masked callsigns. This rises to around 10% from December 2007 onward because in some cases it has not been possible to replace the generic masked callsign applied to all NCEP NRT and ICOADS RT data. Section 'Discussion and Summary' briefly discusses ship QC outcomes for the ICOADS 2.5 data set for 1996–2010. Here ICOADS 2.5 ship data are used up to November 2007, and unmasked ICOADS RT data are used from December 2007 onward.
 At present, Met Office Hadley Centre processing classifies any observation not from a drifter or moored buoy as a ship. This results in observations from the NOAA National Data Buoy Centre Coastal-Marine Automated Network (C-MAN) being treated as ship data. To mitigate against this, a list of “ship-callsigns-that-are-not-ships” is maintained (Met Office Hadley Centre HadSST3 data set: www.metoffice.gov.uk/hadobs/hadsst3), which is used here to help ensure that only ship data are run through the ship QC procedures (∼5–20% of “ship” observations are removed by this step). From December 2007 onward, only ship observations are extracted for unmasking and this is no longer a problem.
 SST observations from drifters, ships, and other platforms that are received by the Met Office Hadley Centre are subjected to a variety of automated QC checks (described by Rayneretal. ), referred to here as base QC. These are a mix of basic checks of observation form (checking observations possess a meaningful date, time and location, an SST observation exists and is above the freezing point of seawater), a positional track check (to identify observations that may be mislocated) and gross error checks (SST observations are assessed against climatology and “buddy checked” against surrounding observations to check an observation is plausible). Met Office Hadley Centre SST data sets such as HadSST3 [Kennedy et al., 2011a, 2011b] also use robust statistics to improve their resistance to poor quality observations that pass these checks. The new additional platform-by-platform QC procedures described here are applied to SST observations that have already passed the base QC checks. Note that the SST base QC checks described above work on an observation-by-observation basis and differ from, but are complementary to, those described in section 'Quality Control Procedures', which rely on “tracking” observations against a reference SST.
3. Quality Control Procedures
3.1. SST Reference
 To assess the quality of drifter and ship SST observations, they are compared to a reference SST product. The reference chosen is the Met Office OSTIA product, which provides globally complete, daily estimates of foundation SST on a 1/20° (∼6 km) grid based on in situ and satellite observations [Donlon et al., 2011; Roberts-Jones et al., 2012]. The foundation SST is the temperature free of diurnal temperature variability; this variability typically occurs in the upper meters of the water column (see, for example, https://www.ghrsst.org/ghrsst-science/sst-definitions/). OSTIA only assimilates SST observations that are not contaminated by diurnal warming and so can be considered to represent the foundation SST; these are nighttime observations and observations made in the daytime during windy conditions when the water column becomes well mixed. OSTIA includes ATSR SST retrievals, measurements made in situ and retrievals from other satellite instruments (bias corrected against ATSR and in situ observations). Infrared satellite sensors such as ATSR measure the radiation emitted from the upper tens of micro-meters (the skin) of the ocean surface. The type of SST observation obtained will depend upon the method used in the satellite SST retrieval. The majority of satellite observations used in the OSTIA system use retrievals obtained via regression to drifters which provides SST observations at the depths typically observed by drifters (∼0.2 m). By contrast, the ATSR data use a physical retrieval methodology to provide observations of the skin SST so prior to assimilation into OSTIA a bias-adjustment is made to account for the relative coolness of the skin temperature. The interested reader is referred to Donlon et al.  and Roberts-Jones et al.  for further information. Validation of the OSTIA SST analysis using assimilated in situ observation-minus-background statistics shows that the global root-mean-square difference is approximately 0.50 K by 2007 [Roberts-Jones et al., 2012]. As can be seen, for example, in Figures 3 and 5, this magnitude of uncertainty is suitably small for detecting gross errors in the measurements from individual in situ platforms.
 OSTIA is available as a reanalysis from 1985 to 2007 [Roberts-Jones et al., 2012] for use in climate studies, and as an operational analysis from 2006-present [Donlon et al., 2011] for use in short-range forecasting. The reanalysis product uses ICOADS 2.1 ship and buoy observations (updated from 1998 onward by the Met Office Hadley Centre) and ATSR and Advanced Very High Resolution Radiometer (AVHRR) satellite records (reprocessed for homogeneity by the ESA/NEODC and NOAA/NASA Pathfinder projects, respectively, see Roberts-Jones et al. ), to produce a stable SST product. To improve observational coverage, the operational analysis product used (during the period of study) also incorporates satellite data from several other infrared (MetOP AVHRR, AVHRR North Atlantic Area, geo-stationary Meteosat Second Generation Spinning Enhanced Visible and Infrared Radiometer) and microwave (Aqua Advanced Microwave Scanning Radiometer for the Earth Observing System and Tropical Rainfall Mapping Mission Microwave Imager) instruments. Both the reanalysis and the operational analysis are needed here to assess the quality of drifter and ship observations from 1996 to 2010. The switch between the two products is not found to be a problem for this study.
 A matchup database (MD) is created between drifter and ship observations and OSTIA from July 1995 to June 2011. Where available, the OSTIA reanalysis is preferred. In situ observations are matched to OSTIA SSTs from the nearest (nonland) grid point and from the previous day. When using the operational analysis, matching with the previous day's OSTIA field ensures the in situ observations have not been assimilated into the reference OSTIA field and that the two are independent. However, strict independence cannot always be assumed because some in situ measurement errors are persistent in space and time (e.g., a biased slowly drifting buoy). In the production of the reanalysis, a 3 day window centered on midday of the day of interest was used for assimilation of observations (albeit with greater weighting given to observations nearer the center of the window). Our reanalysis matchups are not, therefore, independent of the drifter and ship observations they are matched with. Validation statistics calculated for OSTIA using observation-minus-background fields have been found to compare well with equivalent observation-minus-analysis validation statistics calculated using independent Argo float observations [Roberts-Jones et al., 2012]. It is, therefore, reasonable to assume that OSTIA fields are largely independent of the following day's observations. Whilst this is true for global statistics, this may not be the case locally where OSTIA may become more dependent on in situ observations (e.g., in cloudy conditions), hence in section 'Results', QC flags derived using OSTIA are validated against other sources of SST.
 Although OSTIA is used here as a reference for assessing the quality of drifter and ship SST observations, the OSTIA SSTs are themselves also uncertain. When comparing in situ observations to OSTIA, it is important to allow for this uncertainty to avoid mistakenly attributing errors in OSTIA to the observations themselves. Because in situ observations are being compared to the OSTIA field from the previous day, it is the background uncertainty that is needed. The best estimate of OSTIA background uncertainty presently available is a set of four climatological seasonal fields (DJF, MAM, JJA, and SON). These were produced by matching the OSTIA reanalysis with AATSR SSTs from 2002 to 2007 (with adjustments applied to AATSR to account for the relatively cool skin and observations contaminated by diurnal effects omitted, as described above), and using the Hollingsworth and Lonnberg method to parameterize the background error covariances into an associated error variance and error correlation length scale [Roberts-Jones et al., 2013]. These error variances were calculated on a 1° grid and underwent smoothing (Figure 1). Only the background error variance is of interest to this study, which provides the best available estimate of the uncertainty associated with comparing an in situ SST observation to the SST from the OSTIA analysis on the previous day. For each matchup, the background error variance from the nearest 1° grid point of the appropriate seasonal field is, therefore, also used by the QC procedures (sections 'Drifter Quality Control Procedures' and 'Ship Quality Control Procedures'). Note that following Donlon et al. , the representivity error (the error associated with comparing a point in situ observation with an OSTIA grid box average) is assumed negligible relative to the observational measurement errors being considered.
 ]In addition to OSTIA, drifter and ship SST measurements are compared to SST estimates from ARC. ARC provides a climate quality, high spatial resolution, stable record of SST, which is independent of in situ observations [Merchant et al., 2012] and is useful for assessing our QC flags produced using OSTIA. This study makes use of an existing MD, where in situ SST observations from ICOADS 2.0 and NCEP NRT (see section 'In Situ SST Data') have been matched with ARC SSTs up to and including 2009 [Embury et al., 2012]. The ARC MD makes use of skin to subskin, subskin to depth, and time adjustments to reduce the systematic discrepancy between ATSR skin SSTs and SSTs measured in situ in the near surface, associated with diurnal temperature variability. This results in discrepancies with robust standard deviations of order 0.2K between buoy and ARC SSTs [Embury et al., 2012], which is suitably small for detecting gross errors in individual in situ platforms. Unmasked ship callsigns for measurements from December 2007 were not available for NCEP NRT when the ARC MD was created, so from then on all ship callsigns in the MD are masked (see section 'In Situ SST Data'). Only matches with ATSR2 and AATSR are used here (available from July 1995 to December 2009). For further information about the ARC MD and its use as a reference, see Appendix The ARC MD.
 The main disadvantage of using the ARC MD to assess the quality of drifter and ship SST observations is the sparse spatiotemporal coverage of ARC SST estimates, owing to the narrow swath width of the [A]ATSR instruments [Merchant et al., 2012]. An investigation of the proportion of in situ platforms' lifetimes that are sampled by ARC reveals that only ∼15% of ∼15,500 drifter records and ∼3% of ∼41,000 ship records are fully sampled by ARC for the period 1996–2010. Here “fully sampled” refers to platforms that have at least one ARC matchup for each month they observe. The records that are fully sampled tend to be short records with relatively few observations and only ∼3% of drifters (∼400 records) are found to be fully sampled over their lifetime by ARC and to have >40 matchups, whilst ∼0% of ships (∼50 records) are found to be fully sampled over their lifetime by ARC and to have >10 matchups. This approach for estimating how well a platform's lifetime is sampled by ARC is not ideal as the probability of getting an ARC matchup in some month will vary from month-to-month and from platform-to-platform such that it is not always reasonable to expect ARC to sample every month of a platform's lifetime. However, it does illustrate that the ARC MD is of limited use for the QC of the drifter and ship population. Nonetheless, QC procedures and flags are developed for the subset of ∼400 drifters that are fully sampled by ARC and possess >40 matchups, as the high quality of ARC SSTs and their independence from drifter observations renders this a useful subset against which OSTIA QC can be tuned and verified. Furthermore, the ARC MD is very useful for assessing the performance of OSTIA as a reference field (section 'Drifter Quality Control Flags') and for initial assessments of the types of errors that exist in the in situ data.
3.2. Drifter Quality Control Procedures
 The drifter QC procedures described below work by “tracking” an individual drifter using its unique WMO ID (see section 'In Situ SST Data') and assessing the discrepancy between its SST record and matched OSTIA SSTs and uncertainty estimates. An initial visual investigation of drifter-ARC matchups revealed that, in general, the drifter and ARC SSTs are in reasonable agreement. However, clear gross errors in the drifter data do exist. Whilst these drifter errors can take a variety of forms, some of the most common appear to be poor quality data at the start and end of a drifter record (hereon referred to as “tails”) and persistently biased or noisy observations. The QC procedures developed here for drifters are, therefore, designed to identify and flag observations showing these types of gross error.
 When assessing the discrepancy between drifter and OSTIA SSTs, only nighttime drifter observations are used (nighttime is defined here where the solar zenith angle exceeds 92.5°). This is because OSTIA is a foundation SST product which excludes the effects of diurnal warming (section 'OSTIA'). This is not necessary for the ARC MD which includes skin to depth SST adjustments that account for diurnal warming (section 'ARC' and Appendix The ARC MD).
 In addition to QC checks based on SST observations, a further two gross error checks are also developed based on an individual drifter's movements. Drifters commonly cease transmitting because they have run aground or been picked up [Lumpkin et al., 2012], associated with a period of poor quality observations. These additional checks are designed to help identify spurious drifter movements so that any associated poor quality SST observations can be flagged. Because these checks do not rely on drifter-OSTIA discrepancies, both daytime and nighttime observations are used.
 The following subsections describe each drifter QC check in detail. A flow chart illustrating the drifter (and ship) QC procedures is shown in Figure 2; this includes the pre-QC data preparation steps described in sections 'In Situ SST Data' and 'OSTIA'. Examples of records triggering the flags produced by the drifter QC checks are shown in Figures 3 and 4.
3.2.1. Tail Check
 The tail check aims to identify and flag observations that are biased or noisy, occurring at the start or end of a drifter record. It has two steps: The first step detects longer-lived tails at the start or end of a drifter record; the second step is similar but detects shorter-lived tails (the Long and Short Tail QC checks in Figure 2). Both steps are based on a moving window approach. To detect a start tail, an n-observation-wide window is moved forward (in time) over a record and the observations within each window are checked. When a window passes the check this is deemed the end of the start tail and the observations are no longer flagged. To detect an end tail a similar approach is taken, with the window moving backward over a record.
 For the first step, a 120-observation-wide window is used (Figure 2; Long Tail QC check). For each set of windowed observations, the drifter-OSTIA SST discrepancies and the associated OSTIA background error uncertainty estimates are used to detect observations that are biased or noisy. To detect biased observations: (i) the mean SST discrepancy is calculated; (ii) the OSTIA background error variances are square rooted to obtain standard deviations of the background error (σn) and their mean is calculated;
 (iii) windowed observations are deemed to be biased if the mean SST discrepancy exceeds three times the mean of the background error standard deviations. To detect noisy observations: (i) the standard deviation of the SST discrepancy is calculated; (ii) the mean of the OSTIA background error variances (σ2n) is calculated and square rooted to give the mean standard deviation of the background error;
 (iii) windowed observations are deemed to be noisy if the standard deviation of the SST discrepancy exceeds three times the mean standard deviation of the background error.
 At first glance, the design of the bias and noise checks described above appears similar, however, they are in fact subtly different because of an imperfect knowledge of the uncertainties in the OSTIA SSTs. Ideally, the background error variance and spatial and temporal covariance for each day's OSTIA SST field would be known. The tail check could then properly allow for the uncertainty introduced by OSTIA when calculating the bias and noise for drifter observations. However, because at present the climatological seasonal background uncertainty estimates are the best uncertainty fields available for OSTIA (section 'OSTIA'), the daily correlation and magnitude of the error in the OSTIA fields is not precisely known. When determining whether or not a set of drifter observations is biased, the check above makes the assumption that the background error is entirely correlated (the mean of the background error standard deviations is used, equation (1)) so as to allow for the maximum possible bias that could be introduced by background error correlation. When determining whether a set of drifter observations is noisy, the check above instead assumes the background error is entirely uncorrelated (the square root of the mean error variance is used, equation (2)), so as to allow for the maximum possible noise that could be introduced by background error correlation. In regions with strong SST gradients such as the Gulf Stream, the uncertainty in OSTIA is enhanced where OSTIA does not fully resolve fine scale ocean features. This uncertainty will vary in time, dependent on the exact position of such features. Because of the climatological nature of the background error variances used, at some moment in time the background error variance may be underestimated, so the noise check uses three times the seasonal background error standard deviation as a threshold for flagging observations. The bias check, however, treats the seasonal background error variance as a reasonable estimate for testing whether observations are biased relative to OSTIA. This “tuning” of the tail check parameters is statistically undesirable but is required to maximize the number of gross errors detected in regions where OSTIA is less uncertain, whilst minimizing the number of false rejections in regions where OSTIA is more uncertain.
 Following the removal of observations failing the first step, for the second step a 30-observation-wide window is used to detect any remaining gross start or end tail errors (Figure 2; Short Tail QC check). For each set of windowed observations, if one or more drifter-OSTIA SST discrepancy exceeds nine times the standard deviation of the background uncertainty for that match, the window fails QC. This gross error detection is less sensitive than for the first step because it is effectively trying to detect errors in individual observations. However, the advantage of this step is that a narrow window can be used so that short-lived tails at the start and end of a drifter record can be detected.
3.2.2. Bias and Noise Checks
 The bias and noise checks aim to identify and flag drifters that are persistently biased or noisy for the portion of their record which passes the tail checks (Figure 2; Biased and Noisy Record QC checks). We apply these checks only to drifter records with at least 30 observations remaining following the tail checks; otherwise, the “short” record check is used instead (section 'Short Drifter Record Checks').
 For each drifter record, the mean and standard deviation of the drifter-OSTIA SST discrepancies is first calculated. As for the tail checks, as the correlation and magnitude of the OSTIA background error is not precisely known, some assumptions about the uncertainty in OSTIA SSTs are necessary. For the bias and noise checks, when considering an entire drifter record, the background error is assumed largely uncorrelated and the magnitude of the seasonal error variances is considered a reasonable estimate of the uncertainty in OSTIA. For the bias check, the assumption of no error correlation means that we believe OSTIA introduces relatively little bias uncertainty. This assumption seems reasonable as, for whole drifter records, mean OSTIA-drifter discrepancies are typically found to differ from mean ARC-drifter discrepancies by no more than ±0.2°C. For the noise check, the mean of the OSTIA background error variances (σ2n) is removed from the variance of the OSTIA-drifter discrepancy (σ2ob-ref) and the result square rooted to give an estimate of the standard deviation of the drifter SST uncertainty (or random measurement error, σob-error);
 Estimates of the biases and random measurement errors having been obtained for all drifters from 1996 to 2010, we choose flagging thresholds of ±1.5°C and 1.5°C for the bias and noise checks, respectively. Beyond these thresholds, drifter bias and noise are deemed outlying (and in gross error) relative to their main distributions. In these cases, the whole drifter record is flagged.
3.2.3. Short Drifter Record Checks
 For drifter records with fewer than 30 observations (<1% of all drifters, Figure 2; Short Record QC check), the record is flagged if >10% of its OSTIA-drifter SST discrepancies exceed nine times the local background error uncertainty. This is similar in concept to the second step of the tail check.
3.2.4. Movement Checks
 The movement checks aim to identify poor quality SST observations associated with drifters that have run aground or have been picked up (Figure 2; Aground and Picked-Up QC checks).
 For the “aground” check, a 21 day window is moved over a drifter record. If a drifter does not move for the duration of the window it is deemed to have run aground and the remainder of the record is flagged. A window is only evaluated if there is at least one drifter observation per day to ensure that the motion of a drifter is properly assessed.
 For the check for drifters that have been “picked up,” a 7 day window is moved over a drifter record. Where mean drifter velocity exceeds 2 m s−1 for the duration of the window, observations within that window are flagged. A sustained speed of 2 m s−1 is used by the NOAA Global Drifter Program Data Assembly Center (NOAA GDP DAC) to identify suspicious drifter movement [Lumpkin et al., 2012] and is an order of magnitude (or more) larger than typical surface ocean velocities (though maximum velocities in currents such as the Gulf Stream can be comparable). A window is only evaluated if observations cover a period of at least five days, so that velocity can be evaluated over a reasonable period. Note that the remainder of a drifter record is not flagged following a window failure, as cases are found where drifters are returned to the ocean following a period apparently aboard a vessel (e.g., Figure 4, bottom row). Prior to the velocity check an initial despiking of a drifter record is carried out whereby observations are omitted from QC when a velocity >20 m s−1 is found for consecutive observations. These are frequently associated with spurious position reports.
3.3. Ship Quality Control Procedures
 The ship QC procedures work by “tracking” an individual ship using its unique callsign (see section 'In Situ SST Data') and assessing the discrepancy between its SST record and matched OSTIA SSTs and uncertainty estimates (see Figure 2 for a flow chart illustrating the ship QC procedures). An initial visual investigation of ship-ARC matchups revealed that, in general, the ship and ARC SSTs are in worse agreement than drifter and ARC SSTs, consistent with other studies which show greater measurement errors in ship data [e.g., Kennedy et al., 2011c; Xu and Ignatov, 2010; Kent and Berry, 2008; Emery et al., 2001; see also Figure 12]. Furthermore, whilst drifter SST observations tend to be mostly of good quality with some common gross errors, the quality of observations made by ships appears highly variable, with measurement errors in ship data both frequent and quite heterogeneous in nature. The variable quality of ship data is not surprising given the range of sensors and data collection practices and the human impacts on ship data and has been discussed in other studies [e.g., Kent and Challenor, 2006; Kent and Taylor, 2006; Brasnett, 2008; Kent and Berry, 2008]. Whilst it may be possible to devise a suite of QC checks to clean-up the ship data, here a blacklist approach was taken. Hence, if a ship displays a period of poor quality observations, this is taken as evidence that the observations from that ship are generally of low quality and not to be trusted, and the ship record is flagged in its entirety. To some extent, this approach is wasteful in that many good observations are potentially flagged (see section 'Discussion and Summary'), for example, if ships reporting poor quality observations subsequently have their instruments recalibrated or observing practices improved. Nevertheless, it does ensure that only good quality ship observations are retained and so is the approach taken here. Examples of records assessed by the ship QC check can be seen in Figure 5.
 When assessing the discrepancy between ship and OSTIA SSTs, both daytime and nighttime observations are used as then ∼56% more ships (8000 versus 5135 if only nighttime observations were used) can be assessed as “long” records (>120 observations) by QC. The predominance of ship engine-intake and hull sensor SST observations over this period [Kennedy et al., 2011b], which typically measure water temperature at depths of several meters [Kent and Taylor, 2006], should help minimize diurnal effects for the majority of ships. For drifters, the use of daytime and nighttime data was found to introduce a warm bias of order 0.05°C in drifter-OSTIA SST discrepancies relative to the nighttime data only case. For ships, use of daytime and nighttime observations also appears to introduce a similar bias of order 0.06°C relative to nighttime observations only. To test if such a bias substantially affects the ship QC flags, a version of the QC described below was run, where 0.06°C was removed from all observations prior to blacklisting. No major changes were seen in the QC flags (in general 0–2% fewer observations were flagged each month when a bias was removed). For the ship blacklist QC described below, trimmed means and variances are used when calculating window properties to help mitigate against any outliers or large diurnal effects (where the three smallest and largest values are removed from the window data prior to calculation).
 The ship blacklist QC works as follows. For each ship record, a 120-observation-wide window is moved over the data and the ship record is blacklisted if the observations within any window fail QC. The QC checks applied to each window are designed to test whether the ship observations are biased or unacceptably noisy. As for the drifter checks, OSTIA will introduce some uncertainty into the estimation of ship SST properties and our knowledge of OSTIA uncertainties is imperfect. To detect biased ship observations: (i) the trimmed mean SST discrepancy is calculated; (ii) the OSTIA background error variances are square rooted to obtain standard deviations of the background error and their mean is calculated (equation (1)); and (iii) windowed ship observations are deemed biased if their mean SST discrepancy exceeds three times the mean of the background error standard deviations. For the noise check, the mean of the OSTIA background error variances is removed from the trimmed variance of the OSTIA-ship discrepancy and the result square rooted to give an estimate of the standard deviation of the ship SST error (or random measurement error, equation (3)). Where random measurement error exceeds 1.5°C, ship observations are deemed unacceptably noisy. The 1.5°C threshold is a compromise between ensuring a reasonable quality of ship observations and not reducing the number of ship observations too severely. As for drifters, the correlation of the background error is not known so a conservative approach is taken whereby this is assumed entirely correlated for the bias check, and entirely uncorrelated for the noise check (section 'Tail Check'). However, unlike in the drifter tail checks, uncertainty in the magnitude of the error variance, which tends to be most prominent in regions with strong SST gradients (e.g., Gulf Stream, Kuroshio Current), is not taken into account in the ship QC (i.e., one standard deviation of the background error is used). This is because the majority of the ocean surface does not possess such strong gradients and because ships move relatively quickly and sample relatively infrequently. It is, therefore, unlikely for any set of windowed ship data that many observations will be influenced by such features (unlike drifters which advect slowly with the ocean currents and sample more frequently).
 As a large number of ships are found to have made relatively few observations (∼81% of distinct ship records comprise <120 observations, but represent just ∼2% of all ship observations) a QC check for “short” ship records is also required (Figure 2). This is similar to the short drifter record checks described in section 'Short Drifter Record Checks', whereby a ship record is blacklisted if >10% of its OSTIA-ship SST discrepancies exceed nine times the local background error uncertainty.
4.1. Drifter Quality Control Flags
4.1.1. A Comparison of OSTIA and ARC QC Flags
 As discussed in section 'ARC', ARC provides a high-quality reference SST against which OSTIA QC flags and the performance of OSTIA as a reference field can be assessed. For the ∼400 drifters that are fully sampled by ARC (see section 'ARC'), QC procedures were developed that test for the same gross errors as those described in sections 'Tail Check' and 'Bias and Noise Checks' for OSTIA. The ARC-referenced QC procedures (hereon referred to as ARC QC) are not described in detail here but are necessarily different from the OSTIA-referenced QC procedures (hereon referred to as OSTIA QC) because of the relatively small number of matchups achieved by ARC throughout a drifter record and the absence of ARC uncertainty estimates in the ARC MD used. During the development of the OSTIA QC procedures, ARC QC and OSTIA QC flags were intercompared for the subset of drifters fully sampled by ARC, to help tune the OSTIA QC parameters (note that ARC is used differently in section 'Drifter QC Validation Against ARC and Argo'; there OSTIA QC flags for all ∼15,500 drifter records are applied to the ARC MD to assess the impact they have on the matchup statistics). As expected, some differences persist between the OSTIA QC and ARC QC flags because of the differing design of the QC checks, although other differences are apparent. A major advantage of the OSTIA QC is that its superior matchup frequency allows short-lived features such as tails at the end of the data to be more readily detected than for ARC. A disadvantage of the OSTIA QC is that it is less able to detect observational errors that are small in magnitude because of the greater magnitudes of drifter-OSTIA discrepancies, though the worst errors are still identified. The most significant problem for the OSTIA QC is the large uncertainty of OSTIA SSTs in temporally variable frontal regions which is not adequately captured by the climatological seasonal background uncertainty fields used (Figure 1).
 Two examples of this latter problem are illustrated in Figure 6 (top two rows). The first example (top row) shows a relatively short drifter record (∼1.5 months) whose life is spent in the subAntarctic fronts of the Southern Ocean. It is clear that the few ARC SST estimates that are available compare well with the drifter SST observations, whilst the OSTIA SSTs can be in error by several degrees Celsius for weeks at a time, with discrepancy frequently exceeding three standard deviations of background error. In this case, the drifter falsely fails both the tail and biased record checks. The second example (second row) shows a similar problem in the Gulf Stream, where the underestimation of OSTIA uncertainty in this case leads to the false rejection of over a months data at the start of the drifter record. A full explanation for the underestimation of OSTIA uncertainty in frontal regions is not given here but it is likely to be at least partly a consequence of the seasonal resolution of the background error fields and time variability of frontal gradient magnitudes and positions at nonseasonal timescales.
 As discussed in section 'Tail Check', this imperfect knowledge of OSTIA SST uncertainties will inevitably lead to false rejections of drifter (and ship) data but to some extent the QC parameters have been tuned to try and minimize the number of false rejections in frontal regions, whilst maximizing the number of correct rejections elsewhere in the global oceans. The validation of QC flags presented in section 'Drifter QC Validation Against ARC and Argo' (and section 'Ship QC Validation Against ARC and Argo') suggests that, in terms of global statistics, application of the QC improves observation quality. However, a user interested in using the QC flags for regional studies in frontal (or high latitude) locations should treat the results with caution. Note, however, that by no means do the QC flags lead to a loss of all data in frontal regions (Figure 8).
4.1.2. Drifter Quality Control Results
 The results of the drifter QC for 1996–2010 are presented in Figure 7 as monthly time series (note that results for the ICOADS 2.5 dataset are also included in red, see section 'Discussion and Summary') and in Figure 8 as global maps. The number of drifter observations has increased markedly over the period 1996–2010 (Figure 7, bottom right) and is highly variable in space (Figure 8, bottom right). Therefore, the results are standardized to show the proportion of drifter observations failing QC checks. The number of drifters failing any of the QC checks is 2443 out of a total of 15,494 drifters. These failures can be broken down by individual checks thus: start tail = 383, end tail = 1510, biased record = 174, noisy record = 100, short record = 128, aground = 351, and picked-up = 283. By far the most common gross error exhibited by drifters is bad data at the end of a drifter record, though the proportion of individual observations flagged by this check is comparable to that flagged by other QC checks (Figure 7) because the number of observations per drifter associated with each end tail failure is relatively small. Note that the total number of drifters failing the individual checks is greater than the number of drifters failing any QC checks; this is because a drifter reporting gross errors often fails more than one check (e.g., a drifter found to have run aground may also fail the end tail check).
 Figure 7 shows that the proportion of drifter observations failing gross error checks in any month is relatively small, generally <5%, and has reduced over time from typically 2–4% to nearer 1%. This reduction occurs chiefly over the period 2002–2004 and is mainly attributable to a reduction in the number of drifter observations failing the end tail, aground, biased record, and noisy record checks. This reduction in drifting buoy gross errors over time seems reasonable given the commitment to improving drifter observations by bodies such as the DBCP which provides international coordination for data buoy efforts. An improvement in the quality of drifting buoy observations does not necessarily only represent improvements to the buoys themselves, but also to procedures for monitoring the quality of buoy observations and the timely removal of data from bad platforms from the GTS. The reduction in the proportion of drifter observations failing the end tail and aground checks may well represent such an improvement in monitoring procedures, where the onset of bad observations by a drifter (due to running aground, instrument failure, etc.) are more promptly detected. The NOAA GDP DAC is responsible for having drifters with bad SST removed from the GTS stream. In 2002-2004 the drifter DAC changed their procedures so that this was done more frequently (from once per week to every few days) and comparing the drifter values against concurrent microwave and infrared satellite SST products to identify gross errors (when previously only SST climatology had been used) (Rick Lumpkin and Mayra Pazos, personal communication). This could potentially explain the reduction in drifting buoy gross errors observed around this time.
 It is also possible that the reduction in the proportion of drifters failing quality checks represents an improvement in the quality of the OSTIA SSTs against which drifters are assessed. From mid-2002, the OSTIA reanalysis began assimilating AATSR SST estimates and using AATSR for bias correction. AATSR SST estimates are more accurate and precise than their ATSR2 predecessors [e.g., Merchant et al., 2012] and lead to a small improvement in the OSTIA reanalysis global observation-background statistics (notably a 0.04 K drop in bias relative to drifting buoys [Roberts-Jonesetal., 2011]). This small improvement in OSTIA SSTs is not, however, expected to impact on the detection of gross errors and it is notable that no reduction is seen in the proportion of drifter observations failing the start tail check from 2002 to 2004 (Figure 7). An apparent improvement in drifter observation quality is also seen by Xu and Ignatov  around 2001–2002 relative to a SST analysis which only assimilates AVHRR satellite data (their Figure 5). It seems likely, therefore, that the reduction in drifter gross errors reported here is a feature of the drifter observations. A brief investigation (not shown) of drifter-Argo SST matchups (Argo observations are independent of OSTIA, see section 'Drifter QC Validation Against ARC and Argo') also shows a gradual improvement in matchup statistics from 2000 to 2004 which points toward an improvement in drifter observations over time. It is noted, however, that the coverage, and perhaps the quality, of Argo observations has also evolved over time. As noted in section 'OSTIA', the switch to the use of the OSTIA operational analysis in 2008 has no obvious effect on the drifter QC flags.
 Whilst improvements to the design or monitoring of drifters may translate to basin or global scale improvements in the quality of drifter data, some of the features of the QC failure time series shown in Figure 7 can be linked to drifter errors on a more regional scale. The peak in the proportion of drifters failing the biased record check circa-1999 seems to be partly linked to a batch of biased drifters that were deployed around Japan (an example of this is shown in Figure 4, top row). The peak in the proportion of drifter observations failing the start tail check in 2005 seems to be linked to a batch of initially biased drifter records in the eastern equatorial Pacific. These regional features are prominent in the QC failure maps shown in Figure 8. The end tail check, however, produces a more global pattern of failures which, owing to its predominance as a gross error type in drifter data, is also reflected in the map showing all QC failures. Of note is the distribution of failures seen for the aground check, some of which are far from the nearest land. This, at least in part, is the result of the recycling of drifter WMO ids within 90 days, which can lead to two separate drifter records being treated as a single record (see section 'In Situ SST Data', an example is also shown in Figure 6, third row). The first drifter be found to have run aground, the second drifter record will then be flagged in its entirety. This is discussed further in section 'Discussion and Summary'. In some cases, aground check failures in the open ocean are found to be associated with nonmoving platforms, which are likely moored buoys misclassified as drifters.
 For the noisy record check, the distribution of failures shown in Figure 8 is more worrying. It is clear that a large number of drifter records are flagged in the vicinity of the Gulf Stream which, as discussed in section 'A Comparison of OSTIA and ARC QC Flags', is due to an underestimation of OSTIA uncertainty in frontal regions. Whilst the QC procedures have been tuned to reduce this effect, this proves particularly difficult for the noisy record check reflecting the difficulty of separating signal and error in regions of high SST variability. This is particularly problematic for drifters that spend most of their life in frontal regions (e.g., Figure 6, top row), which can break the assumptions made when designing the QC check (section 'Bias and Noise Checks'). As discussed further in section 'Discussion and Summary', without better uncertainty estimates for OSTIA to some extent this cannot be helped. We show the noisy record check flags here because for some applications these flags may still be worth considering. An example is shown in Figure 4 (second row).
4.1.3. Drifter QC Validation Against ARC and Argo
 As discussed in sections 'Introduction' and 'A Comparison of OSTIA and ARC QC Flags', it is important to ensure that the QC flags generated here are an improvement to the in situ data and not the result of uncertainties in the reference field used. The QC flags are validated in two ways; the first is by applying the QC to drifter observations in the ARC MD and assessing the impact on matchup statistics, the second is a similar exercise but applying QC to drifter-Argo matchups. Because an earlier version of the [A]ATSR record is assimilated into OSTIA, the ARC MD does not strictly provide an independent assessment of the drifter QC flags (and in designing the OSTIA QC, a small subset of drifter QC flags are also tuned to ARC QC outcomes, see section 'A Comparison of OSTIA and ARC QC Flags'), so Argo (which is not assimilated by OSTIA and is entirely independent of the QC flags) is also used here for validation purposes. The drifter-Argo matchups are produced using quality controlled Argo observations from the Met Office EN4 database (S. A. Good, M. J. Martin, and N. A. Rayner, EN4: Quality controlled ocean temperature and salinity profiles and monthly objective analyses with uncertainty estimates, submitted to Journal of Geophysical Research, 2013). The criteria used for matchups are that observations must fall within 50 km and 3 h of each other. For each Argo profile, the shallowest observation in the 4–6 m depth range is selected (which is the shallowest depth range over which Argo floats commonly sample) and matched to the nearest drifter observation in space which satisfies the matchup criteria.
 Table 1 shows the impact that the various QC checks have on ARC-drifter matchup statistics for 1996–2010. Each individual QC check improves the matchup statistics relative to the case where no QC is applied. The tail checks and biased record check make the greatest improvements to the statistics, whilst the picked up and short record checks, which flag the fewest observations, have the smallest impact. Both the noisy record and aground checks have a relatively small impact on the ARC MD statistics despite the relatively large number of matchups they flag. As discussed in section 'Drifter Quality Control Results', this is likely a consequence of the false flagging of drifter observations in some circumstances, however, in the case of the aground check a lack of ARC-drifter coastal matches is also likely to be a factor.
Table 1. The Impact of Applying Drifter QC Flags to ARC MD Matchup Statisticsa
Average Discrepancy (Standard Error) [drifter-ref, °C]
Standard Deviation. (Variance) of Discrepancy [°C (°C2)]
Number of Matches
ARC MD data from 1996–2009 are used. Average (and standard error), standard deviation (and variance), number of matchups, and number of matchups rejected are shown for matchups with no QC flags applied (No QC), with all QC flags applied (All QC), with all QC flags applied except the aground and picked-up checks (Ref. QC) and for the individual QC checks.
 Figure 9 (top row) shows the impact of applying the drifter QC flags to the ARC MD as zonally averaged and monthly statistics. In general, the drifter QC flags improve the quality of ARC matchups at all latitudes except in the high latitudes, where relatively few drifter observations occur. Relatively little impact is seen around 45°N, likely a combination of the reduced capacity of the drifter QC checks to detect bad observations in the Gulf Stream and Kuroshio regions where OSTIA is more uncertain, and the occurrence of false rejections in these regions (Figure 6). Whilst the aground and picked up checks have a relatively small impact on matchup statistics as a whole (Table 1), spatially their impact is more pronounced in certain latitude bands (e.g., 10–20°N). With drifter QC flags applied, the ARC-drifter matchup monthly time series is less variable, particularly prior to 2002–2003 when the drifter observations were found to possess a greater proportion of gross errors (section 'Drifter Quality Control Results') and shows an improved agreement for the whole 1996–2010 period.
 For the drifter-Argo matchups, there are not enough matches to break down the statistics by check, month, or latitude. Instead, Table 2 shows only the impact of applying the QC checks to all the drifter-Argo matchups from 2002 to 2009, for comparison with drifter-ARC matchups over a similar period. For the drifter-Argo matchups, applying the drifter QC results in a rejection of 0.7% of matchups and a reduction in matchup variance of 0.05°C2. For the drifter-ARC matchups, applying the drifter QC results in a rejection of 0.9% of matchups and a variance reduction of 0.06°C2. It is noted that the aground and picked up flags are not used here as these are largely coastal in their coverage and are expected to have a lesser impact on drifter-Argo matchups which will be largely open ocean. Both the ARC and Argo matchups, therefore, seem to validate the drifter QC checks.
Table 2. The Impact of Applying Drifter QC Flags (Except the Aground and Picked-Up Checks) to ARC MD Matchup Statistics and Drifter-Argo Matchup Statisticsa
Average Discrepancy (Standard Error) [drifter-ref, °C]
Standard Deviation (Variance) of Discrepancy [°C (°C2)]
Number of Matches
AATSR Matchups from 2002 to 2009 only. Drifter-Argo matchups from 2002 to 2009 are used. The matchup criteria for Argo matchups are observations falling within 3 h and 50 km (see text).
Argo No QC
Argo Ref. QC
ARC No QC
ARC Ref. QC
4.2. Ship Quality Control Flags
4.2.1. Ship Quality Control Results
 The results of the ship QC for 1996–2010 are presented in Figures 10 and 11. Figure 10 shows monthly time series of ship observations failing either QC check (both the blacklist and short record checks) and just the short record check (note that results for the ICOADS 2.5 data set are also included in red, see section 'Discussion and Summary'). The time series of observations failing either QC check primarily reflects the blacklist QC check as the number of observations failing the short QC check is relatively small. As for the drifter QC results, the proportion of ship observations failing QC is shown because the number of ship observations varies in time (Figure 10, right) and space (Figure 11, upper right). Whilst for drifters the change in observation number over time is a result of the growth of the drifter array, for ships two steps are evident in January 1998 and December 2007 associated with changes in the source of ship data (section 'In Situ SST Data'). This has a clear influence on the proportion of ships failing either QC check, with a lower proportion of ship failures prior to 1998 (ICOADS 2.0 source) and from December 2007 onward (ICOADS RT source) relative to the period in-between (NCEP NRT source). The reason for these steps is not fully known, but their existence is not surprising given differences in data set production, the use of delayed mode and real-time data, the omission of C-MAN callsigns from some data, and the introduction of ship callsign masking (section 'In Situ SST Data'). From October 1999 to May 2000, a drop in the proportion of ships failing QC is attributable to an unexplained increase in the proportion of observations with a generic callsign (e.g., “SHIP”), from less than 5% to 10–20%, which can't be tracked by QC. The proportion of ship observations flagged by the blacklist check is high, between 45 and 70%, consistent with the design of the check which aims to restrict the ship data to a high-quality subset.
 In Figure 11, the global distribution of ship observations failing the QC checks for 1996–2010 is shown (upper left). The proportion of ship observations failing QC globally is spatially very variable, with some shipping lanes showing relatively few observation failures (e.g., in the North Atlantic), whilst others show a high percentage of failures (e.g., between Indonesia and Japan or Hawaii and mainland North America). Some basin-scale structure is evident, which is particularly apparent when the location of only the windowed observations that fail the blacklist QC check are shown (Figure 11, bottom left, see also section 'Discussion and Summary' for further discussion), with observation quality higher overall in the North Atlantic, and of a lower quality in the North Pacific and parts of the southern hemisphere oceans. In the eastern equatorial Pacific, a region of high-quality ship observations is found though the reason for this is unknown. The patchy structure of ship failures in the Southern Ocean and Arctic Ocean is a consequence of the relatively few ships that sample in these regions. Using ICOADS observations and WMO metadata from 1970 to 1997, Kent and Challenor  described a similar global pattern of ship SST errors, with lower error estimates in the Tropics than the midlatitudes, and (for 1990–1994) greater error estimates for ships from Japan and the USA than Northern Europe. Figure 11 (lower right) shows the proportion of ship observations remaining after the QC flags have been applied (a change in color scheme is used to allow better contrast at low values) and it is clear that in parts of the polar oceans no ship observations remain after QC. To help mitigate against this it is planned to use OSTIA ice concentrations to create an “ice-flag”; this will help identify ships whose blacklisting may be the result of OSTIA SSTs which are influenced by ice and, therefore, particularly uncertain (e.g., around the Arctic ice edge in summer, Figure 1). As noted above, the proportion of ship observations remaining following QC is spatially highly variable but, with the exception of the high latitudes, some ship observations are retained in all regions of the global ocean.
 Figure 12 shows the impact of the blacklist QC checks on probability distributions of the mean (bias) and standard deviation (random measurement error) of ship-OSTIA discrepancies calculated for each ship record with more than 120 observations. For comparison, probability distributions are also shown for drifter records passing the drifter QC checks. For the distribution of ship biases (left), the use of the ship QC checks removes the worst outlying ships in the distribution, and removes a slight positive warm skew from the ship observations (the standard deviation of the ship biases decreases from 0.67°C to 0.37°C and the mean ship bias decreases from 0.23°C to 0.13°C). This results in a more peaked (leptokurtic) and slightly negatively skewed distribution, though a small warm bias in the modal value of the distribution remains. For the distribution of ship random measurement errors (right), the use of the ship QC checks removes the worst outlying ships, resulting in a more peaked and less skewed distribution (the mean value of the ship random measurement errors reduces from 1.07°C to 0.92°C). It is clear that increasingly severe blacklisting procedures will produce only a small subset of ships with an observation quality comparable to drifters.
4.2.2. Ship QC Validation Against ARC and Argo
 As for drifters (section 'Drifter QC Validation Against ARC and Argo'), the impact of ship QC checks on ship-ARC and ship-Argo statistics is presented in Tables 3 and 4. For comparison of ship-Argo and ship-ARC statistics (Table 4), only AATSR data from 2002 to 2007 are used as ship callsigns are not available unmasked (see section 'In Situ SST Data') in the ARC MD used. For the ship-Argo matchups, applying the ship QC results in a rejection of 55% of matchups and a reduction in matchup variance of 0.63°C2. For the ship-ARC matchups, applying the ship QC results in a rejection of 62% of matchups and a variance reduction of 0.45°C2. Given the differing coverage of ships, Argo floats, and ARC, a difference in the variance reduction for the two sets of statistics is not surprising, but crucially it shows that the use of the QC checks does not have a greater impact on the ship-ARC matchups, which are expected to be less independent of OSTIA. The ARC and Argo matchups, therefore, seem to validate the ship QC checks.
Table 3. The Impact of Applying Ship QC Flags to ARC MD Matchup Statisticsa
Average Discrepancy (Standard Error) [ship-ref, °C]
Std. Dev. (Variance) of Discrepancy [°C (°C2)]
Number of Matches
ARC MD data from 1996 to 2007 are used, avoiding masked ship callsigns, see text. Average (and standard error), standard deviation (and variance), number of matchups, and number of matchups rejected are shown for matchups with no QC flags applied (No QC), with all QC flags applied (All QC), for only the windowed observations failing QC checks (see text), and for the individual blacklist and short record QC checks.
Table 4. The Impact of Applying Ship QC Flags to ARC MD Matchup Statistics and Ship-Argo Matchup Statisticsa
Average Discrepancy (Standard Error) [ship-ref, °C]
Std. Dev. (Variance) of Discrepancy [°C (°C2)]
Number of Matches
For ARC MD matchup statistics, AATSR matchups from 2002 to 2007 only are used, avoiding masked ship callsigns; see text. For ship-Argo matchup statistics, matchups from 2002 to 2007 are used. The matchup criteria for Argo matchups are observations falling within 3 h and 50 km (see text).
Argo No QC
Argo All QC
ARC No QC
ARC All QC
 Figure 9 (bottom row) shows the impact of applying the ship QC checks to the standard deviation of ARC-ship matchups, both as a monthly time series and as a zonal average. For most latitudes, the use of the ship QC flags reduces the discrepancy between ARC and ship SSTs, except in the high latitudes where there are relatively few ship observations and the zonal estimates become noisy. For the monthly time series, although the ship QC generally reduces the discrepancy between ARC and ship SSTs, it does have the effect of enhancing the seasonality seen in the ARC-ship discrepancy (which peaks in summer) prior to circa-2004. One candidate explanation is a seasonality of the QC flags themselves, related to seasonality of the OSTIA background error variance fields. Seasonality of OSTIA background error variance is particularly pronounced in the mid-latitudes, with an increase in the error associated with synoptic atmospheric scales in the Summer Hemisphere [Roberts-Jones et al., 2013]. Because ship sampling is concentrated in the North Atlantic and North Pacific (Figure 11), this could impact on the ARC-ship matchups. However, a repeat of the ship QC using only the winter seasonal uncertainty field reveals a similar enhanced seasonality in the ARC-ship matchup time series. It is inferred that this feature of the QC may be a result either of the changing coverage or of the changing composition of the ship observations following application of the QC flags, for example, giving greater weight to bucket observations which are of better quality than engine-intake observations but can suffer from seasonal biases [Kent and Challenor, 2006; Kent and Taylor, 2006].
5. Discussion and Summary
 This study investigates whether the quality of SST observations made by drifters and ships can be improved through retrospective QC against a reference field. The observations assessed so far have been a blend of delayed mode data taken from ICOADS 2.0 and real time data obtained from the GTS. A comparison of drifter and ship measurements (from 1996 to 2010) with ARC SST estimates on a platform-by-platform basis reveals drifter observations are generally of good quality but frequently suffer from gross errors, whilst ship observations are generally of worse quality than drifters and show a diverse range of measurement errors. Note, however, that not all ship data are of a lower quality than drifter data, for example, ship data from the Integrated Marine Observing System have been shown to have comparable uncertainty to those from data buoys [Beggs et al., 2012]. Furthermore, initiatives such as the Voluntary Observing Ships for Climate (VOSClim) project aim to provide a high-quality subset of VOS data, supplemented by extensive metadata, to support global climate studies and research (Joint WMO-IOC Technical Commission for Oceanography and Marine Meteorology VOS website: http://www.bom.gov.au/jcomm/vos/vosclim.html).
 QC procedures are developed which assess the quality of drifter and ship SST observations on a platform-by-platform basis through comparison with spatially complete, daily SST analyses from OSTIA. Drifter observations displaying some commonly observed gross errors (bad observations at the start or end of a drifter record, records that are biased or noisy as a whole, and drifters that have run aground or been picked up) are flagged. Typically, 2–4% of drifter observations are flagged, diminishing to nearer 1% by the mid-2000s. Ship callsigns whose observations are deemed unreliable (where a ship record contains a period of biased or noisy observations) are blacklisted. Typically, 45–70% of ship observations are blacklisted, with proportion varying dependent on data source. Validation of the QC outcomes against SSTs from ARC and Argo demonstrates that globally this retrospective QC has improved the quality of drifter and ship observations.
 The data used in this study have already undergone basic QC; these include checks of observation form (e.g., checking an observation is located over the ocean), and an assessment against climatology and nearby in situ observations to identify gross errors (outliers). The additional QC procedures developed in this study have been used to further improve the quality of these data. Although this study makes use of a background field for QC, only groups of observations showing significant error have been flagged, and the study has focused on ensuring an improved data set suitable for a range of applications. Only one possible approach to improving the data has been demonstrated here, however, and in part this work has shown that plenty of scope remains for improving the quality of in situ SST observations.
 Although the QC procedures developed here successfully improve the quality of drifter and ship observations (see Tables 2 and 4), the QC outcomes still need to be treated with some caution. However, much care is taken, any QC procedure that uses a reference as “truth” will inherently be limited by errors in the reference itself. This study attempted to mitigate against errors in the reference field by making use of OSTIA seasonal background error variance fields. Whilst this enabled an improvement in the QC outcomes, it should be remembered that the outcomes are still inherently limited; now by the uncertainty in the use of the background error variance fields. In dynamic frontal regions where drifter, OSTIA and ARC SSTs can be compared (e.g., Figure 6, top two rows), the magnitude of OSTIA background error variance appears to be underestimated and assumptions made about the correlation of this error variance between drifter observations may be invalidated. QC outcomes in such locations should be used with care and, dependent on application (e.g., in local studies), users may choose to discount those from frontal (or ice affected) regions.
 Moving forward the use of background error covariance estimates for each analysis day, instead of seasonal climatological fields, would improve the OSTIA SST analysis and significantly aid studies such as this presented here. At the time of writing, daily analysis error fields are calculated as part of the OSTIA system [Donlon et al., 2011] and are available for both the reanalysis and operational analysis data. In theory, these should provide a better representation of the uncertainties in the OSTIA data, accounting for their temporal variability. However, the applicability of these daily analysis error fields to studies such as this is limited because they do not account for the errors associated with using yesterday's SST field today and they were produced using an inferior parameterization of the background error covariances to that presented in section 'OSTIA'. The implementation of the new seasonal background error estimates within the OSTIA systems should improve both the OSTIA SST fields and the analysis error estimates generated in the future, the use of which might improve the QC outcomes. However, the availability of background error estimates for each analysis day (“flow-dependent” error) is the optimum solution.
 An alternative approach to the QC might be to vary the treatment of the existing OSTIA background error variance estimates from one region to the next (e.g., between frontal and nonfrontal regions), or to mask observations in regions with large SST variability, but this has not been explored here. It is noted that the uncertainty estimates appear less problematic for assessment of ships, which are less likely to spend extended periods in frontal regions, though some exceptions may exist (e.g., research ships) which would require further study. The use of a reference in QC work is not, however, discouraged. The advantage of using the OSTIA analysis is that it spreads information from multiple observations and observation types, providing a useful estimate of “truth” for identifying measurement errors in individual in situ observations. As long as uncertainties in the use of such a reference for QC are carefully treated and acknowledged, a useful improvement to the quality of in situ data can be made.
 Users interested in diurnal variability should also treat the QC outcomes discussed here with caution. Because OSTIA is a foundation SST product, some of the discrepancies between in situ and OSTIA SSTs may arise due to diurnal heating, which can reach several degrees Celsius in magnitude, comparable to the sorts of errors that are being detected. This is particularly relevant to the ship QC, where it was decided to use both daytime and nighttime observations. Although the spatial distribution of failures shown in Figures 8 and 11 do not obviously reflect the spatial variability of diurnal warming magnitude (which is greatest in the subtropics and western Pacific warm pool, see, for example, Kennedy et al. ), an impact on the QC outcomes cannot be discounted. Future QC work might benefit from the incorporation of a model of diurnal warming, as is the case for matchups in the ARC MD, but this has not been explored here. An alternative is to use a reference data set that captures the diurnal signal, such as the UK Met Office Diurnal Analysis which is presently under development.
 A further impact of the QC outcomes that users should be aware of is the reduced bias of the ship observations passing QC. As illustrated in Figure 12 and described in several other studies [e.g., Kent and Taylor, 2006; Kennedy et al., 2011b], many ships have a tendency to make observations that are biased warm. Because the QC procedures described here are designed to extract a high-quality subset of less-biased ship observations, after QC this warm bias is less evident. Validation of the ship QC against ARC suggests that globally a significant SST bias of order 0.1°C is removed and ships are left less biased relative to ARC (Tables 3 and 4). For climate applications that adjust for platform specific (and platform-by-platform) biases, for example, the HadSST3 data set [Kennedy et al., 2011a, 2011b], the QC outcomes must be treated carefully.
 A consideration for climate monitoring activities is the loss of generally 50% or more of ship observations following QC. Whilst the remaining ship population will have a lower measurement error than the ship population prior to QC, the dramatic reduction in ship numbers are likely to lead to an increase in coverage and sampling error (i.e., errors associated with how completely the oceans are sampled). Whilst this is less likely to be an issue from the mid-2000s when the drifter array grew dramatically (Figure 7), the sampling pattern of ship observations is significantly different to that of drifters (see Figures 8 and 11) and studies such as that of Kennedy et al. [2011c], which assess the adequacy of the in situ observing system, would be useful for informing how to optimize the QC of ships for SST monitoring. It may be that bias correcting ship data where possible [e.g., Brasnett, 2008] is a superior approach to ship blacklisting.
 At present, the ship QC procedures flag all observations from a ship whose record is found to possess a period of biased or noisy observations (i.e., the ship is blacklisted). The justification for this is that for some ship records, a period of poor quality observations is indicative of poor quality observations in general (e.g., Figure 5, third row, where the ship observations become gradually more, then less, biased over time). Table 3 shows, however, that if only the periods of poor quality observations are rejected (Window QC), this results in an improvement to the ARC matchup statistics similar to that obtained if all blacklisted observations are rejected (Blacklist QC). This suggests that if a ship is found to report some poor quality observations that does not necessarily mean that observations made by that ship at other times are of limited use and it may be that only a smaller proportion of ship observations warrant rejection by QC. This, however, will require further study. Figure 11 shows that whilst the Window QC (lower left) flags a substantially smaller proportion of observations than all QC checks (upper left), regional variations in the Window QC flags also exist which imply that blacklisting may be more appropriate in some locations than others, for example, in the North Pacific where the proportion of flagged observations is relatively large compared to the North Atlantic.
 For the drifter QC procedures, some improvements in the QC flags could be achieved through the use of higher precision positional data. This is particularly true for the aground check where the use here of low precision (0.1°) positions requires a relatively long-time window to establish a drifter is no longer moving. The use of a 21 day window is a compromise, resulting in some shorter-lived drifter groundings remaining undetected and the occasional rejection of drifter observations in regions where surface velocities are small. Instead of assessing drifter movement, a more effective approach to the drifter aground and picked up QC checks might be to flag observations showing anomalously large diurnal SST variability. This appears indicative of atmospheric temperature measurements (see e.g., Figure 4 bottom row and Figure 6 bottom row) and is used by the NOAA GDP DAC to help identify drifters that have been picked up [Lumpkin et al., 2012]. An automated diurnal temperature QC check is not, however, necessarily trivial to implement as in low wind, high-insolation conditions, oceanic diurnal temperature variability can reach several degrees Celsius; the distinction between this and atmospheric temperature measurements would require further investigation. The drifter QC would also benefit from better separation of drifter records into separate deployments where drifter WMO IDs have been reused (section 'In Situ SST Data'), but this would require several thousand records to be manually assessed.
 It is planned to extend the drifter and ship QC outcomes to ICOADS 2.5 data for 1986–2010 and some improvements may be incorporated at this stage. Although the ICOADS 2.5 data set includes improved delayed mode data up to 2007, preliminary results for ICOADS 2.5 data from 1996 to 2010 using the QC procedures documented here suggest that the QC flags are also useful for drifter and ship data in the ICOADS 2.5 data set throughout this period (Figures 7 and 10 respectively, red lines). For drifters, the use of delayed mode ICOADS 2.5 data from 1996 to 2007 results in a reduction in the proportion of drifter observations failing the end tail, aground and picked-up checks relative to the ICOADS 2.0 and NCEP NRT data (Figure 7), however, many gross drifter errors are still detected (verified by an investigation of individual records as in Figures 3 and 4, not shown). After 2007, where the ICOADS RT product is used, the proportion of drifters failing QC increases in line with that in NCEP NRT. Although a full exploration of the ICOADS 2.5 results is beyond the scope of this study, it is clear that the drifter (and ship) QC outcomes can help homogenize the quality of data sets constructed from different sources and improve the quality of ICOADS delayed mode data. Delayed mode data for GDP buoys (a large proportion of the global drifter array, e.g., see http://www.jcommops.org/dbcp/network/dbcpmaps. html) can be obtained from the NOAA GDP DAC (http://www.aoml.noaa.gov/phod/dac/dacdata.php), who also aim to remove some of the gross drifter errors addressed by the QC procedures described here. ICOADS 2.5 uses delayed mode drifter data provided by Canadian Integrated Science Data Management (ISDM) [Woodruff et al., 2011], who archive all GTS drifter data on behalf of the DBCP. Some of the approaches described in this study (such as the detection of sustained drifter offsets) could potentially be migrated to the drifter GDP DAC or ISDM. There remains significant scope to improve the quality of in situ observations by inter-comparing and refining the differing approaches used to track observation quality on a platform-by-platform basis.
Appendix A: The ARC MD
 In the ARC MD [see also Emburyetal., 2012], a colocation between satellite swath and in situ observations is done to the nearest 1 km pixel (though the low precision of in situ positions means the colocation is only accurate within ∼5 km), with a 5 × 5 block of pixels centered on each in situ observation extracted for the MD and clear-sky pixels averaged to calculate SSTs. The time window for colocations in the ARC MD is ±3 h from a satellite orbit (which last ∼1.5 h). Matches are excluded with time differences >3 h to avoid a time window which varies with position in the satellite orbit. Modeled skin to subskin and subskin to depth adjustments, which are included in the ARC MD, are used to reduce the differences between ARC SST skin estimates and in situ SST depth measurements which arise due to the effects of thermal stratification combined with the differing depth of in situ and satellite observations; in the ARC MD drifters and ships are nominally assumed to sample at 0.2 m and 1.0 m depths, respectively. Matchups are discarded where the skin to subskin or subskin to 1.0 m depth adjustments have extreme values, over 0.4 K and 0.2 K, respectively, corresponding to 1–2% of cases (Owen Embury, personal communication). A time adjustment (based on an empirical model of heating rate as a function of solar zenith angle and wind speed) is applied to the in situ observations to reduce differences between ARC and in situ SSTs arising from diurnal variability (described by Embury et al. ). The ARC D2 (dual view, two channel, day and night) product is used here to increase the number of ARC matchups with in situ observations. Twilight observations (defined here where the solar zenith angle is in the range 87.5–92.5°) are excluded to avoid difficulties with cloud masking at this time of day. If AATSR and ATSR2 both match to an in situ observation (because of an overlap between mission periods and similarity in satellite orbits), the AATSR matchup only is retained. If the same sensor matches to an in situ observation twice (where successive swaths overlap in high latitudes), only the closest matchup in time is retained.
 C.P.A., N.A.R., and R.O.S. were supported by the Joint DECC/Defra Met Office Hadley Centre Climate Programme (GA01101). This work is a contribution toward ERA-CLIM, a Collaborative Project (2011–2013) funded by the European Union under the 7th Framework Programme. The OSTIA background error variances were estimated as part of the ESA SST CCI project. The OSTIA reanalysis was funded through the European Community's Seventh Framework Program FP7/2007–2013 under Grant Agreement 218812 (MyOcean). The ICOADS 2.0 and NCEP NRT databases were provided by the NOAA-CIRES Climate Diagnostics Centre, Boulder Colorado. ICOADS RT and ICOADS 2.5 data are from the Research Data Archive maintained by the Computational and Information Systems Laboratory at the National Centre for Atmospheric Research. Formatted, quality controlled ICOADS and NCEP NRT data, including unmasked ship callsigns, were provided by Michael Saunby. The ARC MD and helpful advice were kindly provided by Owen Embury. EN4 Argo data were provided by Simon Good. Helpful comments were provided throughout by John Kennedy. This manuscript has benefitted from the comments of two anonymous reviewers.