Assessing potential of sparse-input reanalyses for centennial-scale land surface air temperature homogenisation

Observations from the historical meteorological observing network contain many artefacts of non-climatic origin which must be accounted for prior to using these data in climate applications. State-of-the-art homogenisation approaches use various flavours of pairwise comparison between a target station and candidate neighbour station series. Such approaches require an ade-quate number of neighbours of sufficient quality and comparability – a condition that is met for most station series since the mid-20th Century. How-ever, pairwise approaches have challenges where suitable neighbouring stations are sparse, as remains the case in vast regions of the globe and is common almost everywhere prior to the early 20th Century. Modern sparse-input centennial reanalysis products continue to improve and offer a potential alternative to pairwise comparison, particularly where and when observations are sparse. They do not directly ingest or use land-based surface temperature observations, so they are a formally independent estimate. This may be particularly helpful in cases where structurally similar changes exist across broad networks, which challenges current techniques in the absence of metadata. They also potentially offer a valuable methodologically distinct method, which would help explore structural uncertainty in homogenisation techniques. The present study compares the potential of spatially-interpolated sparse-input reanalysis products to neighbour-based approaches to perform homogenisation of global monthly land surface air temperature records back to 1850 based upon the statistical properties of station-minus-reanalysis

K E Y W O R D S homogeneity, reanalyses, surface temperatures

| INTRODUCTION
Scientists have been collecting and analysing global land surface air temperature records for a very long time. Callender put together the first truly global collection of temperature estimates in 1938, collating by hand a number of global station records and concluding that carbon dioxide from the burning of fossil fuels was responsible for the warming of the climate that he had assessed over the previous 50 years (Callender, 1938;Hawkins and Jones, 2013). Since that pioneering work, there have been significant advances and very many global, regional, and national surface temperature datasets have been created, curated, and analysed. These have become progressively more complete, and the methods used in their creation have become more advanced. Notable current global and regional works include, but are not limited to, CRUTEM now at version 5 (Osborn et al., submitted), GHCN-M now at version 4 (Menne et al., 2018), E-OBS (Cornes et al., 2018), and the Berkely Earth dataset (Rohde et al., 2013). Although these datasets are in broad agreement in terms of global mean behaviour, even at these scales potentially important divergences occur prior to the mid-20th Century (Sanchez-Lugo et al., 2019).
Despite numerous advances, creation of long-term climate data records remains a challenging proposition.
Meteorological observations were generally taken to observe and predict local and regional weather and not to monitor long-term climate change. Change in the records has been ubiquitous and has often been beneficial. The original 'raw' data are very often biased as a result of a wide range of factors well-reviewed in the literature. These include station moves, urbanization effects, instrument changes, land cover changes, and observation practice changes amongst others (Parker, 1993;Peterson et al., 1998;Changnon and Kunkel, 2005;Trewin, 2010). The degree to which these biases do not represent the true climate evolution complicates attempts to quantify climate variability and change unless adequately identified and adjusted for (Willett et al., 2014).
Compounding this, long-term data are only available for certain locations, with relatively few meteorological measurements having been performed quasicontinuously since the 19th century (Bronnimann et al., 2013;Rennie et al., 2014). These locations are not distributed equitably across the global land surface and are concentrated in Eurasia, North America, and parts of Australasia. It is, therefore, a challenge to infer truly global estimates of long-term change.
Recently, the International Surface Temperature Initiative (ISTI) has undertaken an open and transparent effort to recover, combine, and create a database of 'original' (raw) monthly land surface air temperatures from historical observational records, with an emphasis on provenance and completeness (Rennie et al., 2014). This database in its current iteration contains more than 36,000 individual station records (although many are short-period records). It is the most extensive global collection of original instrumental land surface air temperature series produced thus far. It increases by approximately three-fold the number of station series that were available to researchers prior to its assembly, with improved spatial completeness back to at least 1850 (Rennie et al., 2014). To date, it has been homogenized to create both a new version of the Global Historical Climatology Network Monthly product -GHCNv4 (Menne et al., 2018), and an estimate of Diurnal Temperature Range changes (Thorne et al., 2016). Both of these have utilized the operational version of NOAA NCEI's Pairwise Homogenisation Algorithm (Menne and Williams, 2009) to create bias-adjusted station series. To better quantify the uncertainty in homogenized data products arising from the ISTI databank, it is imperative that a broader range of methodological approaches be explored to probe the structural uncertainty in surface temperature records derived from these holdings (Thorne et al., 2005).
Such broadened approaches could include using climate reanalysis products. Over recent decades these products have been generated starting with the NCEP/ NCAR reanalysis (Kalnay et al., 1995;Kistler et al., 2001), with several groups developing full-input reanalysis products with the most recent versions being the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 (Hersbach et al., 2020), the Japanese Meteorological Agency's JRA-55 (Kobayashi et al., 2015), and NASA's Modern-ERA Retrospective Analysis for Research (MERRA-2) (Gelaro et al., 2017). ECMWF reanalyses have been successfully used, instead of neighbour-based approaches, to homogenize radiosonde temperatures (Haimberger et al., 2012).
More recently, surface-only sparse-input reanalysis products that extend back to the 19th Century have been produced (Compo et al., 2011;Poli et al., 2016;Laloyaux et al., 2018;Slivinski et al., 2019). Most specify fields of homogenized sea surface temperatures and sea ice concentration as a lower boundary condition (Rayner et al., 2005;Titchner and Rayner, 2014). All assimilate only surface pressure or surface winds and pressure as a dynamical constraint to reconstruct the full atmospheric state over the globe. They are thus formally and fully independent of land surface air temperature observations and any time averages derived from them. A number of precursor comparisons of these products to meteorological observations of land surface air temperature (Ferguson and Villarini, 2012;Jones et al., 2012;Compo et al., 2013;Parker, 2016;Wang et al., 2018) imply close correspondence, at least over certain regions and periods, but with potential caveats. For example, Ferguson and Villarini (2012) highlighted good correspondence from the mid-20th Century onwards but with the potential for a spurious break over the Central United States around the mid-20th Century in the NOAA-20CR version they analysed.
As with traditional full-input reanalysis products, successive generations of sparse-input reanalysis products show improved quality as we learn from previous efforts and as data assimilation techniques and model skill improves (Slivinski et al., 2019). Sparse-input reanalyses depend on the quality of the reconstruction of Sea Surface Temperatures as well as Sea Ice extent. These SST fields were carefully developed but may nevertheless contain remaining inhomogeneities, particularly near the ice edge before 1950. However, it is reasonable to assume that these inhomogeneities are smaller than, and that any SST inhomogeneities are independent of, inhomogeneities occurring in land temperature records. This paper sets out to assess whether using the latest generation of sparse-input reanalysis products may plausibly constitute an alternative approach to homogenize the 'raw' ISTI monthly databank holdings. This could provide a valuable methodologically-independent estimate of the necessary adjustments to these fundamental data holdings. The present analysis is a necessary precursor to such a homogenisation effort by evaluating critically whether the primary building block of the new method, sparse-input reanalysis fields, can provide suitable comparator-series for the homogenisation of land surface air temperature series.
Having outlined the context, the remainder of the paper is structured as follows. Section 2 considers the options of constructing a comparator series for homogenisation and outlines the role that sparse-input reanalysis products could play. Section 3 highlights work to postprocess the ISTI databank to remove identified duplicates. Section 4 details the interpolation method employed to arrive at a reanalysis-based comparator record. Section 5 examines the relative performance of state-of-the-art pairwise comparison versus sparse-input reanalysis as a tool for homogenisation based upon the statistical properties of station-minus-neighbour and station-minus-reanalysis timeseries. Section 6 discusses the key remaining issues. Finally, conclusions are given in Section 7.

| POSSIBLE APPROACHES TO CONSTRUCTING COMPARATOR SERIES
Homogenisation of station time series to remove nonclimatic influences from the record is essential to estimate the underlying climate record. The goal is to remove artificial non-stationarities in a series ('breaks') while retaining any real trends (e.g., Menne and Williams, 2009;Venema et al., 2012). Homogenisation of a candidate station record thus requires some comparator series. Acquiring a suitable and robust series is a challenge. The series must contain a reasonable approximation to the real geophysical variations experienced at the candidate station to avoid misappropriating real climate variability and trends as arising from data artefacts. Fundamentally, a comparator series needs to be as highly correlated with the target series, and with as low noise, as possible. The higher the correlation and lower the noise the smaller the breaks in the candidate series that can be detected and the lower the propensity to falsely identify breaks (Menne and Williams, 2009;Williams et al., 2012).
In a perfect world scenario, consulting the station's comprehensive metadata would be the solution to breakpoint detection, identifying where shifts or discontinuities may be expected (Trewin, 2010). In such a scenario, whenever an instrument had been changed or a station moved there would have been a period of parallel measurements undertaken and these series would also be available. Furthermore, all sites would have been well maintained and all siting would follow stipulated criteria that ensure representativeness. There would also exist a backbone of high-quality traceable reference stations . Sadly, in the real world, very often metadata are incomplete or missing, parallel measurements are rarely made and even more rarely openly shared, many sites are sub-optimal, and there exists, at least historically, no absolutely traceable reference network. Thus those interested in creating data records must confront the challenge of working with data series that are poorly documented and highly likely to contain unresolved issues arising at unknown times.
Some researchers have used sections of the record of the station under examination that they have high E3002 confidence in to homogenize suspect sections of the same record (Peterson et al., 1998;Mamara et al., 2012). But most employ the use of several nearby stations in the same region (Peterson and Easterling, 1994). Early neighbour-based techniques used some form of neighbour averaging (or compositing), but a growing recognition that quasi-contemporaneous or large breaks in neighbours might lead to their misattribution has led to most modern techniques using some form of multiple pairwise comparison techniques (Venema et al., 2012). These start by finding all potential breaks by comparing, in turn, each station to every other station within a given set of stations and then proceed via logical elimination to ascertain whether detected breaks most likely exist in a candidate series or in individual neighbours.
For homogenisation, the individual station records must also be both of sufficient length and overlap substantially. The ISTI databank consists of station records of varying duration, period of observations, and completeness such that it is very much the exception rather than the norm to have a 1:1 correspondence in data availability between any pair of stations. This means that any particular comparison can typically only elucidate potential data issues in a subset of the candidate station series under consideration ( Figure 1).
As a novel alternative, series from reanalysis products offer the potential to circumvent many of these issues. Sparse-input reanalysis products (Compo et al., 2011;Poli et al., 2016;Laloyaux et al., 2018;Slivinski et al., 2019) extend back to the mid-19th Century and include surface temperature estimates consistent with the prior forecast field, assimilated meteorological measurements (which exclude the land surface temperatures), and any specified boundary conditions. Reanalyses will thus always have a corresponding value to every station observation over the common period of record and are substantively independent. The use of full-input modern period reanalysis products that assimilate considerable additional data from radiosondes, aircraft, satellites, etc. to homogenize radiosonde records has proven effective (Haimberger et al., 2012). The question remains whether this is more broadly the case and, specifically, whether the centennial-scale sparse-input reanalysis products can perform a similar function for land surface air temperatures.

| REMOVAL OF DUPLICATED DATA FROM THE ISTI DATABANK HOLDINGS
The ISTI databank consists of a hierarchical merge of records of original monthly-averaged temperature observations from more than 70 underlying data sources (Rennie et al., 2014) of hugely varying volume (both station number and period of record) and provenance. Many stations have been shared broadly such that they exist in many of the sources used to create the merged holdings. To further confound matters, data from different sources may differ in coordinate precision, station naming, application of Quality Control, and even in a small number of cases homogenisation. Furthermore, some sources may have performed merges which will have been invisible to the databank creators. The ISTI databank construction process is automated and uses a mix of geographical metadata and data similarity to make a decision to either: (a) merge series; (b) create a new series; or (c) withhold the series. Despite significant efforts, it is recognized that there will have been inevitable incorrect choices made by the algorithm.
For this study, several steps have been taken to address potential issues in the holdings of the ISTI F I G U R E 1 Summary of neighbour station data availability for De Bilt since 1850 (the series extends to the 1700s but for the present study the interest is in the period since the 1850s driven by availability of sparse-input reanalysis products and globally representative observations). This series is a centennial station series with almost continuous availability (bottom black) since the 1850s, although prior to 1897 data arise from Utrecht and then several additional sources: http://projects.knmi.nl/klimatologie/ daggegevens/antieke_wrn/index.html). Within the ISTI databank data 1901 to date arises from the KNMI hosted E-OBS. Data prior to 1901 arises from GHCNMv2 collection which appears to arise directly from KNMI. The 25 nearest neighbours (other colours) are shorter with no suitable neighbour amongst them to use for homogenisation in the 1850-1900 period. There is one potential neighbour for the period of 1900-1945, after which there are several possible neighbours for pairwise homogenisation. Effectively pairwise homogenisation techniques are not possible for the period of 1850-1945 without expanding the neighbour search radius due to a lack of suitable neighbours databank. NCEI has undertaken their own blacklisting of issues they found in the ISTI databank in GHCNMv4 construction, (Rennie, pers. comm), and these have been applied herein as a first step. Following identification of additional potential issues, two distinct analysis steps were undertaken herein to detect and remedy apparent residual duplication of records in the database prior to using it. Between them, these additional steps removed about 3.5% of the remaining stations in entirety and, in addition, removed questionable sources from 340 series leading to a reduction in their station series length.
The first issue pertains almost exclusively to data rich regions and is most prevalent for long-term meteorological series, which have been shared widely over the decades, leading to their presence in multiple source data decks that underly the ISTI databank. In different sources, these may have been merged, quality controlled, and/or homogenized. The end result is that for several stations, either exact matches or near exact matches that, that is, repeat a seasonal cycle, exist across multiple nearby locations in the ISTI databank. To identify such cases, each station of more than 10 years record duration was compared with its 25 nearest neighbours. Cases with either strings of zero differences or annually repeating differences were identified. All such cases with greater than 10% prevalence were then examined manually to confirm the presence of duplication, and the longest available series was retained with all other series discarded ( Figure 2). A total of 973 such cases of erroneous data duplications have been identified. In many cases, these had multiple duplications that necessitated dropping more than one series to be satisfactorily resolved.
The second issue arises when considering neighbour distances. There are just over 300 cases where 2 or more stations in the ISTI databank have exactly matched geographical coordinates recorded. Such a result is possible, particularly where the coordinate resolution is coarse (e.g., a 0.1 resolution coordinate, which is not uncommon in many of the ISTI databank sources, has a c. 10 km radius in which the true location may exist). The issue is most prevalent in Canada and the United States, where the ISTI databank is most dense, accounting for in excess of 75% of the identified cases. Out of an abundance of caution, a single series is retained at a given unique coordinate. Where records overlap, only the longest record has been retained. Where stations did not overlap in time, they have been merged.
Between the two steps outlined above, a total of 1,467 additional station series have been removed (sometimes by merging), beyond the NCEI originated blacklist. A map of these removed stations is given in Figure 3. All the decisions have been communicated back to NOAA NCEI, who have integrated this additional blacklist into their operational GHCNMv4 processing (Rennie, pers. comm), and are reported in Data S1. It should be noted that the ISTI databank shall in the medium-term be superseded by efforts to create an integrated set of holdings across variables and timescales from synoptic to monthly (Thorne et al., 2017).

| INTERPOLATION OF REANALYSIS GRIDDED SERIES TO STATION LOCATIONS
Sparse-input (or 20th Century) reanalyses are relatively recent additions to the family of reanalysis products. Pioneered by NOAA and the University of Colorado, they have now been produced also by ECMWF and are under preparation elsewhere. Recourse is made to four versions of these reanalysis products arising from ECMWF, NOAA and the University of Colorado: neighbouring stations records exhibit substantial spurious periods of data overlap due to ghosting of a single long station into the regional network on multiple occasions. Each series shows the candidate station minus the neighbour (offset by 2C and coloured distinctly for clarity). In reality, there is one very long-term record in Helsinki, but it has been shared on multiple occasions. Further analysis shows that the issue for this station arises in large part due to upstream source merge decisions, but there are also some cases where the ISTI databank merge procedures erroneously merged this single long series from several sources into nearby stations. Mis-merged series contain both repeat zero differences and annually repeating values implying one or more of the mis-merged series has been homogenized, further complicating matters 1. The NOAA-CIRES Twentieth Century Reanalysis version 2c (20CRv2c) provides 2 by 2 resolution estimates over 1851-2012 generated with an Ensemble Kalman Filter (EnKF) algorithm. Use is made of both the ensemble mean product and the underlying 56 ensemble members (Compo et al., 2011;Giese et al., 2016). 2. The ECMWF ERA-20C reanalysis, produced under the EU funded ERA-CLIM project, provides a deterministic estimate (single analysis with no uncertainty) on a 1 by 1 grid from 1900 to 2010 using a 4D-Var algorithm (Poli et al., 2016).

The
NOAA-CIRES-DOE Twentieth Century Reanalysis version 3 (20CRv3) is a comprehensive update of previous versions of 20CR. It has an improved resolution of approximately 0.7 by 0.7 covering the period from 1836 to 2015 and an ensemble of 80 members (Slivinski et al., 2019). Solely the ensemble mean is considered herein as the full ensemble of 80 members was released only after the present analysis was completed. 20CRv3 benefits from an upgraded EnKF data assimilation algorithm and an improved NOAA atmospheric model. The observational constraint benefits from an enhanced observational database version 4.7 of the International Surface Pressure Databank (Cram et al., 2015) from data rescue efforts, a new variational quality control algorithm, a new bias correction for marine observations before 1871, and an updated bias correction algorithm for all station data over land (Slivinski et al., 2019). 4. The ECMWF CERA-20C product is a coupled reanalysis product with a 1 by 1 resolution extending from 1900 to 2010 with a 10-member ensemble (Laloyaux et al., 2018).
All of the sparse-input reanalyses used here are available upon a regular grid. To construct a comparator series of monthly 2-m air temperature estimates using reanalysis for each target station it is thus necessary to interpolate the gridded estimates to the station locations. Several possible interpolation methods exist of varying complexity. Interpolation by inverse distance weighting (IDW) is a popular method which is computationally efficient and considered to be relatively accurate (Willmott and Robeson, 1995). IDW is strongly recommended where the points to be interpolated are dense enough to capture local variation (Childs, 2004) and reduces any concern about topographic complexity that may generate micro-environments impacting on climatic values (Vicente-Serrano et al., 2003).
For each station in turn, a three by three grid of the nine nearest reanalysis grid points is used to interpolate to provide a station temperature estimate. A station located at the equator would have a maximum diagonal grid distance of 630 km for 20CRv2c decreasing significantly as it nears the poles. The other reanalysis products (ERA-20C, 20CRv3 and CERA-20C) are under half these distances. Two methods of inverse distance weighting were considered: (a) inverse distance weighting and (b) inverse distance squared weighting. Weighting was applied to absolute values and the resulting series was then anomalised after matching to target station data availability. Using a station period of record climatology at this stage maximizes the retained pool of station records. Results were compared individually in anomaly space for a selection of global stations and evaluated using Pearson correlations and standard deviations. On an individual station basis, there were only negligible differences (Table 1) with, in general, a very marginal performance advantage using inverse squared distance weightings. Given the very marginal differences, subsequent sections consider only the inverse distance squared weighting approach.
The 20CRv2c data set comes as an ensemble mean and 56 underlying ensemble members. To determine whether the ensemble members may provide additional valuable context, the similarity to the target station series F I G U R E 3 Stations removed from the ISTI databank in this work, in addition to the NOAA NCEI provided blacklist (black boxes). The stations in red diamonds, predominantly in Europe, are stations that had exactly duplicated data or seasonally repeating offsets. In all such cases, only the longest record was retained. Stations marked in blue triangles, predominantly in North America, are where two or more stations had the same coordinates but not necessarily the same data. If two or more stations had the same coordinates but different data, the station with the longest time series was retained and the shorter time series was deleted. If the two stations had the same coordinates with time series over distinct periods a merge was undertaken of estimates derived from individual ensemble members was compared with that from the ensemble mean for a selected set of stations (Section 5.1). Consistent with Kalman Filter theory, each ensemble member contains the analysed geophysical signal plus noise whose standard deviation is the uncertainty of the ensemble mean. Therefore it is expected that the ensemble mean reanalysis values should yield an estimate that may be better correlated and have lower noise than the individual members. For the selected stations analysed, this holds true in the majority of cases (e.g., Figure 4). Nevertheless, the improvement is not ubiquitous and there may still be value in using the ensemble members for homogenisation, or indeed other applications, in future work. The 20CRv3 product comes as an 80-member ensemble but at the time of instigation of the analysis T A B L E 1 Comparison of interpolated reanalysis minus station difference series correlation and standard deviation using the 20CRv2c ensemble mean product between inverse linear distance and inverse linear squared distance for interpolation for selected stations (Section 5. The difference between the methods on both an individual basis and an aggregate basis is small with a slight overall improvement when using the inverse squared distance approach. Given that this product is the coarsest resolution reanalysis, differences are smaller for other reanalysis products considered (not shown).
only the ensemble mean product was available. Initial inspection confirms that the 20CRv2C ensemble member versus ensemble average behaviour found herein holds also for 20CRv3, as would be expected for the reasons discussed above.

| RESULTS
An evaluation of the applicability of sparse-input reanalysis products to the assessment of homogeneity of individual station series requires an assessment of both individual station correspondence and aggregated spatial differences, under the assumption that after sufficient aggregation station data artefacts, even though individually systematic, become pseudo-random. First, 29 selected station series are considered in-depth. Then, areaaggregated series are examined using Giorgi regions (Giorgi and Francisco, 2000) to subdivide into regionallyaggregated series. Finally, the relative performance is studied in densely-sampled and sparsely-sampled regions.

| Case study stations analysis
A set of 29 carefully-chosen stations was selected in an attempt to ensure a representative sampling that considers the inclusion of urban stations, rural stations, coastal stations, desert stations, high altitude stations, island stations, and densely and sparsely sampled regions. Tropical, mid-latitude, and near-polar regions were approximately equally represented in the selection. The case study stations and a subset of key characteristics are summarized in Table 2 and their locations are shown in Figure 5. To retain the maximum number of stations for use in this part of the analysis, the climatology of the full period of station availability over 1850-2014 was used and not normalized to  or any other rigid 30-year climatology. In the literature, there are several different ways of selecting candidates as a reference series for pairwise homogenisation (Menne and Williams, 2009;Mamara et al., 2012;Wang et al., 2018). These methods often select stations based upon correlation, spatial representativeness, or both. Two selection choices are used here. Firstly, for simplicity, the 25 nearest neighbours are used herein to compare relative performance of reanalysisand neighbour-based approaches (see Figure 5). While less optimized than many state-of-the-art techniques, it mitigates against a metric-based selection of a neighbour F I G U R E 4 An example analysis (Adams central, West Sackets Harour New York state at 43.9 N, 76.0667 W) of correlation (r) and standard deviation (sigma) of the 56 ensemble members of 20CRv2c to determine if the ensemble mean or individual ensemble members are most suitable for further comparison to pairwise homogenisation. The correlation and standard deviation are plotted for the 25 nearest neighbours the three reanalysis products, and the 56 20CRv2c ensemble members. Values closer to [1,0] would constitute increasingly valuable comparators. The study station is in the densely sampled US. The 20CRv2c ensemble mean (black asterisk) is clearly preferable to the underlying ensemble members (red triangles) in this case, and the 20CRv3 ensemble mean (yellow x) is a significant improvement on 20CRv2c mean. For this station, its nearby neighbours still represent a generally preferable estimator based on these similarity metrics F I G U R E 5 Locations of the 29 stations chosen for the case study (red crosses). For pairwise comparison of these stations, their 25 nearest neighbour stations were also selected (blue stars). Note that for some case study stations, particularly in the Southern Hemisphere, there are common neighbours T A B L E 2 A summary of the initial subset of 29 case study stations used herein including their identifier, name, country, number of valid observations, geolocation, regional characteristics and local environment Crop/Grass E3008 station-set that may inadvertently contain data issues of similar structure to those present in the candidate station. Secondly, for completeness, the 25 closest neighbours with at least a 50% data overlap were also retained. Including the 50% data overlap criteria results in an increase in distance between the candidate station and the overall pool of selected neighbours that becomes larger for the longest station records. This increase in the distance has a knock-on effect in decreasing the correlation between the pairs and an increase in the standard deviation in the difference series. In well sampled regions such as North America, these effects are small. However, in the poorly sampled regions of the globe, the increased separation between the pairs (Figure 6) results in a significant cost to the correlation and standard deviation of the difference series, such that the selection of neighbours based on the 50% observation overlap criterion disadvantages pairwise comparison compared to homogenisation by sparse-input reanalysis (compare Tables 3 and 4).
For each of the 29 case-study stations, the performance of the reanalysis-estimate and neighbour-based difference series was compared. Figure 7 shows an example from De Bilt in the Netherlands (the headquarters of KNMI). This series has been well maintained, extends back to prior to the mid-nineteenth century, and is available quasi-continuously through to the present. The difference series to reanalysis (Figure 7 top panel) shows a marked break in the series at around the turn of the twentieth Century relative to both 20CRv2 and 20CRv3. This corresponds to a change in input source in the ISTI databank, although both arise ultimately from KNMI as far as can be ascertained (Rennie, pers. comm.). Detailed KNMI metadata shows that this station was moved 4-5 km from Utretch to De Bilt in 1897. This move is coincident with the break that would be identified as a 'potential' break if only reanalysis data were available. ERA-20C and CERA 20C do not extend further back further than 1900 and therefore cannot identify this break. Only one of the 25 nearest neighbours extends back to earlier than 1950, meaning that using solely the 25 nearest neighbours (Figure 7 middle panel) neighbour-based comparisons are effectively able to elucidate only the latter third of the station series. Even this one series stops prior to the 1897 break. Extending the neighbour search to include stations with >50% overlap (Figure 7 bottom panel) permits pairwise comparisons all the way back to at least 1850. These comparisons support the reanalysis-based estimation of a significant data issue arising around the turn of the twentieth Century in the ISTI databank version of this series. The comparisons using these more distant neighbours, however, show greater variability than using the 25 geographically nearest neighbours (compare the variability around the mean offsets per station in the middle and bottom panels over their common periods) highlighting the inherent trade-off in pairwise neighbour approaches over selecting proximal versus sufficiently overlapping neighbours.
The results shown in Figure 7 and confirmed with KNMI metadata are indicative of a broader issue with neighbour-based homogenisation approaches in that contiguous pairwise comparisons for the whole period of record are rare. Across all 29 case study stations, only two stations had neighbours within their closest 25 with paired comparisons exceeding 1800 months in length. The shortest overlapping record was 5 months between the candidate and a neighbour. This becomes particularly problematic for longer-term analyses as the ISTI databank has relatively few centennial scale station records. In such cases the current ERA-20C and CERA-20C reanalysis products which start at the beginning of the 20th Century may be of lower utility compared to 20CRv2c and 20CRv3 which extend back to the early to mid-19th Century. The use of reanalysis fields has a clear benefit as there is an estimate for each and every time there is an observation over the reanalysis period of record. However, it is not simply data availability that defines the quality of a comparator series for homogenisation. It also matters how well the comparator is correlated with the target station series and what are the standard deviation and autocorrelation of their difference series. These properties will collectively determine the likelihood of being able to detect and adjust for breakpoints in the series (Williams et al., 2012).  Individual correlations between the case study stations and their nearest neighbours vary from near 1 to values of less than 0.1 (Table 4). Neighbour-based pairwise comparison for the case study stations situated in those areas of the globe that are densely sampled generally exhibit high correlations. For example, for USC00300047 (Albany, USA) the correlation between the candidate and each of the 25 nearest neighbours ranges from a high of 0.96 to a low of 0.75. However, the distances between the neighbours and the candidate station are small ranging from 11.5 to 43 km. Conversely, the remote case study stations have correlations that are considerably lower, particularly for island stations in the Pacific Ocean. For example, the nearest neighbouring station to Easter Island is over 2000 km away on French Polynesia and is effectively uncorrelated. The station with the best correlation is Juan Ferandez Island at 3000 km distance and with a correlation value of just 0.25.
Similarly, in densely sampled regions, neighbour difference series generally have low standard deviations. In the more sparsely sampled regions, the standard deviations grow markedly for neighbour-based approaches. This is returned to in Section 5.3.
To assess the potential performance of sparse-input reanalysis datasets against the neighbour-based homogenisation techniques, the correlation and standard deviation of the difference series using the 25 nearest neighbours is compared in Table 3 under the assumption that the median neighbour is a reasonable indicator of the overall performance of the neighbours to detect breaks. This avoids negatively biasing the apparent performance of neighbour difference approaches overall if the neighbour set includes a small number of outlier series. Overall, the case study station analysis shown in Table 3 highlights that at an individual station level across most of the globe, sparse-input reanalysis-based estimates are now broadly comparable in the key metrics of correlation and standard deviation to neighbour-difference series approaches. Table 3 also demonstrates a marked difference between the performance of the ERA-20C and CERA-20C reanalysis. CERA-20C consistently shows a markedly lower apparent agreement with the case study stations. It is beyond the limit of the current analysis to postulate why this may be so, but given the similar version of the ECMWF IFS model versions used it presumably arises from the coupling of the atmospheric model to the ocean reanalysis in some manner. Because of this apparent degradation in performance, ERA-20C was carried forward and the coupled reanalysis of CERA-20C was not used in the remainder of the present analysis.

| Regionally aggregated analyses
While breaks in individual stations will be systematic, when averaged over a sufficient sample size they should become increasingly pseudo-random in nature. Conversely, systematic issues in the reanalyses will tend not to cancel when similarly averaged. Thus an aggregated analysis was performed to elucidate any likely data issues in the sparse-input reanalysis products. This analysis uses the Giorgi regions (Giorgi and Francisco, 2000) and an additional class, Not In Giorgi (NIG), to capture a suite of remote locales. Giorgi regions divide the global land surface area into 21 regions, excluding Antarctica. Figure 8 illustrates the global distribution of ISTI databank stations with greater than 120 months of observations into these regions. This illustrates the uneven distribution of meteorological stations, with fully 68.2% of the ISTI databank stations located in Europe (including the Giorgi Mediterranean region) and the lower 48 states of the USA. Thus 68.2% of the global station network covers only 7.5% of the global land surface.
Regionally aggregated series analyses highlight a shift in 20CRv2c around the early-mid 1940s in N. America (in agreement with Ferguson and Villarini (2012)) and also in many other regions (Figure 9). This abrupt shift is much reduced in both 20CRv3 ( Figure 10) and ERA-20C ( Figure 11). Overall, 20CRv3 shows the best agreement with aggregated station series across most regions of the globe ( Figure 12) and this performance extends far further back in time compared to the prior generation of sparse-input reanalysis products (compare Figure 10 to Figures 9 and 11). This is consistent with what has been observed for more traditional full-input reanalysis products whereby newer versions, learning from prior iterations and benefitting from innovations in data assimilation techniques and improved models, have markedly improved in various metrics relative to previous generations (e.g., Simmons et al., 2017). For 20CRv2c there would be plausible questions about its application for homogeneity assessment prior to the mid-20th Century. In comparison, ERA-20C shows useful performance. However, it is time limited to 1900. In contrast, series from 20CRv3, at least in most regions of the globe, can likely be applied until much earlier and likely to at least 1850 or the instigation of measurements (whichever is the later date).

| Comparison between densely and sparsely sampled regions
The Giorgi region analysis in Section 5.2 highlighted the fact that the vast majority of available monthly-mean T A B L E 4 Comparison of standard deviation and correlation (as in Table 1) based upon the selection of neighbours based on two criteria (1) the selection by distance only, and (2) distance with 50% overlap in observations, which will expand the search region and include, on average, stations further away 25 nearest neighbours by distance in KM only 25 nearest neighbours with (1) at least 50% of time series overlap (2)  station records in the ISTI databank sample only a small percentage of the globe. This is of particular concern because global mean surface temperatures are calculated by area-weighting regional temperature records from a combination of land and marine sources. The influence of an individual station in a sparsely sampled area thus far exceeds the influence of individual stations in richly sampled areas in the calculation of the global mean (Cowtan et al., 2018). It is therefore of great importance that high-quality homogenisation of the stations in the sparsely sampled regions is undertaken. Analyses in the two preceding sub-sections imply that reanalysis-based approaches may have advantages here.
To investigate this further, we have randomly selected 100 stations from those Giorgi regions that can be considered sparsely sampled and 100 stations from those regions that can be considered densely sampled for comparison. The mean distance between the selected stations and their neighbours in richly sampled regions is 79.6 km with a standard deviation of 37.6 km. For poorly sampled regions, the mean distance between a station for homogenisation and its neighbours is 567 km with a standard deviation of 356 km. Prior work has shown that inter-station correlation decreases roughly exponentially with distance with correlation halved on monthly timescales typically within 500 km distance (Hansen and Lebedeff, 1987;New et al., 1999).
We quantify how 20CRv3, which the Giorgi region analysis highlighted constituted the best potential reanalysis product for this task, compares to neighbour-based approaches based upon network sparsity ( Figure 13). Given that the station series are the basic 'raw' ISTI databank monthly-mean data without having had homogenisation or quality control applied, some proportion of this spread will inevitably arise from data issues in the candidate and/or neighbour series. Using the median neighbour distance to indicate network sparsity, there is far less of a marked drop in correlation/increase in standard deviation when using 20CRv3 as the estimator than when using neighbours. In dense regions, it is clear that neighbour-based approaches will tend to have more power (higher correlation, lower standard deviation). Conversely, in sparse regions, the 20CRv3 estimates likely have more power. The cross-over between the two occurs somewhere around the 600-800 km distance to the median neighbour. Furthermore, there is a much reduced gradient in these diagnostics with network density when using 20CRv3, implying that any analysis is likely to be more globally homogeneous in its application using 20CRv3, even if this came at the expense of reduced breakpoint detection power in data dense regions of the globe. Note: The comparison of the differences in sigma ( C), correlation, and distance between the two selection methods are shown in the final three columns.

| DISCUSSION
The global land surface records of monthly mean temperature are not evenly distributed in either space or time.
Much of the ISTI databank of 'raw' data is made up of short-term records, with the majority of stations not extending back before the 1950s. This uneven spatiotemporal distribution creates challenges, not least in the homogenisation of the longest records. This is particularly acute in sparsely sampled regions. All current stateof-the-art homogenisation techniques use some form of a neighbour-based approach. However, these approaches work best in densely sampled regions and for periods where a sufficient number of physically correlated series are available as comparators. Hence, neighbour-based approaches will work best in the recent past and in areas such as Europe, North America, and Japan where a highdensity network of meteorological measurement stations is available.
Herein we have shown that perhaps for the first time, the most recent generation of sparse-input reanalysis products, represented by the NOAA-CIRES-DOE 20CRv3 data set, likely has broadly comparable power to neighbour-based approaches based upon individual station comparisons and regionally aggregated characteristics. The 20CRv3 achieves this while being independent of observational temperature records over land. This independence will be an aid in cases where changes that are broadly consistent in nature apply quasicontemporaneously across broad regions. An example of such a change is the transition from cotton region shelters to maximum-minimum temperature systems (MMTS) sensors across the United States cooperative observer network that occurred over a decade or so in the late 1980s to early 1990s (Quayle et al., 1991). However, the lack of direct use of surface temperature observations means that care is needed to firstly ascertain the quality of the sparse-input reanalysis data. This analysis, building upon precursor analyses (Compo et al., 2013;Parker, 2016;Simmons et al., 2017;Zhou et al., 2018), provides the evidence basis that the quality of 20CRv3 is likely sufficient.
A substantial further advantage in the use of these reanalyses spanning more than a century is the F I G U R E 7 Top panel: Anomaly difference series between the long-running De Bilt series in the Netherlands (although note caveats identified in Figure 1) for the sub-period of record since 1850 and the sparse-input reanalysis-based estimates. Middle panel: anomaly difference series using De Bilt's 25 nearest neighbours. Bottom panel: anomaly difference series using De Bilt's 25 nearest neighbours with a minimum 50% data overlap. Comparisons are now available for the entire post-1850 portion of the De-Bilt data record, but at a cost to correlation and the standard deviation of the difference series (Table 4). Each neighbour difference series is a different colour for illustrative purposes in the middle and lower panels E3014 availability of a comparator series at each and every month for which a temperature value is available in the target station series. Conversely, Figure 14 shows that this is a major challenge for fixed neighbour constellations. For a neighbour set consisting solely of the 25 nearest neighbours, the most frequent number of neighbour observations at any given timestep is zero. The least frequent occurrence is to have all 25 neighbours available.
However, an advantage of neighbour approaches is that multiple independent assessments are possible meaning that if any single comparison is compromised F I G U R E 8 The 27,639 long-term stations in the ISTI dataset, following removal of questionable series as detailed in Section 3, split out into Giorgi region groupings. Note the extra grouping of 'Not in Giorgi' which captures the Antarctic, remote island, and some Arctic sites not included in the original 21 Giorgi regions F I G U R E 9 The median value (50th percentile) of the regionally-aggregated differences series between 20CRv2c ensemble mean and the station anomalies at each timestep aggregated over the Giorgi regions. Each series is vertically offset for clarity. There is a marked degradation in apparent performance over many regions in the mid-20th century. For region definitions see main text F I G U R E 1 0 As Figure 9 but for 20CRv3. The 20CRv3 product shows better performance than either ERA-20C or 20CRv2c across all regions with stable behaviour back to at least 1900 across all regions. The mid-20th century is much more stable than either of the other sparse-input reanalysis products F I G U R E 1 1 As Figure 9 but for ERA-20C which starts in only 1900. This is a clear limitation on the use of ERA-20C compared to the two NOAA sparse-input reanalysis products. Although ERA-20C contains apparent decadal variations in the mid-20th century, the degradation in this case is much less marked than for 20CRv2c in most regions (c.f. Figure 9) by a poor comparator series other independent comparisons can rectify the issue. If, on the other hand, the reanalysis contains a systematic artefact, it is harder to identify and remedy. To try to ascertain the risk of this, robust regionally-aggregated analyses were performed. These assume that station issues will become pseudorandom when averaged over a sufficient sample of stations leaving behind an indicator of regional issues in the reanalysis fields. Such comparisons point to issues in the previous generation of reanalysis products, in agreement with prior analyses (e.g., Ferguson and Villarini, 2012), which are much less evident in the newest 20CRv3 product.
It is also possible to use the ensemble products from the reanalyses which give a population of estimates, although questions as to their dispersiveness may remain. Our analysis of the 20CRv2c ensemble indicates that, as expected from Ensemble Kalman Filter theory (Compo et al., 2011) the ensemble mean tends to be a somewhat better estimate of the station series than the individual ensemble members. This is likely to hold for 20CRv3 which has a larger ensemble size that is designed to be more dispersive to more reliably capture the true climate state. This latter ensemble was not available at the time of the 20CRv2c analysis being performed herein.
This has been the first analysis to directly compare the quality of 20CRv3 to earlier generation products for the ability to estimate observed land surface air temperature series. The interpolated-to-station estimates show improved correlations and reduced standard deviations of station-minus-reanalysis difference series. When aggregated over broad regions, 20CRv3 shows marked improvements in its ability to reproduce regional series behaviour prior to the mid-twentieth F I G U R E 1 3 Neighbour comparisons show a clear tendency for the correlation to decrease and standard deviation to increase with distance, at least over the first 100 km or so between stations, although with noticeable spread. Top Panel is a pooled comparison of the correlations (r) (Blue x's) between each station and its 25 nearest neighbours across both the 100 densely-sampled and sparsely-sampled stations (200 times 25 independent values). Overplotted are correlations between the 20CRv3 product and the candidate series (Red Diamonds). These are each displaced in the xaxis by the distance to the median neighbour such that for stations in densely sampled regions the reanalysis is closer to x = 0 and for progressively sparser station locations the reanalysis estimate is further displaced from x = 0. The bottom panel is the same comparison as in the top panel, but for the standard deviation of the difference series. Neighbour-based pairwise comparisons are likely better when the distance from a test station to its neighbours is less than 350 km and, conversely 20CRv3 reanalysis performs better when the distances are c. 700 km or greater Century, addressing previously stated concerns (Ferguson and Villarini, 2012).
New sparse-input reanalysis products are planned which, as has been the case with full-input reanalysis products (Simmons et al., 2017), will likely yield further substantial improvements. As new generations of sparseinput reanalysis data sets become available, it is thus increasingly probable that they will become an attractive proposition for homogenisation activities of surface air temperatures and potentially other surface meteorological series.
Our analysis points to 20CRv3 becoming potentially advantageous compared to neighbour-based approaches when stations are separated by 700 km or more, while neighbour-based approaches are better at separation distances less than 350 km. This means that 20CRv3 is preferable for about 700 stations and competitive with neighbour-based approaches for a further c. 3,000 out of a total of 28,000+ stations of at least a decade duration in the ISTI databank. However, it is not the number of stations that matters, it is their spatial distribution. Figure 15 illustrates how those stations where 20CRv3 is likely preferable to or competitive with neighbour-based approaches account for the majority of the global land domain. There is clear potential value in using the 20CRv3 data set for homogenisation if the target is a global estimate of changes. Ongoing work is assessing the potential impact upon existing global surface temperature products by applying homogenisation approaches building upon Haimberger et al. (2012) using these series.

| CONCLUSION
Homogenisation of long-term records of land surface temperature is a challenging proposition. A wellcharacterized estimate of the true underlying climate signal is required to separate real geophysical effects from non-climatic artefacts. The current state-of-the-art techniques utilize nearby station neighbours. While the majority of stations are in densely sampled regions where such techniques have proven effective, the vast majority of the global land surface is poorly sampled. Sparse-input F I G U R E 1 4 Histogram of the occurrence of the frequency of overlap between each station and its 25 nearest neighbours aggregated over the poorly sampled candidate station-neighbour pairs top panel and well sampled regions bottom panel and from 1850 to 2012. The most frequent occurrence is for no overlap occurs for both regions and the median value is six out of 25 comparisons being possible at any given timestep in well sampled regions, falling to three in poorly sampled regions F I G U R E 1 5 Stations with 25 or more stations within 350 km radius (yellow) for which pairwise approaches may be preferable. Stations with the 25 nearest neighbours within 700 km (blue) in which pairwise and 20CRv3 based approaches may be of comparable power according to the present analysis. Stations in more data sparse regions (red) which likely will be more amenable to homogenisation using 20CRv3. As successive sparse-input reanalysis products improve over time progressively more points may become blue or red in similar future maps centennial reanalysis products, which are independent of the surface temperature observations, offer an opportunity to address both this issue and the paucity of early records available as comparator series to assess early instrumental series homogeneity. Once interpolated to the point observation, we find that the best current such reanalysis data set -NOAA-CIRES-DOE 20CRv3 has clear potential value whereas earlier products had substantial potential limitations. Use of sparse-input reanalysis products would also offer a valuable methodologically distinct approach which would allow improved exploration of structural uncertainty in reconstructions of global land surface air temperatures.