4.1. Correcting the HadAT0 Stations
 A seasonal mean difference series for each station series at each level was calculated: station time series minus neighbor time series. If the target station series is a realistic representation of the true climate evolution, and the neighbor series is similarly free of systematic biases, this difference series will be indistinguishable from white noise with a zero mean. This is the basic assumption of all climate anomaly homogeneity approaches [Conrad and Pollak, 1962]. The main interest is in long-term trends so the primary aim was to identify and adjust for systematic changes. A nonparametric Kolomogorov-Smirnov test [Press et al., 1992] (KS-test) was passed through the difference series to identify suspected breakpoints. This test can be interpreted as returning the probability that two populations arise from the same distribution. The KS-test was applied to each time series with a 15 season window either side of the current point. Cases at the 10% level or lower were highlighted as suspected breakpoints (Figures 2 and 3). Note that a nonparametric test is weaker (will yield fewer suspected breakpoints) than a parametric test, e.g., a student's t-test.
Figure 2. Time series plots for station 8495 (Gibraltar) (left) before and (right) following the QC procedure. Plots are for 9 levels (30 hPa to 850 hPa). Each plot shows station time series (blue), neighbor series (green), and difference series (black). All time series have had a simple seven-point filter applied. For levels above 300 hPa the y axis range is −4 to 4 K, and below is −2 to 2 K. Superimposed on each plot are static metadata events (black crosses). The KS-test statistic results are denoted by vertical bars for differing probabilities below 0.1 (<0.01 red, >0.01 and <0.05 orange, >0.05 and <0.1 yellow). The metadata and KS-test indicators taken together with the time series characteristics were used to guide expert judgment as to the locations of breakpoints. Figure 2 (right) additionally shows blue crosses where deletions were implemented and vertical blue bars where adjustments were applied. Note that several iterations of the procedure were performed and at these intermediate steps additional breakpoints may have been identified as the station and neighbors series were made more homogeneous.
Download figure to PowerPoint
 For each station, including LKS and GUAN stations, a plot similar to that in Figures 2 (left) and 3 (left) was produced. Figures 2 and 3 are for two stations randomly chosen to illustrate our procedures. On the basis of these plots PWT identified times where the KS-test identified a vertically coherent jump point in the difference series. The station series and neighbor series helped in deciding whether a break point resulted from problems in the station or the neighbors. Only in a handful of cases were the neighbors deemed to be the most likely cause. Having identified suspected breakpoints, recourse was made to available metadata (Gaffen  and updates) to try to determine an exact date. This was limited to static metadata change point events, i.e., those given a definite timing. If PWT decided there was sufficient evidence for a break in the station time series then a breakpoint was assigned and adjustments implemented as well as noting the metadata event, if any. Inevitably this step required subjective judgment. As it is informed by quantitative measures and knowledge of metadata events (where available) and factors which might impact the difference series (ENSO, explosive volcanic eruptions, etc.), it need not add any significant overall bias.
 A bootstrap type approach was used to estimate the required adjustment factor at each breakpoint. Adjustments at each level were defined as the change in the mean of the difference series between the ten years before and after, or a shortened period so as not to overlap with the next breakpoint. To verify this adjustment factor 1000 additional estimates were created. A random number generator was used to define what proportion, up to 40%, of values to omit from the neighbor difference series. This proportion was calculated independently either side of the break point, e.g., 5% could be dropped from one side and 25% from the other for a given estimate. A second random number generator provided an index of times to be dropped. These subsampled series were used to create an estimate of the required adjustment factor. By randomly dropping values, bimodal or multimodal distributions result if there are dubious value(s) present as these bias the solutions only when they are included.
 A number of checks were performed on the population of adjustment factor estimates:
 1. The first check was to ensure that the adjustment factor is significantly nonzero: Are the 5th and 95th percentiles of the adjustment estimates distribution of the same sign?
 2. The second check was to test whether the population of estimates is normally distributed: (1) Is the 1st (99th) percentile within 1.5 ± 0.4 times the 5th (95th) percentile distance from the median? (2) Are the fifth and ninety-fifth percentiles approximately equidistant from the median value? (3) Are the initial estimate and the median of the population within 0.03 K or 25% of the absolute value of the median adjustment?
 3. The third check was to check for grossly erroneous values: Are all absolute seasonal difference values <4 K?
 If all three tests passed then the median value was used as the best guess adjustment factor.
 If any of the tests failed then any values deemed by PWT to be obviously dubious in the context of the rest of the difference series were deleted. If values were deleted then the adjustment calculation procedure was repeated. In total order 1–2% of seasonal values were deleted. Soviet data until the mid-1960s were found to be highly suspect in the winter season at all heights, but particularly in the stratosphere (Figure 4, left). The absolute differences to the neighbor composite series were often >10 K (the time series shown are temporally smoothed), whereas subsequently they were generally within the range ±2 K. A number of stations from developing countries were also particularly poor. Conversely, relatively few deletions were made for U.S., Canadian, Australian, Japanese, and NW European series.
 Only significantly nonzero change points were adjusted. Implementing small and insignificant adjustments could artificially redden the spectrum by adding spurious step changes to the time series. Adjustments were applied as seasonally invariant changes to all points in a station time series before the break point.
 Once decisions for all HadAT0 stations regarding adjustments/deletions had been made, they were implemented and the seasonal climatologies recalculated. The adjusted series were then used to create new neighbor composite and difference series and the quality control procedure repeated. On the first iteration, only breakpoints which PWT assessed as very definite breaks in the station data were adjusted, to minimize the chances of aliasing spurious neighbor series trends into the adjustments. In subsequent iterations all suspected breakpoints were considered and, where significant, adjusted. Once a station had no adjustment or deletion applied on a given iteration of the procedure it was considered homogeneous and no longer a candidate for future adjustment. This prevented the procedure from forcing each station series to become identical by iterating indefinitely. The entire QC procedure was carried out a total of five times, after which PWT decided that convergence had been attained. We caution that another expert or group of experts (e.g., the LKS approach) may have reached different decisions in performing this QC so there are questions as to repeatability.
 Following QC a final check for outliers was performed removing all values greater than 3.5σ in the homogenized difference series from the target station series. This led to the further removal of 0.05% of points. Some of these values might be real extreme events. However, the primary interest is in characterizing the long-term behavior of upper air temperatures. Hence it is more important to remove erroneously large anomalies which could have a disproportionate influence. The approach may artificially reduce the interannual variability.
 For HadAT0 stations at all pressure levels the final difference series is closer to random noise around zero than the initial version (e.g., Figures 2, 3, and 4). The homogenized time series yield KS-test results that are approximately normally distributed, whereas the raw data KS-test results are highly negatively skewed implying the presence of discontinuities in these data (Figure 5).
Figure 5. Summary of KS-test results before the first iteration and following completion of HadAT1. For each station at each time step the results have been multiplied together and then renormalized by taking the power 1/n where n is the number of pressure levels with a KS-test result. Probabilities are of the truth of the null hypothesis of no breakpoint, so that low probabilities suggest a discontinuity. The statistic is constrained to lie between 0 and 1 and would have a mean of 0.5 for simple white noise. Taking the geometric mean reduces this expectation to c.0.4 if there are nine levels with data upon which the test is performed.
Download figure to PowerPoint
 As the iterations proceeded fewer breakpoints were identified (and slightly fewer levels were adjusted per breakpoint) and the magnitudes of the adjustments decreased (Table 2). The distribution of absolute adjustment factors is highly positively skewed – there were a large number of relatively small adjustments and a small number of very large adjustments, especially in the initial iteration. There is little indication of a systematic sign of the adjustments – given the methodological approach this is not surprising. Although there are large variations between stations it is striking how invariant the average number of breakpoints identified per station by WMO region is (Table 3). For any station PWT identified on average 6 breakpoints over the 45 year period (1.3 per decade, although many stations are incomplete). The mean and median absolute adjustment factors applied are also similar except for North America and the Pacific region where they are lower, reflecting the traditionally higher quality stewardship by U.S. operators. The frequency of breakpoints identified is reduced at the ends of the record as a result of the reduced power of the KS-test and more conservative breakpoint identification undertaken by PWT when less than 15 seasons are available before or after the time step. Over the rest of the record there is little variation by decade. Station practices have been forecasting- rather than climate-driven and numerous changes to procedure have been and continue to be made.
Table 3. Summary Statistics From Our QC of HadAT1 Stationsa
|WMO Region||Number of Stations in HadAT1||Number of Breakpoints Identified by Time Period (Average Per Station)||Mean of All Absolute Adjustment Factors, K||Median of All Absolute Adjustment Factors, K||Standard Deviation of All Absolute Adjustment Factors, K|
|Europe (01–19)||70||86 (1.2)||106 (1.5)||125 (1.8)||100 (1.4)||15 (0.2)||432 (6.2)||0.553||0.444||0.430|
|Russia (20–39)||142||164 (1.2)||249 (1.8)||269 (1.9)||204 (1.4)||22 (0.2)||908 (6.4)||0.555||0.468||0.368|
|Asia (40–49)||67||73 (1.1)||111 (1.7)||101 (1.5)||106 (1.6)||26 (0.4)||417 (6.2)||0.510||0.371||0.468|
|Africa||14||6 (0.4)||29 (2.1)||26 (1.9)||22 (1.6)||1 (0.1)||84 (6.0)||0.575||0.421||0.464|
|North America||121||162 (1.3)||199 (1.6)||191 (1.6)||176 (1.5)||25 (0.2)||753 (6.2)||0.396||0.319||0.390|
|South America||21||14 (0.7)||43 (2.0)||37 (1.8)||25 (1.2)||1 (0.0)||120 (5.7)||0.510||0.400||0.392|
|Pacific area||42||45 (1.1)||81 (1.9)||79 (1.9)||59 (1.4)||7 (0.2)||271 (6.5)||0.446||0.352||0.347|
 Particularly outside of developed nations there were few metadata, so most breakpoints identified (c. 70%) had no accompanying metadata (Table 4). This is a major impediment to the unambiguous identification and removal of nonclimatic influences. A subset of stations with seemingly complete metadata (the exception rather than the rule) from a range of countries yields an average of 13 metadata events per station over the HadAT period. So the average number of adjustments applied here per station may be an underestimate of the pervasiveness of nonclimatic influences and the series may retain heterogeneities. Alternatively, many metadata events may lead to no discernible influence on long-term continuity of the station records. Most metadata associated with the breakpoints were documented as either a change to the basic sonde model (or one or more of its components) or a change in the calculation methods, primarily how radiation effects were removed. The resulting data set is HadAT1.
Table 4. Summary of Metadata Events Associated With Breakpoints Adjusted in the HadAT1 Station Seta
|Metadata Event Type Associated With Breakpoint||Number of Breakpoints||Percentage of Total|
|No known event||2088||69.95|
|Radiosonde model change||522||17.49|
|Humidity sensor change||116||3.89|
|Ground equipment replacement||53||1.78|
|Radiation corrections applied changed||38||1.27|
|Cutoffs for data changed||23||0.77|
|Cord length change (Japan only)||16||0.54|
|Observations time change||6||0.20|
|Wind speed measurements||5||0.17|
|Station operator change||1||0.03|
4.2. Expanding the Station Network
 Having homogenized HadAT0 stations to form HadAT1, those stations which were initially deemed to be insufficiently similar to the LKS/GUAN network were reconsidered. Adjusted HadAT1 stations were used to create neighbor composites for these stations, relaxing the stratospheric requirement so that all HadAT1 stations, which were now homogenized, contributed. Hence the neighbor series were not updated upon the completion of each iteration of the QC procedure. In all other respects the methodology was identical to that employed for the HadAT1 stations.
 Of the remaining stations for which it was possible to calculate a climatology, 199 were adjusted. The rest were either deemed by PWT to be too heterogeneous, without sufficient neighbors, or contained limited data for two levels at most. A total of four iterations were required to homogenize these series. The homogenized series pooled with the HadAT1 station series produce HadAT2.
 Previous investigations by LKS and Parker et al.  concluded that Indian station data are highly dubious. However 15 Indian stations qualified for HadAT2 (Figure 1), These series did indeed exhibit large heterogeneities, having on a national average the largest discrepancies vis-à-vis the neighbor composites. However, it was relatively simple to identify breakpoints, many of which correlated with the available metadata. We see no compelling reason why the adjusted Indian data should not reflect the true long-term behavior, so long as the HadAT approach is sufficiently powerful and unbiased. Figure 6 gives temperature time series before and after adjustments for an example Indian station (cf. Figures 2, 3, and 4) showing that the most pervasive breakpoints have seemingly been removed.