When dealing with real paleointensity data parameters such as m, s and the SE can be estimated from the data. Only in recent times, with the use of DGRF data [Maus et al., 2005], can we obtain values for μ, but values for σ remain unobtainable. In the following discussion we will look at historical data sets where μ can be obtained from DGRF data and make use of the criteria outlined above.
4.1. How Are Real Data Distributed?
 A key issue is how well the considered distributions represent real paleointensity data. The descriptive statistics of a number of historical paleointensity data sets from a range of localities, methods and materials are summarized in Table 1. Biggin et al.  used the Anderson-Darling (AD) test [Anderson and Darling, 1952; Stephens, 1986] to show that three historical data sets could not be distinguished from a normal distribution at the 0.05 significance level. We expand on this approach by considering additional data sets and testing for lognormality (Table 2). In addition, we have used the AD test to calculate the probability that the data sets are normally distributed with m = μ, or that they are lognormally distributed with γ = ln μ (which assumes that the true mean is the median value of the lognormal distribution, which greatly simplifies calculations for γ and θ).
 For all but one data set (the Parícutin data set of A. R. Muxworthy et al. (A Preisach methodology to determining absolute paleointensities: 2. Field testing, submitted to Journal of Geophysical Research, 2010)) the AD test cannot reject the null hypothesis that the data sets have been sampled from continuous lognormal distributions at the 0.05 significance level. With the exception of four data sets (Pick and Tauxe , the Thellier data from Yamamoto et al. , and both data sets from Muxworthy et al. (submitted manuscript, 2010)), all data sets could also be sampled from continuous normal distributions. Considering the probabilities that the data sets are distributed around the expected values (P* values in Table 2), we observe that the data from Hill and Shaw , and the Thellier data from Yamamoto et al.  and Mochizuki et al.  are not normally or lognormally distributed. Two of these data sets are from the 1960 lava flow on Hawaii, which has been noted for yielding absolute paleointensity results that are inconsistent with the expected value [Tanaka and Kono, 1991; Tsunakawa and Shaw, 1994; Hill and Shaw, 2000; Yamamoto et al., 2003]. This may be the result of bias due to the presence of chemical or thermochemical remanent magnetizations [e.g., Tsunakawa and Shaw, 1994; Hill and Shaw, 2000; Yamamoto, 2006; Fabian, 2009]. Mochizuki et al.  noted that their Thellier data are systematically higher than expected and suggested that an inherent rock magnetic property or thermal alteration due to laboratory heating has caused this bias.
 It is worth considering the statistical power of the AD test with respect to the data being analyzed. In general, goodness-of-fit tests lose accuracy with decreasing n. The AD test is no exception. Given the small size of some of the data sets here, some of the probability should be viewed with caution. P values (Table 2) were calculated using the asymptotically derived analytical solution for the AD test [Stephens, 1986]. However, no analytical solution is currently available for the P* probabilities, which were therefore estimated using a Monte Carlo approximation with 107 simulations [e.g., Stephens, 1974, 1979]. The effect is that the P* probabilities are poorly constrained close to the tails of the distribution (i.e., P* ≈ 0.05 and P* ≈ 0.95). This is of most concern for us when P* ≈ 0.05, which means that about four of the P* probabilities (representing three data sets) are poorly constrained. Another consideration is the sensitivity of the goodness-of-fit test. The AD test is sensitive to deviations from normality at the tails of the distribution. That is to say, a small number of large outliers can dramatically reduce the calculated probability that the data are normally distributed. Given the nature of paleointensity data, where nonideal behavior can be difficult to exclude from data sets, this is a possibility. On the other hand, the Kolmogorov-Smirnov (KS) test is more sensitive to deviations close to the median value of the distribution (i.e., large numbers of data that deviate from normality close to the mean will reduce the calculated probability). The one-sample KS test for normality and lognormality returns probabilities ≥0.138, using the estimated mean and estimated standard deviation. This provides additional evidence that the data sets could be sampled from either a normal or lognormal distribution at the 0.05 significance level.
 For scalar paleointensities, given that the intensity must be >0 for all practical purposes, the distributions must be non-Gaussian. In general, paleointensity data sets could be lognormally distributed (Table 2). However, most data sets cannot be distinguished from a normal distribution. Our simulations indicate that treating lognormal data normally (i.e., using the arithmetic mean and the standard deviation, equations (1) and (2), respectively) produces statistics that behave in an approximately normal fashion. Importantly these statistics and probabilities represent best-case scenarios and in reality the confidence levels of these statistics will be lower. In addition, large deviations or systematic biases due to nonideal paleointensity behavior cannot be identified with these methods, and all statistics of paleointensity data rely on the assumption that such biases can be successfully identified and excluded from final data sets.
4.2. Implications for the Paleointensity Database
 While the SE provides a better estimate of the confidence interval around an estimated mean, the estimated standard deviation, s, remains useful for paleointensity studies. In one respect, s can be viewed as a measure of the fidelity of a paleomagnetic recorder, by accounting for natural (or laboratory induced) variability of paleointensity results from a group of specimens. It should therefore retain its role as a paleointensity data selection criterion. However, additional considerations are necessary if s is to be used in this way.
 The known distribution of sample variances for normally distributed data allows quantification of a confidence interval around s:
where and are the two-tailed χ2 critical values with (n − 1) degrees of freedom at the (1 − )th and th percentiles. As illustrated by Figure 7a, the confidence intervals are large for small n and decrease as n increases. For n = 2, the 95% confidence interval is 0.4s ≤ s ≤ 31.9s, but for n = 30 the interval is only 0.8s ≤ s ≤ 1.3s. This quantifies the intuitive notion that s is poorly constrained for small n, for normally distributed data. If we wish to use s as a selection criterion for paleointensity analysis, we need to take into account the high degree of variability of s for small n. That is, criteria, such as δB (%), must have a sample size dependence, the necessity of which can be seen in Figure 6. If a static δB (%) criterion were to be used, as is the case with most previous studies, a data set with n = 2 and s = 15% would be accepted for further analysis along with a data set with n = 30 and s = 15%. In reality for the former data set, at the 95% confidence level, s could range from 6% to 479%, while for the latter data set, s will lie within the range 12%–20%. Clearly, the n = 30 data set is more reliable. If we impose δB (%) ≤ 25% both data sets would be deemed as acceptable results.
Figure 7. (a) Upper and lower 95% confidence limits for the estimated standard deviation as a function of n. These limits assume normally distributed data. (b) Sample size–dependent within-site consistency (δBn (%)) threshold values that ensure that the maximum acceptable within-site scatter is ≤25% at the 95% confidence level.
Download figure to PowerPoint
 The ratio can be shown to follow a noncentral t distribution with noncentrality parameter ϕ = (Appendix C). This allows a sample size–dependent within-site criterion (δBn (%)) to be defined:
where is the one-tailed noncentral t critical value with noncentrality parameter , and where Rmax is the desired maximum acceptable within-site consistency (e.g., the commonly used threshold of ≤25%). This formulation exactly corresponds to the confidence level contours for the normal distribution shown in Figure 6. Due to the fact that δBn (%) is within the noncentrality parameter, no unique analytical solution can be derived, however, accurate solutions can be rapidly obtained using a numerical approach. The cutoff values that give 95% confidence that ≤ 25% for normally and lognormally distributed data are shown in Figure 7b. Table 3 provides δBn(%) values for various maximum values of and n, assuming normally distributed data. Implementing a sample size–dependent within-site consistency criterion ensures a consistent confidence level (e.g., 95%) in all selected data. Assuming normality, and choosing a maximum within-site consistency of 25%, this approach gives a cutoff value for n = 2 of δBn (%) ≤ 12.56%, and for n = 30, δBn (%) ≤ 20.43% (Figure 7b and Table 3).
Table 3. Threshold Values for δBn That Ensure a 95% Confidence Level That the Estimated Standard Deviation Is Less Than a Specified Maximum Percentage of the Estimated Meana
 The PINT08 paleointensity database [Biggin et al., 2009] contains 3576 data entries. For the purposes of analyzing long-term global paleointensity variations it is necessary to compare intensities in the form of virtual (axial) dipole moments (V(A)DM). Currently, only 3049 of the PINT08 entries report a V(A)DM. Using only these entries and excluding data entries with n = 1 and data with no reported n or s, 2173 entries remain. If we apply δB (%) ≤ 25%, 1936 entries remain. This is, generally speaking, the extent to which most database analyses go, although some analyses impose restrictions on the paleointensity method used. If we apply the above-described sample size–dependent within-site criterion, δBn (%), 1560 data entries are left; which represents ∼44% of all available data. This a further reduction of ∼12% when compared to using the δB (%) criterion. The result of this pruning of the database, however, is that we have a consistent confidence in the remaining data, despite having variable n. The application of this new criterion does not greatly change the general long-term trends in geomagnetic field intensity variation (Figure 8a). It does, however, exacerbate the problem of scarce data is certain time periods: no data are available in the Middle to Upper Triassic (244–202 Ma) and only two data points pass the δBn (%) criterion from the Lower Devonian to the end of the Proterozoic Eon, from ∼524–407 Ma. A more detailed view of the number of data accepted before and after applying the δBn (%) criterion is shown in Figure 8b.
Figure 8. (a) Average V(A)DM during the Phanerozoic determined from the PINT08 data set (green), after application of δB (%) ≤ 25% (blue), and after application of δBn (%) ≤ 25% (red). Average V(A)DMs are calculated for 5 Myr bins. Lines indicate consecutive bin averages. (b) Number of data points per bin. The scale has been truncated at 160 points per bin for clarity. This excludes only the first bin (0–5 Ma) which includes 1324 points from PINT08, 785 after applying the δB (%) criterion, and 621 after applying the δBn (%) criterion.
Download figure to PowerPoint
4.3. How Many Samples Are Enough?
 Determining the optimal number of samples for a paleointensity study often is a subjective determination that depends on the degree of confidence required for the study in question. As outlined above, as many as 24 samples would be the optimal minimum number, but this is rarely achievable. When only one data point is available, no information can be obtained to quantify the uncertainty. Therefore, a minimum of n = 2 should be used. This at least allows calculation of s and quantification of a confidence interval, despite this interval being large. However, investigators should aim to maximize the number of successful results by collecting as many paleomagnetic samples as possible per unit investigated. Studies that collect only a few paleomagnetic samples per unit (i.e., 10 or less) are most likely to produce data sets that have large or unquantifiable confidence intervals. Given that paleointensity studies can have high failure rates, as many as 30–40 paleomagnetic samples should be collected per unit.
4.4. Comparison of Confidence Intervals
 When applied to real data sets, how well do the confidence intervals defined by the SE compare to other methods of estimating confidence intervals? The uncertainty interval defined by the estimated standard deviation, and the confidence intervals defined by the standard error (t × SE) and estimated by a nonparametric statistical bootstrap for the data sets in Table 1 are summarized in Table 4. Both t × SE and the bootstrapped confidence limits reflect the 95% confidence level, while the uncertainty interval of the standard deviation, under ideal circumstances, reflects ∼68% coverage (i.e., ∼68% of the data will fall within ±1s of the estimated mean). Two standard deviations, which should represent ∼95% coverage is also included in Table 4, however, 2s is rarely used in paleointensity studies. The uncertainty intervals defined by the estimated standard deviation and the confidence interval defined by t × SE involve the assumption that the data sets are normally distributed. The bootstrapped confidence intervals involve no assumptions about the distribution of the data sets.
 Using the estimated standard deviation to define uncertainty intervals includes the true mean for 12 of the 18 data sets investigated. This uncertainty interval fails when there is a bias in the data [e.g., Hill and Shaw, 2000] or when the data set contains few values [e.g., Michalk et al., 2008]. The 2s uncertainty intervals include μ in all cases, but in some instances 2s defines a range of ±50 μT (e.g., the Vesuvius data of Muxworthy et al. (submitted manuscript, 2010)). In addition, it is unlikely that the estimated standard deviation will represent a consistent confidence level for data sets with n < 7 (Figure 4). Therefore, for at least six data sets the estimated standard deviation does not provide 95% coverage (Table 4). The t × SE confidence intervals include the true mean for 13 of the data sets and include the true mean when n is small. Four of the five data sets for which the t × SE confidence interval does not include μ are rejected by the AD test for being normally or lognormally distributed about the expected means at the 0.05 significance level. This suggests that there may be a bias in the data sets as noted by the authors [Hill and Shaw, 2000; Yamamoto et al., 2003; Mochizuki et al., 2004]. For these data sets, ±1s also fails to include the true mean. Rolph  noted that the paleointensity results from the 1971 lava flow from Mt. Etna may be affected by chemical remanent magnetization. Despite having relatively large n (≥7), these five data sets yield inaccurate results (intensity error fraction, ∣IEF∣ ≥ 10.7% (Table 1)).
 The statistical bootstrap confidence intervals were determined using a bias-corrected accelerated bootstrap method [Manly, 2007] with 106 repeat samplings to define the 95% confidence interval around the mean (Table 4). The bootstrap method consistently fails to yield confidence intervals that include the true mean. It has been noted by others that the bootstrap method can underestimate the uncertainties of data sets with few values [e.g., Schenker, 1985]. A comparison between bootstrap and t × SE confidence intervals from a Monte Carlo analysis of a normal distribution suggests that 20 point values are required for the bootstrap confidence interval to be within 10% of that defined by t × SE, and as many as 40 point values are needed to reduce this to within 5%. This makes bootstrapped confidence intervals unsuitable for most paleointensity data sets.