The methods used by this study explicitly acknowledge the limitations of analysing percentage-frequency data using traditional multivariate statistical techniques. An alternative approach is offered by a suite of techniques known collectively as ‘compositional data analysis’. Compositional data analysis has been applied to particle-size data for over 30 years, yet remains a comparatively unknown approach within the field of sedimentology (Aitchison, 1982, 1986, 1999, 2003; von Eynatten, 2004; Weltje & Prins, 2003; Jonkers et al., 2009; Tolosana-Delgado & von Eynatten, 2009, 2010; Weltje & Roberson, 2012). Many other examples of compositional data analysis exist in geoscience, including: mineral compositions of rocks (Thomas & Aitchison, 2006; Weltje, 2002, 2006), pollutant profiles (Howel, 2007), pollen populations (Jackson, 1997) and trace element compositions (von Eynatten et al., 2003). Initially, it is helpful to provide a brief introduction to compositional data analysis and its relevance to inter-instrument comparison of particle-size analysers using a short example.
The mathematical properties of particle-size distributions
To appreciate the best way to analyse particle-size distributions, expressed as discrete particle-size percentage values, it is necessary to highlight some of their fundamental (often obvious) mathematical properties.
- Particle-size distributions contain relative information about the proportions of different particle sizes in a sediment sample.
- Particle-size categories are expressed as percentages, so are greater or equal to zero and less than or equal to 100%.
- Particle-size distributions sum to 100%.
These properties are very useful in that they normalize measurements of particle mass or particle volume, allowing for widespread data comparison; however, they also represent serious obstacles to multivariate statistical analysis and subsequent data interpretation. A data set of 100 simulated size-frequency distributions is presented as: (i) a percentage-frequency plot (Fig. 1); and (ii) ternary diagrams (Fig. 2A and C). These data are used below to illustrate that these mathematical properties act as constraints, limiting the extent to which standard statistical tests can be applied to particle-size distributions.
Figure 1. A size-frequency biplot of 100 simulated particle-size distributions (grey). The mean and upper and lower 95% confidence levels are plotted respectively as solid, dot-dashed and dashed lines. These levels have been calculated using percentage-frequency data (red) and log-ratio data (black). The upper and lower 95% confidence levels calculated from percentage data violate the constant-sum constraint, summing respectively to 197·5% and 2·5%. The lower confidence intervals calculated from percentage data are negative, falling outside the parameter space. Confidence intervals calculated from log-ratio values obey both constant-sum and parameter space constraints. Both confidence intervals are predictions of feasible particle-size distributions.
Download figure to PowerPoint
Figure 2. Simulated particle-size distributions (n = 100) summarized by their relative proportions of sand silt and clay, plotted as: (A) A ternary diagram with hexagons of variation delineating confidence regions of the population at 90%, 95% and 99% confidence levels. The red cross indicates the population mean. (B) A three-dimensional scatterplot of centred log-ratio transformed values showing confidence regions of the population at 90%, 95% and 99% confidence levels. Confidence regions for three orthogonal axes are calculated as a volume, but are shown here as two-dimensional areas for graphic simplicity. (C) A ternary diagram with log-ratio confidence regions of the population at 90%, 95% and 99% confidence levels [plotted in (B)] transformed back into percentage-frequency values. The red cross indicates the population mean. In contrast to (A), all areas of the confidence region fall with the parameter space (0 ≤ x ≤ 100).
Download figure to PowerPoint
The first constraint is that each part of a particle-size distribution must be considered in relation to all its other parts; this is because a change in one part of a distribution automatically results in an inverse change in all the other parts of the distribution. For example, if there is an absolute increase in the mass of silt in one sample compared to another sample, the relative proportions of sand and clay will decrease, even if the mass of both of these size fractions remains constant. The lengthy description of the relative proportions of each part of a particle-size distribution can be avoided if log-normal distribution coefficients are used instead (Folk & Ward, 1957; Inman, 1952; Evans & Benn, 2004). Comparisons between particle-size distributions are quantified most commonly using mean, standard deviation, skewness and kurtosis statistics. The limitations of this approach have been documented by several authors (Bagnold & Barndorff-Nielsen, 1980; Fredlund et al., 2000; Fieller et al., 1992; Beierle et al., 2002; Friedman, 1962), the most fundamental of which being that particle-size distributions are frequently not log-normal. Figure 3 illustrates how a series of markedly different multimodal distributions can have identical log-normal distribution coefficients (mean and standard deviation). Quantifying the analytical precision of an instrument is clearly problematic if the statistics used are unable to differentiate between particle-size distributions that are evidently dissimilar. Alternative probability distribution functions (for example, log-hyperbolic and skew-Laplace) have been suggested to circumvent this issue (Bagnold & Barndorff-Nielsen, 1980; Fieller et al., 1992), but the limitations of non-uniqueness are still applicable. Moreover, all distribution function statistics mask potentially important variations in empirical data. One means of avoiding the pitfalls of distribution function statistics has been to compare complete particle-size distributions using factor analysis (Syvitski, 1991; Stauble & Cialone, 1996). Unfortunately, in most cases this approach is not valid because percentage values in particle-size distributions are subject to bias (Falco et al., 2003). The source of this bias is detailed by the second constraint acting on particle-size distributions and is described below.
Figure 3. Semi-log plots of randomly simulated size-frequency data. Each subplot contains a unimodal, bimodal, trimodal and quadramodal frequency distribution with identical log-normal mean and standard deviation values. These plots illustrate one of the potential problems involved with using log-normal distribution coefficients to describe complex multimodal particle-size distributions.
Download figure to PowerPoint
The second constraint is that percentage-frequency data occupy a mathematically limited space, 0 < x < 100. This has important implications for applying regression models to particle-size distributions, used for example when making adjustments to homogenize data sets analysed by a range of different instruments (Konert & Vandenberghe, 1997; Beuselinck et al., 1998; Buurman et al., 2001; Eshel et al., 2004). Figure 4A, B and C show the simulated data set (‘Data A’) of percentage-frequency values plotted against another synthetic data set (‘Data B’). Data set ‘B’ has been simulated to replicate the underestimation of clay particles by a laser diffraction instrument. A least-squares linear regression model has been calculated for each size fraction (solid red line) with model coefficients and R2 correlation coefficients given for each. R2 statistics are close to one for both sand and silt, indicating a good agreement between the two data sets. Close inspection of the regression models reveal that for the silt fraction (Fig. 4B) percentage values greater than 100 are predicted for the upper range of data, indicated by the horizontal dashed line. The inverse case is also true for the clay fraction (Fig. 4C), where the regression model predicts negative percentage values for data set A.
Figure 4. Biplots of the simulated data set (n = 100) summarized as the relative proportions of sand, silt and clay plotted against another simulated data set, replicating the underestimation of clay particles by a laser diffraction based instruments. (A), (B) and (C) Percentage data with ordinary least-squares linear regression model (red line), model coefficients (β0, β1) and root mean squared error (R). Regression models constructed using percentage data are capable of predicting values outside the mathematical parameter space. These predictions, if corrected post hoc, have implications for the constant-sum constraint. (D), (E) and (F) Log-ratio data with ordinary least-squares linear regression model (red line), model coefficients (β0, β1) and root mean squared error (R). Log-ratio data are free to range between −∞ and ∞, avoiding the prediction of invalid values. The constant-sum constraint is ensured by the back transformation of predicted values (Eq. (3)). Note the bias inherent in calculating correlation coefficients (R) of percentage-frequency data, which are consistently higher than for log-ratio data because all the data are positive.
Download figure to PowerPoint
Calculating confidence regions around populations of particle-size distributions is important if differences between populations are to be determined reliably. Confidence regions for ternary diagrams have traditionally been calculated using hexagonal fields of variation (Stevens et al., 1956; Weltje, 2002). These are plotted for the simulated data set in Fig. 2 as confidence regions of the population at 90%, 95% and 99% confidence limits. Confidence intervals calculated using percentage-frequency values have also been plotted in Fig. 1. The red-dashed line in Fig. 1 indicates the lower 95th confidence limit of the population. In both the figures, some of the values within the confidence regions fall outside the range of zero and 100, which is clearly impossible.
The third constraint operating on particle-size distributions is that they must sum to 100. This restriction applies as equally to confidence intervals as it does to particle-size distributions modelled using regression functions. In the former case, the upper and lower 95% confidence levels of the synthetic data calculated using percentage-frequency values plotted in Fig. 1 sum respectively to 197·5% and 2·5%. Were these estimates taken as correct, they would violate principles of conservation of mass, implying that mass has the potential to both leave and enter the system. In the latter case, there is no way to guarantee that distributions predicted from percentage data will sum to 100. Post hoc adjustments performed to ensure constant sum, unless done with extreme care, are liable to violate the first constraint by changing the relative proportions of the distribution. These three mathematical constraints impose serious limitations on how well particle-size data can be described and compared and, consequently, the reliability with which the performance of particle-size analysers can be defined and compared. The simulated data sets ‘A’ and ‘B’ are available online in Appendix S3.
To overcome the constraints detailed above, the principles of compositional data and the log-ratio transformation must be introduced. Data characterized mathematically as positive constant-sum vectors are known as compositional data (Aitchison, 1986). A single compositional data point is known as a composition, for example a particle-size distribution. Aitchison (1986) recognized that because each part of a composition must be considered relative to all its other parts, they were best treated as ratios. Ratios are mathematically awkward, so Aitchison (1986) logically extended the transformation to derive the logarithm of the ratios, log-ratios. There are a number of different ways to calculate the log-ratio of compositions. For particle-size distributions composed of more than three parts, the centred-log-ratio transform (clr) is generally the most useful:
where q is the log-ratio transform of p, a particle-size distribution with D-particle-size categories, i is the ith category and g(p) is the geometric mean of the particle-size distribution p. For a three-part particle-size distribution [0·4 0·35 0·25] the log-ratio of the first category is:
The log-ratio transformation of compositional data moves it from closed space to real space, also referred to as co-ordinate space. The transformation into co-ordinate space is best visualized by comparing Fig. 2A to Fig. 2B. In the latter three-dimensional scatterplot, the log-ratios of the sand, silt and clay fractions are a dimension in co-ordinate space, analogous to (x, y, z) Cartesian co-ordinates. In co-ordinate space where there are no parameter limits (data values range from −∞ to ∞), these data can be analysed using standard multivariate statistics without any of the restrictions described above.
The application of linear regression models to particle-size distributions in co-ordinate space is illustrated by Fig. 4D to F. In co-ordinate space, linear regression models are incapable of predicting ‘impossible’ values. Equation (1) also removes the influence of bias on root mean squared error (RMSE) calculations because the data are both positive and negative. With the removal of this bias, correlation coefficients calculated using log-ratio data are notably lower than for the equivalent percentage-frequency data in Fig. 4A to C. Standard goodness-of-fit statistics can also be applied to data in co-ordinate space without the need to apply to moment measure statistics. The root mean squared error statistic applied in this case is identical to the Euclidian distance measurement normalized by degrees of freedom. This can be applied to compare the output of predictive models and instrument precision using replicate sample measurements.
Working with log-ratio data allows population confidence regions to be predicted in a straightforward manner (van den Boogaart & Tolosana-Delgado, 2008). The red ellipses plotted in Fig. 2B delineate confidence regions of the synthetic data population at 90%, 95% and 99% confidence limits. These are actually calculated as volumes using a trivariate probability model, but are plotted as orthogonal ellipses for graphical simplicity. To turn these confidence regions, along with modelled distributions, into meaningful information they must be converted back to percentage-frequency values. The transformation of log-ratio data back into constrained space is performed using the inverse log-ratio transform (clr−1):
where qi is the ith part of a log-ratio particle-size distribution q. To arrive at percentage-frequency values p* must be further adjusted using the closure operation C so that its component parts sum to 100%:
The inverted confidence intervals are plotted as a ternary diagram in Figs 2C and 1. All of the values within these confidence regions are conveniently within the parameter space and all sum to 100%. Modelled particle-size distributions predicted using log-ratio regression models also obey closed-data restrictions following back transformation by Eq. (3). The example given above demonstrates how the log-ratio transformation (Eq. (1)) can be used to overcome the restrictions of closed space and apply regression analysis and concepts of statistical confidence to comparing particle-size analysers.