The methods used by this study explicitly acknowledge the limitations of analysing percentage-frequency data using traditional multivariate statistical techniques. An alternative approach is offered by a suite of techniques known collectively as ‘compositional data analysis’. Compositional data analysis has been applied to particle-size data for over 30 years, yet remains a comparatively unknown approach within the field of sedimentology (Aitchison, 1982, 1986, 1999, 2003; von Eynatten, 2004; Weltje & Prins, 2003; Jonkers *et al*., 2009; Tolosana-Delgado & von Eynatten, 2009, 2010; Weltje & Roberson, 2012). Many other examples of compositional data analysis exist in geoscience, including: mineral compositions of rocks (Thomas & Aitchison, 2006; Weltje, 2002, 2006), pollutant profiles (Howel, 2007), pollen populations (Jackson, 1997) and trace element compositions (von Eynatten *et al*., 2003). Initially, it is helpful to provide a brief introduction to compositional data analysis and its relevance to inter-instrument comparison of particle-size analysers using a short example.

#### The mathematical properties of particle-size distributions

To appreciate the best way to analyse particle-size distributions, expressed as discrete particle-size percentage values, it is necessary to highlight some of their fundamental (often obvious) mathematical properties.

- Particle-size distributions contain relative information about the proportions of different particle sizes in a sediment sample.
- Particle-size categories are expressed as percentages, so are greater or equal to zero and less than or equal to 100%.
- Particle-size distributions sum to 100%.

These properties are very useful in that they normalize measurements of particle mass or particle volume, allowing for widespread data comparison; however, they also represent serious obstacles to multivariate statistical analysis and subsequent data interpretation. A data set of 100 simulated size-frequency distributions is presented as: (i) a percentage-frequency plot (Fig. 1); and (ii) ternary diagrams (Fig. 2A and C). These data are used below to illustrate that these mathematical properties act as constraints, limiting the extent to which standard statistical tests can be applied to particle-size distributions.

The first constraint is that each part of a particle-size distribution must be considered in relation to all its other parts; this is because a change in one part of a distribution automatically results in an inverse change in all the other parts of the distribution. For example, if there is an absolute increase in the mass of silt in one sample compared to another sample, the relative proportions of sand and clay will decrease, even if the mass of both of these size fractions remains constant. The lengthy description of the relative proportions of each part of a particle-size distribution can be avoided if log-normal distribution coefficients are used instead (Folk & Ward, 1957; Inman, 1952; Evans & Benn, 2004). Comparisons between particle-size distributions are quantified most commonly using mean, standard deviation, skewness and kurtosis statistics. The limitations of this approach have been documented by several authors (Bagnold & Barndorff-Nielsen, 1980; Fredlund *et al*., 2000; Fieller *et al*., 1992; Beierle *et al*., 2002; Friedman, 1962), the most fundamental of which being that particle-size distributions are frequently not log-normal. Figure 3 illustrates how a series of markedly different multimodal distributions can have identical log-normal distribution coefficients (mean and standard deviation). Quantifying the analytical precision of an instrument is clearly problematic if the statistics used are unable to differentiate between particle-size distributions that are evidently dissimilar. Alternative probability distribution functions (for example, log-hyperbolic and skew-Laplace) have been suggested to circumvent this issue (Bagnold & Barndorff-Nielsen, 1980; Fieller *et al*., 1992), but the limitations of non-uniqueness are still applicable. Moreover, all distribution function statistics mask potentially important variations in empirical data. One means of avoiding the pitfalls of distribution function statistics has been to compare complete particle-size distributions using factor analysis (Syvitski, 1991; Stauble & Cialone, 1996). Unfortunately, in most cases this approach is not valid because percentage values in particle-size distributions are subject to bias (Falco *et al*., 2003). The source of this bias is detailed by the second constraint acting on particle-size distributions and is described below.

The second constraint is that percentage-frequency data occupy a mathematically limited space, 0 < x < 100. This has important implications for applying regression models to particle-size distributions, used for example when making adjustments to homogenize data sets analysed by a range of different instruments (Konert & Vandenberghe, 1997; Beuselinck *et al*., 1998; Buurman *et al*., 2001; Eshel *et al*., 2004). Figure 4A, B and C show the simulated data set (‘Data A’) of percentage-frequency values plotted against another synthetic data set (‘Data B’). Data set ‘B’ has been simulated to replicate the underestimation of clay particles by a laser diffraction instrument. A least-squares linear regression model has been calculated for each size fraction (solid red line) with model coefficients and *R*^{2} correlation coefficients given for each. *R*^{2} statistics are close to one for both sand and silt, indicating a good agreement between the two data sets. Close inspection of the regression models reveal that for the silt fraction (Fig. 4B) percentage values greater than 100 are predicted for the upper range of data, indicated by the horizontal dashed line. The inverse case is also true for the clay fraction (Fig. 4C), where the regression model predicts negative percentage values for data set A.

Calculating confidence regions around populations of particle-size distributions is important if differences between populations are to be determined reliably. Confidence regions for ternary diagrams have traditionally been calculated using hexagonal fields of variation (Stevens *et al*., 1956; Weltje, 2002). These are plotted for the simulated data set in Fig. 2 as confidence regions of the population at 90%, 95% and 99% confidence limits. Confidence intervals calculated using percentage-frequency values have also been plotted in Fig. 1. The red-dashed line in Fig. 1 indicates the lower 95th confidence limit of the population. In both the figures, some of the values within the confidence regions fall outside the range of zero and 100, which is clearly impossible.

The third constraint operating on particle-size distributions is that they must sum to 100. This restriction applies as equally to confidence intervals as it does to particle-size distributions modelled using regression functions. In the former case, the upper and lower 95% confidence levels of the synthetic data calculated using percentage-frequency values plotted in Fig. 1 sum respectively to 197·5% and 2·5%. Were these estimates taken as correct, they would violate principles of conservation of mass, implying that mass has the potential to both leave and enter the system. In the latter case, there is no way to guarantee that distributions predicted from percentage data will sum to 100. *Post hoc* adjustments performed to ensure constant sum, unless done with extreme care, are liable to violate the first constraint by changing the relative proportions of the distribution. These three mathematical constraints impose serious limitations on how well particle-size data can be described and compared and, consequently, the reliability with which the performance of particle-size analysers can be defined and compared. The simulated data sets ‘A’ and ‘B’ are available online in Appendix S3.

#### Compositional data

To overcome the constraints detailed above, the principles of compositional data and the log-ratio transformation must be introduced. Data characterized mathematically as positive constant-sum vectors are known as compositional data (Aitchison, 1986). A single compositional data point is known as a composition, for example a particle-size distribution. Aitchison (1986) recognized that because each part of a composition must be considered relative to all its other parts, they were best treated as ratios. Ratios are mathematically awkward, so Aitchison (1986) logically extended the transformation to derive the logarithm of the ratios, *log-ratios*. There are a number of different ways to calculate the log-ratio of compositions. For particle-size distributions composed of more than three parts, the *centred-log-ratio* transform (*clr*) is generally the most useful:

- (1)

where **q** is the log-ratio transform of **p**, a particle-size distribution with *D*-particle-size categories, *i* is the *i*th category and *g*(*p*) is the geometric mean of the particle-size distribution *p*. For a three-part particle-size distribution [0·4 0·35 0·25] the log-ratio of the first category is:

- (2)

The log-ratio transformation of compositional data moves it from closed space to real space, also referred to as co-ordinate space. The transformation into co-ordinate space is best visualized by comparing Fig. 2A to Fig. 2B. In the latter three-dimensional scatterplot, the log-ratios of the sand, silt and clay fractions are a dimension in co-ordinate space, analogous to (*x*,* y*,* z*) Cartesian co-ordinates. In co-ordinate space where there are no parameter limits (data values range from −∞ to ∞), these data can be analysed using standard multivariate statistics without any of the restrictions described above.

The application of linear regression models to particle-size distributions in co-ordinate space is illustrated by Fig. 4D to F. In co-ordinate space, linear regression models are incapable of predicting ‘impossible’ values. Equation (1) also removes the influence of bias on root mean squared error (RMSE) calculations because the data are both positive *and* negative. With the removal of this bias, correlation coefficients calculated using log-ratio data are notably lower than for the equivalent percentage-frequency data in Fig. 4A to C. Standard goodness-of-fit statistics can also be applied to data in co-ordinate space without the need to apply to moment measure statistics. The root mean squared error statistic applied in this case is identical to the Euclidian distance measurement normalized by degrees of freedom. This can be applied to compare the output of predictive models and instrument precision using replicate sample measurements.

Working with log-ratio data allows population confidence regions to be predicted in a straightforward manner (van den Boogaart & Tolosana-Delgado, 2008). The red ellipses plotted in Fig. 2B delineate confidence regions of the synthetic data population at 90%, 95% and 99% confidence limits. These are actually calculated as volumes using a trivariate probability model, but are plotted as orthogonal ellipses for graphical simplicity. To turn these confidence regions, along with modelled distributions, into meaningful information they must be converted back to percentage-frequency values. The transformation of log-ratio data back into constrained space is performed using the inverse log-ratio transform (*clr*^{−1}):

- (3)

where *q*_{i} is the *ith* part of a log-ratio particle-size distribution *q*. To arrive at percentage-frequency values *p** must be further adjusted using the closure operation *C* so that its component parts sum to 100%:

- (4)

The inverted confidence intervals are plotted as a ternary diagram in Figs 2C and 1. All of the values within these confidence regions are conveniently within the parameter space and all sum to 100%. Modelled particle-size distributions predicted using log-ratio regression models also obey closed-data restrictions following back transformation by Eq. (3). The example given above demonstrates how the log-ratio transformation (Eq. (1)) can be used to overcome the restrictions of closed space and apply regression analysis and concepts of statistical confidence to comparing particle-size analysers.