Phytoplankton are microscopic, photosynthetic organisms floating in the sea. They are essential to marine systems and, consequently, much effort and expense have been invested in the study of their distribution and abundance. The development of analytical flow cytometry (AFC) in biomedical research, and its subsequent introduction to the aquatic sciences, has provided significant impetus to these studies. AFC is an efficient technique for the rapid and accurate characterization and sorting of particles from heterogeneous populations (1). The AFC processing of marine samples on board ship (2–4) has improved considerably the speed and accuracy of data collection but has, in turn, provoked a commensurate demand for development of effective statistical methods of analysis.

One problem is that the speed at which the AFC measures single cells is such that very large numbers of phytoplankton can be processed in short time periods, yielding very large data sets. The main drawback, however, is that phytoplankton populations contain many more cell types than are found in biomedical applications. Although the latter populations can usually be discriminated readily using two or three flow cytometric variables, discrimination of phytoplankton populations has proved to be much more complicated. Early attempts (5) used descriptive techniques such as histograms and bivariate scatterplots, which failed to exploit the multivariate nature of the data. Ignoring the a priori group structure enables the use of principal component analysis for dimensionality reduction to be followed by such techniques as clustering and mixture modeling (4, 6), but this does not address the population discrimination question. More appropriate methods such as back-propagation neural networks for supervised classification (7) or canonical variate analysis and quadratic discriminant analysis (QDA; 8) have shown some success, but have also highlighted sufficient limitations (e.g., inappropriate assumptions of normality and common dispersion matrices) to encourage further investigation.

To summarize the position following the work to date, there are three main problem areas that need to be addressed. First, the assumption of normal distributions that underlies the application of linear or quadratic discriminant functions is questionable in some phytoplankton populations, with a number of them exhibiting marked departures from normality including bimodality. One of the recommendations made by Carr et al. (8) was that nonparametric methods should be tried, but results using either *k*-nearest neighbors or classification and regression trees have proved to be relatively impractical (7). Second, marine samples will typically contain a relatively small selection of populations taken (randomly) from a very large set, whereas most traditional applications of discriminant analysis are concerned with a small number of fixed and specified populations. An additional property of phytoplankton populations is that they fall into a natural hierarchy comprising species within taxonomic groups. Special methods may be needed to cope with these aspects. Third, marine samples may also contain a “new” species for which no prior data are available. It is important to recognize these individuals as distinct from the rest, and not to misclassify them into existing species.

This study addresses all three areas, but places greatest emphasis on the first problem area. We tackle nonparametric modeling of the data and consider both kernel and wavelet methods for estimating the densities necessary in the construction of discriminant functions. There are arguments for and against each approach. AFC data contain a lot of noise due to cell debris, dead cells, and bacteria, which suggests that wavelet-based density function estimation might have advantages over kernel-based methods. On the other hand, the relatively low dimensionality of AFC data might reverse the position on computational grounds. The computational burden is not too great when obtaining kernel density estimates and discriminant functions can be built directly in this case, but wavelet density estimation is computationally much more demanding, particularly when the dimensionality rises beyond two or three. Current computing power precludes direct treatment of dimensionalities as high as five. Therefore, we examine ideas such as binning and projection pursuit for producing tractable discriminant functions with the wavelet approach. As far as we can tell, none of these ideas has been applied yet to the analysis of AFC data. In view of the technicalities involved, we first review some of the underlying theory.

### NONPARAMETRIC DISCRIMINATION

- Top of page
- Abstract
- NONPARAMETRIC DISCRIMINATION
- MATERIALS AND METHODS
- RESULTS AND DISCUSSION
- CONCLUSION
- Acknowledgements
- LITERATURE CITED

Denote by *x* the set of flow cytometric values for an individual (cell) that is to be classified, and assume that this individual belongs to one particular group out of a set of g groups *A*_{j}(*j* = 1,…*g*). The usual approach to classification (9, 10) requires computation of the posterior probabilities *p*(*A*_{j} | *x*) of membership of group *A*_{j} by individual *x* for *j* = 1,…,*g*, followed by allocation of *x* to the group whose posterior probability is highest. The posterior probabilities depend in turn on the prior probabilities *π*_{j} of occurrence of group *A*_{j} and on the probability densities *f*(*x* | *A*_{j}) of observing *x* in each group, through Bayes's theorem as

To follow this procedure in practice, we need to estimate the *p*(*A*_{j} | *x*) from training data, which in turn requires estimates *πˆ*_{j} and *f̂*(*x* | *A*_{j}) of the *π*_{j} and *f*(*x* | *A*_{j}), respectively. A parametric approach requires assumption of a parametric form for these densities, with parameters estimated from the training data. However, as already noted earlier (8), assuming normal distributions (which leads to the familiar linear and quadratic discriminant functions) may not be suitable for AFC data, so we examine nonparametric methods. The *π*_{j} may either be estimated from proportions observed in the training data or from prior knowledge, or all taken to be equal and hence removed from consideration because they cancel in the above equation. The sampling mechanism generated 10,000 observations from each population in the present study (see below), so we used the last-named approach in all our analyses.

Kernel methods provide perhaps the most popular approach to estimation of *f*(*x* | *A*_{j}). First introduced by Fix and Hodges (11), the methods are now well established and thoroughly researched. A large number of variants can be found in the literature; for a general overview see Silverman (12) and for details of the multivariate case see Scott (13).

For implementation on the AFC data, we used a variety of estimators. The basic one was the product Gaussian kernel with fixed but separate bandwidths for each variable (13). Choice of bandwidth can be made in various ways (14, 15). However, with samples as large as the ones in this study, none seemed to offer any material advantage over the universal rule-of-thumb value σ_{i}*n*^{−1/(d + 4)}, where *d* is the number of variables, *n* is the sample size, and σ_{i} is the SD of the variable under consideration (13). The latter approach provided a simple but effective density estimate. To check that we were not missing important improvements, we also implemented more sophisticated methods using variable bandwidths (16) and balloon estimators (17), but no material improvements were obtained. For conciseness, we report the results for the basic method.

A very different approach to nonparametric density estimation was introduced by Cencov (18), who proposed approximating the unknown density by a convergent orthogonal series expansion and estimating the coefficients of the orthonormal functions from the available data. This idea was revived sporadically over the following three decades, but has suddenly come into its own as a consequence of the recent interest shown in a particular family of orthonormal basis functions known as wavelets. The application of wavelet methods to the estimation of a probability density was first proposed by Doukhan and Leon (19). Theoretical developments were reported by Johnstone et al. (20) and Donoho et al. (21), practical aspects were discussed by Tribouley (22) and Vannucci (23), and Abramowich et al. (24) provided a general overview.

The starting point for wavelet methods is the choice of two related, mutually orthonormal, functions: the scaling function (sometimes called the father wavelet), *ϑ*, and the mother wavelet, *ψ*. A variety of pairs of functions are now available (24); we have generally used those provided by Daubechies (25). Phytoplankton data exhibit no irregular behavior such as discontinuities or high-frequency oscillations, so we implemented the simplified approach known as the linear estimator using the computational method described by Pinheiro and Vidakovic (26).

Unfortunately, the speed of the flow cytometer process means that AFC data sets are very large, which places an enormous computational burden on the estimation of densities once the dimensionality rises beyond 3. Similarly, large data sets are encountered in other areas of statistics (e.g., data mining), so it is worth seeking an approximation to the standard methods that is computationally efficient. Encouraged by the success of discretization, or binning, in kernel density estimation (27, 28), we implemented a binned wavelet density estimator (29). The range of each variable is divided into a number of (equally spaced) intervals, the conjunction of which defines a finite set of bins, and each individual is allocated uniquely into one of the bins. Computation now requires consideration of only this finite set of bins, a number usually much smaller than the size of the sample, and is therefore hugely reduced. Even so, the computational burden remained high. As a result, we also considered projection pursuit methodology for dimension reduction.

The idea here (13) is to find an optimal projection of the data from the original number of dimensions to a smaller number, and then to carry out the nonparametric discriminant analysis in this smaller dimensionality where computation is much easier. We decided to look for two-dimensional solutions, as computing times are fast for bivariate data. We believe that much of the between-species differences of the original data will reside in suitably chosen two-dimensional subspaces.

A simple procedure to effect the process is as follows. Suppose first that the *n* available observations are divided into a training and a test set of *na* and *nb* observations, respectively, and that *p* cytometric variables have been measured. Let *Y* denote the *na* × *p* training data matrix, and write *A* as an arbitrary *p* × 2 projection matrix with elements *a*_{ij}. For a particular set of values of the *a*_{ij}, the matrix *Z* = *YA* contains a corresponding projection of the data in two dimensions. This matrix can be divided randomly into two portions; wavelet density estimation plus associated calculation of a classification rule can be conducted on the first portion and an overall classification error rate can be determined from the second portion. This error rate is the “output” corresponding to the *a*_{ij} values that were the “input”. A standard optimization algorithm such as the Nelder and Mead simplex method (30) can then be used to find the *a*_{ij} values (i.e., projection) and resultant classification rule that minimizes the error rate. Finally, applying this rule to the original test set of *nb* observations gives an unbiased assessment of its performance. The wavelet density estimation at the heart of this scheme can use either the raw or the binned data. We implemented both varieties. We report results using this method alongside the other methods below. The above methodology involves extensive theoretical and computational development, which is not described here (Collins, unpublished results).

The first special problem associated with AFC data is the large number of groups. Our basic data set contains 65 groups, all of which potentially need to be discriminated from each other. Including them all produces overload and runs the risk of substantially inflating error rates. One way of solving this is to exploit the features of the phytoplankton themselves. We suggest two possible approaches.

There are usually about five or six phytoplankton species in any one natural sample. The organisms populate the seas in large groups rather than singly. Therefore, if one species is observed, then we would expect more of the same species to be in the immediate surrounding area. This is especially true for one particular taxon, the Diatoms, which drift in the sea as chains, single-celled species interlocking with each other. Consequently, if only a small number of test observations are classified into a certain group, we need to question the allocation procedure and remodel the situation.

A suggested modification of the earlier discriminant rules follows. We term this process the “leave-*k*-in” approach. First, construct the group-conditional density estimates *f̂*(*x* | *A*_{j}), obtain the posterior probabilities of group membership *p*(*A*_{j} | *x*), and allocate all the test observations accordingly. Second, eliminate from the training data any groups that have no classified observations. Third, for the remaining groups, pick out the *k* largest groups in terms of classified observations and remove groups that have small numbers of classified test observations. Finally, reclassify the test observations that were classified previously into the groups that have now been removed.

The value of *k* is chosen most conveniently by a threshold rule. For example, in the analyses reported below, we removed any groups that did not achieve at least 1% of allocated individuals.

Given the hierarchical structure of species within taxonomic groups, an alternative approach is to adopt a two-stage discriminant procedure. The first stage classifies an organism into a taxonomic group and the second stage classifies it into a species within the selected taxonomic group. Any misclassification at the first stage is irredeemable so this method relies on good separability of taxa.

The second special problem area is that of encountering “new” species. It is possible that any given specimen for allocation does not belong to one of the groups in the training data and we must avoid misallocating it to one of these groups. Hermans et al. (31) suggested computing *t*_{j} = −2log *f̂*(*x* | *A*_{j}) as a measure of typicality of *x* for group *A*_{j}. *f̂*(*x* | *A*_{j}) lies between 0 and 1; it takes a very small value when *x* is “untypical” of *A*_{j} and a much larger value when x is in the “center” of *A*_{j}. Thus, *t*_{j} is very large in the former case and much smaller in the latter case. Comparison of the value of this quantity for all training and test elements gives an indication of which individuals might be outlying. To effect this comparison, we suggest the following simple statistic. Let mean(−2log f̂(y | y ∈ A_{j})) denote the mean of all *t*_{j} values computed for the training set individuals in group *A*_{j}. Then let

If *x* is in *A*_{j}, then −2log *f̂*(*x* | *A*_{j}) will be close to mean(−2log f̂(y | y ∈ A_{j})). If *x* is far from *A*_{j}, the former will be much larger than the latter. Individuals *x* whose *a*_{j}(*x*) value is close to 1 can be viewed as being “typical” of group *A*_{j}, whereas those with values close to 0 are highly “atypical” of this group.

### MATERIALS AND METHODS

- Top of page
- Abstract
- NONPARAMETRIC DISCRIMINATION
- MATERIALS AND METHODS
- RESULTS AND DISCUSSION
- CONCLUSION
- Acknowledgements
- LITERATURE CITED

We used the FACSort flow cytometer (Becton Dickinson, San Jose, CA). Particle characterization is given by a selection of measurements from a restricted list of possibilities (32). Our data comprised the following five variables: forward light scatter (FSC), 90° light scatter, depolarized/horizontally polarized light scatter, and orange and red fluorescence. FSC is used as a relative particle sizing parameter; its signal is affected by shape, density, pigmentation, granularity, and refractive index. Ninety-degree light scatter is used as a measure of particle refractive index or internal granularity. Depolarized light scatter is an indicator of a particular group of phytoplankton, the Coccolithiphores. Orange fluorescence measures cellular accessory pigments, and is likewise an indicator of particular groups of phytoplankton, the Cryptomonads. Finally, red fluorescence measures the cellular chlorophyll fluorescence, the dominant photosynthetic pigment.

Ten thousand observations were acquired for each of 65 species of phytoplankton. The 65 species were grouped into the following eight taxonomic classes, thereby producing the hierarchical population structure: Cryptomonadida (12 species), Prasinomonadida (11 species), Dinoflagellida (14 species), Prymnesiida (12 species), Bacillariophyceae (5 species), Volvocida (6 species), Chrysomonadida (3 species), and Rhodomonadida (2 species).

The benefit of having such large data sets is that they make the assessment of discriminant rules fairly straightforward. With 10,000 individuals in each sample, it suffices to split all samples randomly into two sets. One half of the data (the training set) is used for developing the rule and the other half (the test set) is used for assessing its performance by finding the proportions misclassified from each population. The large training and test sets ensure both a stable classification rule and an accurate estimate of misclassification probabilities. It is therefore not necessary to incorporate computer-intensive procedures (e.g., bootstrapping, jackknifing, or leave-one-out cross-validation) into the assessment process.

We compared the performances of the different methods either mentioned or described above when applied to relatively small subsets of the AFC data. Exploratory univariate and bivariate analyses confirmed the nonnormality of many of the species. For example, for the particular species *Prorocentrum nanum*, the FSC measurements exhibit multimodality, both 90° light scatter and depolarized light scatter exhibit differing degrees of skewness, orange fluorescence is bimodal with an extended left-hand tail, and red fluorescence has a long right-hand tail. Nevertheless, we decided to include linear discriminant analysis (LDA) and QDA as benchmarks for the nonparametric methods. The five methods to be compared include wavelet A (two-dimensional projection pursuit using wavelets), wavelet B (two-dimensional projection pursuit using binned wavelets), kernel (five-dimensional fixed kernel), LDA, and QDA.

To subject these methods to a comprehensive investigation, we applied them to a large number of different subsets of the full data set. First, we focused on subsets containing two, three, four, five, and six species. For each case, we chose three sets of species, distinguished from each other by the degree of overlap among the species (as measured by the error rates incurred using LDA). The first set exhibited good separation between species (LDA error rates around 5%), the second had moderate overlap (LDA error rates around 20%), and the third had substantial overlap between species (LDA error rates around 35%). Given the large number of species in the full data set, it was relatively easy to identify sets of species that satisfied these requirements. We then conducted discriminant analyses for the 15 different situations, applying each of the five discrimination methods to each data set.

Next, we examined situations that reflect more closely the structure of phytoplankton data, viz, the presence of large numbers of hierarchically structured groups. To investigate the performance of the two approaches suggested above, we performed two experiments. The training data in each case represented a large number of groups spread across many of the taxonomic classes specified above (to provide diversity at both levels of the hierarchy). In experiment 1, we chose 1,000 observations from each of 28 species from seven taxonomic classes. In experiment 2, we chose 5,000 observations from each of 57 species from all eight taxonomic classes. Six test sets were chosen for classification in each experiment, in such a way as to represent different degrees of diversity of species across taxonomic classes. These test sets had similar structure in both experiments: set 1 (five species from the same taxonomic class, 1,000 recordings per species); set 2 (six species from the same taxonomic class, 1,000 recordings per species); set 3 (three pairs of species, with both species from the same taxonomic class in two pairs but each species from different taxonomic classes in the third pair, either 1,000 or 2,000 recordings per species); set 4 (four randomly chosen species, i.e., unpredictable taxonomic classes, with, respectively, 9,000, 4,000, 2,000, and 500 recordings); set 5 (five randomly chosen species, 1,000 recordings per species); and set 6 (six randomly chosen species with, respectively, 9,000, 5,000, 3,000, 2,000, 1,000, and 500 recordings).

To provide an easily computable baseline that did not take hierarchical structure into account, we conducted a straightforward five-dimensional kernel discriminant analysis on the training data and subsequently classified the observations in each test set for each experiment. We then implemented, in turn, the leave-*k*-in method and the two-stage hierarchical method, using two-dimensional projection-pursuit binned wavelets for the classification of the test sets. Finally, we conducted some simple experiments to investigate the utility of the proposed typicality measure.