SEARCH

SEARCH BY CITATION

Keywords:

  • atypicality;
  • binning;
  • classification;
  • density estimation;
  • projection pursuit;
  • wavelets

Abstract

  1. Top of page
  2. Abstract
  3. NONPARAMETRIC DISCRIMINATION
  4. MATERIALS AND METHODS
  5. RESULTS AND DISCUSSION
  6. CONCLUSION
  7. Acknowledgements
  8. LITERATURE CITED

Background

Analytical flow cytometry (AFC) provides rapid and accurate measurement of particles from heterogeneous populations. AFC has been used to classify and identify phytoplankton species, but most methods of discriminant analysis of resulting data have depended on normality assumptions and outcomes have been disappointing.

Methods and Results

In this study, we consider nonparametric methods based on density estimation. In addition to the familiar kernel method, methods based on wavelets are also implemented. Full five-dimensional wavelet estimation proves to be computationally prohibitive with current workstation power, so we employ projection pursuit for reduction of dimensionality. AFC typically produces very large samples, so we also investigate data simplification through binning. Further modifications to the discrimination strategy are suggested by specific features of phytoplankton data, namely, a hierarchical group structure, the possible presence of many groups, and the likelihood of encountering an aberrant group in a test sample.

Conclusions

We apply all the resultant procedures to appropriate subsets of a very large data set, demonstrate their efficacy, and compare their error rates with those of more conventional methods. We further show that incorporation of the specific features of phytoplankton data into the analysis leads to improved results and provides a general framework for analysis of such data. Cytometry 48:26–33, 2002. © 2002 Wiley-Liss, Inc.

Phytoplankton are microscopic, photosynthetic organisms floating in the sea. They are essential to marine systems and, consequently, much effort and expense have been invested in the study of their distribution and abundance. The development of analytical flow cytometry (AFC) in biomedical research, and its subsequent introduction to the aquatic sciences, has provided significant impetus to these studies. AFC is an efficient technique for the rapid and accurate characterization and sorting of particles from heterogeneous populations (1). The AFC processing of marine samples on board ship (2–4) has improved considerably the speed and accuracy of data collection but has, in turn, provoked a commensurate demand for development of effective statistical methods of analysis.

One problem is that the speed at which the AFC measures single cells is such that very large numbers of phytoplankton can be processed in short time periods, yielding very large data sets. The main drawback, however, is that phytoplankton populations contain many more cell types than are found in biomedical applications. Although the latter populations can usually be discriminated readily using two or three flow cytometric variables, discrimination of phytoplankton populations has proved to be much more complicated. Early attempts (5) used descriptive techniques such as histograms and bivariate scatterplots, which failed to exploit the multivariate nature of the data. Ignoring the a priori group structure enables the use of principal component analysis for dimensionality reduction to be followed by such techniques as clustering and mixture modeling (4, 6), but this does not address the population discrimination question. More appropriate methods such as back-propagation neural networks for supervised classification (7) or canonical variate analysis and quadratic discriminant analysis (QDA; 8) have shown some success, but have also highlighted sufficient limitations (e.g., inappropriate assumptions of normality and common dispersion matrices) to encourage further investigation.

To summarize the position following the work to date, there are three main problem areas that need to be addressed. First, the assumption of normal distributions that underlies the application of linear or quadratic discriminant functions is questionable in some phytoplankton populations, with a number of them exhibiting marked departures from normality including bimodality. One of the recommendations made by Carr et al. (8) was that nonparametric methods should be tried, but results using either k-nearest neighbors or classification and regression trees have proved to be relatively impractical (7). Second, marine samples will typically contain a relatively small selection of populations taken (randomly) from a very large set, whereas most traditional applications of discriminant analysis are concerned with a small number of fixed and specified populations. An additional property of phytoplankton populations is that they fall into a natural hierarchy comprising species within taxonomic groups. Special methods may be needed to cope with these aspects. Third, marine samples may also contain a “new” species for which no prior data are available. It is important to recognize these individuals as distinct from the rest, and not to misclassify them into existing species.

This study addresses all three areas, but places greatest emphasis on the first problem area. We tackle nonparametric modeling of the data and consider both kernel and wavelet methods for estimating the densities necessary in the construction of discriminant functions. There are arguments for and against each approach. AFC data contain a lot of noise due to cell debris, dead cells, and bacteria, which suggests that wavelet-based density function estimation might have advantages over kernel-based methods. On the other hand, the relatively low dimensionality of AFC data might reverse the position on computational grounds. The computational burden is not too great when obtaining kernel density estimates and discriminant functions can be built directly in this case, but wavelet density estimation is computationally much more demanding, particularly when the dimensionality rises beyond two or three. Current computing power precludes direct treatment of dimensionalities as high as five. Therefore, we examine ideas such as binning and projection pursuit for producing tractable discriminant functions with the wavelet approach. As far as we can tell, none of these ideas has been applied yet to the analysis of AFC data. In view of the technicalities involved, we first review some of the underlying theory.

NONPARAMETRIC DISCRIMINATION

  1. Top of page
  2. Abstract
  3. NONPARAMETRIC DISCRIMINATION
  4. MATERIALS AND METHODS
  5. RESULTS AND DISCUSSION
  6. CONCLUSION
  7. Acknowledgements
  8. LITERATURE CITED

Denote by x the set of flow cytometric values for an individual (cell) that is to be classified, and assume that this individual belongs to one particular group out of a set of g groups Aj(j = 1,…g). The usual approach to classification (9, 10) requires computation of the posterior probabilities p(Aj | x) of membership of group Aj by individual x for j = 1,…,g, followed by allocation of x to the group whose posterior probability is highest. The posterior probabilities depend in turn on the prior probabilities πj of occurrence of group Aj and on the probability densities f(x | Aj) of observing x in each group, through Bayes's theorem as

  • equation image

To follow this procedure in practice, we need to estimate the p(Aj | x) from training data, which in turn requires estimates πˆj and (x | Aj) of the πj and f(x | Aj), respectively. A parametric approach requires assumption of a parametric form for these densities, with parameters estimated from the training data. However, as already noted earlier (8), assuming normal distributions (which leads to the familiar linear and quadratic discriminant functions) may not be suitable for AFC data, so we examine nonparametric methods. The πj may either be estimated from proportions observed in the training data or from prior knowledge, or all taken to be equal and hence removed from consideration because they cancel in the above equation. The sampling mechanism generated 10,000 observations from each population in the present study (see below), so we used the last-named approach in all our analyses.

Kernel methods provide perhaps the most popular approach to estimation of f(x | Aj). First introduced by Fix and Hodges (11), the methods are now well established and thoroughly researched. A large number of variants can be found in the literature; for a general overview see Silverman (12) and for details of the multivariate case see Scott (13).

For implementation on the AFC data, we used a variety of estimators. The basic one was the product Gaussian kernel with fixed but separate bandwidths for each variable (13). Choice of bandwidth can be made in various ways (14, 15). However, with samples as large as the ones in this study, none seemed to offer any material advantage over the universal rule-of-thumb value σin−1/(d + 4), where d is the number of variables, n is the sample size, and σi is the SD of the variable under consideration (13). The latter approach provided a simple but effective density estimate. To check that we were not missing important improvements, we also implemented more sophisticated methods using variable bandwidths (16) and balloon estimators (17), but no material improvements were obtained. For conciseness, we report the results for the basic method.

A very different approach to nonparametric density estimation was introduced by Cencov (18), who proposed approximating the unknown density by a convergent orthogonal series expansion and estimating the coefficients of the orthonormal functions from the available data. This idea was revived sporadically over the following three decades, but has suddenly come into its own as a consequence of the recent interest shown in a particular family of orthonormal basis functions known as wavelets. The application of wavelet methods to the estimation of a probability density was first proposed by Doukhan and Leon (19). Theoretical developments were reported by Johnstone et al. (20) and Donoho et al. (21), practical aspects were discussed by Tribouley (22) and Vannucci (23), and Abramowich et al. (24) provided a general overview.

The starting point for wavelet methods is the choice of two related, mutually orthonormal, functions: the scaling function (sometimes called the father wavelet), ϑ, and the mother wavelet, ψ. A variety of pairs of functions are now available (24); we have generally used those provided by Daubechies (25). Phytoplankton data exhibit no irregular behavior such as discontinuities or high-frequency oscillations, so we implemented the simplified approach known as the linear estimator using the computational method described by Pinheiro and Vidakovic (26).

Unfortunately, the speed of the flow cytometer process means that AFC data sets are very large, which places an enormous computational burden on the estimation of densities once the dimensionality rises beyond 3. Similarly, large data sets are encountered in other areas of statistics (e.g., data mining), so it is worth seeking an approximation to the standard methods that is computationally efficient. Encouraged by the success of discretization, or binning, in kernel density estimation (27, 28), we implemented a binned wavelet density estimator (29). The range of each variable is divided into a number of (equally spaced) intervals, the conjunction of which defines a finite set of bins, and each individual is allocated uniquely into one of the bins. Computation now requires consideration of only this finite set of bins, a number usually much smaller than the size of the sample, and is therefore hugely reduced. Even so, the computational burden remained high. As a result, we also considered projection pursuit methodology for dimension reduction.

The idea here (13) is to find an optimal projection of the data from the original number of dimensions to a smaller number, and then to carry out the nonparametric discriminant analysis in this smaller dimensionality where computation is much easier. We decided to look for two-dimensional solutions, as computing times are fast for bivariate data. We believe that much of the between-species differences of the original data will reside in suitably chosen two-dimensional subspaces.

A simple procedure to effect the process is as follows. Suppose first that the n available observations are divided into a training and a test set of na and nb observations, respectively, and that p cytometric variables have been measured. Let Y denote the na × p training data matrix, and write A as an arbitrary p × 2 projection matrix with elements aij. For a particular set of values of the aij, the matrix Z = YA contains a corresponding projection of the data in two dimensions. This matrix can be divided randomly into two portions; wavelet density estimation plus associated calculation of a classification rule can be conducted on the first portion and an overall classification error rate can be determined from the second portion. This error rate is the “output” corresponding to the aij values that were the “input”. A standard optimization algorithm such as the Nelder and Mead simplex method (30) can then be used to find the aij values (i.e., projection) and resultant classification rule that minimizes the error rate. Finally, applying this rule to the original test set of nb observations gives an unbiased assessment of its performance. The wavelet density estimation at the heart of this scheme can use either the raw or the binned data. We implemented both varieties. We report results using this method alongside the other methods below. The above methodology involves extensive theoretical and computational development, which is not described here (Collins, unpublished results).

The first special problem associated with AFC data is the large number of groups. Our basic data set contains 65 groups, all of which potentially need to be discriminated from each other. Including them all produces overload and runs the risk of substantially inflating error rates. One way of solving this is to exploit the features of the phytoplankton themselves. We suggest two possible approaches.

There are usually about five or six phytoplankton species in any one natural sample. The organisms populate the seas in large groups rather than singly. Therefore, if one species is observed, then we would expect more of the same species to be in the immediate surrounding area. This is especially true for one particular taxon, the Diatoms, which drift in the sea as chains, single-celled species interlocking with each other. Consequently, if only a small number of test observations are classified into a certain group, we need to question the allocation procedure and remodel the situation.

A suggested modification of the earlier discriminant rules follows. We term this process the “leave-k-in” approach. First, construct the group-conditional density estimates (x | Aj), obtain the posterior probabilities of group membership p(Aj | x), and allocate all the test observations accordingly. Second, eliminate from the training data any groups that have no classified observations. Third, for the remaining groups, pick out the k largest groups in terms of classified observations and remove groups that have small numbers of classified test observations. Finally, reclassify the test observations that were classified previously into the groups that have now been removed.

The value of k is chosen most conveniently by a threshold rule. For example, in the analyses reported below, we removed any groups that did not achieve at least 1% of allocated individuals.

Given the hierarchical structure of species within taxonomic groups, an alternative approach is to adopt a two-stage discriminant procedure. The first stage classifies an organism into a taxonomic group and the second stage classifies it into a species within the selected taxonomic group. Any misclassification at the first stage is irredeemable so this method relies on good separability of taxa.

The second special problem area is that of encountering “new” species. It is possible that any given specimen for allocation does not belong to one of the groups in the training data and we must avoid misallocating it to one of these groups. Hermans et al. (31) suggested computing tj = −2log (x | Aj) as a measure of typicality of x for group Aj. (x | Aj) lies between 0 and 1; it takes a very small value when x is “untypical” of Aj and a much larger value when x is in the “center” of Aj. Thus, tj is very large in the former case and much smaller in the latter case. Comparison of the value of this quantity for all training and test elements gives an indication of which individuals might be outlying. To effect this comparison, we suggest the following simple statistic. Let mean(−2log f̂(y | y ∈ Aj)) denote the mean of all tj values computed for the training set individuals in group Aj. Then let

  • equation image

If x is in Aj, then −2log (x | Aj) will be close to mean(−2log f̂(y | y ∈ Aj)). If x is far from Aj, the former will be much larger than the latter. Individuals x whose aj(x) value is close to 1 can be viewed as being “typical” of group Aj, whereas those with values close to 0 are highly “atypical” of this group.

MATERIALS AND METHODS

  1. Top of page
  2. Abstract
  3. NONPARAMETRIC DISCRIMINATION
  4. MATERIALS AND METHODS
  5. RESULTS AND DISCUSSION
  6. CONCLUSION
  7. Acknowledgements
  8. LITERATURE CITED

We used the FACSort flow cytometer (Becton Dickinson, San Jose, CA). Particle characterization is given by a selection of measurements from a restricted list of possibilities (32). Our data comprised the following five variables: forward light scatter (FSC), 90° light scatter, depolarized/horizontally polarized light scatter, and orange and red fluorescence. FSC is used as a relative particle sizing parameter; its signal is affected by shape, density, pigmentation, granularity, and refractive index. Ninety-degree light scatter is used as a measure of particle refractive index or internal granularity. Depolarized light scatter is an indicator of a particular group of phytoplankton, the Coccolithiphores. Orange fluorescence measures cellular accessory pigments, and is likewise an indicator of particular groups of phytoplankton, the Cryptomonads. Finally, red fluorescence measures the cellular chlorophyll fluorescence, the dominant photosynthetic pigment.

Ten thousand observations were acquired for each of 65 species of phytoplankton. The 65 species were grouped into the following eight taxonomic classes, thereby producing the hierarchical population structure: Cryptomonadida (12 species), Prasinomonadida (11 species), Dinoflagellida (14 species), Prymnesiida (12 species), Bacillariophyceae (5 species), Volvocida (6 species), Chrysomonadida (3 species), and Rhodomonadida (2 species).

The benefit of having such large data sets is that they make the assessment of discriminant rules fairly straightforward. With 10,000 individuals in each sample, it suffices to split all samples randomly into two sets. One half of the data (the training set) is used for developing the rule and the other half (the test set) is used for assessing its performance by finding the proportions misclassified from each population. The large training and test sets ensure both a stable classification rule and an accurate estimate of misclassification probabilities. It is therefore not necessary to incorporate computer-intensive procedures (e.g., bootstrapping, jackknifing, or leave-one-out cross-validation) into the assessment process.

We compared the performances of the different methods either mentioned or described above when applied to relatively small subsets of the AFC data. Exploratory univariate and bivariate analyses confirmed the nonnormality of many of the species. For example, for the particular species Prorocentrum nanum, the FSC measurements exhibit multimodality, both 90° light scatter and depolarized light scatter exhibit differing degrees of skewness, orange fluorescence is bimodal with an extended left-hand tail, and red fluorescence has a long right-hand tail. Nevertheless, we decided to include linear discriminant analysis (LDA) and QDA as benchmarks for the nonparametric methods. The five methods to be compared include wavelet A (two-dimensional projection pursuit using wavelets), wavelet B (two-dimensional projection pursuit using binned wavelets), kernel (five-dimensional fixed kernel), LDA, and QDA.

To subject these methods to a comprehensive investigation, we applied them to a large number of different subsets of the full data set. First, we focused on subsets containing two, three, four, five, and six species. For each case, we chose three sets of species, distinguished from each other by the degree of overlap among the species (as measured by the error rates incurred using LDA). The first set exhibited good separation between species (LDA error rates around 5%), the second had moderate overlap (LDA error rates around 20%), and the third had substantial overlap between species (LDA error rates around 35%). Given the large number of species in the full data set, it was relatively easy to identify sets of species that satisfied these requirements. We then conducted discriminant analyses for the 15 different situations, applying each of the five discrimination methods to each data set.

Next, we examined situations that reflect more closely the structure of phytoplankton data, viz, the presence of large numbers of hierarchically structured groups. To investigate the performance of the two approaches suggested above, we performed two experiments. The training data in each case represented a large number of groups spread across many of the taxonomic classes specified above (to provide diversity at both levels of the hierarchy). In experiment 1, we chose 1,000 observations from each of 28 species from seven taxonomic classes. In experiment 2, we chose 5,000 observations from each of 57 species from all eight taxonomic classes. Six test sets were chosen for classification in each experiment, in such a way as to represent different degrees of diversity of species across taxonomic classes. These test sets had similar structure in both experiments: set 1 (five species from the same taxonomic class, 1,000 recordings per species); set 2 (six species from the same taxonomic class, 1,000 recordings per species); set 3 (three pairs of species, with both species from the same taxonomic class in two pairs but each species from different taxonomic classes in the third pair, either 1,000 or 2,000 recordings per species); set 4 (four randomly chosen species, i.e., unpredictable taxonomic classes, with, respectively, 9,000, 4,000, 2,000, and 500 recordings); set 5 (five randomly chosen species, 1,000 recordings per species); and set 6 (six randomly chosen species with, respectively, 9,000, 5,000, 3,000, 2,000, 1,000, and 500 recordings).

To provide an easily computable baseline that did not take hierarchical structure into account, we conducted a straightforward five-dimensional kernel discriminant analysis on the training data and subsequently classified the observations in each test set for each experiment. We then implemented, in turn, the leave-k-in method and the two-stage hierarchical method, using two-dimensional projection-pursuit binned wavelets for the classification of the test sets. Finally, we conducted some simple experiments to investigate the utility of the proposed typicality measure.

RESULTS AND DISCUSSION

  1. Top of page
  2. Abstract
  3. NONPARAMETRIC DISCRIMINATION
  4. MATERIALS AND METHODS
  5. RESULTS AND DISCUSSION
  6. CONCLUSION
  7. Acknowledgements
  8. LITERATURE CITED

Group Discriminability: Small Number of Groups

Results of the first investigation are shown in Table 1. Percentage error rates over all groups of the test data are quoted for each condition and each method. In agreement with results from other discrimination studies in the literature, these error rates have to be viewed as descriptive measures of success of a method. There is no experimental basis for establishing significance of either absolute values or differences between pairs of values.

Table 1. Percentage Error Rates on Test Data Sets
Number of GroupsMethodSmall overlapMedium overlapLarge overlap
2Wavelet A0.4816.0919.60
Wavelet B0.3015.7518.72
Kernel0.3114.7216.81
LDA0.8718.8035.20
QDA0.8421.5042.80
3Wavelet A5.3211.4225.62
Wavelet B4.1610.9824.40
Kernel6.267.9419.90
LDA5.8018.2842.28
QDA4.8616.1448.60
4Wavelet A0.4620.3824.93
Wavelet B0.4519.3024.04
Kernel2.3518.1423.60
LDA3.3019.2434.36
QDA3.1021.1637.88
5Wavelet A3.2611.7628.08
Wavelet B3.2011.5227.24
Kernel1.0411.0825.32
LDA2.7222.3232.40
QDA0.9612.3634.72
6Wavelet A1.8615.3431.03
Wavelet B1.2814.9530.54
Kernel1.4711.2327.45
LDA5.5217.0832.80
QDA2.0012.6632.33

The major differences occur among the three categories of species overlap, but there is no discernible systematic trend among discrimination methods within these categories. It is noticeable that the nonparametric methods outperform those based on normality assumptions in most situations, supporting the conclusions of Carr et al. (8) that led to the present investigation. Interestingly, however, QDA is very competitive for the larger numbers of groups, i.e., five and six, when they are either well or moderately well separated. LDA, on the other hand, performs relatively poorly always and sometimes extremely poorly.

Of the nonparametric methods, the full-dimensional kernel approach is clearly the one to be favored in general. However, the low-dimensional projection pursuit methods based on wavelets perform extremely creditably, and prove to be the winners when groups are well separated. This is presumably because the projection pursuit aspect is successful in identifying the low-dimensional subspace primarily responsible for the group separation. No doubt a combination of projection pursuit and kernel density estimation would also be effective, but was not implemented given the other results. It is also noticeable that for such large amounts of data, the methods that employ wavelet binning are uniformly more successful than the ones that use the raw data.

Group Discriminability: Large Numbers of Groups and Hierarchical Group Structure

Taking a 1% threshold for exclusion of groups in the first method gave typical k values of 8 or 9. There are far too many details relating to the results for reporting here (such as precisely which species are removed during the first method and misclassification rates at separate stages of the hierarchy in the second method; Collins, unpublished results). For this study, we report the overall misclassification rates for each test set for each experiment (Table 2).

Table 2. Percentage Error Rates on Six Test Data Sets in Each of Two Experiments
Test setKernelLeave-k-inHierarchical
Experiment 1
 111.0215.189.36
 219.7316.3315.50
 316.4112.3412.62
 416.5515.2311.80
 516.5014.4811.94
 69.7617.528.72
Experiment 2
 123.5821.8121.34
 225.1824.6025.95
 321.7321.0420.65
 419.2017.3716.74
 528.0224.4426.31
 614.0213.8214.87

The results testify to the efficacy of the two approaches. The hierarchical approach achieved the lowest error rates in experiment 1 on five of six occasions, the leave-k-in approach producing the lowest error rate in the other case. In experiment 2, the two methods enjoy equal success, both producing the lowest error rate in three of six cases. We conclude that there is much to be gained from implementing a strategy that takes into consideration prior knowledge about hierarchical structure and typical species presence in samples. Moreover, the good performance of the hierarchical approach demonstrates the separability of phytoplankton taxa, as high misclassification at the first stage of the hierarchy would feed through into the overall error rates. Interestingly, Boddy et al. (33) did not have this same success with a hierarchical classification, which is perhaps a reflection on the separability of taxa in that study.

Typicality

To illustrate the calculation of the index aj(x) given earlier, we took a training set of individuals from species Gymnodinium veneficum, a second training set from species Gymnodinium vitilis, and a test set that contained a mixture of individuals from the three species G. veneficum, G. vitilis, and Prorocentrum balticum.

Figure 1 shows histograms of the aj(x) values when the individuals in each training set are assessed against the other individuals in the same set. Figure 2 shows the corresponding histograms of the aj(x) values when the individuals in each training set are assessed against the individuals in the other training set. These histograms demonstrate internal cohesion of the two species (the histograms in Fig. 1 are highly skewed toward the value 1, showing individuals in each group to be very typical of that group) and good separability between them (the histograms in Fig. 2 are highly skewed toward the value 0, showing individuals of each group to be very atypical of the other group). Figure 3 shows the histograms of the aj(x) values when the individuals of the test set are assessed against the individuals of each training set. Some individuals in the test set are typical of each training set, but many (the members of Prorocentrum balticum) are atypical of both training sets with values close to 0.

thumbnail image

Figure 1. Histograms of the typicality values when the individuals in each training set are assessed against the other individuals in the same set.

Download figure to PowerPoint

thumbnail image

Figure 2. Histograms of the typicality values when the individuals in each training set are assessed against the individuals in the other training set.

Download figure to PowerPoint

thumbnail image

Figure 3. Histograms of the typicality values when the individuals of the test set are assessed against the individuals of each training set.

Download figure to PowerPoint

CONCLUSION

  1. Top of page
  2. Abstract
  3. NONPARAMETRIC DISCRIMINATION
  4. MATERIALS AND METHODS
  5. RESULTS AND DISCUSSION
  6. CONCLUSION
  7. Acknowledgements
  8. LITERATURE CITED

Our results can be compared with various competitor methods of pattern recognition, and demonstrate some definite advances in the classification of phytoplankton using AFC signatures. We focused on a nonparametric approach using density estimation. We have shown that substantial improvement can be achieved over parametric methods such as LDA and QDA that are based on assumptions of normality. Projection pursuit is effective in overcoming computational burdens associated with wavelet implementations. The incorporation of prior information has led to the development of various strategies (hierarchical classification, leave-k-in) that not only improve further the classification rates, but make the practical task of species identification much more realistic and realizable. The incorporation of atypicality measures also assists the task of marine classification substantially, and increases the measure of confidence that can be ascribed to the results of analysis.

From a broader view, a number of other nonparametric methods have been applied to classification of AFC data. Classification and regression trees (CART) have been used in a number of studies (34), as have k-nearest neighbors (KNN; 35) and various forms of artificial neural networks (ANNs; 36). A full comparison of all these methods was beyond the scope of our investigation, but we can offer some comments based on the results in the literature. Some criticisms of both CART and KNN are that the methods are slow and unwieldy and they cannot cope easily with identification of “new” species (7). Our work on binning and projection pursuit has shown how the former problems can be overcome, whereas the typicality index gives a way forward for the latter drawback. ANNs also show considerable promise for cytometric data, and some direct comparisons of them with our methods would be of interest. For the present, we note that our error rates are not out of line with those reported in other ANN studies (36). We contend that nonparametric discrimination using either wavelets or kernel density estimates is a valuable addition to the toolkit for classification of AFC data.

Acknowledgements

  1. Top of page
  2. Abstract
  3. NONPARAMETRIC DISCRIMINATION
  4. MATERIALS AND METHODS
  5. RESULTS AND DISCUSSION
  6. CONCLUSION
  7. Acknowledgements
  8. LITERATURE CITED

This study formed part of the PRIME project supported by the Natural Environment Research Council, which also funded the research studentship held by GSC. We gratefully acknowledge the involvement of Drs. Peter Burkill and Glen Tarran of the Plymouth Marine Laboratory. They supplied the data, contributed to many useful discussions, and provided expertise in AFC technology. We also thank the careful reading of four anonymous referees, who provided many comments and suggestions that helped to improve the clarity of our presentation.

LITERATURE CITED

  1. Top of page
  2. Abstract
  3. NONPARAMETRIC DISCRIMINATION
  4. MATERIALS AND METHODS
  5. RESULTS AND DISCUSSION
  6. CONCLUSION
  7. Acknowledgements
  8. LITERATURE CITED
  • 1
    Shapiro HM. Practical flow cytometry, 3rd edition. New York: Wiley-Liss; 1994.
  • 2
    Chisholm SW, Olsen RJ, Zettler ER, Goericke R, Waterbury J, Welschmeyer NA. A novel free-living prochlorohyte abundant in the oceanic euphotic zone. Nature 1988; 334: 340343.
  • 3
    Olson RJ, Chisholm SW, Zettler ER, Armbrust EV. Pigments, size and distribution of synechococcus in the North Atlantic and Pacific Oceans. Limnol Oceanography 1990; 35: 4558.
  • 4
    Li WK, Dickie PM, Irwin BD, Wood AM. Biomass of bacteria, cyanobacteria, prochlorophytes and photosynthetic eukaryotes in the Sargasso sea. Deep-Sea Res 1992; 39: 501519.
  • 5
    Olson RJ, Zettler ER, Anderson OK. Discrimination of eukaryotic phytoplankton cell types from light scatter and autofluorescence properties measured by flow cytometry. Cytometry 1989; 10: 636646.
  • 6
    Demers S, Kim J, Legendre P, Legendre L. Analysing multivariate flow cytometric data in aquatic sciences. Cytometry 1992; 13: 291299.
  • 7
    Boddy L, Wilkins MF, Morris CW. Pattern recognition in flow cytometry. Cytometry 2001; 44: 195209.
  • 8
    Carr MR, Tarran GA, Burkill PH. Discrimination of marine phytoplankton species through the statistical analysis of their flow cytometric signatures. J Plankton Res 1996; 18: 12251238.
  • 9
    Tou JT, Gonzalez RC. Pattern recognition principles. London: Addison-Wesley; 1974.
  • 10
    McLachlan GJ. Discriminant analysis and statistical pattern recognition. New York: Wiley; 1992.
  • 11
    Fix E, Hodges JL. Discriminatory analysis, nonparametric estimation: consistency properties. Technical Report No. 4, Project no. 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas; 1951. (Reprinted in Int Stat Rev 1989;57:238–247.)
  • 12
    Silverman BW. Density estimation for statistics and data analysis. New York: Chapman and Hall; 1986.
  • 13
    Scott DW. Multivariate density estimation. New York: Wiley; 1992.
  • 14
    Bowman AW. An alternative method of cross-validation for the smoothing of density estimates. Biometrika 1984; 71: 353360.
  • 15
    Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. J R Stat Soc [B] 1991; 53: 683690.
  • 16
    Brewer MJ. A Bayesian model for local smoothing in kernel density estimation. Stat Comput 2000; 10: 295306.
  • 17
    Terrel GR., Scott DW. Variable kernel density estimation. Ann Stat 1992; 20: 12361265.
  • 18
    Cencov NN. Evaluation of an unknown distribution density from observations. Doklady 1962; 3: 15591562.
  • 19
    Doukhan P, Leon J. Squared error of density estimators using orthogonal projection. Comptes Rendues Acad Paris [A] 1990; 310: 424430.
  • 20
    Johnstone IM, Kerkyacharian G, Picard D. Estimation of a probability density using the wavelet method. Comptes Rendues Acad Paris [A] 1992; 315: 211216.
  • 21
    Donoho DL, Johnstone IM, Kerkyacharian G, Picard D. Density estimation by wavelet thresholding. Ann Stat 1996; 24: 508539.
  • 22
    Tribouley K. Practical estimation of multivariate densities using wavelet methods. Stat Neerlandica 1995; 49: 4162.
  • 23
    Vannucci M. Nonparametric density estimation using wavelets. Technical Report DP95-26, Duke University. Revised 1998.
  • 24
    Abramovich F, Bailey TC, Sapatinas T. Wavelet analysis and its statistical applications. J R Stat Soc [D] 2000; 49: 129.
  • 25
    Daubechies I. Orthogonal basis of compactly supported wavelets. Commun Pure Appl Math 1988; 41: 909996.
  • 26
    Pinheiro A, Vidakovic B. Estimating the square root of a density via compactly supported wavelets. Comput Stat Data Anal 1997; 25: 399415.
  • 27
    Scott DW. Using computer-binned data for density estimation. In: EddyWF, editor. Computer science and statistics: proceedings of the 13th symposium on the interface. New York: Springer-Verlag; 1997. p 292294.
  • 28
    Jones MC. Discretized and interpolated kernel density estimates. J Am Stat Assoc 1989; 84: 733741.
  • 29
    Antoniadis A, Gregoire G, Nason GP. Density and hazard rate estimation for right censored data using wavelet methods. J R Stat Soc [B] 1999; 61: 6384.
  • 30
    Nelder JA, Mead R. A simplex method for function minimisation. Comput J 1965; 7: 308313.
  • 31
    Hermans J, Habbema JDF, Kasanmoentalib TKD, Raatgever JW. Manual for the ALLOC80 discriminant analysis program. Department of Medical Statistics, University of Leiden; 1982.
  • 32
    Burkill PH. Analytical flow cytometry and its application to marine microbial ecology. In: SleighMA, editor. Microbes and the sea. Chichester: Ellis Horwood; 1987. p 139166.
  • 33
    Boddy L, Morris CW, Wilkins MF, Tarran GA, Burkill PH. Neural network analysis of flow cytometric data for 40 marine phytoplankton. Cytometry 1994; 15: 283293.
  • 34
    Beckman RJ, Salzman GC, Stewart CC. Classification and regression trees for bone marrow immunophenotyping. Cytometry 1995; 20: 210217.
  • 35
    Wilkins MF, Boddy L, Morris CW, Jonker RR. A comparison of some neural and non-neural methods for identification of phytoplankton from flow cytometry data. CABIOS 1996; 12: 918.
  • 36
    Boddy L, Morris CW, Wilkins MF, Al-Haddad L, Tarran GA, Jonker RR, Burkill PH. Identification of 72 phytoplankton species by radial basis function neural network analysis of flow cytometric data. Marine Ecol Series 2000; 195: 4759.