Automated classification of bacterial particles in flow by multiangle scatter measurement and support vector machine classifier


  • Bartek Rajwa,

    Corresponding author
    1. Purdue University Cytometry Laboratories, Bindley Bioscience Center, Purdue University, West Lafayette, Indiana 47907
    • Purdue University Cytometry Laboratories, Bindley Bioscience Center, Purdue University, 1203 W. State Street, West Lafayette, IN 47907, USA
    Search for more papers by this author
  • Murugesan Venkatapathi,

    1. Purdue University Cytometry Laboratories, Bindley Bioscience Center, Purdue University, West Lafayette, Indiana 47907
    2. School of Mechanical Engineering, Purdue University, West Lafayette, Indiana 47907
    Search for more papers by this author
  • Kathy Ragheb,

    1. Purdue University Cytometry Laboratories, Bindley Bioscience Center, Purdue University, West Lafayette, Indiana 47907
    Search for more papers by this author
  • Padmapriya P. Banada,

    1. Molecular Food Microbiology Laboratory, Department of Food Science, Purdue University, West Lafayette, Indiana 47907
    Search for more papers by this author
  • E. Daniel Hirleman,

    1. School of Mechanical Engineering, Purdue University, West Lafayette, Indiana 47907
    Search for more papers by this author
  • Todd Lary,

    1. Cellular Analysis Technology Center, Beckman-Coulter, Inc., Miami, Florida 33196
    Search for more papers by this author
  • J. Paul Robinson

    1. Purdue University Cytometry Laboratories, Bindley Bioscience Center, Purdue University, West Lafayette, Indiana 47907
    Search for more papers by this author


Biological microparticles, including bacteria, scatter light in all directions when illuminated. The complex scatter pattern is dependent on particle size, shape, refraction index, density, and morphology. Commercial flow cytometers allow measurement of scattered light intensity at forward and perpendicular (side) angles (2° ≤ θ1 ≤ 20° and 70° ≤ θ2 ≤ 110°, respectively) with a speed varying from 10 to 10,000 particles per second. The choice of angle is dictated by the fact that scattered light in the forward region is primarily dependent on cell size and refractive index, whereas side-scatter intensity is dependent on the granularity of cellular structures. However, these two-parameter measurements cannot be used to separate populations of cells of similar shape, size, or structure. Hence, there have been several attempts in flow cytometry to measure the entire scatter patterns. The published concepts require the use of unique custom-built flow cytometers and cannot be applied to existing instruments. It was also not clear how much information about patterns is really necessary to separate various populations of cells present in a given sample. The presented work demonstrates application of pattern-recognition techniques to classify particles on the basis of their discrete scatter patterns collected at just five different angles, and accompanied by the measurement of axial light loss. The proposed approach can be potentially used with existing instruments because it requires only the addition of a compact enhanced scatter detector. An analytical model of scatter of laser beams by individual bacterial cells suspended in a fluid was used to determine the location of scatter sensors. Experimental results were used to train the support vector machine-based pattern recognition system. It has been shown that information provided just by five angles of scatter and axial light loss can be sufficient to recognize various bacteria with 68–99% success rate. © 2007 International Society for Analytical Cytology

Light-scatter signal detection has been employed in flow cytometry almost from the moment the method was introduced to practical use. Initially, scatter signal was utilized to synchronize the fluorescence detectors with the flow of particles through the flow chamber. Very soon it was demonstrated that information about forward (2° ≤ θ1 ≤ 20°) and side (70° ≤ θ2 ≤ 110°) scatter can be used to identify a number of subpopulations of cells without the use of any additional information provided by fluorescence stains (1–3). This was possible owing to the fact that forward-scattered light in the small-angle region (θ ≤ 2°) is primarily dependent on the cell size, and is mostly independent of particle refractive index or shape (4–6), whereas perpendicular light scatter is sensitive to small internal structures in cells and to refractive index changes. In the early days of flow cytometry there had been also some reports published on the use of axial light loss, which was employed for cell sizing (7).

Flow cytometrists agree that full scatter patterns of bioparticles may contain much more information than what forward scatter, perpendicular (side) scatter, and extinction can reveal. When a cell passing through the flow chamber is illuminated, a very complex spatial pattern is formed that is dependent on cell size, shape, refraction index, density, morphology, and orientation of the cell relative to direction of incident beam. Therefore, many researchers investigated the possibility of collecting more than just two angles of scatter simultaneously. Meyer et al. (8) postulated that single cells could be comprehensively characterized in flow systems using multiangle scatter detectors utilizing 32 channels in a fashion similar to the observation of scatter patterns of single cells in microscopy-based systems.

Despite technical difficulties there were indeed several reports published in the 1970s and early 1980s on complex applications of scatter detection in flow, such as label-free detection of morphological changes inside the cell. Some of the reported systems involved detection of the full 180° or 360° scatter patterns from single biological cells (1, 9–11).

The largest body of work on scatter in flow cytometry was performed in the 1970s by a group of researchers at Los Alamos National Labs (12–14). Their custom-built flow cytometers capable of multiangle scatter measurement were interfaced to DEC minicomputers for data processing. Several of these instruments were delivered to the NIH, but sadly none of the multiangle scatter detectors designed in Los Alamos found its way to commercial systems.

Owing to the immense progress in fluorescent label development, multiparameter flow cytometry moved in the last decade in the direction of adding more fluorescence detectors, rather than enhancing scatter measurements. Twelve-color machines are currently commercially available, and reports have been published on 16-color flow cytometry analysis. Multiangle scatter systems designed in the 1970s and 1980s failed to make a substantial impact on the field. Currently, only a few research groups still actively investigate applications of multiangle light-scatter analysis in flow (15–19). However, the published concepts usually require sophisticated, unique custom-built flow cytometers and cannot be applied to the existing instruments.

The presented work demonstrates application of pattern recognition techniques to classification of microbial particles on the basis of their discrete scatter patterns collected at five selected angles and accompanied by the measurement of axial light loss. Our approach differs from previous reports by the use of a discrete-dipole approximation (DDA)-based analytical model of laser-beam scatter by individual bacterial cells suspended in a fluid to determine the optimal location of scatter sensors. Most of the available cytometry literature related to light-scatter measurements employed generalized Lorenz–Mie theory to perform optical particle sizing using a model of light scattering by a sphere irradiated by a laser beam having a Gaussian intensity distribution. In contrast, the DDA method can be applied to nonhomogenous particles of arbitrary geometry, which makes it especially well suited for modeling scatter response of nonspherical cells, such as rod-shaped bacteria (20–22).

Our approach can be used with existing instruments and requires only minor modifications and the addition of a compact custom-built scatter detector. In contrast to other reports that describe direct use of collected scatter signals to characterize bioparticles, our method works in concert with a machine learning system. The experimental results obtained from the known samples are used to train the support vector machine (SVM), and subsequently the samples containing mixture of unknown particles are classified by the trained algorithm. This report shows that information provided by just six scatter-related parameters is sufficient for a trained system to recognize various bacteria with a 69–99% success rate.


Flow Cytometry

All the analyses were performed with a Cytomics FC500 flow cytometer (Beckman-Coulter, Miami, FL) equipped with a 488-nm air-cooled argon laser. A prototype of an enhanced scatter detection system (courtesy of Beckman-Coulter) capable of measuring forward-scatter signals at four different angles was added to the above flow cytometer, replacing the traditional forward scatter detector. This scatter measurement system consists of four ring detectors and an axial light-loss (extinction cross section) detector that can be moved toward or away from the laser beam-particle intersection point to change the angles of measurement (Fig. 1). Therefore, the four angles of detection in each experiment cannot be chosen independently because the four rings in the detector are fixed with respect to each other. The number of the uniformly spaced optical fibers in each ring varies linearly with its radius (12–34 per ring) to correct for variation of solid angle. The scatter measurements from each ring detector are amplified by different sets of pre-amplifiers and amplifiers to collect the 10-bit linear data. Discrimination of doublets from single particles is achieved by plotting forward-angle light-scatter integral versus peak intensity and gating on single-particle signals. The CXP software (Beckman-Coulter, Miami, FL) was used to acquire the data on the flow cytometer.

Figure 1.

A simple schematics of the optical setup. [Color figure can be viewed in the online issue, which is available at]

Bacterial Cultures

Four different nonpathogenic bacterial cultures of varying size and shape were selected for the experiments: Escherichia coli K12, Listeria innocua F4248, Bacillus subtilis ATCC 6633, and Enterococcus faecalis CG110. The cultures were grown in brain heart infusion (BHI) broth for 16–18 h at 37°C, 140 rpm in a shaker incubator. The cultures were washed once by centrifuging 5 min at 3,000 rpm and re-suspended in sterile phosphate buffered saline (PBS), pH 7.6, before analysis. All the bacterial cultures were obtained from the Purdue Department of Food Science culture collection.

Analytical Model of Scatter

The mathematical model of scatter used in this work assumes that the particles are in isolation in the sheath fluid (n = 1.33), and the angular scatter distribution is calculated and integrated over the area of the forward-scatter detector placed outside the sheath fluid. This assumption is valid if the particles are much smaller than the channel and if the laser beam (10 μm × 80 μm Gaussian) inside the channel is considerably larger than the particle. These assumptions are indeed valid in the analyzed case. Because a flow channel with square cross section was used and the dimension of the laser beam was smaller than the width of the channel (250 μm), the changes in dimension and intensity (<4%) of the laser beam due to refraction at the surface of the channel are negligible (unlike the case in cylindrical channels). The bacterial cells were modeled as homogeneous particles using the DDA method (20–22).

The DDA method was first formulated by Purcell and Pennypacker (23), who used it to study interstellar dust grains, and later extended by other researchers such as Draine, and Taubenblatt and Tran (24, 25). In DDA an arbitrarily shaped particle is treated as a three-dimensional assembly of dipoles (j = 1, … ,N) on a cubic grid, located at positions rj (26). Each dipole is assigned a complex polarizability αI, which can be computed from the complex refractive index of the bulk material and the number of dipoles in a unit volume. The dipole moment or polarization at each dipole is related to the electric field by Pj = αjEtot,j, where Pj is the dipole moment at the dipole j, and Etot,j is the total electric field at dipole j, at rj.

Following the notation of Ref. (27), the field Etot,j at each dipole can be decomposed into the electrical field incident upon the features and the electric field contribution from the other interacting dipoles. Hence, the electric field can then be represented as Etot,j = Einc,j + Edipole,j, where Edipole,j is the electric field contribution from the other N–1 dipoles, and Einc,j is the known incident field E0exp(ik·riiωt). Therefore, Edipole,j can be expressed as:

equation image(1)

where AjkPk is the electric field at rj due to dipole Pk. Each element Ajk is a 3 × 3 matrix:

equation image(2)

where k ≡ ω/c, rjk = |rjrk|, jk ≡ (rjrk)/rjk, and 13 is a 3 × 3 identity matrix. With Ajj ≡ αmath image, the scattering problem is reduced to finding polarizations Pj that satisfy a system of equations:

equation image(3)

These equations can be solved by iterations. By introducing the Green function, the method produces reliable results for extremely rough discretization grids such as 2.22 meshes per wavelength (26). In the presented study the quasi-minimal residual method has been used to solve the problem. Owing to the characteristics of the coefficient matrix, the convergence towards an accurate answer is dependent on scattering feature size and refractive index.

Light-scatter signal from four different bacteria species was modeled: E. coli, L. innocua, B. subtilis, and E. faecalis. E. coli, L. innocua, and B. subtilis are rod-shaped bacteria, whereas E. faecalis appears as cocci in chains. The size of E. coli depends on the growth phase, and the nutrients available in the medium. E. coli bacilli can be up to 1.5 μm wide and 2.0–6.0 μm long (28). For the purpose of modeling we assumed that E. coli cells are typically 2 μm in length and about 1 μm in diameter. L. innocua cells were modeled as rods, ∼2 μm in length, and ∼0.6 μm in width (29). B. subtilis forms long rods with oval endospores. The dimensions for our model were based on direct observation under a phase-contrast light microscope (Leica Microsystems, Bannockburn, IL): the average size of the vegetative cell was 4.3 μm × 0.54 μm and the endospores measured about 0.8 μm × 0.5 μm. Typically, the volume of the B. subtilis cells increases by as much as 4% when spores are formed, whereas the refractive index decreases from 1.51 to 1.39 (30). E. faecalis forms oval cocci elongated in the direction of the chain, mostly in pairs and short chains, with each coccus measuring about 1.38 μm long. The refractive indices of vegetative cells of all these bacteria vary from 1.4 to 1.5 (30, 31).

The obtained results are valid for forward angles. The effects of internal nonhomogeneity that affect the light scatter at large angles can be ignored in this study. Because the cells are much smaller than the incident laser beam, an incident uniform plane wave is assumed. The numerical model described in this report assumes a nominal effective refractive index of 1.394. The angular variation of scatter has been corrected for refraction of the scattered partial waves across the flow cell on the way to the detectors, and the longer axes of the bacteria were assumed to be aligned with the axis of flow owing to the hydrodynamic forces in a flow cytometer. Because polarization changes the scattering cross section noticeably, especially for long rod-like particles, the employed model takes into account incident laser beam polarization (Ex/Ey = 0.33).

Machine Learning Tools

Among various machine learning tools tested for classification of scatter features of individual bacteria, SVM-based algorithms were especially promising (32–34). SVM algorithms allow for nonlinear decision boundaries in the input space. SVMs are based on the concept of decision hyperplanes that define decision boundaries. A decision hyperplane is one that separates a set of objects having different class memberships. SVMs are able to construct hyperplanes in a multidimensional space that separates cases of different class labels. An optimal decision hyperplane is here defined as the linear decision function with maximal margin between the vectors of the two classes. It has been demonstrated that to construct such hyperplanes one has to take into account only a small amount of the training data, the so-called support vectors, which determine this margin (33). For w0 · z + b0 = 0 | wRN, bR, which is the optimal hyperplane, it has been shown that the weights w0 can be expressed as linear combination of support vectors:

equation image(4)

Therefore, the linear decision function I(z) will be in the form of

equation image(5)

where zi · z is the dot-product between support vectors zi and vector z in feature space. SVM is a linear classifier in the parameter space, but it is easily extended to a nonlinear classifier by mapping the space S = {x} of the input data into a high-dimensional (possibly infinite-dimensional) feature space F = {φ(x)} (see Fig. 2). If one chooses an adequate mapping φ, the data points become linearly separable or mostly linearly separable in the high-dimensional space, so that one can easily apply the structure risk minimization (35). To avoid working in the potentially high-dimensional space F, one tries to pick a feature space in which the dot product can be evaluated directly using a nonlinear function in input space, i.e. by means of the kernel trick: κ(x1, x2) = 〈φ(x1), φ(x2)〉. Therefore, instead of making a nonlinear transformation of the input vectors followed by dot-products with support vectors in feature space, one can first compare two vectors in input space, and then make a nonlinear transformation of the value of the result (33). A kernel can be also understood as a similarity measure between two observations. A large value for κ(x1,x2) indicates similar points, where smaller values indicate dissimilar points. Typical kernels include the linear kernel, κ(x1, x2) = xmath imagex2, the polynomial kernel, κ(x1,x2) = (xmath imagex2 + 1)d, or the RBF kernel, κ(x1, x2) = exp(−γ‖x1x22). It has been shown that all these kernels are functions of dot products (36).

Figure 2.

The toy XOR problem demonstrates the concept of mapping the features to higher dimensionality to find a linear separation. The red points (class 1) and green points (class 2) cannot be separated by a linear function in the feature space (left plot). However, a simple mapping to a higher dimension allows linear separation (right plot). The classes can be mapped to a six-dimensional space: 1, √2X, √2Y, √2XY, X2, Y2, where the optimal separation hyperplane is XY = 0. [Color figure can be viewed in the online issue, which is available at]

Supervised classification performed in this report used an implementation SVM-based algorithm by Chih-Chung Chang and Chih-Jen Lin (37–39) All the plots, including the examples of the SVM decision boundaries, were prepared using R, a free software environment for statistical computing and graphics (40).


Calculation of Distinguishability Factor

Light-scatter patterns created by cells belonging to four different bacterial species, L. innocua, B. subtilis, E. coli, and E. faecalis, were modeled using the DDA method. Flow cytometry measurements of traditional forward- and side-scatter signals, as well as multiangle scatter measurements, have been performed using actual samples of E. coli K12, L. innocua F4248, B. subtilis ATCC 6633, and E. faecalis CG110.

The employed model of scattering by bacteria with a nominal refractive index predicts scatter intensities at angles varying from 0 to 30°. The predictions of angular scatter intensity are shown in Figure 3 (averaged over all Φ for each ring). The plotted differential scattering cross section is independent of the distance between the detector and the sample, and is used to select the angles for maximum discrimination of the bacteria.

Figure 3.

The nominal variation of differential scattering cross section (dCsc/dω) with forward angle for the four bacteria species. Set I—7.8, 11.3, 17, and 22.5°; Set II—4, 5.8, 8.7, and 11.5°. [Color figure can be viewed in the online issue, which is available at]

The analytical model of the bacteria with a nominal refractive index and size allowed us to find the optimal location for the scatter detectors for effective classification. The proper placement of the detectors was determined by the value of distinguishability factor D, defined as the ratio of the difference in scattering cross section to the sum of scattering cross section of two different bacteria:

equation image(6)

where i,j represent different bacterial species, and θ is the angle of light scatter.

The idea of calculating the D factor can be easily derived from an analysis of Figure 3. One can easily see that the angles represented by light-gray lines (nominal 7.8, 11.3, 17, and 22.5°—Set I) are better than the angles highlighted by dark-gray lines (4, 5.8, 8.7, and 11.5°—Set II) for distinguishing all the analyzed two-component mixtures of bacteria. These sets of angles can in practice be translated to different positions of the multi-angle scatter detector. The distinguishabilities for the two analyzed sets are presented in Table 1.

Table 1. Distinguishabilities calculated for scatter measurement at nominal 7.8, 11.3, 17, and 22.5° (Set I), and 4, 5.8, 8.7, and 11.5° (Set II)
 E. coliL. innocuaB. subtilisE. faecalis
E. coli  1.882.822.151.491.340.67
L. innocua1.882.82  2.111.792.423.12
B. subtilis2.151.492.111.79  2.252.01
E. faecalis1.340.672.423.122.252.01  

The data showed that the ability to distinguish between the analyzed bacterial species decreased when the multiangle detector was placed too close to the flow chamber, effectively collecting signals from larger angles of scatter. L. innocua was an exception from this finding.

Although this estimate ignores the variation of scatter signals within each population due to intra-species differences in size and refractive indices, it gave a qualitative approximation of the expected outcome of the alternative measurement scenarios.

The difference in refractive indices and cell sizes results in dispersion of each bacterial species population in the light-scatter measurement space. Therefore, the experimental data have not been directly classified using the output of a model, but rather processed employing a machine learning system.

Predicted Classification Success

To predict the feasible classification, the variability in scatter signal owing to differences in refractive index and sizes of individual particles had to be considered. Hence, a normal distribution of sizes and refractive indices was employed in the enhanced model calculated for three species of bacteria (L. innocua, E. faecalis, and E. coli). The 1/e width of the normal distribution of refractive index was 0.033 with a mean (μ) of 1.394. The standard deviation of the refractive index was assumed to be approximately 2%. Similarly, the standard deviation in volume of the bacteria was assumed to be 5%. This increase in volume was modeled by corresponding isotropic changes in the dimensions of the bacteria. The modeled populations of each bacterial species were divided into 91 subgroups (13 different values of refractive index by 7 values volume, both varying from μ − 3σ to μ + 3σ) and mapped onto the four-dimensional measurement space of the two investigated angle sets. The modeled bacterial populations were used to calculate theoretical scatter signals using the DDA approach as described before. Once scatter intensities of each subgroup were computed, the results were weighted by the population density (w) using the equation

equation image(7)

where ni and Vi are the refractive index and volume of the ij-th subgroup, n0 and σi are the mean and the standard deviation of refractive index, V0 and σV are the mean and the standard deviation of bacterial cell volume.

The overlap of weighted resultant values calculated for every pair of bacterial species in the measurement space was used as an estimate of the possible classification error, for every two-class case. Because the model did not take the instrument noise into account, these values were expected to give an approximate upper bound of the feasible classification success.

In Silico Analysis of Flow Cytometry Experiments

Samples containing the pure bacterial suspensions were run in sequence but separately on the modified Beckman-Coulter FC500 flow cytometer. Subsequently, the datasets were electronically mixed, and a parameter representing the ground truth was added to the dataset. This parameter was used to verify the results of automated classification.

The collected scatter signals at the four forward angles established by the numerical analysis study, a parameter representing sum of all the forward-scatter intensities, side-scatter, and axial light loss measures for each particle from every group of bacteria formed multidimensional data vectors describing the analyzed bioparticles. Visual examination of plots representing measurements of forward- and side-scatter signals could not distinguish between the microbial particles of E. coli K12, L. innocua F4248, B. subtilis ATCC 6633, and E. faecalis CG110 in any of the experiments. An example is demonstrated in Figure 4, where samples containing particles of E. faecalis and B. subtilis are represented on a scatter-plot matrix. Partial or complete overlap of the two populations in the parametric space is evident.

Figure 4.

Matrix of scatter plots representing multiangle scatter measurement of B. subtilis (red dots) and E. faecalis samples (green dots). For clarity, only 1,000 events were plotted. FS14, four forward-scatter measurements; LL, axial light loss; SS, side scatter. B. subtilis—orange dots, E. faecalis—blue dots. [Color figure can be viewed in the online issue, which is available at]

The classification problem was then to determine the type (species) of every analyzed particle on the basis of its multidimensional data vectors. The unsupervised dimensionality reduction approach employing linear and kernel principal component analysis using radial basis function (PCA, kPCA), independent component analysis (ICA), as well as factor analysis (FA) have not resulted in separable populations (see an example in Fig. 5). Attempted supervised classification using linear discriminant analysis (LDA) also failed, producing results with error rates above 35% (Fig. 6) in all the cases tested.

Figure 5.

Example of dimensionality reduction techniques applied to the multiangle scatter data. (A) Principal component analysis, (B), independent component analysis, (C), kernelized version of principal component analysis, (D), factor analysis. B. subtilis—orange dots, E. faecalis—blue dots. [Color figure can be viewed in the online issue, which is available at]

Figure 6.

Linear discriminant score plot illustrating inability to achieve separation between data vectors describing B. subtilis (orange dots) and E. faecalis (blue dots). Events indexed from 1 to 500 should be placed above the y = 0 function, whereas all the events from 500 to 1000 should score below 0. However, owing to misclassification a large portion of dots representing B. subtilis is placed above y = 0 discriminant function. Conversely, a large group of E. faecalis cells was misclassified as B. subtilis. [Color figure can be viewed in the online issue, which is available at]

In contrast to LDA, SVM is usually capable of solving complex classification problems which do not have simple linear (or quadratic) solution in the parametric space (Fig. 2). Therefore, supervised classification was performed using an SVM-based approach. A radial-basis function kernel κ(x1,x2) = exp(−γ‖x1x22) was used for all the classification. The optimal type of kernel was established experimentally. The SVM complexity parameter as well as the γ kernel parameter was found by an extensive grid search evaluating every pair of parameters by re-training and cross-validation.

The 5 × 2 cross-validation and bootstrap algorithms were used to determine the classification success of the optimal SVM. The accuracy of classification computed using cross-validation is summarized in Table 2. An example of a complicated decision boundary (a hyperplane in n dimensions, where n is the number of parameters) determined by a typical SVM training applied to a scattered-light dataset is illustrated in Figure 7.

Figure 7.

Examples of cross sections through decision boundaries of SVM-based pattern-recognition system. Filled points represent regular data vectors, empty points represent support vectors. Values of the variables not represented on the 2D plots were set to their medians. B. subtilis—orange points, E. faecalis—blue points. [Color figure can be viewed in the online issue, which is available at]

Table 2. Average classification success rates for 6-parameter (7.8, 11.3, 17.7, 22.5, 90°, and axial light loss) scatter system employing SVM classifier
 E. coliL. innocuaB. subtilisE. faecalis
R (%)E (%)R (%)E (%)R (%)E (%)R (%)E (%)
  1. R, real (measured) classification accuracy; E, estimated classification accuracy.

E. coli86.3095.899.1010068.7077.1
L. innocua86.3095.899.609881.6095.6
B. subtilis99.1010099.609898.50100
E. faecalis68.7077.181.6095.698.50100


Although light-scatter signatures of cells have been utilized in microbiological applications of flow cytometry, the role of scattered light was secondary at best. It was the growing availability of fluorescence-labeled antibodies to specific antigens that made possible the use of flow cytometry to directly detect the presence of pathogens. Cytometry-based methods have been employed to detect surface antigens in Haemophilus (41), Salmonella (42, 43), Mycobacterium (44), Brucella (45), Branhamella catarrhalis (46), Mycoplasma fermentans (47), Pseudomonas aeruginosa (48), Bacteroides fragilis (49, 50), Legionella (51), and other microorganisms (52). The main disadvantages of label-dependent detection are the limited availability of antibodies directed against certain microorganisms, and problems with fluorescence detection multiplexing. Although scatter signals have been routinely collected during flow cytometry measurements of bacterial populations, classification of live microorganisms on the basis of scatter signal alone measured in a commercially available flow system has not been reported.

Since the pioneering experiments by Salzmann et al., (9) it has been demonstrated in numerous reports that label-free measurements and classification of bioparticles in flow cytometry is feasible. The major obstacles for the wider implementation of the multiangle scatter systems were the complexity of the design and the lack of easy-to-use tools for data analysis. Although the system reported in this manuscript uses an enhanced detector, the number of simultaneously measured angles is relatively low, and detector installation does not require extensive modifications to the flow cytometry hardware. Instead of focusing on the increase of the number of scatter angles the proposed approach requires pre-selecting angles which are likely to offer high distinguishability for the bioparticles of interest. This is a strength of our design, but also a weakness, since the system can be optimally set up only for a given (and known) type of bioparticle. Consequently, the system as proposed cannot be used for purely exploratory flow cytometry, in which the characteristics of the analyzed bioparticles are completely unknown. However, if control samples are available, and particles whose presence has to be determined (or which have to be enumerated) can be characterized in terms of their scatter properties, there is a good chance that a system can be tuned to accommodate such a specialized measurement. Alternatively, one may locate the optimal position of the detector (and consequently, the collected scatter angles), simply by trial and error, where results obtained from controls are electronically mixed, and classified with a machine-learning system, using cross-validation to determine the optimal angles. The current prototype used for demonstration of the proof of concept allows for only two positions of the detectors, but there is absolutely no technical reason why multiple positions along the z-axis and consequently multiple sets of angles could not available.

Comparison of the classification success obtained experimentally to the distinguishabilities estimated from the simple scatter model shows high level of agreement except for two of the six classification cases for each set of angles. This is encouraging considering the fact that the intra-population variance in size and refractive indices has not been accounted for in the first model. It should also be noted that the predicted high classification rates for L. innocua and E. faecalis mixture measured with the second configuration (4°, 5.8°, 8.7°, and 11.5°), do not match the experimental results (Table 3). We suspect that the reason for this discrepancy is the high variance in dimensions and refractive index of bacteria.

Table 3. Average classification success rates for 6-parameter (4, 5.8, 8.7, 11.5°, and axial light loss) scatter system employing SVM classifier
 E. coliL. innocuaB. subtilisE. faecalis
R (%)E (%)R (%)E (%)R (%)E (%)R (%)E (%)
  1. R, real (measured) classification accuracy; E, estimated classification accuracy.

E. coli74.879.069.981.057.961.0
L. innocua74.879.071.774.077.079.0
B. subtilis69.981.071.774.070.287.0
E. faecalis57.961.

The upper bound of classification success estimated with the help of the enhanced model for turned out to be valid, although over-optimistic. The real classification success differed from the estimated by 1–10%. However, we still consider the model to have high predictive power since only in one of the analyzed cases (B. subtilis vs. L. innocua) was the predicted classification rate lower than the real accuracy (Table 2). This shows that the simulated upper bound can be used to determine a priori whether a certain type of analysis and classification is feasible in the given system. For instance, if the simulated upper bound is on the level of 65–70%, any attempt for successful classification will be most likely futile regardless of the quality of the sample and stability of the lasers.

In the presented report the scatter simulation employing state-of-the art techniques such as DDA allowed us to utilize optimally the multiangle detector. However, owing to high biological variability of the real samples containing microorganisms we have not employed scatter simulation for the purpose of the actual particle classification. Instead, a machine-learning system was used.

Machine-learning and pattern recognition systems have been applied to flow cytometry by a number of researchers in fields such as marine biology (53, 54), hematology and immunology (55, 56), and microbiology (57, 58). Among the proposed methods were LDA, neural networks, and SVM (59). However, we are not aware of any application of these techniques to label-free bacteria classification. Another aspect of this work is a combination of simulation-based pre-selection of features with a machine-learning system. The premise of this approach is two-fold. Firstly, the smaller number of parameters to collect simplifies the design of the detector, making it pluggable to older hardware. Secondly, overwhelming a machine-learning system with nonrelevant features may degrade the performance of classifiers. Naturally, employing a feature selection procedure is an answer to the problem of a huge number of features. However, this solution comes at significant computational cost, and ultimately it may be difficult to implement, especially if a real-time analysis or fast analysis and classification is desirable.

Use of SVM allowed for high classification accuracy and eliminated the need for gating. Manual gating can be performed easily if analyzed parameters are orthogonal. However, intensity of light collected by scatter detectors cannot be orthogonalized via compensation as in the case of fluorescence measurements. Therefore, other methods of classification had to be explored. PCA, kPCA, ICA, and FA failed to separate the analyzed populations. A simple linear discrimination approach was also unable to perform in a satisfactory manner. However, supervised classification employing a kernel approach, such as SVM, produced a remarkably high success rate (Table 2). Unfortunately, SVM results cannot be easily interpreted if the dimensionality of the problem is higher than 2 (compare Fig. 2 with Fig. 7). This may be a serious problem for many practitioners in the field who expect that despite the growth in the number of available variables, some simple graphical model of data analysis would still be employed. Therefore, one of the most important aspects of multiangle scatter studies should be a search for innovative data visualization tools, allowing for meaningful dimensionality reduction and easy exploratory gating.