Commentary
Automating flow cytometry
Article first published online: 28 DEC 2011
DOI: 10.1002/cyto.a.22007
Copyright © 2011 International Society for Advancement of Cytometry
Additional Information
How to Cite
Pedreira, C. E. (2012), Automating flow cytometry. Cytometry, 81A: 110–111. doi: 10.1002/cyto.a.22007
Publication History
- Issue published online: 23 JAN 2012
- Article first published online: 28 DEC 2011
- Manuscript Accepted: 28 NOV 2011
- Manuscript Revised: 22 NOV 2011
- Manuscript Received: 7 NOV 2011
- Abstract
- Article
- References
- Cited By
Flow cytometry data analysis has been traditionally based on the identification of cell populations by using strategies founded on the definition of gates in bi-dimensional plots, where an experienced operator selects the subpopulation(s) of interest. However, recent advances concerning the increase of flow cytometers multiparameter capabilities and also the introduction of sophisticated strategies (1), which allow for virtual infinite color flow cytometry, have increased the motivation and interest for new strategies for automated identification of cell populations. Automation schemes for flow cytometry data have been approached in Ref.2 and more recently in Refs.3–7.
The article by Stuchlý et al., in this issue (page 120), introduces an interesting procedure aiming automation for flow cytometry generated proteomic data. It is important to emphasize the relevance of this contribution as it brings in a methodology that can be potentially used in a broad range of proteomic problems. Stuchlý et al. provided a tool to automatically find a set of clusters, constituted by color-coded microspheres, in a sequence of bi-dimensional spaces.
Cluster analysis (8) is an assemblage of techniques that has been extensively and successfully used in engineering problems and more and more, especially in the last decade, is turning out to be a key tool in many medical applications. The main goal in cluster analysis is to identify underlying structures present in data. Accordingly, the aim is to track down groups of multidimensional points, in the case of Stuchlý et al. in this issue, groups of color-coded microspheres, which have similarities among them. More than that, one pursuits within-compact and well separated groups of points.
There are many well-established clustering methods in the literature. Stuchlý et al. reported that they experimented some of these methods, among those the classical k-means and hierarchical clustering (8) and finally decided for a k-medoids family approach, specifically the partitioning around medoids (PAM) (9). At this point, it looks like beneficial to go a little bit around these choices and their potential applications in flow cytometry automation.
The k-means, probably the most popular approach for clustering, is an easy-to-use, intuitive method. It groups the data by associating a representative, called centroid, to each of the clusters. The initial centroids are arbitrarily chosen-typically, although not necessarily, randomly-in the data-points space. Accordingly, if one aims to partition data into k clusters, one should initially choose (or draw) k centroids. The next step is to identify, for each centroid, the subset of data-points that are closer to it than to any other of the centroids. Forthwith, one can calculate the means of each of the subsets, and place these means as the new centroids. These steps are done as a loop until none of the centroids change their location anymore.
The k-medoids approach has also to be arbitrarily initiated but the clusters representatives (medoids) are chosen as one of the data-points, instead of the mean of the data-points belonging to a given cluster, as for the k-means. The aim is then to find k medoids that minimize a measure of dissimilarity to all the data-points of the cluster it represents. The implications of this, apparently small, change are twofold: (i) First, the method becomes less sensible to the data-points that are far away from their cluster representative. In the k-means approach, eventual far away data-points could severely change the value of the means, used as the clusters representatives; (ii) Second, the counterpart is that this procedure is much harder from the computational point of view -for both, processing and memory- and furthermore, this computational hardship increases quite fast (with the square) as the number of data-points increases.
Some applications of hierarchical clustering (8) in flow cytometry can be found in the literature, e.g., in Refs.10–12. In contrast with the k-means and k-medoids approaches, hierarchical clustering does not undergo the arbitrarily initiation step. Also, the number of clusters must not be set beforehand. On the other hand, it requires the specification of a measure of pairwise dissimilarities between the clusters, and these measures will directly influence the final clustering result. The clusters are created, at each hierarchical level, by merging some of the existing clusters in the previous lower level. At the beginning (lowest level), each cluster is formed by a single data-point, i.e., one has as many clusters as the quantity of data-points. The algorithm follows by merging, at each subsequent level, the pair of clusters with the smallest intergroup dissimilarity. This is known as agglomerative approach, its counterpart is the divisive approach in which one starts with just one cluster, containing all the data-points, and proceeds by splitting, at each level, one of the existing clusters. Hierarchical clustering is known to be computationally heavy, especially for a large amount of data-points, since the dissimilarity measures have to be calculated and ordered (since one desires to merge the pair of clusters with the smallest dissimilarity) for all possible pairs of clusters (data-points at the first step).
Although intuitive, simple and not very heavy from the computational point of view, the k-means is quite sensitive to the initialization of the algorithm. This means that depending on the initial arbitrary choice of the centroids, one may get better or worst clustering results, what is obviously an undesirable feature. At this point, it is crucial to direct attention at the possibility of incorporating previous knowledge about the experiments to underline the idea that automation should by no means be confused with doing it in a blind way. To add previous known information on the problem is, in many cases, a good approach. For instance in Ref.13, in a peripheral blood lymphocyte automated analysis context, it is proposed the usage of biological knowledge to manually locate prototypes and also to determine a sequence of featured-markers spaces in which the cells subsets are to be identified. In Stuchlý et al., an interesting strategy, taking advantage of information about the experiment, was used to make the PAM algorithm numerically feasible. The bi-dimensional Ax488 vs. Ax647 space was partitioned into beans and the more dense bins were selected. Next, a number of events in each bin were randomly sorted, and the PAM was run just for this fraction of representative events. In this way, the algorithm requires much less computation effort, both, concerning memory and processing. It is important to remember that especially (RAM) memory is a key setback for k-medoids algorithms as the complexity increases proportionally to the square of the number of events, in Stuchlý et al. around 50,000. It is worthwhile to stress that after under sampling the data in this way, a k-means algorithm (e.g., initialized with the means of the representative data-points) could have most probably reached similar results.
Hierarchical clustering brings along in its name a powerful advantage, the possibility of having hierarchical procedures is no doubts attractive. On the other hand, the hierarchical tree, the dendogram, is built up directly by the procedure, taking the dissimilarities in account. Stuchlý et al., once again appropriately used knowledge about the experiment by hierarchically choosing a suitable sequence of bi-dimensional plots, e.g., FSC versus Ax750. Note that the problem solved in Stuchlý et al. is in fact a higher dimension problem that was correctly approached as a sequence of bi-dimensional ones.
Concerning the practical applicability of the proposed framework, on one hand, it is true that processing huge protein arrays requires cytometers with at least six detectors of fluorescence and that instruments like the modified FACS LSR II, used by Stuchlý et al., are not already available in many laboratories. On the other hand, it is also true that accessibility to cytometers with eight or more detectors is gradually becoming more and more universal. The strategy introduced in Stuchlý et al. in this issue constitutes an important contribution since it provides a quite general tool. Automation plays a key role in the proposed framework, allowing for fast and reliable processing of a large amount of information related to protein profile changes.
LITERATURE CITED
- 1,,,,,,. Generation of flow cytometry data files with a potentially infinite number of dimensions. Cytometry Part A 2008; 73A: 834–846.Direct Link:
- 2,,,,,,. A new automated flow cytometry data analysis approach for the diagnostic screening of neoplastic B-cell disorders in peripheral blood samples with absolute lymphocytosis. Leukemia 2006; 20: 1221–1240.
- 3,,,,,,,,,,,. Automated high-dimensional flow cytometric data analysis. PNAS 2009; 106: 8519–8524.
- 4,,,,,,,,,. Automated pattern-guided principal component analysis versus expert-based immunophenotypic classification of hematological malignancies. Leukemia 2010; 24: 1927–1933.
- 5,,,. Rapid cell population identification in flow cytometry data. Cytometry Part A. 2011; 79A: 6–13.Direct Link:
- 6,,,,. Analysis of flow cytometry data using. Cytometry Part A 2008; 73A: 857–867.Direct Link:
- 7,. Automation in high-content flow cytometry screening. Cytometry Part A. 2009; 75A: 789–797.Direct Link:
- 8,,. Pattern Classification, Chapter 10, 2nd ed. New York: Wiley, 2001.
- 9,. Finding Groups in Data, Chapter 2, New Jersey: Wiley, 1989.
- 10,,,,,,,. Profiling of polychromatic flow cytometry data on B-Cells reveals patients' clusters in common variable immunodeficiency. Cytometry Part A 2009; 75A; 902–909.Direct Link:
- 11,,,,,,,. Combination of automated high throughput platforms. Flow cytometry, and hierarchical clustering to detect cell state. Cytometry Part A 2007; 71A; 16–27.Direct Link:
- 12,,,,,,. Detection and monitoring of normal and leukemic cell populations with hierarchical clustering of flow cytometry data. Cytometry Part A 2012; 81A: 25–34.Direct Link:
- 13,,,,. A multidimensional classification approach for the automated analysis of flow cytometry data. IEEE Trans Biomed Eng 2008; 55: 1155–1162.

1552-4930/asset/olbannerleft.gif?v=1&s=cf6fcbfdc8ffa722551b71f8be0977b5b04db6ee)
1552-4930/asset/olbannerright.gif?v=1&s=6d0ba4e64946e5d9d4ebba4b8395a68bf324cd0b)
