Traditionally, flow cytometry (FCM) has been a tube-based technique limited to small-scale laboratory and clinical studies. High throughput methods for FCM have recently been developed for drug discovery and advanced research methods (1–4). As an example, flow cytometry high content screening (FC-HCS) can process up to a thousand samples daily at a single workstation, and the results have been equivalent or superior to traditional manual multiparameter staining and analysis techniques. The amount of information generated by high throughput technologies, such as FC-HCS, need to be transformed into executive summaries, which are brief enough for creative studies by a human researcher (5). Quality control and quality assessment are crucial steps in the development, and use of new high throughput technologies and their associated information services (5–7). Quality control in clinical cell analysis by FCM has been considered (8, 9). As an example, Edwards et al. (9) proposed some quality scores for monitoring the quality of immunophenotyping process (e.g., blood acquisition, cell preparation, lymphocyte staining). They showed that a low degree of temporal parameter variation exits within individual whereas significant variations can exist between donors with respect to the parameter monitored. However little has been done with high throughput FCM. For example, quality control of FCM experiments should include the assessment of instrument parameters that affect the accuracy and precision of data. In that respect, Gratama et al. (10) have proposed some guidelines such as monitoring the fluorescence measurements by computing calibration plots for each fluorescent parameter. However, such procedures are not yet systematically applied, and data quality assessment is often needed to overcome a lack of data quality control. The aim of data quality assessment is to detect whether any measurements of any samples are substantially different from the others, in ways that were not likely to be biologically motivated. The rationale is that such samples should be identified, investigated, and potentially removed from any downstream analyses. Quality control, on the other hand, measures such quantities during the assaying procedure and can alert the user to problems at a time where they can be corrected.
Data quality assessment in high throughput FCM experiment is complicated by the volume of data involved and by the many processing steps required to produce those data. Each instrument manufacturer has created software to drive the data acquisition process of the cytometer (e.g., CellQuest Pro by BD Biosciences, San Jose, CA; Summit by DakoCytomation, Fort Collins, CO; or Expo32 by Beckman Coulter, Fullerton, CA). These tools are primarily designed for their proprietary instrument interface and offer few, or no, data quality assessment functions. Third party analysis and management tools, such as FlowJo (Tree Star, Ashland, OR), WinList (Verity Software House, Topsham, ME) or FCSExpress (Denovo Software, Thornhill, Canada) provide researchers with more capable “offline” analysis tools but remain limited in term of data quality assessment.
We propose a number of one- and two-dimensional graphical methods for exploring the data in the hope that they would be of some use to the investigators. The basis of our approach is that, given a cell line, or a single sample, divided in several aliquots, the distribution of the same physical or chemical characteristics (e.g., side light scatter -SSC-or forward light scatter -FSC-) should be similar between aliquots. To test this hypothesis, we made use of graphical exploratory data analysis (EDA). Five distinct visualization methods were implemented to explore the distributions and densities of ungated FCM data: Empirical Cumulative Distribution Function (ECDF) plots, histograms, boxplots, and two types of bivariate plots. These different graphical methods should provide investigators with different views of the data. ECDF plots have been widely used in the analysis of microarray data where they help to detect defective print tips, or plates of reagents that have not been well handled (11). These plots can quickly reveal differences in the distributions, but are not particularly useful for understanding the shape of a distribution. Histograms help to visualize the shape of the distribution and can reveal structure, such as the mode. Boxplots summarize the location of the distribution and can reveal asymmetry but are mainly applicable to unimodal distributions. Boxplots are also commonly used in the processing of microarray data where they help to identify hybridization artifacts and assess the need for between-array normalization to deal with scale differences among different arrays (11). Finally, we use bivariate plots representation in two different ways. In fact in some cases, when comparing two samples, we found two-dimensional displays more informative, i.e., two-dimensional summaries can show differences in samples, while the one-dimensional summaries, mentioned earlier, are similar. One common use of bivariate plots in FCM experiments is to display the joint distribution of two continuous variables as dot plots (e.g., FSC versus SSC). However the analysis of such dot plots might be a challenge as the high density of plotted data points (an average of 10,000 data points per sample) might form a blot and the frequency of the observations might not be easily appreciated. To overcome this issue we propose to use contour plots where contour lines might be interpreted as the frequency of observations with respect to the x–y plane. The second use of bivariate plots, for high throughput FCM data, is to render per well summary statistics for a particular plate in the format of a scatterplot. In this view each point represents a single well and the x and y values are chosen to be various summary statistics.
We illustrate the need and usefulness of those visualization tools to assess FCM data quality through examination of two FC-HCS datasets. Our results demonstrate that the application of these graphical analysis methods to ungated FCM data provides a systematic and efficient method of data quality assessment, preventing time-consuming gating and further analysis of unreliable samples. Although the methods we propose are primarily aimed at the discovery of data quality problems, they may detect differences that are biologically motivated. Hence, we discourage the automatic removal of aberrant samples and emphasize the need to check whether such underlying biological causes are present.
DISCUSSION AND CONCLUSION
- Top of page
- MATERIALS AND METHODS
- DISCUSSION AND CONCLUSION
- LITERATURE CITED
In this article we have concentrated on one- and two-dimensional views to summarize ungated FCM data. Table 1 summarizes their different properties and limitations. We believe these plots are an intuitive and good first approach for quality assessment of high throughput FCM data. We also understand that some of the samples identified by our proposed procedures will be anomalous for biological reasons, as in Figure 6. However the main thrust of this article is the identification of problems in data quality. Any point identified are worth studying further, and some determination as to whether the sample presents data quality issues or rather presents real biological significance, should be made.
When assessing the quality of any high throughput FCM experiment, one-dimensional summaries should come first. They allow us to display several samples together in one graphic, facilitating comparison between groups. For example, Figure 1 presents ECDF plots of 8 FSC measurements at 12 different time points, for one patient. Any sample unusual in any one of these plots should be further investigated. We also demonstrated that boxplots can help identify plate specific biases (Fig. 4). The spacings between the different parts of the box can help indicate variance, skew and identify outliers. However, we note that in the context of ungated FCM data the use of summary statistics (summary scatterplots or boxplots) might be problematic, since such data are often mixtures of populations and thus give rise to multimodal distributions as revealed by the density plots (Fig. 3). In such cases the interpretation of most summary statistics is problematic and most will change depending on the mixture present, even when each subpopulation is unchanged and conversely they may change little even when the corresponding subpopulations change substantially. Further research into developing good summary statistics for such data is needed.
While analyzing samples stored in 96-well plates, one should look for row, column, plate effects etc. (Figs. 6 and 7). On one hand, samples that are divided in aliquots should be compared for any and all sensible comparisons, e.g., ECDF plots of FSC values across all replicates in Figure 1. On the other hand, two-dimensional scatterplots should be used for comparisons across samples, as shown in the scatterplot of summary statistics in Figure 6. Two or higher dimensional summaries are usually more complex to visualize but they can reveal unusual samples that are not anomalous in either one-dimensional projection (the same is true for higher dimensions). Contour plots or other enhanced high density scatterplots could be used to further investigate a particular sample, as seen in Figure 7. However, they typically require separate plots for each sample and thus are more difficult to use and interpret.
In conclusion, artifacts from sample preparation, handling, variations in instrument parameters or other factors may confound experimental measurements and lead to erroneous conclusions. Methods designed to detect problematic samples are an important, yet difficult, application. The role of EDA and quality assessment is to provide knowledge that the data are as anticipated. We found that coupling graphical representations together provides a useful approach for assessing the quality of the data in high throughput experiments, as each plot highlights different characteristics of the data (Fig. 5 and Table 1) and can reveal substantial nonbiological differences in samples. Such screening tools are particularly valuable for high throughput technologies as they allow rapid evaluation of a large number of samples. We thus propose that the described visualizations should be used as quality assessment tools and used as quality control procedures where possible. It is likely that some summaries can be developed and specialized to particular situations and settings using interactive visualization tools like GGobi (16), which have an elegant interface with R via the Rggobi package.