Analyses of large flow cytometry datasets


  • Jan Stuchlý,

    1. Department of Pediatric Hematology/Oncology, 2nd Faculty of Medicine, Charles University Prague, Czech Republic
    Search for more papers by this author
  • Tomáš Kalina

    Corresponding author
    1. Department of Pediatric Hematology/Oncology, 2nd Faculty of Medicine, Charles University Prague, Czech Republic
    • Correspondence to: Tomáš Kalina, Department of Pediatric Hematology/Oncology, 2nd Faculty of Medicine, Charles University Prague, Czech Republic. E-mail:

    Search for more papers by this author

Flow cytometry has become an essential tool for clinical diagnosis and monitoring in the fields of immunology and hematology. Flow cytometry instrumentation is widely available, and flow cytometry assays are relatively simple and fast to perform. Combined with the wealth of single cell information and the digital nature of the generated data, the use of flow cytometry in clinical trials and in other large studies that involve patient cohorts is expanding. Clinical trials are typically spread over large amounts of time (years), involve many subjects (multi-centric), and may require hundreds to thousands of measurements. Analyses of large data sets from clinical trials should strive for maximal objectivity. This goal has traditionally been accomplished by a single expert that analyzes raw flow cytometry data [1]. In such a setting, the data analysis is performed file by file by an expert analyst (“complete manual analysis”), and the gating strategy is decided and applied by the analyst using software supplied by the flow cytometry instrument manufacturer or a third party. The strength of this approach is clearly in the involvement of an expert that can potentially address non-standard situations and that can provide immediate feedback to the laboratory. At the same time, the expert is potentially a weak point because the quality of the expert is difficult to define and maintain, and it is often difficult for several experts to reach a consensus. Moreover, current in vitro diagnostics (IVD) instruments offer 8–10 fluorescence parameters that make complete manual gating impractical and non-standard, especially if the analyses are performed over several months or years. The reproducibility of the analyses can be improved by using a predetermined set of sequential gates (“template gating”) that are applied uniformly to the complete dataset [2]. However, this approach may be negatively influenced by inter-sample variations. In most projects, template gating is manually adjusted by an expert when it is needed. This approach is faster, more reproducible, and easier to supervise. At present, fully computational analyses of data (“algorithmic analysis”) are being developed experimentally that can be applied to an entire patient cohort. For example, we used “template gating” to define B-lymphocytes from a large cohort of patients with common variable immunodeficiency and then subjected them to an algorithmic analysis (probability binning algorithm) to search for B-cell phenotypic patterns that showed similarities among the patients [3]. Another general approach to algorithmic analysis is to identify clusters of events (cell populations [4] or distinct color-coded beads as a part of an analytical pipeline [5]). There are various algorithmic analytic approaches that can be applied to flow cytometry data, which have been reviewed by Lugli et al. [6] and more recently by Pedreira et al. [7]. Diverse algorithmic analyses were critically tested and compared by FlowCAP challenges [8]. Algorithmic dataset analysis can be directly joined with testing for correlations with clinical outcome (RchyOptimyx) as exemplified by Aghaeepour et al. [9]. Fiser et al. [10] has developed an algorithm to ascertain the relationships of individual cells within a FCS file by hierarchical clustering.

Template gating and algorithmic analyses require highly uniform data, which means that any variability that is introduced by non-biological factors is limited. Unless corrected by an analyst, any potential technical variability may be misinterpreted, including shifts in fluorescence caused by the instrument or during the staining procedure.

The article by Finak et al. in this issue (page 277) introduces an algorithm that can be used for data normalization that can achieve robust uniformity of the data prior to analysis. The authors combine the a priori knowledge of the structure of the data, which is the knowledge of the hierarchy of the main subsets, and normalize the data only at the level of the template-gated subsets. Use of the template gating approach leads to the deconvolution of the data and to more precise automated detection of the peaks (so called “landmarks”). This method therefore results in a more accurate alignment of the peaks across the dataset. Similar to the eye of an experienced analyst, the algorithm is also more efficient and reliable in finding well-separated single peaks than in interpretation of information about several overlapping peaks.

When employing a normalization tool, one should always observe the common saying: “No fancy analysis can make good data out of bad data.” All possible means should be taken to ensure that the data are of high quality. Protocols for the standardization of instrument setup and sample preparation, as exemplified by the EuroFlow consortium [11], should be used at the start of the analysis to ensure that the data are as robust as possible. Experimental design is crucial to limit fluorochrome interference (compensation induced data spread) and to provide adequate resolution of the signal over the background. The training of technicians and operators is also of utmost importance, which should be accompanied by internal and external quality assessment. Important technical aspects for the use of flow cytometry in clinical trials have been summarized previously by Maecker et al. [12]. Although normalization can adjust for symmetrical shifts of signals, such as those caused by decreased laser power or decreased antibody staining, it will not solve errors introduced by random error (e.g., omitted antibody). Care must therefore be taken to not apply the presented normalization to the data in cases where changes in fluorescence intensity are the desired read-out.

We tested how well the algorithm could correct the frequent finding of inappropriate compensation. We used one FCS data file and saved it four times with alterations of CD3 APC to CD45RA APC-Alexa750 channel compensation (correct, overcompensated by 30%, undercompensated by 30%, uncompensated). We applied the “warpSet” function from “flowStats” R-package (version 3.19.9, to the four FCS files. The normalization was applied to the gated subset (human CD3+ T-cell subset). As proposed by Finak, the algorithm can correct for mild errors that are introduced by compensation (see Fig. 1) with the following limitations: the data will not lose or gain discrete peaks (so called “landmarks”), and the landmarks will not be shifted too far or be confused with neighboring landmarks (Fig. 1d, uncompensated). Normalized data show much smaller variations in the percentage of positive cells (Fig. 2b, overcompensated, Fig. 2c undercompensated).

Figure 1.

Compensation artifact of CD3 APC to CD45RA APC-Alexa750 correction (original files in the left panel, normalized data on the right panel), gated CD3 APC positive cells only (a) correctly compensated, (b) overcompensated by 30%, (c) undercompensated by 30%, (d) uncompensated. Landmarks used by algorithm for normalization are highlighted. [Color figure can be viewed in the online issue, which is available at]

Figure 2.

Single template gate (CD45RA+) applied to CD3 APC positive cells before (left) and after normalization (right) shows correction of percentage of CD45RA positive events in mild over- (b) and undercompensated (c), but not in uncompensated data (d). [Color figure can be viewed in the online issue, which is available at]

In summary, there is certainly a place for automated algorithmic analyses, such as the one presented by Finak et al. (in this issue, page 277). However, these analyses should never be applied blindly. There should be extensive understanding of the algorithm by the biologist that interprets the data and a fair understanding of the biological problem by bioinformaticians. In fact, the presented approach directly calls for the involvement of a biologist to provide a sound gating template for major subsets. The effects of normalization should always be studied by comparing the results of the analysis before and after normalization. To achieve practical usefulness, there needs to be a good integration of new tools, such as those developed in “R,” into the “old school” manual software. At the same time, more education in bioinformatics and its concepts is clearly needed within the flow cytometry community to make full use of the rich datasets that we are now able to generate.