Specification of Logicle Functions
By considering these criteria and examining the behavior of a number of functions, we concluded that particular generalizations of the hyperbolic sine function (sinh), which we came to call Logicle functions, can best meet the criteria. The hyperbolic sine function itself has the desirable properties of being essentially linear near zero, becoming exponential for large values (leading to a logarithmic display scale there), and making a very smooth transition between these regions (i.e., it is continuous in all derivatives), but it does not provide enough flexibility to meet the display needs encountered in flow cytometry.1
The hyperbolic sine function itself is given as follows:
This can be generalized to what we call biexponential functions,
Interpreting the condition of maximal linearity around data value zero to mean that the second derivative of the function should be zero therein, we identified a subset of biexponential functions with this property and call them Logicle scaling functions.
Besides the constraint just specified, there are four further choices that need to be made to fix the five parameters in Eq. (2) (a, b, c, d, and f), and thereby define a specific display. How these choices appear in an actual Logicle display is illustrated in Figure 2. The parameters described later and in Figure 2 are not simply a, b, c, d, and f, but, once specified, they uniquely determine the function in Eq. (2). The first choice is the maximum data value in the displayed scale (T). The second is the range of the display in relation to the width of high data value decades (M or m in decade or natural log formulations, respectively). If this is held constant among plots optimized to different data sets, the nearly logarithmic area at the upper end of each display will be essentially the same, while the region near data zero is adjusted to optimize for different data sets. We have found that a total plot width of 4.5 “decades” is usually a good choice for displaying flow cytometry data.
The third choice is the strength and range of linearization around zero (W or w). The linear slope at zero (in, for example, data units per pixel or data units per mm in a printout) and the range of data values in the nearly linear zone are determined by this selection. In displaying a particular data set, the linearized range must be adequate to cover broad population distributions that do not display well on log scales. This, in particular, is the selection that is critical in matching displays to particular data sets and in ensuring that the linearized zone covers the range of statistical spread in the data. If the transition toward log behavior occurs in too low data values, the artifacts seen in logarithmic displays will not be suppressed.
The fourth choice is to specify the range of negative values to be included in the display (which also defines the position of the data zero in the plot). This range must be great enough to avoid truncating populations of interest. In practice, as shown in Figure 2, we find that it is desirable to link the third and fourth choices as a single value. This assures that the lowest negative data values in view correspond to the approximate edge of the linearized zone. As discussed earlier under Statistical Uncertainties, negative values should occur only as a result of statistical spreading, and, therefore, they should be displayed within the near-linear zone.
Assuming that the top-of-scale value and the nominal “decade” width of the display have been selected, linking the third and fourth choices results in a family of functions with only one parameter to be adjusted to match the particular data set being displayed.
Using natural log units, an expression for the Logicle scaling function that embodies all of the constraints and choices described earlier is given as follows:
In Eq. (3), T is the top of scale data value (e.g., 10,000 for common 4 decade data or 262,144 for an 18 bit data range).
w = 2p ln(p)/(p + 1) is the width of the negative data range and the range of linearized data in natural log units. p is introduced for compactness in presenting the Logicle function, but p and w together represent a single adjustable parameter.
m is the breadth of the display in natural log units. For a 4.5 decade, display range m = 4.5 ln(10) = 10.36.
The display is defined for x in the range from 0 to m. Negative data values appear in the space from x = 0 to x = w, and positive data values are plotted between x = w and x = m (where the top data value T occurs). The form shown as Eq. (3) is for the positive data zone, where x ≥ w. For the negative zone where x < w, we enforce symmetry by computing the Logicle function for the corresponding positive value (w − x) and changing the sign. The data zero at x = w is where the second derivative is zero, i.e., the most linear area.
To select an appropriate value for w to generate a good display for a particular data set, we obtain a reference value marking the low end of the distribution to be displayed. As described later, we typically select the data value at the fifth percentile of all events that are below zero as this reference value. Designating this (negative) value as “r,” and using its absolute value abs(r), w is computed as follows:
Equations (3) and (4) can be rewritten using base 10 representation in order to express the parameters in terms of “decades” of signal level or display:
In Eq. (5), W = 2p log(p)/(p + 1) is the width of the negative data range and the range of linearized data in “decades” and M is the breadth of the display in “decades.” For a 4.5 decade, display range M = 4.5.
We obtain W from the negative range reference value “r” as follows:
Figure 2 illustrates the relationship between these parameters and the resulting Logicle display.
Specifying a logarithmic display requires two values corresponding to T and M, and the scaling near the upper end of a Logicle plot approximates that of a logarithmic display with the same values of T and M. The additional linearization width, W, adapts the Logicle scale to the characteristics of different data sets.
Logicle functions with different values of W are plotted in Figure 3 along with the linear and exponential functions that match them around data zero and at high data values, respectively. Use of an exponential function for scaling is what results in a logarithmic scale. Note that each Logicle curve closely follows its matched linear function at low signal values, confirming good linearity in the region around data zero. At middle signal values that vary depending on the value of W, the Logicle functions depart from linearity and move smoothly toward the exponential line. At high signal levels, the Logicle curves become indistinguishable from the exponential line.
Figure 3. Logicle, linear, and exponential scaling functions. The Logicle functions are plotted for W = 0, W = 0.5, W = 1.0, and W = 1.5. The display range covers 4.5 “decades,” and the signal level scale is logarithmic, so only the positive data values can be represented. The black diagonal line in each panel is a pure exponential, i.e., the scaling function for a standard logarithmic display. The green broken lines are pure linear functions with zero crossings, and slopes matched to the corresponding Logicle curves (red).
Download figure to PowerPoint
Figure 4 shows a Logicle curve for W = 1.0 and its matched linear and exponential curves displayed with a linear signal level scale. The signal level scale is expanded (top of scale is 300 rather than 10,000) to show in detail the matching of the Logicle and linear curves at low signal levels and the divergence of the Logicle curve at higher levels and the beginning of its approach to the exponential curve.
Figure 4. Logicle, linear, and exponential scaling functions. The same Logicle function, shown in the W = 1.0 panel of Figure, 3 is presented with a linear signal level scale. To visualize the relationships between the different functions clearly, the signal level is shown only from −100 to +300, and the display range is shown only from 0 to 3.
Download figure to PowerPoint
Strategy for Selecting the Width Parameter
As we have discussed, proper estimates of dye signals using measurements on individual cells may be negative, but actual negative dye amounts are impossible. Therefore, any negative values present in the compensated data must be due to purely statistical effects. This is true despite the presence of essentially arbitrary positive staining distributions. Thus, for a population with near zero mean and significant statistical spread, the most negative values indicate the necessary range of the negative part of the scale, and they also indicate the range of linearization needed to ensure that the population will be displayed in a compact and unimodal form. The positive part of the population is less helpful, since it may overlap with other populations in the data set and may not provide a clear upper end with which to define a suitable range for linearization.
A simple strategy of choosing the fifth percentile of the negative data values to set this scale seems to work well and combines adequate sensitivity to extreme values with reasonable sampling stability. Using this strategy (the one currently implemented in FlowJo and illustrated in Figure 2 based on the leftmost of the four data distributions), the visible negative data range extends somewhat below the fifth percentile of negatives reference data value, so that almost all the negative data (out to roughly 1.5 times the negative reference data value) is actually seen in the plot.
In cases where no negative data values occur or the negative values are all close to zero, our experience indicates that a minimal Logicle scale sufficient to linearize data in the range of cell autofluorescence provides a more readily interpreted view of the data than does a purely logarithmic scale.
In some data sets, there are few negative data values, but some aberrant events yielding extreme negative values also occur. In such cases, the fifth percentile of negatives value may lead to a value of W, too high for optimal display of the main data set. Gating out the unrepresentative negative data points and reapplying the automatic scale selection to the gated data cures this problem.
To achieve consistency in data display when analyzing experiments that include a number of samples to be compared, it is appropriate to fix the Logicle scale (for each dimension) based on the most extreme sample present (usually one with the maximum number of labels in use) and use these fixed scales to analyze all similarly stained samples in the experiment. The current implementation in FlowJo bases the scale selection on a single user-specified (gated) data set. A simple and probably desirable variant of this method which has not yet been implemented in user software would operate on a group of data sets designated to be analyzed together. The Logicle width parameter would be evaluated for each dimension in each data set, and the largest resulting width in each dimension would be selected for the common displays. In general, when there are multiple populations in a single sample or multiple samples to be viewed on the same display scale, the population or sample with the greatest negative extent should drive the selection of W.
The method we have chosen for defining the negative end of the display scale in relation to the linearization width makes it possible to evaluate the appropriateness of a particular scaling for a specific data set, by examining the negative data region. If a substantial fraction of the negative data values pile up at the low end of the scale, the value of W is too low to properly display this data, and a higher value of W should be used. If there is a lot of empty negative data space below the lowest population of interest, the linearized region around zero is more compressed than necessary. The population will be properly compact and unimodal, but it would be advantageous to lower W and obtain a more expanded view.
The Effective Dynamic Range of a Logicle Display
We can give a precise expression for the range of variation in scale across a Logicle plot in a form analogous to the “dynamic range” of a logarithmic plot. An ordinary logarithmic scale is often characterized by the number of “decades”, i.e., by the common logarithm of the ratio of the maximum to the minimum data values. Clearly, with Logicle scales that extend through zero, such a formula cannot work. However, if we consider the variation in the number of data units corresponding to a given width on the display, we get a relevant and useful ratio corresponding to the range of expansion or compression of the data across the plot. Mathematically, this is the ratio of the highest and lowest values of the slope or derivative of the scale function within the plot. For an ordinary logarithmic scale, this method yields exactly the same results as the usual procedure, i.e., the common logarithm of this ratio of slopes is the same as the number of decades, as defined earlier. For a Logicle scale, the ratio of maximum to minimum derivatives (at the top of scale and data zero, respectively) varies as a function of the linearization width W.
Working from the expression in Eq. (3), the derivative is given as follows:
The effective dynamic range discussed earlier is S′(m;w)/S′(w;w), i.e., the ratio of derivatives at x = m and x = w.
For the Logicle curves illustrated in Figure 3 with M = 4.5 decades, the effective dynamic ranges are 4.2, 3.5, 2.8, and 2.1 decades for width values W = 0.0, 0.5, 1.0, and 1.5, respectively. (The dynamic range of the logarithmic plot with comparable scaling in the upper range would be 4.5 decades.)
Illustrations and Interpretation of Logicle Displays
Figure 5 shows a comparison of logarithmic and Logicle displays of four signal level distributions that have different means but the same real width of about 2,000 signal level units. Note that the two higher level curves look essentially the same in the two displays, since they occur at signal levels where the Logicle scale is nearly logarithmic. However, the lowest curve is shown very differently in the two graphs. In the Logicle plot, the mean data value occurs at the visual center of the peak and very few data events (less than 1%) fall at the low edge of the scale. In contrast, the logarithmic display for this data set fails to convey an accurate view of the data, in that the mean of the data appears in a highly counter-intuitive location far from the apparent peak of the plot. Also, of course, 49% of very low and negative data values are piled up in an uninterpretable spike at the left edge of the display. This kind of behavior constitutes what may be referred to as a “log artifact” or, more colorfully, the “valley of death.” The second curve from the bottom is intermediate in that it is well represented in the Logicle display, but shows a moderate amount of “log artifact” in the logarithmic display.
Figure 5. Logarithmic and Logicle presentations of four data distributions whose means are different but whose real widths are the same. The distributions were generated by applying different “compensation” amounts to a distribution for single test particles. The highest curve at about 30,000 units represents zero compensation.
Download figure to PowerPoint
Figure 6 illustrates the value of Logicle displays for intuitive and accurate interpretation of fluorescence compensated data and their particular value in the analysis of data acquired in high resolution linear data systems. A mixture of unlabeled microspheres and antibody capture microspheres (BD Biosciences) loaded with FITC antibody were analyzed on a FACSAria cytometer (BD Biosciences), which produces floating point data with values up to 218 or 262,144 and may include (background subtracted) data values below zero. In the upper panels, uncompensated data is shown in Logicle, 4-decade log, 5.5 decade log pseudocolor dot display, and 5.5 decade log contour display. Computed compensation based partly on this sample itself leads to the matching compensated data set shown in the lower panels. In the uncompensated Logicle display, the population of unlabeled particles forms a compact two-dimensional peak centered near zero and includes some negative values for events whose measured signal was below the average background. The 4-decade log display piles up all data values below 26 (= 262,144/10,000) at 26, and the 5.5-decade displays pile up zero and negative data values at 1. The arrow in the 5.5 decade log pseudocolor dot display points out the distracting but otherwise harmless “picket fencing” in the low region, where display pixels are denser than actual data values.
Figure 6. Comparison of Logicle, 4-decade log, and 5.5-decade log displays for uncompensated and compensated versions of a single stain compensation control sample. The sample consists of a mixture of unlabeled microspheres and reagent capture microspheres (BD Biosciences) loaded with FITC antibody. The arrow in the third upper panel points out “picket fencing” in the 5.5-decade log display. The 80% and 64% on baselines designation includes events on the lower part of either the horizontal or vertical axis.
Download figure to PowerPoint
In Logicle displays of compensation control samples, it is easy to confirm that compensation is correct. The compensated Logicle display (lower left) shows clearly that, as expected for an FITC compensation control, the centers of the distributions for unlabeled particles and FITC-labeled particles match at a value near zero in the <PE-A> dimension. It is obvious that the FITC high population has greater spread in the <PE-A> dimension (as would be expected from the discussion earlier under Statistical Uncertainties in Dye Estimates) and that the threshold amount of real PE needed for identification of PE positive events would be greater on the FITC high population than on the FITC negative population. In the logarithmic displays of the compensated data, the apparent center of the FITC labeled population looks higher in the <PE-A> dimension than does the center of the unlabeled population. This is another manifestation of the “log artifact.” In fact, the PE dimension medians of the two populations are equal. Adjusting compensation by eye using logarithmic displays is unlikely to lead to correct compensation.
An appropriate selection of the Logicle width parameter assures that almost all data events will be displayed on scale. Less than 1% of the events in the compensated Logicle display in Figure 6 fall on the baselines. In the logarithmic displays, 45–80% of the events in the two populations fall on the baselines where their frequencies and actual measurement values cannot be interpreted visually. The events piled up on the low margins of the logarithmic color dot plots are almost invisible, while the pileup contours on the margins of the logarithmic contour plot make it look like that there may be separate populations there.