Graphical considerations for presenting data


There is more than one way to make a plot. When presenting data in a scientific report or journal, the objective should be to use plots that make interpretation easy, without distortion. “Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency. (It conveys to) the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.” (Tufte 2001). The ideas in this article are based on general principles found in Tufte (2001) and Cleveland (1993). The figures were included in my poster presented at the Society of Environmental Toxicology and Chemistry North America (SETAC NA) 2012 Annual Meeting in Long Beach, California (poster MP023, copy available on request).

What is the best way to summarize data results from scientific studies? The choice of plot will depend on the type and complexity of the data and the nature of the summary or relationship to be portrayed. A plot that works for one data set may be a poor choice for another data set. The overall goals should be to “show the data, avoid distortion, encourage comparisons and explain statistical conclusions” (Tufte 2001). Scientists spend much time and money on research but often spend too little time considering the best way to convey the findings or conclusions of their efforts. Many resort to using the same type of graphic for all presentations, an easy tool to apply to all situations. At SETAC meetings, I see many presentations where all results are presented in bar plots when another plot style could be more informative. Before choosing bar plots out of habit, consider the following comments and examples.

Bar plots present simple results clearly, but multiple displays or complicated displays are often needed for data that include multiple experimental settings or categories. Bar plots are not generally useful for large amounts of structured information. Figure 1A (top panel) illustrates the point with data from 3 categories. It summarizes “Cigarette smoking among adults 18 years and over, by family income, sex, and race in the United States in 1995.” The bar plot on the left is taken from Figure 36 of Pamuk et al. (1998). It is difficult to discern differences between men and women and patterns associated with family income using the bar plot. The line plot on the right provides a better summary of the data. It shows clearly that black and white males have the highest smoking rates, whereas Hispanic males and females have the lowest rates. It is also clear that the change in smoking frequency with income is highest among black and white males.

Figure 1.

Examples of graphical presentations. (A) Smoking status by 3 categories. (B) Group comparison. (C) US population growth.

The plots in Figure 1B (center panel) summarize results from 2 groups (data constructed for illustration). The bar plots on the left display the group means clearly; however, it is questionable whether a figure with just 2 bars contains enough information to justify a plot. The box plots beside the bar plots provide more summary information including the median, quartiles, and minimum and maximum values. This display includes the box plot summary plus the data (x values) giving an indication of group sample sizes. Box plots provide a good presentation tool for statistical 2-sample comparisons, such as comparing an exposed site to a reference site. The mean and confidence interval (CI) plot on the right is more informative than box plots if the intent of the research is to compare the means of treatment groups. The dots (group means) containing the bar plot information and the 95% CI estimates encourage a comparison of the treatment means using a probability-based statistical comparison. Nonoverlapping CIs indicate that the group means are statistically significantly different. Just adding an extension to the top of the bar plots to depict variability, either the standard deviation of the groups or CIs on the means, is not as effective. The information conveyed by the extension, the ease of comparing group variability, is masked or diminished by the dominant bars.

Graphical presentation of temporal data is challenging because frequently there is both cross-sectional and longitudinal information. Figure 1C (bottom panel) uses box plots above and line plots below to summarize the population of the United States from 1800 to 2000 in the 9 Census Bureau regions (US Census Bureau 2011). Box plots give a good idea of the population growth in the past 200 years, and the increasing variation in regional populations. However, used alone, this is not an ideal application for box plots because they essentially describe only cross-sectional variability. The line plots identify the regions and link them across time. Now it is possible to tell where the largest population growth has taken place in the United States over the last 200 years. The longitudinal analysis indicates the explosive growth of the Pacific and South Atlantic regions (the southern, ocean bordering states.) The fan-shaped pattern of frequencies across time provides almost the same cross-sectional information as the box plots and, in addition, shows the relative ranking of the regions from one 20-year period to another. Color, rather than black and white, would make some lines easier to follow if that level of detail was needed. A line plot could have been constructed by identifying a symbol for each region and a legend in the margin. This detail would have been informative but visually more difficult to follow by glancing back and forth between the plot and legend.

When there is more than one source of variation it is important to identify those sources. The challenge is to display multiple sources of variability. In Figure 1C, the variability in any 1 year is the cross-sectional variability or variability between regions. Both the box plots and line plots in Figure 1C provide good measures of that variability. The slope of a line for a region is a measure of within-region variability. The line plots display the longitudinal or within-region variability. Bar plots would be a poor choice to present either type of variation.

Choosing the best way to display data can be a challenge. Many excellent reference books and software manuals are available describing alternatives. Keep in mind that bar plots are popular but rarely the most appropriate way to present data. Bar plots use a great deal of ink, yet the only purpose is to indicate height. The arrangement of bars frequently ignores or complicates the understanding of the underlying data structure from multiple categories. Always try to find a substitute for a bar plot.