Pollen is among the most ubiquitous of terrestrial fossils, preserving an extended record of vegetation change. However, this temporal continuity comes with a taxonomic tradeoff. Analytical methods that improve the taxonomic precision of pollen identifications would expand the research questions that could be addressed by pollen, in fields such as paleoecology, paleoclimatology, biostratigraphy, melissopalynology, and forensics.
We developed a supervised, layered, instance-based machine-learning classification system that uses leave-one-out bias optimization and discriminates among small variations in pollen shape, size, and texture. We tested our system on black and white spruce, two paleoclimatically significant taxa in the North American Quaternary.
We achieved > 93% grain-to-grain classification accuracies in a series of experiments with both fossil and reference material. More significantly, when applied to Quaternary samples, the learning system was able to replicate the count proportions of a human expert (R2 = 0.78, P =0.007), with one key difference – the machine achieved these ratios by including larger numbers of grains with low-confidence identifications.
Our results demonstrate the capability of machine-learning systems to solve the most challenging palynological classification problem, the discrimination of congeneric species, extending the capabilities of the pollen analyst and improving the taxonomic resolution of the palynological record.
Fossil pollen and spores are used to test hypotheses from a broad cross-section of biological and geological sciences. This extraordinarily rich record has been used to address hypotheses from a diverse array of disciplines: from paleoecological and paleoclimatological investigations across hundreds to millions of years; to the identification of plant speciation and extinction events; to the correlation and biostratigraphic dating of rock sequences; to studies of long-term anthropogenic impacts on plant communities; and to the study of plant–pollinator relationships. The consistency of identifications, from sample to sample and from analyst to analyst, is the foundation of all palynological research. This emphasis on repeatability means that identifications are by necessity restricted to taxonomic categories that can be reliably classified by multiple analysts – most often the genus. However, palynologists have long recognized that species identifications are critical to paleoclimatic and paleoecological interpretations, as generic classifications potentially mask radically different environmental adaptations (Erdtman, 1931; Cain, 1948; Birks & Birks, 2000). This tension between the consistency of classification and taxonomic resolution is a fundamental and unresolved problem in palynology.
A classic paleoecological example of this tension is the morphological discrimination of black and white spruce pollen (Picea mariana and Picea glauca, respectively). The problem of species identification in spruce has been a recurring issue in Late Quaternary pollen analysis since the beginnings of North American palynology (Erdtman, 1931; Wilson, 1938; Wilson & Kosanke, 1940; Wilson & Webster, 1942; Cain, 1948). For over eight decades, palynologists have sought to differentiate these species because of their different autecological characteristics; white spruce tends to occupy well-drained upland soils, while black spruce often populates poorly drained lowlands and peatlands (Brubaker et al., 1987). Consequently, the changing abundance of black and white spruce has paleoclimatic significance. The frequency of these two species in North American sediment cores, in addition to their substantial ecological differences, makes Picea an important taxon in Late Quaternary paleoclimatic reconstructions and the assessment of regional and global climate models (Bartlein et al., 1998; Jackson et al., 2000).
One promising solution to the challenges of species identification is in the application of image analysis and machine learning to pollen analysis. Computer-based analyses of morphological difference permit palynologists to move beyond the qualitative visual identifications that dominate pollen and spore identifications and allow measurement of the consistency and confidence to classifications. However, previous research applying machine learning to pollen identification (Langford et al., 1990; Li & Flenley, 1999; France et al., 2000; Ronneberger et al., 2002; Li et al., 2004; Treloar et al., 2004; Zhang et al., 2004; Chen et al., 2006; Dell'Anna et al., 2009; Landsmeer et al., 2009; Holt et al., 2011) has largely avoided the hard machine-learning problem of species-level classification, the most challenging classification problem in palynology (one exception is Rodriguez-Damian et al. (2006), who focused on the morphological discrimination of three Urticaceae species). Consequently, no previous study has focused on the discrimination of paleoecologically and paleoclimatically significant taxa, where the identification of species is significant to the reconstruction of paleoenvironments.
We extend the known capabilities of machine-based classifications by developing a machine-learning system capable of discriminating between the morphologically similar pollen of black and white spruce. By increasing the range of machine-based identifications to the problem of spruce classification, our results establish that automated learning systems can be used not only to automate the identification of morphologically distinct taxa, as has been shown by previous studies, but also to solve paleoecologically critical classification problems for which there is limited expertise. Our experiments include both modern reference and fossil material, demonstrating that the learning system we have designed is not limited to pristine samples and can work with material that has been altered through taphonomic processes, such as compression or tearing.
Materials and Methods
A layered learning system
Our machine-classification experiments were completed using a nearest-neighbor instance-based supervised layered learning system based on kernel density estimation (named ARLO, Automated Recognition and Layered Optimization). ARLO uses bias optimization to find the most effective combinations of experimental parameters (Tcheng et al., 1989, 1991). At the highest level, bias optimization uses an optimizer to maximize the performance of a learning system by manipulating control parameters of the learning system – the bias space (Fig. 1). Bias space is a formal, parameterized space representing all decisions that determine model performance. Bias space includes, but is not limited to, sample preparation, choice of imaging technology, image resolution, choice of image features, image weighting metrics, choice of training examples, single or set of learning algorithms, and how they are combined. A given point in bias space defines all the variables that control the learning process, specifying both the example representation and the learning algorithm to use.
The optimizer searches bias space by trying a series of candidate points, keeping those that maximize the performance metrics. In this study, we had two distinct performance metrics of interest: grain-to-grain classification accuracy (for both our modern and fossil pollen experiments) and whole slide pollen ratio accuracy (for our fossil pollen experiments). We used a stochastic hill-climbing optimizer to find points in bias space resulting in the best performance. Automating the search in bias space makes the system robust and adaptable to different image recognition problems. Bias search automation also removes reliance on a human experimenter to operate the classification system.
We tested our system against two separate data sources: modern reference material derived from vouchered herbarium specimens, where the taxonomic identity of the material was known (Supporting Information, Table S1), and fossil material from Nelson Lake, Illinois, where taxonomic classifications were based on our expert identifications (Table S2). For the modern pollen analysis, in addition to samples of black and white spruce, P. mariana (Mill.) Britton, Sterns & Poggenb. and P. glauca (Moench) Voss, we included two outgroup genera (Abies and Pinus) that share a similar morphology, as well as a third North American spruce species, P. rubens. For the fossil pollen experiments, we focused our efforts on discriminating the two species that were most abundant in our Nelson Lake samples: black and white spruce. Both sets of pollen material had been prepared following standard protocols (Faegri et al., 1989), with silicone or balsam oil as the mounting medium.
Modern reference pollen samples were from reference collections at the University of Illinois and the Illinois State Museum, isolated from University of Minnesota herbarium specimens (Table S1). Up to 100 grains were imaged from each slide, which represented a single individual tree. An uneven number of representatives were available for each modern class, with an especially limited amount of material available for P. rubens. Entire grains, with minimal damage, were chosen for imaging in this analysis of modern material, following the procedure described in the Pollen imaging section.
Fossil spruce examples were from duplicate slides of sediment residues from a published study on Nelson Lake, Illinois (Curry et al., 2007) (Table S2). Ten samples were analyzed, from depths corresponding to high black spruce concentrations, high white spruce concentrations, and roughly equal proportions of the two species. All samples were from fine clay deposits. The preservational quality of these samples were comparable, although more grain damage was observed in the deeper material.
Grains were chosen for imaging for the fossil analysis using a semirandomized method. Student researchers scanned the slides following parallel transects and electronically marked the XYZ location of all saccate grains. Approximately 100 of all the marked grains were then randomly chosen and imaged following the procedure described in the Pollen imaging section. As a result, damaged grains were included, if they were recognizable as a saccate grain by a non-expert. This includes grains that were mechanically or physically changed through tearing, corrosion, or folding.
Each imaged fossil grain was also manually classified by a pollen expert as one of three classes: black spruce, white spruce, or other saccate grain (Pinus/Abies) (Table S2); 896 of the 1014 imaged grains were identified as spruce. Identifications were made using the original sample slides, with images as reference to verify that the same grain was being observed. We used a number of morphological features to determine classification manually: grain size, width of saccus at point of attachment, saccus height, angle of saccus attachment, degree of constriction of saccus at point of attachment, saccus shape, endoreticulate pattern of sacci, and relative size of sacci to corpus (Fig. 2; Birks & Peglar, 1980; Hansen & Engstrom, 1985; Lindbladh et al., 2002).
We included a qualitative assessment of expert confidence with each classification in order to record the difficulty of each classification and the certainty of the expert identification. For black and white spruce, these included: 50% (recognized as spruce, but species uncertain), 60% (few key features representative of the species), 70% (several key features representative of the species), 80% (most features representative of the species), 90% (almost all features representative of the species), ≥ 95% (all features representative of the species). These numbers capture the self-reported confidence of the human expert for a given classification and so are, by necessity, approximate.
We used structured illumination (a Zeiss Apotome fluorescence microscope; Weigel et al., 2009) to produce high-resolution, three-dimensional images. Because of the relatively thin pollen wall of these saccate grains, structured illumination allowed us to capture grain shape and volume in addition to detailed surface images. Images were acquired following a standard manual protocol to minimize variation that would lead to imaging artifacts and potential misdirection of the machine results. Images were taken by multiple researchers, with no one researcher responsible for a single species. Pollen grains were photographed as image stacks using autofluorescence (563 nm excitation frequency (green), 581 nm emission frequency (red)), at 400× magnification (40× EC Plan Neofluor objective, NA 0.75). The shape and depth of the grain were captured as multiple z-focal planes at intervals of half the Nyquist frequency (0.69 μm for this objective; Fig. 3). A typical grain was represented by c. 50 focal slices. Each individual image pixel measured 0.0256 μm2. Grains were cropped manually, using a bounding box that reached from the maximum width of the grain in the x-axis and the maximum length of the grain in the y-axis. The z-stack was limited to the uppermost and lowermost in-focus planes of the grain.
Example representation and classification
Each image within the z-stack was reduced to a vector of image features. For each new pollen classification problem, the bias optimizer determined the appropriate weights and resolutions for each feature (Fig. 1). These measurements do not directly relate to morphological characters that palynological experts would use, but do describe a large range of morphological variation, totaling > 16 000 dimensions of morphological space. The optimized image features can be categorized into three broad categories:
Intensity distribution. A representation of the probability distribution function of pixel intensity values with a variable number of equally populated bins (quantile values). We used probability distributions with an optimized resolution, from two to 40 quantile bins.
Gross shape. To make gross image comparisons, we compared low-resolution projections of our high-resolution images. These projections captured overall shape and the degree of image coarseness was optimized. We used resolutions as low as 1 × 1 pixels to resolutions as ‘high’ as 11 × 11 pixels.
Texture. We approximated texture as the change in sign of the first derivative of pixel intensities along a series of horizontal, vertical, or diagonal lines. The line length was optimized, and ranged from 1 to 13 pixels. The texture features were applied to the original high-resolution image as well as down-sampled versions of the image at varying scales.
In addition to these optimized feature groups, we also used three fixed measurements in our classification: image area (the area of an image slice); image aspect (the height-to-width aspect ratio of an image slice); and image depth (the depth of a grain measured as the number of image slices). Grains were classified using a weighted contribution of each image slice. Weighting was by the confidence of image classification, raised to an optimized power. Confidence was measured as the number of nearest neighboring training examples that share the majority classification.
Traditional classification criteria for spruce include a mix of manually assessed quantitative and qualitative characters (Table S3). Researchers consistently use five key features: size (white spruce generally having larger grains); angle of the sacci attachment to pollen body (white spruce being more acute); texture of the pollen body (white spruce being finer); the internal reticulate structure of the saccus (white spruce having larger, more circular lumina); and the saccus shape (white spruce having blunter sacci) (Fig. 2; Hansen & Engstrom, 1985; Lindbladh et al., 2002). The features we employed in our study coarsely capture these shape, size, and texture characters, but unlike current methods, can be used irrespective of grain orientation.
Learning system evaluation
System accuracy was measured by repeatedly dividing our image data into training and testing sets (Fig. 1). We formed our classification models on a set of training examples and evaluated performance on a separate, randomized set of testing examples. The learning algorithm used the training examples to form the prediction model and applied it to the test examples while measuring its performance in terms of accuracy, speed, and model size. We repeated the experiment with different random partitions and averaged the results. Without bias optimization (i.e. if we had only run a single iteration of the experiment), this accuracy measure would have been an unbiased predictor of future performance. Since we use bias optimization, we know our accuracy is likely optimistic, or overfit. Overfitting refers to the difference in the system's predicted accuracy and its measured accuracy when the system is applied to new data. The greater the optimization, the greater the overfit. We addressed the problem of overfitting in part by using large training samples and through experimental repetition and performance averaging.
Experiments with modern pollen
Our first classification experiments were with modern reference material and involved the identification of three saccate pollen genera (spruce, Abies, and Pinus) – a three-class problem. The identification of these genera is considered consistent among palynologists, since there are clear differences in grain size, with Pinus on average smaller than spruce, and Abies on average larger. There are also differences in the relative size of the sacci to pollen body (as defined in Fig. 2), with Abies having a smaller ratio of saccus to corpus size, and Pinus having the largest. Fir was represented by one species (Abies balsamea (n = 96)), pine by four species (Pinus banksiana, Pinus strobus, Pinus resinosa, and Pinus rigida (n = 103)), and spruce by three species (P. mariana, P. glauca, and P. rubens) (n = 442; Table S1).
For this modern three-class problem, we achieved 95.2% classification accuracy based on a full 641-fold leave-one-out cross-validation. The features with the most weight in the optimized bias were texture-related (Table S4). All identified Abies grains were Abies, and 99% of grains identified as Pinus were Pinus. However, 6% of fir and pine were misclassified as spruce. This higher number of misclassifications was not unexpected; with 4.5 times more spruce grain examples, a given grain was 4.5 times more likely to be classified as spruce than any other taxon.
We next tested the five categories that would be of interest to a palynologist. These classes were the three spruce species, black spruce (n = 200), white spruce (n = 193), and P. rubens (n = 49), and our two outgroup genera, Abies and Pinus. Our final average grain accuracy was 93.3% with a full 641-fold cross-validation (Table S5). Variables with the highest weights were size (measured as image area), followed by texture. P. rubens was the most consistently identified of the five classes, with 100% of the grains accurately identified. However, this extremely high classification accuracy suggests that imaging artifacts, potentially resulting from low sample size, may have biased the results.
Smaller white spruce and larger Pinus grains were misclassified as black spruce, and larger black spruce and smaller Abies grains were misclassified as white spruce. This suggests that, despite overlapping ranges, there is also discriminatory information in size. Finally, the relative importance of texture in the final bias optimization is in agreement with previous work on texture-based machine classifications (Langford et al., 1990; Li & Flenley, 1999; Li et al., 2004; Treloar et al., 2004; Zhang et al., 2004) and suggests that texture, a feature that is often unchanged by moderate taphonomic damage such as compression and tearing, may be an important variable in species discrimination.
Experiments with fossil pollen
Our fossil experiments were with the Quaternary pollen samples from Nelson Lake, Illinois (Curry et al., 2007), corresponding to depths with high black spruce concentrations, high white spruce concentrations, and roughly equal proportions of the two species (Table S2). Ten samples were analyzed, with c. 100 saccate grains imaged and manually identified from each sample, as described in the Data sources section. Each identification included an assessment of self-reported expert confidence, ranging from 50% (where only the genus was certain) to ≥ 95% (where the species was certain).
We first ran a series of experiments to investigate the relationship between expert confidence and learning system performance. Predictably, when the system was trained and tested on high-quality examples (examples classified with high confidence by the expert), it performed better than when trained on low-quality examples (Fig. 4). There was a clear correlation between human and machine uncertainty, verifying that our human expert assessment of the difficulty of the identification problem meaningfully reflected to true problem difficulty. Poorly preserved, damaged, or ambiguous grains were always challenging – to both the expert and to the system. When spruce grains classified with ≥ 95% expert confidence (n = 264) were used as the basis for training and testing, our classification accuracy was 94.2% based on full cross-validation of all grains (Table S6).
We next conducted a ‘slide-level’ leave-one-out cross-validation experiment, where all the pollen grains of a given slide were used for either training or testing, but never both. Accuracy was expected to drop, since unknown grains were never compared with training examples from the same slide. Our final accuracy, however, remained stable at 93.8% (Table S7), demonstrating that the learning system is generalizable and can be applied to new slides drawn from sediment samples with similar taphonomic conditions with very high accuracy. The consistency in our reported accuracy is also our strongest evidence that our learning system is not victim to overly optimistic accuracies based on problem overfitting.
Reconstructing fossil ratios
Although analysts are trained and tested using grain-to-grain measures of accuracy, consistency of slide-level ratios is the more significant goal. Pollen ratios, as either percentages or pollen densities, must be comparable if palynological data are to be used in any comparative or aggregative study. Our final fossil experiments tested whether we could train the machine classification system to reconstruct slide-level ratios of black and white spruce. We used the difference between human and machine counts, expressed as fractions, as our error metric for optimization and learning.
The final results were striking. We were able to replicate our expert proportions for black and white spruce (R2 = 0.78, P = 0.007; Fig. 5) by optimizing the absolute standard error between machine and expert and allowing the system to ‘choose’ its training examples using bias optimization. Notably, the system chose to use all spruce examples that had been marked with a species classification with an expert confidence of 69% or more (with 50% the lowest possible assigned value). As a result, the learning system's grain-to-grain accuracy fell to 77.5% (Table S8). However, since more grains were included in the final counts by the machine than by the expert, the overall slide-level accuracy was maximized. The high correlation of slide-level ratios was achieved with lower grain-to-grain accuracy, demonstrating that the computer could arrive at similar proportions to human expert counts by using larger numbers of lower-quality data.
Morphological differences among congeneric species are often only differences in degree, and the qualitative vocabulary of palynology often cannot convey subtle differences in morphology observed by the analyst. Machine learning circumvents these limitations, as the quantitative measurements that a computer can make, and the differences a computer can record, are greater than that of the human expert (MacLeod et al., 2010). While the human analyst focuses on recognizing discrete characters, our learning models define a probabilistic feature space. It is the relative placement of an unknown grain in this n-dimensional space of morphological variation that is the basis of classification. These measured morphological values do not have to have any biological significance or homology; they need only to define the range of morphological variation observed.
Our machine-based classification system was capable of discriminating the morphologically similar pollen of black and white spruce, from both modern and fossil samples. Our approach is not specific to the features of spruce pollen, so therefore could potentially be adapted for use with other taxonomic groups. Quantitative machine-based approaches have the additional advantage of objectively quantifying the certainty of identifications, an asset to meta-analyses of palynological data. Since classifications are based on the similarity of an unknown grain to known examples, the strength of this classification is inherently a measure of system certainty. Automation of pollen identification also has uses that extend beyond the North American Quaternary. Besides application to the paleoecological analysis of other locales and other time periods, pollen identification plays a significant role in biostratigraphy, melissopalynology, aerobiology, air quality monitoring, and forensic science. All these fields stand to benefit from the improved ease and accuracy of pollen identification.
The next steps in this research are clear. Comparisons among human experts are needed to place system failure rates in the appropriate context. There is little documentation of consistency in expert identifications of morphologically similar pollen taxa, so the limits of supervised learning systems are still unknown. Additionally, implementing a two-tier cross-validation system that not only optimizes inductive bias but also quantifies the degree of overfitting caused by bias optimization would produce more realistic measurements of system accuracy and results that are less prone to overfitting.
Finally, our results demonstrate that machine-based classifications serve as a means of extending the capabilities of the pollen analyst, improving the taxonomic resolution of the palynological record and delivering the consistent and repeatable species identifications that have largely eluded palynologists for the last century. Machine-based pollen identifications will improve the quality of palynological analysis and, as a result, ultimately expand the climatological and ecological hypotheses that can be addressed by the palynological record. As advances in microscopy and digital imaging increase the speed and affordability of microscope-based data collection, machine classifications and similar image-based technologies will play a crucial role in managing the coming information explosion and transformation of palynological research.
Pollen imaging was completed by a team of University of Illinois undergraduates: Stephanie Chang, Caroline Martorano, Christopher Singh-Holmes, Ava Holz, Eric Noorts, and Ashwin Nayak. Vishnu Nair and Huiguang Yang assisted with running the learning system. Eric Grim, Feng Sheng Hu, and Luke Mander provided pollen samples and feedback. This study was partially supported by the National Center for Supercomputing Applications (NCSA) and utilized NCSA's Ember computing system. Funding was also provided by the University of Illinois Campus Research Board (#10253 to S.W.P.) and the US National Science Foundation (DBI-1052997 to S.W.P.). Author contributions: the manuscript was written by S.W.P. and D.K.T. S.W.P. and D.K.T. designed the study. D.K.T developed and implemented the machine-learning system described and produced the results presented. C.W. was responsible for imaging, image consistency, and overseeing the team of students collecting pollen images. P.G.M. served as our spruce expert, classifying the grains from the 10 Nelson Lake samples.