To Err is Human


  • P. Forster

    1. Molecular Genetics Laboratory, The McDonald Institute for Archaeological Research, University of Cambridge, Downing Street, Cambridge CB2 3ER, UK
    Search for more papers by this author

About ten years ago, while studying biochemistry in Hamburg, I heard of the plans of Bryan Sykes to identify the genetic origins of the English. Reasoning that a comparative sample of North Germans and Danes would be useful for this project, I plucked hairs from a couple of hundred of my bemused colleagues, professors, and friends in Denmark and Germany, and sent them to England for mitochondrial DNA (mtDNA) sequencing. However, the first sequences contained errors that only became apparent in phylogenetic network analyses (Bandelt et al. 1995) and motivated us to take great pains in proofreading before publication. For evermore, this impressed upon us a healthy scepticism towards sequence data in general, an attitude which incidentally has not always endeared us to fellow geneticists.

It is an irony of fate that errors in these very same German and Danish sequences have now tripped up researchers of deCODE Genetics in Iceland, as described by Arnason on page 5 of this issue. In a number of papers, the deCODE team had previously argued that mtDNA in Iceland is relatively homogeneous when compared to other European regions, notably the northern European source areas of the Vikings, who had colonised Iceland after AD 874. Arnason however arrives at the opposite result, namely that Icelanders are rather diverse when compared to the rest of Europe. What had happened? The deCODE geneticists had taken their sequences not directly from the original publications or their corresponding GenBank/EMBL depositions, but rather from the conveniently compiled Leipzig public database, HvrBase. As Arnason demonstrates, this database is riddled with copying errors in the case of the Danes and Germans, resulting in a qualitatively different ranking of Icelandic genetic diversity.

Nevertheless, it would be unfair to restrict criticism either to Leipzig or to Reykjavik concerning the accuracy of primary mtDNA sequence data. According to Table 1, which I derived from an update of our own database (Röhl et al. 2001) of published mtDNA sequences, more than half of the mtDNA sequencing studies ever published contain obvious errors which should have been caught by the authors. These errors range from simple mistakes in sample description and misread nucleotides to wholesale rearrangements in columns and rows of the sequence tables. Moreover, Table 1 lists only the tip of an iceberg, namely those errors which are confirmed to exist (i.e., confirmed either by inconsistencies within the publication itself, or by inconsistencies between the paper and the deposited sequences, or indeed by feedback from the authors). Phylogenetic analyses make it clear that error free publications on mtDNA sequences are extremely rare (Bandelt et al. 2001; 2002).

Table 1.  Published mtDNA sequence errors 1981–2002
Acta Crim Japon 1
Acta Paediatr 1
Adv Forensic Haemogenet 1
Am J Hum Biol11
Am J Hum Genet1025
Am J Phys Anthropol65
Ann Hum Genet46
Biochem Biophys Res Comm12
Biochem Int1 
Biol Chem 1
Curr Biol 1
DNA Fingerprinting 1
Evolution of Life 1
Eur J Hum Genet 3
FEBS Lett1 
Forensic Sci Int12
Gene 1
Genome Res 1
Hereditas 1
Hum Biol35
Hum Hered 1
Hum Immunol 1
Hum Mol Genet21
Int J Legal Med88
J Forensic Sci1 
J Mol Evol1 
Legal Med 1
Mol Biol Evol31
Nat Genet21
Nature 1
Proc Natl Acad Sci USA52
Proc R Soc Lond B 3
Russ J Genet1 
percentage erroneous 58.4

Does it matter? Unfortunately, in many cases it does. Not only the Icelandic work but also other fundamental research papers, such as those claiming a recent African origin for mankind (Cann et al. 1987; Vigilant et al. 1991), a Neolithic origin of Europeans (Simoni et al. 2000), or recombination of mtDNA (Hagelberg, 1999; Eyre-Walker, 1999) have been criticised, and rejected, respectively, due to the extent of primary data errors, as outlined by Forster et al. (2001); Röhl et al. (2001); Torroni et al. (2000); Kivisild & Villems (2000) and Macaulay et al. (1999). In forensics, accurate comparative mtDNA databases are needed to assess the probabilitythat a mtDNA profile from a crime stain is likely to derive from a suspect rather than from any other member of the population, so the number of errors in forensic journals listed in Table 1 does not engender confidence, either. I hasten to add that the forensic journals or the American Journal of Human Genetics, which feature prominently in the table, are possibly not more or less inaccurate than other journals, as the table seems to suggest. There is more scope for errors in journals which publish large datasets and long mtDNA sequences, than in other journals, such as Nature and Nature Genetics, who on occasion play safe by not divulging any primary sequence data.

One solution may be for journals to impose more rigourous checks that would discourage hasty submission of manuscripts without adequate proofreading, for example by informing all submitting authors that sequence electropherograms routinely will be checked in the course of the reviewing process. But ultimately, of course, it is up to the authors to ensure the accuracy of their data, and the Icelandic example provides a warning that more care is needed than has been practised in the past. Because some errors will slip through anyway and appear in print, it would also help to follow the lead of Richards and colleagues (Richards et al. 2000) and provide websites which allow readers to enter their comments and corrections without the cost, effort and fanfare of a formal erratum. There is no reason to suppose that DNA sequencing errors are restricted to mtDNA. In fact, it is mainly because mtDNA is a non-recombining genetic unit that many errors are easily identified by phylogenetic analysis; errors in nuclear loci or in rapidly mutating loci such as short tandem repeats will be much harder to detect. The postgenomic age promises to become a proofreading age.