• Open Access

Answering biological questions by querying k-mer databases


Correspondence to: Paul Greenfield, CSIRO Life Sciences, 11 Julius Ave, North Ryde, Sydney, NSW 2113, Australia.

E-mail: paul.greenfield@csiro.au


This paper describes a k-mer approach to analysing DNA data and quickly answering certain types of ad hoc biological questions. These k-mers (short DNA strings) are stored in a conventional relational database and indexed to support efficient exact match operations. We show that k-mers around 20–25 bases long have interesting and useful uniqueness properties that can be used to compute a ‘relatedness’ metric and also allow k-mers to be used as ‘unique enough’ tags to identify organisms and genes. This relatedness metric is used in SQL queries that can directly answer questions such as how two related species differ, and what genes are unique to an organism. The k-mer tags have proven useful in applications, largely metagenomic ones that can quickly process large volumes of sequencing data to say something about what organisms and genes might be present in an environmental sample. All of this work is based on simple and fast exact matches of k-mer strings using a database, rather than conventional alignment based on inexact matches of much longer strings. These k-mer tools provide ways of rapidly exploring large genome spaces and handling large volumes of sequence data, and complement rather than replace existing alignment and assembly tools. Copyright © 2012 John Wiley & Sons, Ltd.