A web resource for the UK's long-term individual-based time-series (LITS) data

Authors

  • Owen R. Jones,

    Corresponding author
    1. NERC Centre for Population Biology, Division of Biology, Imperial College London, Silwood Park Campus, Ascot, Berks. SL5 7PY, UK; and
    Search for more papers by this author
  • Tim Clutton-Brock,

    1. Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, UK
    Search for more papers by this author
  • Tim Coulson,

    1. NERC Centre for Population Biology, Division of Biology, Imperial College London, Silwood Park Campus, Ascot, Berks. SL5 7PY, UK; and
    Search for more papers by this author
  • H. Charles J. Godfray

    1. NERC Centre for Population Biology, Division of Biology, Imperial College London, Silwood Park Campus, Ascot, Berks. SL5 7PY, UK; and
    Search for more papers by this author
    • Present address: Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK.


*Correspondence author. E-mail: owen.jones@imperial.ac.uk

Introduction

Online data resources are becoming increasingly important for ecologists. However, compared with fields such as molecular biology and physical environmental science, there are few resources containing raw ecological data that can be used for further research. Reasons for this include the greater heterogeneity of most ecological data compared with, say, nucleic acid sequences, or remote-sensing data, and the fact that many data are collected by individual research laboratories rather than by a community effort. Examples of web-based ecological data resources are the Long-term Ecological Research (LTER) catalogue of the data collected across the US network of environmental monitoring sites (http://www.lternet.edu/), and the Global Population Dynamics Database (GPDD) (NERC Centre for Population Biology at Imperial College, 1999) which makes available 5000 time-series of species population densities, each over 10 years duration. The Ecological Society of America established the Ecological Archives (Peet 1998) as a depositary for electronic information. Fields that overlap with ecology are perhaps richer in web-based data resources: Treebase (http://www.treebase.org/treebase/) is a database of phylogenies, while an increasing number of taxonomic catalogues are now web-based. Much biodiversity information including distributional and ecological data are being collected digitally, or are being digitized, and the goal of the Global Biodiversity Information Facility (GBIF) is to provide a seamless portal for access to this distributed resource.

The purpose of this article is to describe a web-based resource for the UK's long-term individual-based time-series (LITS) data. In the next section we explain what we mean by LITS data, introduce the LITS project, and then describe the web resource. In the concluding part we discuss several issues the project raises that may influence future moves to make more ecological information available on the web.

Long-term individual-based time series (LITS)

what distinguishes lits data

We define LITS data as a type of time-series data where information is recorded not only on the numbers of a species at a particular site but also ancillary information on the state of individuals. This information may be time-invariant attributes like sex or genotype; single events like date of birth or death; or features that change with time such as weight, spatial location, clutch or litter size, and parasite load. LITS data often also includes information about links between individuals, for example parentage, identity of mates, or position within a dominance hierarchy. The biological data may be associated with information about the physical environment, for example local climatic and weather fluctuations or spatial patterns in topography or soil factors. The precise distinction between what are, and what are not, LITS data is of course somewhat fuzzy, and in practice we take long-term to be more than 5 years or generations, and have concentrated on datasets with the richest individual-based data.

The efficient storage, preservation and dissemination of LITS data are important for two reasons. First, by its nature LITS data take a long time to collect and, especially in studies that record many attributes of different individuals, they require many man-hours and much expense. LITS data are often organizationally complex, and their interpretation requires detailed documentation which, if lost, may render the time-series worthless or of reduced value. Second, LITS data are often extremely valuable for exploring ecological and evolutionary hypotheses. We pick three examples illustrating this importance from the many that we could quote.

Moyes et al. (2006) used LITS data from a population of red deer (Cervus elaphus) to examine survival costs of reproduction. They report that current reproduction reduces the survival probability over the coming year, especially in young and old individuals. In contrast, an increase in cumulative reproductive effort, defined as the proportion of years in which a female has bred since sexual maturity, increases the probability of survival. This positive effect on survival of cumulative reproduction is generated by differences in individual fitness and such effects can only be identified with data on complete life histories (i.e. reproductive histories and mortality information) from a large number of individual females.

A second example comes from the Galapagos. The study of Darwin's finches (Geospiza sp.) on the small island of Daphne Major by Peter and Rosemary Grant and colleagues provides an example of one of the most important LITS datasets ever collected. Among other projects, the Grant's used this dataset to show how ecological and evolutionary dynamics of one species, G. fortis, was influenced by the colonization of Daphne Major by a competitor species, G. magnirostris. The arrival of G. magnirostris caused a decline in population density of G. fortis as interspecific competition for food increased, which also generated a response to selection of its beak shape (Grant & Grant 2006). These ecological and evolutionary consequences required detailed information on food availability, population size, individual phenotypic characteristics and mother-offspring links collected over three decades. Without these data such insights could not have been achieved.

Our final example shows how the analysis of multigenerational pedigrees from wild populations has provided insights into the consequences of human actions on ecological and evolutionary change. Once again, this understanding could not have been realized without LITS data on individual mortality, morphometric measurements, and a pedigree also built on individual-level genetic data. The bighorn sheep (Ovis canadensis) population from Ram Mountain, Alberta, Canada, has been studied at the individual level for three decades. Trophy hunting of large-horned males occurs, leading to selection against large horns. This selection has led to an evolutionary response for smaller horned animals, as they survive longer and now achieve many matings because of the absence of large-horned competitors (Coltman et al. 2003).

The LITS project

The United Kingdom has a long history of collecting LITS data, and this has involved ecologists ranging from large-scale multi-institutional teams such as the Soay sheep (Ovis aries) project (Clutton-Brock & Pemberton 2004) through to individuals, such as Tim Healing's work on the Skomer Island bank vole (Clethrionomys glareolus) (Healing et al. 1983). LITS data are a significant national resource for understanding the local response to environmental change, as well as a major contribution of the UK to international science. The aim of the LITS project was three-fold.

1. To survey the LITS datasets concerning terrestrial vertebrates that had been collected in the UK, and by UK-based researchers conducting work overseas.

2. To assess their archival status – whether they were in digital or non-digital form, stored securely, and sufficiently documented.

3. To produce a web resource that would include a catalogue of datasets with associated metadata (defined below) and, where requested, complete datasets.

Many of the UK LITS datasets (see Appendix S1 in Supplementary material) are amongst the most well-known ecological studies. To survey the UK LITS corpus we contacted the holders of the datasets known to us with a provisional list and asked for information about any we had overlooked. We believe the list of 71 datasets we identified is reasonably comprehensive but on the website we request information about new projects and any we have overlooked.

The second aim of the project was to assess the archival status of the different datasets, and any threats to their continued availability. Risk of loss is difficult to assess quantitatively, but we used criteria such as the current research status of the data owner (active, retired or changed career, deceased, respectively, suggesting increasing risks), the number of recent analyses using the dataset (more recent activity indicating less risk), and the method of data storage (relational databases, spreadsheets, paper files, field notebooks, respectively, correlated with increasing risk). Based on this, datasets were classified as low, medium or high risk. The distribution of risk amongst the 71 datasets indicated that although most of the UKs holdings were at low or medium risk of loss (58% and 22%, respectively), approximately 20% were categorized as being at high risk. Note, although we have informed data owners about our assessment of the archival status of their datasets we are aware that this may be sensitive information and it is not provided on the website. Within the LITS project we had funds for data rescue and used the risk assessment to prioritize how this was spent.

Information that provides contextual value to data and assists users in its accurate interpretation is termed metadata. Metadata are diverse and may not only describe things that have been measured or recorded but also document the way in which data are stored, relationships amongst the data, manipulations that have been performed on them, as well as stewardship information. For the LITS catalogue we requested summary metadata (Table 1) from data owners. An advantage of collecting this information in a common format is that it facilitates searching and the comparison of different data resources. Where actual datasets have been mounted on the website more specific information about each data field and their relationships are provided.

Table 1.  Summary metadata stored on UK LITS datasets
DataDescription
Unique IDA unique identification number
TitleA title that provides a description of the dataset that should distinguish it from other projects
AbstractA description of the dataset that will usually describe the objectives, key aspects, design and general methodology of the study
Key-wordsKey-words to facilitate easy searching
Data ownerName and contact details of the data owner(s)
Data contactName and contact details of the first contact to deal with questions regarding the use or interpretation of the data
Associated partiesName and contact details of people who have intellectual property rights on the data. They may have assisted in collection, maintenance or documentation of the data
Usage rightsThe access to the data permitted by the data owner
Geographical coverageThe geographical location of the study site
Temporal coverageThe time span over which data were collected
Taxonomic coverageThe taxonomic identity of the species studied
Methods and samplingDescription of sampling methodologies

The LITS web resource

The outcome of the LITS project is available at http://www.imperial.ac.uk/litsproject/and consists of a catalogue of LITS datasets with their associated metadata. The data depository is an implementation of the Metacat metadata catalogue system (Jones et al. 2001) developed by a consortium of computer scientists, biologists and environmental scientists (Jones et al. 2001). This system is a flexible database specifically designed for metadata storage, query, and access as part of a distributed network of ecological research databases. It can store a diverse range of metadata and data and has a simple user interface allowing flexible search and query. The database uses ecological metadata language (Jones et al. 2001) to store the data. EML is the emerging standard for describing ecological data and is a derivative of extensible mark-up language (XML: Bray, Paoli & Sperberg-McQueen 2000).

Currently, two full datasets have been mounted on the website and a third is being made ready. The first is a 15-year dataset of populations of kestrels (Falco tinnunculus) in England and Scotland (Village 1990). The study includes information on the breeding success, diet, sightings, distribution and morphometrics of approximately 1000 marked birds. The second dataset is on a population of rook (Corvus frugilegus) (Patterson, Dunnet & Goodbody 1988) monitored for 11 years. This dataset holds information on sightings, breeding success, morphometrics, and mortality of almost 7000 birds. The last dataset is a 28-year study of a population of sparrowhawk (Accipiter nisus) (Newton & Rothery 1997) and holds data on breeding success and survival for 829 individuals. Additionally, data on eggshell thickness and pesticide residue are included.

Unrestricted access has been granted by the data owners to the kestrel and rook datasets while the sparrowhawk data are embargoed until 2009 to allow the data owner to complete some analyses. The data owner can allow restricted access, no access, or password-restricted access. Different parts of the dataset can be accorded different access levels. The three datasets so far have been constructed in Microsoft Access because of its widespread availability, though any format can be used. We hope that further full datasets will be added to the website in time. The website is currently maintained by the NERC Centre for Population Biology, and will ultimately be transferred to the NERC Centre for Ecology and Hydrology designated data centre.

Issues raised

fair use of data

During the project we had many discussions with data owners and other ecologists about what constitutes fair use of LITS-type data. Responses varied from the argument that data collected or part-collected using public funds should be immediately available to anyone for analysis (as is sequencing data for example), to the position that access should be completely at the discretion of the data owner indefinitely.

We, and many people we consulted, believe that LITS data are different from sequence data and that any fair-use policy should take this into account. The key difference is the length of time required to collect LITS data, and the difficulty and expense of maintaining long-term field studies. Many of the studies catalogued on our website represent a sizeable fraction of the career of the owners, and are responsible for the majority of their published output. It seems to us fair that they maintain control over access rights to their data whilst still professionally active. But more importantly, we are also concerned that insisting on immediate data release would be a significant deterrent to individuals contemplating starting or continuing LITS projects, to the enormous detriment of ecological science. LITS data are typically collected by individuals, laboratories or small consortia of laboratories, often supported by individual grants, sometimes providing their own funds to bridge gaps in support. Thus, funding bodies often only contribute a fraction of the costs involved with the project.

This type of data collection is different from other ecological projects such as distribution mapping involving large networks of volunteer recorders coordinated by organizations with core funding where the argument for immediate data release seems far stronger.

Based on our consultation we thus put forward for further discussion a suggested fair-use policy for LITS data collected using public funding.

  • 1While a long-term LITS project is active (data still being collected) the owner should be allowed to determine who has access to the data.
  • 2The research funder should have the right to ascertain that the data are being stored in a sustainable way with appropriate metadata and documentation, including requesting that an embargoed copy of the database is held at a designated data depositary. This right should be coupled with the provision of adequate funding to cover the costs of data organization (database construction, data entry, etc.) and storage.
  • 3 At the termination of a project the owner retains exclusive rights for a further period of time (this could be a flat period, for example 5 years, or a fraction of the length of the time series, e.g. one-fifth the length in years).
  • 4All leaders of LITS projects should be encouraged as best practice to collaborate with other groups with complementary skills, and in particular to make data available to groups who want to conduct different types of analyses to those planned by the data collectors.

Separate from this, we would expect journals to insist that data used to derive and support results in a publication should be available for the analysis to be repeated, but that this would not need to include the entire study database.

Web-based ecological resouces

While it seems self-evident that providing internet resources cataloguing ecological data is beneficial, they require financing to establish, and support to maintain them. Money spent on database construction and on the provision and maintenance of internet sites is money not spent on original research. We would defend the investment in a LITS database and website on the grounds of the special importance of this type of data, as well as the UK's particular strength in this type of research. But we also wonder whether the very much larger sums of money devoted to some web-based biodiversity projects will repay their capital outlay in terms of the pure and applied science they will enable. We suggest that a discussion is needed within the ecological community about the best ways to invest a limited budget in different eco-informatic projects.

National or international initiatives

The LITS project was funded by the UK's Natural Environment Research Council and dealt only with data owned or collected by UK-based researchers. It is clearly desirable that projects such as LITS should be international, and that different data catalogues should not develop independently in different regions, or that if they do they use compatible data structures and communicate with each other. It is partly for this reason that we used the EML protocols developed for ecological data by the National Centre for Ecological Analysis and Synthesis (NCEAS) in the United States to describe our datasets. We suggest that debate is needed about how best to co-ordinate web-based ecological resources. One model is that a diverse range of initiatives should be encouraged and that different web sites and resources are informally linked by higher level co-ordinating sites, much as GBIF is planning for biodiversity data. A second model is that different countries should take the lead in providing portals for different types of data. Much as the first option seems democratic and non-bureaucratic, our experience with LITS tells us that without significant investment it will be hard to build high quality ecological resources, and that an international division of labour may be the most efficient way to provide the field with web-based support.

Acknowledgements

We thank the researchers who contributed both data and metadata to our databases and Ian Stevenson (Sunadal Data Solutions) for constructing the Microsoft Access databases. We are indebted to the team behind the Metacat database system and in particular Jing Tao and Matt Jones for assistance in building our website. We thank Rob Anderson for IT help at Imperial College, and Atle Mysterud and an anonymous referee for constructive comments. Finally, we thank the Natural Environment Research Council for funding, and its chief data officer, Mark Thorley, for helpful discussion.

Ancillary