Volume 39, Issue 4
Software note
Free Access

helminthR: an R interface to the London Natural History Museum's Host–Parasite Database

Tad Dallas

E-mail address: tdallas@uga.edu

Univ. of Georgia, Odum School of Ecology, Athens, GA 30602 USA

Search for more papers by this author
First published: 01 February 2016
Citations: 17

Abstract

The understanding of the diversity and distribution of helminth parasites is currently constrained by the limited number of host–parasite interaction databases, and the difficulty in accessing existing data. The London Natural History Museum's Host–Parasite Database represents one such underutilized database, containing over a quarter million helminth parasite occurrence records, accessible through a web interface. To enable users to programmatically search and manipulate data from this database, I developed an R package called helminthR. Here, I introduce the core functions of the package, and detail how helminthR can be used to obtain host–parasite interaction records, citations for interactions, and host taxonomic data.

Helminth parasites are one of the most common infectious agents to humans (Stoll 1947, De Silva et al. 2003, Hotez et al. 2008), wild animals (Poulin and Valtonen 2002, Jolles et al. 2008), and livestock (Over et al. 1992, Morgan et al. 2013). Limitations in data availability have hampered our understanding of the spatial distribution of helminth parasites, and associations between helminth parasites and both human and wildlife hosts. Further, there is a need for basic scientific research into the community ecology and macroecology of host–helminth associations (Rohde 2002). Such efforts could provide tests of principles from community ecology, and macroecological patterns in parasites.

To address these research concerns, data on host–helminth associations across broad spatial scales are needed. Efforts to document known host–parasite associations in large databases are fairly recent, and represent valuable resources for researchers (Gibson et al. 2005, Nunn and Altizer 2005, Strona et al. 2013). However, a portion of these databases are not openly accessible, requiring users to contact database administrators or to copy data from web interfaces. These methods of accessing databases may lead to transcription errors, duplicated efforts among labs, and create static copies of the data that are difficult to update if and when new data are added. Allowing host–parasite databases to be open and easy to access may promote open and reproducible science, and would potentially promote the discovery of ‘general laws’ in parasite ecology (Poulin 2007).

To this end, I have developed an R package capable of extracting information from a large global database of host–helminth parasite occurrence records maintained by the London Natural History Museum (NHM; Gibson et al. 2005). This curated database includes more than 250 000 host–helminth records from over 28 000 published peer‐reviewed articles. However, the web interface of the database makes data analysis difficult, which subsequently limits the use of this data resource by researchers (but see Strona and Fattorini (2014) and Wells et al. (2015)). The goal of the helminthR package is to make all the data contained in the London Natural History Museum's database accessible from R, a commonly used open source statistical programming environment (R Core Team).

Core package functionality

Here, I explore the core functions of the helminthR package, and then demonstrate the utility of helminthR for creating host–parasite interaction networks. helminthR relies on several packages that interface with html and xml, including rvest (Wickham 2015a) and xml2 (Wickham 2015b). Currently, helminthR is available on Github, and is hosted by the rOpenSci collective, a group of scientists and developers committed to creating packages to promote open science, including the creation of packages to access online data sources. The package can be easily downloaded using the devtools package, using the following R code.
urn:x-wiley:09067590:media:ecog2131:ecog2131-math-0001
Downloading and using this package does not require the user to have a Github account, unless they would like to actively contribute to package functionality or file an issue.

Querying the database

Host–parasite records in the NHM database contain information on host and parasite species, one or more citations for the host–parasite association, and the location of the interaction georeferenced to the country, state (for the United States), or water body (e.g. Lake Erie) level. Queries can be made to find all interactions of a known host species (findHost), all interactions of a known parasite species (findParasite), or all interactions at a specific geographic location (findLocation). Links to citations for a given helminth record can be obtained from any of the functions listed above by setting the citation argument to TRUE.

When querying the database for known hosts or helminths, the user can input genus and/or species name in order to query different taxonomic levels of host or parasite. Further, findParasite can find host–helminth records given a parasite group (Cestodes, Acanthocephalans, Monogeneans, Nematodes, Trematodes, or Turbellarian) or subgroup. The following example code would find all interactions of nematodes in the genus Strongyloides.
urn:x-wiley:09067590:media:ecog2131:ecog2131-math-0002
The resulting structure of strongHosts is a host–parasite matrix in the form of a three (or four) column data.frame containing host and parasite names, parasite full name, and citation (if the citation argument is set as TRUE). The argument validateHosts provides taxonomic information on hosts from the Catalogue of Life (Roskov et al. 2015). While slightly slow, this removes questionable hosts, and validates species names (when validateHosts = TRUE), returning a list object containing the data.frame described above, and the taxonomic information for all hosts. This structure is maintained when querying using any of the ‘find*‘ functions, including findHost, findParasite, and findLocation. The following code demonstrates the findHost function in order to find helminth occurrence records in wild individuals of Gorilla gorilla (using the hostState argument). The user can also query captive hosts, domesticated hosts, or hosts used in commercial applications.
urn:x-wiley:09067590:media:ecog2131:ecog2131-math-0003
The final core function in the helminthR package queries all host–parasite interactions for a given geographic location. A list of locations capable of being queried is provided by the listLocations function, and a cached copy of these data is provided as a data object (using the command data(locations)). Georeferencing of these data is performed using the geocode function in the ggmap package (Kahle and Wickham 2013). The user is responsible for ensuring the accuracy of the provided latitude and longitude coordinates. Further care should be taken when searching by location, as some locations may be nested within others (e.g. ‘South America’ is a valid location query, but many countries in South America are also valid queries). Below, I demonstrate the functionality by finding all host–parasite associations recorded in France where the host was ‘in the wild’ (i.e. hostState = 1), and removing occurrence records where the host or parasite has parantheses (e.g. ‘(freshwater_fish)‘) or is identified to be at the genus level (e.g. ‘Sanguinicola spp.‘) by setting the argument speciesOnly to be TRUE. The result is a host–parasite association list containing information on host–helminth associations, including links to the original citations. It is important to note that not all interactions will be unique, so the user may wish to use the unique function on the Host and Parasite columns of the output data.frame.
urn:x-wiley:09067590:media:ecog2131:ecog2131-math-0004
urn:x-wiley:09067590:media:ecog2131:ecog2131-math-0005

Visualizing host–parasite networks

The above code demonstrates the functionality of the helminthR package for querying host–parasite interactions by host and parasite genus and/or species, and also for locating all host–parasite interactions in a given country or locality. Using the findLocation function, I queried the database for all host–parasite interactions occurring within Lake Erie, one of the US Great Lakes, and visualized the resulting host–parasite interaction network (Fig. 1) using the igraph R package (Csardi and Nepusz 2006). Detailed code to create this type of visualization is provided in the Supplementary material Appendix 1.

image

The host–parasite association network for Lake Erie, one of the Great Lakes located in the northern United States. Grey lines between boxes represent interactions between hosts (larger blue dots) and helminth parasites (smaller black dots).

Data limitations

The data contained in the London Natural History Museum's Host–Parasite Database represent a valuable resource, but are not without limitation. First, the data are from studies published anytime after 1922, and the data owners themselves accept no responsibility for data accuracy. Second, the data are only georeferenced to the country level in most cases, which limits their application. However, citations are given for each host–parasite association, and an attempt has been made to obtain latitude and longitude values for the centroids of countries (using the command data(locations)). While this may be time consuming, the examination of original references would help assure data quality, and provide more fine georeferencing. Nevertheless, the data can still be used to address many macroecological patterns in their current form. For example, data on aquatic and marine parasites are georeferenced to coastal areas (e.g. ‘Coast of New Guinea’) or larger bodies of water (e.g. ‘Aral sea’), providing a way to apply macroecological theory to largely unexplored questions related to the diversity and distribution of marine parasites (Rohde 2002, 2010).

Conclusions

In this paper I have shown how the R package helminthR permits the programmatic access of the Natural History Museum Host–Parasite Database, making it easy to generate host–parasite networks at different geographical scales spanning from local to global. This database represents one of the most complete aquatic host–parasite databases (but see Strona et al. 2013), providing data on parasite occurrences for both terrestrial and aquatic hosts. With any luck, helminthR will promote the application of concepts from community ecology and macroecology to parasite communities at a broader spatial scale. This project is hosted on Github, and uses TravisCI for continuous integration of the package on different R versions. Issues or improvements can be suggested at this link (< https://github.com/ropensci/helminthR/issues >).

To cite helminthR or acknowledge its use, cite this Software note as follows, substituting the version of the application that you used for ‘version 0’:

Dallas, T. 2016. helminthR: an R interface to the London Natural History Museum's Host–Parasite Database. – Ecography 39: 391–393 (ver. 0).

Acknowledgements

I thank the London Natural History Museum, and specifically D. A. Baylis, the original curator, and the current curation team (D. Gibson, R. Bray, and E. Harris). Colin Carlson, Kevin Burgio, Alexa McKay, Maxwell Joseph, and Giovanni Strona provided thoughtful comments on earlier drafts. I thank my advisor, John Drake, and the developers at rOpenSci for their guidance, support, and general views on open science.

    Supplementary material (Appendix ECOG‐02131 at < www.ecography.org/appendix/ecog‐02131 >). Appendix 1.

      Number of times cited according to CrossRef: 17

      • A global parasite conservation plan, Biological Conservation, 10.1016/j.biocon.2020.108596, (108596), (2020).
      • Pre-Columbian zoonotic enteric parasites: An insight into Puerto Rican indigenous culture diets and life styles, PLOS ONE, 10.1371/journal.pone.0227810, 15, 1, (e0227810), (2020).
      • Host records and geographical distribution of Corynosoma magdaleni, C. semerme and C. strumosum (Acanthocephala: Polymorphidae), Biodiversity Data Journal, 10.3897/BDJ.8.e50500, 8, (2020).
      • Comparing methods for mapping global parasite diversity, Global Ecology and Biogeography, 10.1111/geb.13008, 29, 1, (182-193), (2019).
      • Contrasting latitudinal gradients of body size in helminth parasites and their hosts, Global Ecology and Biogeography, 10.1111/geb.12894, 28, 6, (804-813), (2019).
      • Taxonomic and geographic bias in the genetic study of helminth parasites, International Journal for Parasitology, 10.1016/j.ijpara.2018.12.005, (2019).
      • Global estimates of mammalian viral diversity accounting for host sharing, Nature Ecology & Evolution, 10.1038/s41559-019-0910-6, (2019).
      • Historical collections as a tool for assessing the global pollination crisis, Philosophical Transactions of the Royal Society B: Biological Sciences, 10.1098/rstb.2017.0389, 374, 1763, (20170389), (2018).
      • Gauging support for macroecological patterns in helminth parasites, Global Ecology and Biogeography, 10.1111/geb.12819, 27, 12, (1437-1447), (2018).
      • Compositional turnover in host and parasite communities does not change network structure, Ecography, 10.1111/ecog.03514, 41, 9, (1534-1542), (2018).
      • Evaluating two freely available geocoding tools for geographical inconsistencies and geocoding errors, Open Geospatial Data, Software and Standards, 10.1186/s40965-017-0026-3, 2, 1, (2017).
      • Pathogenic helminths in the past: Much ado about nothing, F1000Research, 10.12688/f1000research.11752.1, 6, (852), (2017).
      • Pathogenic helminths in the past: Much ado about nothing, F1000Research, 10.12688/f1000research.11752.3, 6, (852), (2017).
      • Pathogenic helminths in the past: Much ado about nothing, F1000Research, 10.12688/f1000research.11752.2, 6, (852), (2017).
      • Predictability of helminth parasite host range using information on geography, host traits and parasite community structure, Parasitology, 10.1017/S0031182016001608, 144, 02, (200-205), (2016).
      • The macroecology of infectious diseases: a new perspective on global‐scale drivers of pathogen distributions and impacts, Ecology Letters, 10.1111/ele.12644, 19, 9, (1159-1171), (2016).
      • Towards a more reproducible ecology, Ecography, 10.1111/ecog.02493, 39, 4, (349-353), (2016).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.