Article condensation The ‘database for preterm birth’ is an aggregation tool to organize the publications, genes, genetic variations and pathways related to preterm birth for use by clinicians and basic scientists.
James Padbury, MD, Department of Pediatrics, Women and Infants Hospital-Warren Alpert Medical School of Brown University, 101 Dudley Street, Providence 02905, Rhode Island.
A vast body of literature has suggested genetic programming of preterm birth. However, there is a complete lack of an organized analysis and stratification of genetic variants that may indeed be involved in the pathogenesis of preterm birth. We developed a novel bioinformatics approach to identify the nominal genetic variants associated with preterm birth. We used semantic data mining to extract all published articles related to preterm birth. Genes identified from public databases and archives of expression arrays were aggregated with genes curated from the literature. Pathway analysis was used to impute genes from pathways identified in the curations. The curated articles and collected genetic information are available in a web-based tool, the database for preterm birth (dbPTB) that forms a unique resource for investigators interested in preterm birth.
Preterm birth (PTB) is an important, poorly understood clinical problem. It inures enormous clinical, economic and psychological burdens to society. While recent theories underscore the role of inflammation in preterm labor, simple explanations, single pathways and simple patterns of inheritance are inadequate to explain the pathogenesis of this enigmatic pregnancy complication. The pathogenesis of PTB could be better investigated whether considered a complex, polygenic disorder that entails activation or suppression of a host of genes. We hypothesized that polymorphic changes in the genes that contribute to the risk of preterm birth could be identified using new bioinformatics approaches coupled with high-throughput technologies applied to appropriate cohorts of patients. This will lead to previously unrecognized insights into the relative contribution of the genetic and environmental factors, which underlie preterm birth.
We developed an alternative approach to identify a more manageable set of candidate genes, which nonetheless incorporates some elements of genome-wide investigation. Our approach combined information from published literature with data from expression databases, linkage data and pathway analyses to identify biologically relevant genes for testing in an association study of genetic variants and preterm birth. These genes, their genomic location, the single nucleotide polymorphisms contained therein and any associated copy number variations are presented in a publically available, searchable database, http://ptbdb.cs.brown.edu/dbPTBv1.php.
Knowledge-based computational biology and bioinformatics approach
We developed a web-based, semantic data mining and aggregation tool to ‘filter’ published literature for evidence of association of preterm birth with genes, genetic variants, single nucleotide polymorphisms (SNPs) or changes in gene expression. dbPTB used SciMinerTm to extract the gene and protein information from published articles specific to preterm birth. More than 30,000 articles related to PTB potentially included relevant information on genes, SNPs or genetic variations. Using semantic language processing, we identified 980 articles with information about genes and genetic variants. We used queries that have common and very well-known keywords for PTB and genetics, for example, ‘preterm birth and genes’. After acceptance of extracted articles, all the MeSH (Medical Subject Headings) terms associated with these papers were used to create new search queries with the newly annotated MeSH terms.
Curation is the process where the literature is searched by several junior and senior members of a biomedical research team. Our curation team consisted of researchers and medical students formally trained in the molecular and cell biology of preterm birth. Each article was carefully read with attention to study design, and relevant articles were deposited into the database with their unique PMID. We entered the genes, genetic variants, SNPs, rs numbers and annotations describing gene–gene interactions. We accepted the authors' criteria for statistical significance. All genes and genetic variants entered into the database were entered using their unique Hugo Gene Nomenclature (HGNC) numbers for identification. SNPs were entered into the database and recorded with their appropriate rs number using HapMap Data Release 27. Where specific haplotypes were shown to confer significant risk for preterm birth, all the individual SNPs within the haplotype were entered into the database. Inter-rater reliability was assessed, and kappa scores were measured after training.[3, 4] Articles that were accepted for PTB immediately become accessible to dbPTB queries along with all the relevant genetic data (Fig. 1).
Query development and data integration
High-dimension databases of expression data, data from linkage analyses, databases of results from SNP arrays and data from proteomic platforms were searched for genes, genetic variants and proteins related to preterm birth or showing differential association with preterm birth. We also searched for articles that provided information on analyses of proteins in body fluids or compartments that were analyzed using contemporary proteomic techniques; for example, mass spectrometry. We also searched the Heart, Lung, Blood Institute and the National Human Genome research (NHGRI) repositories, the Human Gene Mutation Database and the Catalogue of Published Genome-Wide Association Studies hosted by the NHGRI.
For each deposited gene, we include SNP data and tag SNPs from 5 kb upstream to 5 kb downstream from the genomic sequence from HapMap (release number 272). SNP information was utilized from NCBI dbSNP Build 126. For each article, abstract and related information such as PMID numbers, journal name, authors' name and title also were stored in dbPTB.
We used the ingenuity pathway analysis (IPA, Ingenuity® Systems, www.ingenuity.com) to identify pathways and networks involving the genes we identified with significant evidence for their roles in preterm birth. We included the genes and genetic variants identified by curation and in public databases, largely transcriptome wide array data sets[5, 6] and some proteomic analyses related to preterm birth. The genes identified by the ingenuity pathway analysis were entered into the Kyoto Encyclopedia of Genes and Genomes (KEGG) database.
Insights from database for preterm birth
We extracted 31,018 articles dealing with PTB from PubMed using SciMiner. The ‘filtered set’ included 980 articles with likely information from 1200 genes. We ‘accepted’ 142 articles described by a total of 960 unique MeSH terms. These articles provided associations of 186 genes with preterm birth that were accepted as statistically valid by the publishers and the curation team. We next imported 215 genes from both published and public databases containing array data and data from other proteomic analyses. Lastly, we identified and included an additional 216 genes based on the interpolation from pathway analysis. These genes were contained in 173 unique pathways. The work flow supporting retrieval of genes from the literature and public databases and gene interpolation from pathway analysis is shown in Fig. 1. These results are all retrievable from the publicly available database for preterm birth http://ptbdb.cs.brown.edu/dbPTBv1.php. We have also included the 156,963 SNPs contained with the genomic and flanking regions of each gene in dbPTB. We physically mapped the genomic location for genes in dbPTB. The chromosomes and the number of genes mapped to each are shown in Fig. 2.
We identified a total of 25 networks. Several networks including ‘Inflammatory Response, Small Molecule Biochemistry, Cellular Development, Hematological System Development and Function, Cellular Function and Maintenance, Cardiovascular Disease, Connective Tissue Development and Function, Drug Metabolism, Genetic Disorder’ represented the largest portion of interaction domains among the major networks detected.
Database for preterm birth allows investigators interested in preterm birth to pursue several query strategies to search related articles, genes, SNPs, chromosomes or keywords against the MeSH terms and abstracts of the curated articles. This includes the authors, the title of the articles, name of the published journal and the link to the original source. There are links to Online Mendelian Inheritance in Man (OMIM), the UCSC Genome Bioinformatics and HGNC. Under the same search option, users are able to see all related SNP data for each gene.
Recent studies have focused on genomic and proteomic approaches to diagnosing and determining the mechanism(s) of preterm labor. Polymorphic changes in the protein coding regions of specific genes and in regulatory and intronic sequences have been described. In most of the studies reported to date, candidate genes or proteins involved in inflammatory reactivity or uterine contractility have been investigated.[8-26] Summaries of these observations and candidate genes have been reported. Most of the studies reported to date have involved modest-sized patient cohorts and polymorphisms from genes involved in infection/inflammation. The results suggest that alteration in the structure and/or expression of these proteins interacts with infection and/or other environmental influences and is associated with preterm birth. The results generally, however, do not provide insight into the causes of prematurity in the absence of inflammation. They also do not demonstrate whether the observed associations are reflective of genetic mechanism(s) and/or gene–environmental interactions.
The promises of the genomic era have been presented eloquently.[27-29] The genome-wide association study (GWAS) approach queries the genome in a hypothesis-free unbiased approach, with the potential for identifying novel genetic variants. However, while there have been a number of important ‘hits’ (e.g., macular degeneration, obesity), there are many ‘misses’ and failures to replicate findings even from large-scale studies.[30-32] Moreover, the GWAS-based interrogation of large numbers of anonymous SNPs or CNVs severely limits power and makes it difficult computationally to examine combinatorial gene–gene interactions.[33-35]
We created a more manageable set of genes and genetic variants for which there is a prior evidence for involvement in preterm delivery. dbPTB was developed to create, aggregate and store this unique combination and specialized information on preterm birth. We believe this smaller set of genes may allow important but otherwise difficult computational approaches to examination of gene–gene interactions in combinatorial or higher order fashion. As the first basis for population of this database, we used published literature. One hundred and eighty-six genes were identified by using the literature-based curation, 215 genes were from publically available databases and an additional 216 genes came from the pathway-based interpolation. This total of 617 genes represents a parsimonious but robust set of genes for which there is good a priori biological evidence for involvement in preterm birth. These genes and genetic variants can be used now in case–controlled studies comparing genetic variants, SNPs or copy number variations for their relationship to PTB.
This work was supported by the National Foundation March of Dimes Prematurity Initiative # 21-FY08-563, and National Institutes of Health Grants NIH-5T35HL094308-02 and NIH-NCRR P20 RR018728.