Using bootstrapping to identify protein locations

Authors

  • Catherine Blake,

    1. Center for Informatics Research in Science and Scholarship (CIRSS), Graduate School of Library & Information Science, University of Illinois at Urbana Champaign, 501 E Daniel St, Champaign, Illinois 61820-6211
    Search for more papers by this author
  • Wu Zheng

    1. Center for Informatics Research in Science and Scholarship (CIRSS), Graduate School of Library & Information Science, University of Illinois at Urbana Champaign, 501 E Daniel St, Champaign, Illinois 61820-6211
    Search for more papers by this author

Abstract

Automated methods that leverage both large quantities of text, and knowledge resources are increasingly being used to identify relations from the web. One of the challenges with these approaches is that quantity of relations identified makes it difficult to evaluate system performance. Our goal in this paper is to demonstrate how bootstrapping can be used to identify subcellular location relations of proteins, which is a critically important to biologists in order to understand protein function. Specifically, we use protein-location pairs in the UniProt knowledge base and dependency paths from a collection of text, to infer new proteins, new locations and new protein-location pairs. Our second goal is to conduct a detailed manual analysis of the first iteration of the bootstrapping process. Such an analysis reveals pitfalls of this approach and enables us to conclude with specific recommendations to improve bootstrapping performance in subsequent iterations.

INTRODUCTION

One important annotation activity in biology is identifying the subcellular location of a protein because such location information can shed light on our understanding of function. Our long term goal of this work is to identify new proteins, new locations and new protein-location relationships directly from full-text scientific articles. We envision an augmented intelligence scenario where the system provides a manual annotator with the protein, location and a protein-location pair and supporting evidence within a full text scientific article. Then the annotator would then accept or correct the proposed location. This combination of system and human annotations reduces the time required because the annotator can select yes, no or correct the proposed solution rather than having to read the entire document, while still maintaining accuracy. Moreover the manual annotation efforts would provide ongoing feedback to improve the system performance.

The location relation is an interesting case, not only in biology, but because it is a primitive instance-level relation (Smith et al., 2005) and because location relationships occur in multiple genres. For example, a news story may include the geographic location of an organization's head office or a novel might include the location of a city within a country. In this paper, we infer only protein and subcellular locations from full text scientific articles. As with many ontological relations, location relations can be described as a binary predicate comprising two arguments. Specifically, LOCATION (X, Y) indicates that X is located in Y. For example the system should instantiate the location relation as LOCATION (CIC-5, luminal membrane) for the sentence: ClC-5 specific signal also appeared to be localized close to the luminal membrane of the intestinal crypt.

APPROACH

The proposed bootstrapping algorithm depicted in Figure 1 builds on the distributional hypothesis theory, which suggests that words used in the same context have a similar meaning (Harris, 1968). In our case, words are proteins and protein locations, and the context is the lexico-syntactic structures that connect each given protein to the given protein location. In the experiments reported here, the bootstrapping algorithm was provided with a set of known protein-location pairs in the UniProt knowledge base and 16,600 articles with 3.89 million sentences from the Genomics TREC collection (Hersh & Voorhees, 2009) which had been processed using the Stanford Dependency Parser (Klein & Manning., 2003, version 1.6.4). The algorithm identifies frequent lexico-syntactic paths that connect each protein-location pair in the sample and then proposes new pairs that have the same lexico-syntactic patterns. The system proposed proteins (i.e. those that were not in the original set seed terms) were compared with the entries in Uniprot and with online resources.

RESULTS AND CONCLUSIONS

The system identified 792 proteins after the first iteration and all but three of the top 20 proteins were in UniProt, thus the precision is 0.85. Two of the three false positives were caused by preprocessing errors. With respect to the test set, the system identified all 33 proteins in the first iteration; however, some of the proteins occurred infrequently which suggests that further iterations of the bootstrapping algorithm should be explored.

Figure 1.

The Bootstrapping approach used to generate new proteins, subcellular locations and protein location pairs. Inferred proteins and locations are depicted with a dashed line.

The system identified just over 1,200 new protein-location pairs after the first iteration and two of the twenty most frequent pairs contained the incorrect proteins mentioned earlier. Thirteen of the remaining eighteen protein-location pairs were in UniProt. The five remaining pairs were supported by online sources and sentences within the TREC collection. These results suggest that the initial seeds from UniProt are well suited to find new proteins locations.

The system identified 493 new locations after the second bootstrapping step. Three of the most frequent 20 locations (ER, er membrane, apical membrane) are included in the definitions in UniProt, but are not listed as synonyms

The following sentence fragments correspond to the ten most frequent lexico-syntactic patterns (shown in italics). Sentence 5 does not describe a protein-location pair (underlined). Sentence 7 is a valid protein location because the gene ontology [GO:0005576] defines secreted as a extracellular region category even though secreted is not a location in general sense of the term.

  • 1.Both were found with Rac1 in the cytoplasm and vesicles
  • 2.… some reports concentrated on the structural requirements for the nuclear translocation of STAT3.
  • 3.Loss of functional p53 correlated with nuclear accumulation of BRCA1 in the nucleus of MCF7/E6 cells even after DNA damage.
  • 4.We found that nuclear export of BRCA1 in response to DNA damage occurred only in cells with functional p53.
  • 5.Whereas there is a striking increase in detection of Smad in the nucleus of TGF-beta1 treated cells….
  • 6.The efficiency of nuclear localization of full-length Met30p was equivalent to that of a fusion containing….
  • 7.The secreted PAI-1 has very similar or identical physicochemical and functional properties compared …
  • 8.The cells were then incubated in the presence of cycloheximide (CHX) (20 ng/ml) for the indicated times, and membrane ABCA1 was analyzed by immunoblotting
  • 9.As demonstrated above, mutant TTP and CMG1 proteins in which the NES is disrupted (either by deletion or amino acid substitution) accumulate in the nucleus, …
  • 10.Overexpressed Alix (A) and Alix-NT (B) both localize in the cytoplasm

These results suggest that the bootstrapping algorithm may provide an effective way to leverage both human curated ontologies and large quantities of text. After only one iteration the proposed approach identified three locations that were in the Uniprot descriptions, but missing from the ontology, which demonstrates the urgent need to augment human annotation efforts.

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1115774. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Ancillary