- Top of page
- Materials and Methods
- Supplementary Material
- Supporting Information
Application of a computational membrane organization prediction pipeline, MemO, identified putative type II membrane proteins as proteins predicted to encode a single alpha-helical transmembrane domain (TMD) and no signal peptides. MemO was applied to RIKEN's mouse isoform protein set to identify 1436 non-overlapping genomic regions or transcriptional units (TUs), which encode exclusively type II membrane proteins. Proteins with overlapping predicted InterPro and TMDs were reviewed to discard false positive predictions resulting in a dataset comprised of 1831 transcripts in 1408 TUs. This dataset was used to develop a systematic protocol to document subcellular localization of type II membrane proteins. This approach combines mining of published literature to identify subcellular localization data and a high-throughput, polymerase chain reaction (PCR)-based approach to experimentally characterize subcellular localization. These approaches have provided localization data for 244 and 169 proteins. Type II membrane proteins are localized to all major organelle compartments; however, some biases were observed towards the early secretory pathway and punctate structures. Collectively, this study reports the subcellular localization of 26% of the defined dataset. All reported localization data are presented in the LOCATE database (http://www.locate.imb.uq.edu.au).
Organelles serve to isolate biological pathways and cellular functions to specific regions of the cell. Each individual organelle contains a characteristic set of resident proteins that carry out organelle-specific functions. In contrast, numerous proteins transiently move through individual organelles where they are not considered to be residents. These include newly synthesized proteins being transported to their site of function, proteins directly involved in protein trafficking and proteins that can be induced to move from one organelle to another in response to a stimulus (e.g. cell surface receptors). The continuous exchange of lipid and proteins between organelles creates a dynamic environment in which the cell must concentrate and maintain resident proteins within individual organelles.
Recent large-scale sequencing of full-length mRNA transcripts in mouse (1–5) and human (1,2,4,6) has resulted in the identification of numerous hypothetical proteins that have no inferred biological function based on homology to other proteins. The task is to now accurately ascribe biological function to these ‘novel’ proteins. A major determinant of a protein's function is its subcellular localization throughout the various organelles of the cell. Traditionally, cell biology has typically characterized individual proteins to varying degrees of depth. This directed approach allows analysis of the function of each protein and yields an array of experimental data that are subject to experimental variations, such as the cell type analyzed, method of protein detection, and type and position of protein tag. This approach provides vital information about the specifics of individual proteins; however, it fails to address the global understanding of protein sorting as the extent of experimental variation prevents the direct comparison of results (7).
The importance of subcellular localization in determining a novel protein's biological function highlights the need in the genomic era for a rapid, high-throughput assay to determine protein localization. Proteomic approaches have been previously used to attempt to characterize the protein constituents of various organelles [reviewed in (8)]. This methodology has aided in the rapid characterization of the protein complement of numerous organelles but is unable to identify proteins expressed at low levels and is susceptible to various degrees of contamination from other organelles.
Another approach to determining subcellular localizations is the generation of fusion proteins with a detectable protein tag such as green fluorescent protein (GFP). Such high-content assays have already successfully been performed on the yeast models, Saccharomyces cerevisiae (9) and Schizosaccharomyces pombe (10), and in the plant models, Arabidopsis thaliana (11) and Nicotiana benthamiana (12). Within the mammalian context, Simpson et al. (13) have systematically, N- and C-terminally, GFP-tagged full-length human open reading frames (ORFs) using the Gateway® cloning system to report the subcellular localization (see http://www.gfp-cdna.embl.de/) of over 900 novel human proteins. This study did not attempt to identify any features such as the presence of transmembrane domains, signal peptides or targeting motifs that may be affected by the addition of a tag. Collectively, these approaches allow the direct comparison of experimental results and provide a more global view of protein sorting mechanisms used by cells. Additionally, these systematic methods provide the first functional insight to novel proteins by associating them with specific subcellular compartments.
In contrast to Simpson et al. (2000), this study aims to determine the localization of putative type II membrane proteins present in the mouse proteome. Type II membrane proteins are classified as proteins that encode a single alpha-helical transmembrane domain and that lack an endoplasmic reticulum (ER)-targeting signal peptide at their N-terminus. These proteins are predicted to have a distinct topology in the membrane with their N-terminus in the cytoplasm and their C-terminus in the extracellular or lumenal region (14). By focussing on a specific class of membrane proteins, systematic approaches to the engineering of the epitope-tagged reporter constructs can be implemented to minimize the disruption of protein targeting signals by the addition of a protein tag.
- Top of page
- Materials and Methods
- Supplementary Material
- Supporting Information
During the FANTOM3 annotation (5), distinct categories of transcripts encoding type II membrane proteins were specifically targeted for critical review in the systematic annotation process to discard any false transcripts from the dataset. This involved evaluation of each individual transcript and its computational predicted protein-CDS. First, putative protein CDSs less than 150 amino acids in length were reviewed because these potentially represent predicted CDSs in immature transcripts or truncated transcripts that do not overlap with a CDS with stronger supporting evidence. Second, transcripts with greater than 20% of the CDS covered by DNA repeats, as detected by repeat masker (http://www.repeatmasker.org/) (29), were also reviewed as these may represent inaccurate protein open-reading frames or retroviral CDSs, which may not be translated. Third, single exon transcripts with 3′ ends near an adjunct A-rich region in the genome were reviewed as potential oligo dT primed artefacts. Such transcripts may be generated as a result of the oligo dT annealing to regions within an immature pre-mRNA transcript or directly to genomic DNA resulting in the generation of false, truncated cDNAs (5). The predicted CDS within all individual transcripts containing any of the above properties were systematically manually reviewed in the annotation process to exclude CDSs with limited supporting evidence. Typically, in the evaluation of a putative CDS, the presence of peptide sequences predicted to form domains (InterPro) or protein folds was considered sufficient to support a CDS. In contrast, the prediction of signal peptides or transmembrane domains within a CDS was not considered sufficient support for a CDS in isolation. This is because of the low complexity of these protein features and previous observations that numerous translated DNA repeats can be predicted as signal peptides (30). The reviewing process identified and discarded numerous transcripts that represented false transcripts resulting in the higher quality type II membrane protein dataset used in this study.
To identify what proportion of proteins had subcellular localization information MGI was searched to identify proteins with associated GO terms. This identified that less than 15% of the dataset are annotated with a cellular component GO term associating them with an organelle (31). This study reports the development and validation of a methodology to identify and characterize the subcellular localization of 26% of the type II membrane proteins in the dataset by combining computational predictions with experimental validation. A high-throughput, overlapping PCR-based approach has been used to experimentally determine the subcellular localization of 169 type II membrane proteins. These data have been supplemented with previously published experimental evidence mined from the literature describing the localization of 244 type II membrane proteins. Collectively, this represents localization data for 368 TUs within the entire dataset. This effort represents the first directed, high-throughput approach to determining the subcellular localization of a specific class of membrane protein and has resulted in the development of a pipeline approach to identify putative type II membrane proteins and characterize their subcellular localization. Localization data generated by this directed approach can provide insights into the biological function of novel proteins.
The primary concern when designing the tagging approach used in this study was to ensure that the addition of an epitope tag does not disrupt position-dependent sorting signals that are essential for the correct targeting of a protein. Unlike type I membrane proteins, type II membrane proteins do not encode an N-terminal signal peptide. Furthermore, some type II membrane proteins such as tumour necrosis factor-α (TNF-α) are proteolytically processed to yield a C-terminal domain, which is secreted from the cell (32). Therefore, we epitope-tagged the N-terminus (cytoplasmic face) as described by Suzuki et al. (2001). This approach will, however, disrupt the diarginine motif that is required for ER retention located on the N-terminal of type II membrane proteins (33). Analysis of the dataset identified 60 proteins that encoded such a motif and this property was considered within the analytical pipeline. Finally, the nine amino acid myc-epitope was chosen (EQLISEEDL), which allows for rapid characterization of protein localization in fixed samples rather than the 236 amino acid protein, GFP, in order to avoid steric interference.
Proteins exhibiting solely nuclear and/or cytoplasmic localizations were observed in both the experimental (7.10%) and literature (9.84%) datasets. These represent unexpected localizations for type II membrane proteins because they encode a transmembrane alpha helix that traverses the lipid bilayer and should therefore associate with membrane compartments. These nuclear and cytoplasmic localizations may be due to numerous reasons. First, the putative type II dataset will contain a number of false-positive predictions made by MemO. The estimated false-positive prediction error rate is 4.9% for the TMD prediction component of MemO (5, Melissa J. Davis, Fasheng Zhang, Zheng Yuan and Rohan D. Teasdale, manuscript in preparation). Second, the N-terminal tagging system may disrupt integration or translocation of the protein into the membrane. Overall, the proportion of proteins exhibiting nuclear and cytoplasmic localizations provides an estimate of the combined error rate of this study. Alternatively, full-length type II membrane proteins can be proteolytically processed to release a soluble N-terminal polypeptide from a membrane precursor in a process termed regulated intramembrane proteolysis (34), which may result in the generation of a cytoplasmic peptide containing the N-terminus.
A comparison of the observations identified in the literature-based localization data and the experimentally observed localization data identified differences between the distribution of proteins across various cellular compartments. First, a significantly elevated presence of ER-localized proteins (ER-like and nuclear envelope proteins) was observed in the experimental data. This could represent novel ER-localized proteins, misfolded proteins or proteins that are retained within the ER because they represent subunits of larger macromolecular complexes, which are not normally expressed in HeLa cells. Second, an elevated number of proteins that localize to the Golgi apparatus was observed in the literature. This includes a large number of proteins from the glycosyltransferase protein family that have been intensely studied in the literature and therefore were not considered for experimental characterization. Third, a higher proportion of cell surface proteins was also observed within the literature dataset. This may be due to the methodology used to characterize their subcellular localization. Many proteins described in the literature as being localized to the plasma membrane have been analyzed using flow cytometry methods, which do not take into consideration intracellular populations of a protein. Fourth, the experimental dataset displayed an elevated number of proteins found in punctate structures. These punctate structures may represent aggresomes [Reviewed in (35)] consisting of misfolded protein caused by the expression of potentially false CDSs. Alternatively, they may be a direct result of the addition of the myc-epitope tag or be due to the expression of the protein in an inappropriate cell type which can cause the protein to misfold due to the absence of appropriate protein folding and processing machinery. Clearly, the description of punctate structure is the least informative description as the protein could be localized to numerous organelles including, peroxisomes, cytoplasmic vesicles, endosomes, lysosomes, Golgi, subcompartments of the ER or mitochondria. Within the majority of the 80 type II membrane proteins with punctate structures a more definitive description was observed concurrently (24 plasma membrane, 20 reticular, 9 Golgi-like). While the elucidation of the exact nature of the punctate structures will require further co-localization experiments, these proteins have folded sufficiently to be localized to other compartments. Furthermore, 3 of the 27 proteins with only punctate subcellular descriptions have literature data consistent with the observed subcellular localization.
Forty-two proteins with supporting literature evidence for subcellular localization were chosen for experimental characterization using the systematic high-throughput approach used in this pipeline. Thirty-three of these proteins demonstrate agreement between the two localization descriptions. However, nine of these proteins show distinct discrepancies between the two subcellular localizations. Two of these nine discrepancies occurred in proteins that appeared to be trapped in the ER during the experimental assay. Dopamine β-hydroxlyase (AAB24330) is reported in the literature to be localized to chromaffin granules in adrenal medullary cells (36) and neurosecretory vesicles of noradrenergic neurons (37,38). It is reported to exist as both a soluble and a membrane-bound protein. Variation in the cleavage of the signal peptide is reported to result in these different localizations, highlighting the importance of the N-terminus (39). The ER-like localization observed in this study is likely to be due to the N-terminal tagging method. This protein represents a potential false-positive type II prediction made by MemO. The second protein, glycoprotein galactosyltransferase α-1,3 (I420001D20), has been demonstrated to localize to the Golgi apparatus; however, it is believed that the cytoplasmic tail of this protein is responsible for its localization (40). Addition of a myc-epitope or FLAG epitope tag to the N-terminus of α2,6-sialyltransferase and N-acetylglucosaminyltransferase I disrupts Golgi localization resulting in a granular cytoplasmic staining (41), similar to the observation of patchy, reticular ER-like staining observed in this study.
Six of the nine discrepancies between the literature and the experimental datasets were of proteins reported in the literature to reside at the plasma membrane; however, they were localized to punctate structures by the experimental protocol. This can be attributed to a large proportion of literature evidence reporting subcellular localization using flow cytometry as the only method of detection of localization. In these instances, the relative distribution of individual proteins between the plasma membrane and intracellular compartments is not considered. Indeed, three proteins that are discrepant are B-cell Rag-associated protein (D030054H07) (42), natural killer cell receptor (NKR-P1B) (AAK39099) (43) and killer cell leptin-like receptor, subfamily A, member 2 (AAH64711) (44,45) proteins are all demonstrated to localize to the plasma membrane by flow cytometry. The B-cell Rag-associated protein is also reported to have an uncharacterized intracellular localization (42). Furthermore, all three of these proteins have been studied in specialized immune cells and it must be noted that different cell types may express these proteins at the cell surface at different levels, retaining internal stores of the protein in storage compartments. Thus, the detection of cell surface proteins via flow cytometry may contribute to the elevated levels of plasma membrane proteins reported in the literature data when compared with the experimental data generated in this study.
Some proteins form macromolecular complexes and require assembly with other subunits within the ER prior to correct targeting to their appropriate destinations (46,47). If complementary subunits are not present in the cell type in which such proteins are expressed, then proteins are likely to exhibit an ER-like localization (quality control) or a punctate localization, which may represent misfolded protein targeted for degradation. The Na+/K+ ATPase transporting β1 polypeptide (2410046B18) demonstrates some ER-like staining accompanied with punctate cytoplasmic staining. It has been demonstrated in the literature that an ortholog of this protein in Xenopus laevis oocytes requires assembly with an α-subunit at the ER before export to the plasma membrane (46). Similarly, potassium voltage-gated channel Isk-related subfamily gene 3 (4922505J16) also forms a macromolecular complex, the potassium channel (48), and demonstrated a punctate localization experimentally. Alternatively, it is also known that cell surface proteins have dynamic expression at the cell surface. This may result in a variable ratio of the amount of protein at the cell surface and in intracellular stores, depending on factors such as cell type and environmental stimuli. Alternatively, these punctate structures may also represent aggresomes as is the likely case with the OASIS protein (BAA75760), which is reported in the literature to localize to the ER and to translocates to the nucleus upon ER stress yet exhibits a punctate localization in this experiment.
This pilot study has generated descriptions of the subcellular localization of 368 TUs within the computationally defined set of type II membrane proteins. Numerous mouse proteins identified within the FANTOM projects have no previously reported subcellular localization data in the literature. Further analysis of these highlights examples of proteins that are demonstrated to have been correctly folded and targeted to organelles within the experimental dataset. For example, 18 proteins have plasma membrane-like localizations, six proteins have Golgi-like localizations and nine proteins have mitochondrial-like localization patterns. This information represents the first insight into the biological function of many of these novel proteins, which have no inferred function based on predicted InterPro domains or homology to known proteins that can be proposed. Additionally, 37 proteins were localized to ER-like structures, and 36 proteins were localized to punctate, cytoplasmic structures. However, some caution must be exerted when interpreting the localization of these proteins because these may potentially represent misfolded proteins, or proteins that form macromolecular complexes, that are retained in the ER or misfolded proteins that form cytoplasmic aggregates. Further experimental evidence will be needed to validate the subcellular compartments these proteins reside in.
In summary, the collective subcellular localization of approximately 26% of the computationally defined set of type II membrane proteins is reported. These data are incorporated within the LOCATE database (http://www.locate.imb.uq.edu.au) (49) with predicted protein features such as structural and functional domains as well as membrane organization predictions. The LOCATE database also provides links to other major databases [e.g. expression data from the GNF mouse GeneAtlas (50) which, in conjunction with subcellular localization data, can yield insight into biological function. Furthermore, characterization of the subcellular localization of individual proteins enhances our understanding of the protein complement of various organelles and the cellular phenotypes associated with them. Finally, these data also present a training set for the development of computational algorithms to predict subcellular localization.