In this column, I briefly reflect on the manner in which automated analysis of the subcellular distribution of proteins (location proteomics) is relevant to the field of cytomics. There are many definitions of cytomics that vary slightly in emphasis. Fundamentally, however, it is the systematic, comprehensive study of at least one cytome, where a cytome is the collection of cell states exhibited by a tissue or organism. We define a cell state as a unique combination of all observable cell behaviors or phenotypes. Different cell types represent different cell states, of course, but the same cell type can exist in more than one state (e.g., activated and quiescent). Clearly, a cell's state is influenced by and reflected in the set of proteins that it expresses.
However, simply knowing how much of a protein is expressed is not sufficient to understanding its contribution to the cell state. It is particularly important to also know its subcellular location because changes in protein subcellular location can cause dramatic effects on cell behavior. Perhaps the most thoroughly studied example of this phenomenon is the changes in protein location associated with apoptosis (1). Changes in location within a cell type may also cause or result from disease, as illustrated by the suspected involvement of the Wnt pathway and β-catenin in a number of cancers (2).
Based on the success of the various genome projects, the feasibility and desirability of undertaking projects to study a single aspect of gene or protein structure or function has become accepted. Many such projects have been initiated, including projects to determine or predict all protein structures and to measure gene and protein expression levels in many cell types and under many conditions. However, subcellular location has received less attention than many other aspects of gene and protein behavior. The major exception is in yeast, in which almost all proteins have been assigned to a set of major subcellular structures (3, 4) using fusion of cDNAs with the coding sequence of fluorescent proteins such as the green fluorescent protein. For example, Huh et al (4) used green fluorescent protein tagging of cDNAs and visual examination to assign proteins to 12 categories: cell periphery, bud, bud neck, cytoskeleton, microtubule, cytoplasm, nucleus, mitochondrion, endoplasmic reticulum, vacuole, vacuolar membrane, and punctate. They then used colocalization with red fluorescent protein markers to divide the cytoskeleton class into two classes, actin cytoskeleton and spindle pole, and to add nine new categories: nucleolus, nuclear periphery, golgi apparatus, three types of transport vesicles, endosome, peroxisome, and lipid particle. In all, 4,156 proteins were assigned to these 22 categories in their study.
Pilot projects in mammalian cells have also been described. For example, Simpson et al. (5) used cDNA tagging to localize approximately 100 proteins in a human cell line, and Jarvik et al. (6) used a clever genomic-tagging approach (termed CD-tagging) to localize a similar number of proteins in mouse 3T3 cells. As with the yeast studies, analysis was restricted to assignment of proteins to one of a limited number of major locations.
These results, although useful and illustrative, do not provide location information with sufficient resolution to be useful for understanding and modeling cell behavior. The limited resolution also applies to systems that have been designed for predicting subcellular location from protein sequence. Further, there is an implicit assumption in many prediction schemes or curated protein databases that proteins have a single location regardless of cell type or condition. In contrast, location is not necessarily the same between different cell types, as illustrated by the differences in subcellular location of viral glycoproteins between cell types that correlate with viral susceptibility (7).
The analysis above demonstrates the need for high-resolution, comprehensive analysis of the subcellular location of proteins in many or all cell types. This demands high-throughput methods for imaging tagged proteins and automated methods for analyzing the resulting images. To meet the latter need, my colleagues and I began applying machine learning methods to subcellular pattern analysis a number of years ago. We initially demonstrated the feasibility of automated classification of subcellular patterns (8) and have extended and refined these results to the point that all major subcellular patterns can be recognized in two- and three-dimensional images of single cultured cells with very high accuracy (9). An important conclusion from this work is that automated classifiers can not only be trained for this task but also can perform better than visual examination (10). More recently, the combination of classification methods with an automated imaging system has been described (11).
Although automated, these classification approaches still have the same limitation as the visual and prediction approaches: they can recognize only the major patterns that they have been trained with. An important alternative therefore is to use unsupervised machine learning (cluster analysis) to group proteins by their high-resolution patterns. We have coined the term “location proteomics” (12) to describe the combination of large-scale protein tagging, high-resolution imaging and clustering by subcellular pattern. The most extensive results of this type described to date are for 90 tagged 3T3 clones that were demonstrated to contain 17 distinct location patterns (Fig. 1) (13). A similar clustering approach has been taken to group drugs by their effects on subcellular patterns (14).
In addition to being critical for bottom-up systems biology efforts to model cell behavior, information that will become available from location proteomics over the next decade can provide important clues to proteins that reflect abnormal cell states. These can then be used with the same automated pattern analysis methods to detect disease or monitor therapy.