Systematic investigation of protein–small molecule interactions



Cell signaling is extensively wired between cellular components to sustain cell proliferation, differentiation, and adaptation. The interaction network is often manifested in how protein function is regulated through interacting with other cellular components including small molecule metabolites. While many biochemical interactions have been established as reactions between protein enzymes and their substrates and products, much less is known at the system level about how small metabolites regulate protein functions through allosteric binding. In the past decade, study of protein–small molecule interactions has been lagging behind other types of interactions. Recent technological advances have explored several high-throughput platforms to reveal many “unexpected” protein–small molecule interactions that could have profound impact on our understanding of cell signaling. These interactions will help bridge gaps in existing regulatory loops of cell signaling and serve as new targets for medical intervention. In this review, we summarize recent advances of systematic investigation of protein–metabolite/small molecule interactions, and discuss the impact of such studies and their potential impact on both biological researches and medicine. © 2012 IUBMB Life, 65(1):2–8, 2013


From a classic biochemical view, most, if not all, biological systems operate by specific complements of protein functions. Understanding how thousands of protein functions are orchestrated to sustain life has been a great challenge but of vast importance in life science research. Recent breakthroughs in technology, such as DNA and protein microarrays, deep sequencing, and mass spectrometry, have allowed investigation of global changes in many biological processes between and within various phases of cell cycle. Extensive function networks have been built to decipher the operating mechanisms as interaction and association between protein and various molecule species, which could range from 3.5 to 43.7 millions of possible interactions in yeast alone (Fig. 1A), and presumably much more in higher organisms with more complex growth and development plans. These include how protein–DNA interactions in regulating gene expression in human and model organism Encyclopedia of DNA Elements projects (1, 2), protein–RNA interactions in regulating transcription and RNA processing (3), protein–protein interactions in cellular response and development pathways using microarray (4), immunoprecipitation-coupled mass spectrometry (5), or informatics-based predictions (6). These networks have provided unprecedented insights into comprehensive understanding of biology at molecular level, such as how the genetic machinery programs individual development and responds to stimuli by producing proteins with temporal and spatial accuracy at cellular or organism level.

Figure 1.

Molecular interactions between proteins and different types of biomolecules as the chemistry foundation for all biological processes. A: All complexes are from PDB protein structure entries. Protein is shown in gray shade except in protein–protein complex. DNA, RNA, and small molecules (cGMP) are shown in purple, green, and red, respectively. Structures were processed in Jmol v13.0.2. The numbers under interaction types are estimates of the maximal theoretic interactions in yeast, based on data from yeast genome and metabolite databases. B: Summary of systematic researches on each interaction group published over the last 10 years. Inset shows the number of all publications on each type of molecular interactions over the same period. Same line color indicates the same type of interaction in both plots. [Color figure can be viewed in the online issue, which is available at]

However, while other three types of interactions are well documented in past decade, the study of protein–small molecule interactions still lags far behind both in total and systematic researches, as measured by number of publications (Fig. 1B). This can be attributed to extensive genome sequencing projects and technology breakthroughs such as mass spectrometry-based proteomics, microarray, and next generation sequencing. Thanks to the revolutionary progress in functional genomics, to date we have much better ideas about how many proteins are produced in an organism, but we still know less about how many metabolites are used, even less how proteins interact with metabolites. Given that it is the second largest group of all theoretical molecular interactions (Fig. 1A) and the relative scarcity of relevant researches (Fig. 1B), a natural question would be: how many such interactions are of biological consequences but still unknown?

In this review, we use “small molecules” for a group of molecules that include both natural compounds found in biological subjects (metabolites) and artificially introduced chemicals (such as drugs). We also use metabolome for the sum of natural molecules as a systems biology term, and metabolomics for the study of metabolome.

Small Molecules and Protein Functions

Small molecules in the cell not only serve as substrates and products in biochemical reactions that are fundamental to life, they can also regulate protein functions as hormones and ligands through allosteric binding. To name a few, operon-triggering nutrients, plant hormones, cyclic nucleotides, steroid hormones and vitamins are all small molecule metabolites with important regulatory roles.

What about other small molecules? On one hand, many cellular metabolites possess structural similarity to known small molecule regulators because they are just a few steps apart in biochemistry. This may be inferred from known small molecules that are co-crystallized with proteins (Fig. 2). First, 18.9% of the 77,958 proteins in Protein Data Bank (PDB) have at least one small molecule in their experimentally resolved structures. Of the 14,770 small molecules in PDB protein structures, the majority overlays with 8,570 known human metabolites (based on Human Metabolome Database) in an area that ranges between 40 and 1,000 Da and a mild–intermediate hydrophobicity (logP values between 2 and 10). It is not known whether those metabolites of bigger sizes or abnormal hydrophobicity/hydrophilicity have real interactions with proteins, as is shown as the non-overlapped regions of black dots in Fig. 2C. Protein sampling bias in crystallography and limitation in laboratory techniques may both contribute to the apparent lack of large metabolites of extreme chemical properties. However, the distribution in size and chemical property clearly aligned most artificial PDB small molecules with natural human metabolites, suggesting a considerable number of natural metabolites may bind proteins under physiological and pathological conditions.

Figure 2.

Overview of natural metabolites and known protein-bound small compounds. Each dot represents a small molecule whose molecular weight (Da, y-axis) is plotted over its calculated logP value to stand for its chemical properties (hydrophobicity). A: All known protein-bound small compounds (in red) as reported in protein structures from PDB as of August 2012. B: Documented natural metabolites (in black) known in humans as of August 2012. C: Overlay of A and B to show that most known protein-bound compounds are limited to the center region. The numbers of compounds in either database are also indicated. [Color figure can be viewed in the online issue, which is available at]

On the other hand, small molecules dispersed around a protein may bind and alter its biological functions such as enzyme activity and protein interactivity. These interactions provide important means in allosteric regulation of enzymes and receptors. While small molecules may be closely related to their enzyme targets, the small molecules in signaling events can be of little biochemical relevance to their protein targets. Such “unexpected” molecular interactions between protein and small molecules have been implicated in many important signaling events in the past, and are still bringing novel insights into cellular regulation and signaling. Aside from cyclic nucleotides (cAMP, cGMP, e.g.) and inositol phosphates (Inositol phosphate, e.g.) in some classic pathway models, recent discoveries include phosphoinositols binding protein kinases and even glycolytic enzymes (7), sterol regulation of protein kinases in yeast (8), sphingolipid regulation of actin organizing protein (9). Such interactions might not have been easily captured without a systematic or “unbiased” approach. As 20% proteins in yeast were estimated to bind at least a metabolite (8), more systemic scrutiny is still needed to unveil much more of such molecular interactions and underlying regulatory mechanisms. Molecular interactions at this level can then be integrated into the global interaction networks to understand how each biological process is executed at the molecular level.

Recent Advances and Methodological Challenges in Protein–Small Molecule Interactions

Systematic study of protein–small molecule interactions spans both proteomic and metabolomic approaches. While proteomics has attained a mature stage after implementation of functional genomics, mass spectrometry, and microarrays, metabolomics is still in early stage for both technology and knowledge base development. In recent years, significant progresses have been made for systematic investigation of interactions between proteins and small molecules due to advances in technology, especially in small molecule detection approaches (for a method review see Ref.10).

In the first system-scale study, a microarray of 5,800 yeast proteins (>90% of the proteome) were probed with six biotinylated phosphoinositol (PI) liposomes (7). A total of 150 proteins were found to bind PI with varied interaction specificity and strength, 98 proteins of which have known functions. Only 45 proteins (46%) are known membrane-associated proteins, suggesting the majority of protein–PI interactions discovered in this study do not occur on membranes. Interestingly, those unexpected proteins include protein kinases and glycolytic enzymes, suggesting extensive interplay between PIs and various proteins to regulate cellular signaling and metabolism.

Later, a small-scale study used a simple mixture of 12 unlabeled common metabolites to probe a dot array of 10 proteins (mostly enzymes) (11). The majority of known protein–metabolite interactions were scored in this study. But more interesting is the unexpected “cross-reactivity,” where a metabolite or protein interacts with multiple proteins or metabolites, such as bovine hemoglobin to AMP, ADP, and triacetylchitotriose, or NADH to yeast alcohol dehydrogenase and hen egg white lysozyme. This trend was further developed in another study, where a series of lipid-binding proteins (LBP) were used in a similarly designed experiment (12). Whereas proteins were stabilized by affinity of GST-tag to glutathione-coated solid surface, lipid metabolites were added as mixture and bound molecules detected by mass spectrometry. Several expected interactions were scored as enrichment for LBPs to common lipid molecules such as retinoid acid and oleic acid. It was also found that mixture of metabolites has advantage over pure compound in such studies because introduction of potential competing molecules may help eliminate non-specific binding (12).

A slightly different method used dialysis-based assay to focus on low affinity interactions (13). Low affinity interactions, such as those occurring at low micromolar to millimolar levels of metabolites, were detected with this method, probably due to the liquid interaction condition that helps eliminate some background noises common for solid-surface interaction environment. On the five protein enzymes tested in this study, 29 interactions were scored by an enrichment of twofold or above, which include 13 novel interactions that also showed cross-reactivity.

The array idea also branched into a reciprocal probing approach, where metabolite arrays were probed with tagged proteins (9). Lipid binding was explored for 172 yeast proteins on arrays printed with 56 lipid metabolites. The majority of 530 interactions scored in this study is novel, greatly expanding our knowledge of lipid–protein interactions. Because of the great challenge of profiling lipids in an untargeted way, this is probably the most efficient way for studying lipid molecule interactions at present.

A different strategy was used to profile in vivo protein–metabolite interactions in yeast (8). By coupling affinity purification of proteins with mass spectrometry, both protein kinases and ergosterol synthesizing enzymes (124) were assayed for their interactions with metabolites. As interaction complex was purified and detected as is, this assay has the advantage to score in vivo interactions for the first time. Both novel and known natural non-enzymatic interactions were scored in this study, and some were validated with function assays. Not surprisingly, degenerate interaction pattern was also reported, such as ergosterol and a group of protein kinases, suggesting extensive protein–metabolite interactions may exist to directly link metabolism and protein regulation in a global view.

Other promising technologies include nuclear magnetic resonance (NMR) and surface plasmon resonance (SPR), both are powerful tools for screening protein–ligand interaction and elucidating interaction dynamics (10). NMR is widely used for elucidating molecular structures. However, both relative low sensitivity and laborious sample handling limited the use of NMR in interaction studies that involve complex natural samples and expensive biomolecules. Among several NMR-based screening methods, saturation transfer difference (STD)-NMR was developed to improve sensitivity in complex background and simplified sample handling (14). STD-NMR not only determines the fraction of protein-bound ligands, but also maps which part of the ligand molecule makes contact with the protein. This atomic resolution provides a great deal of structural details in the binding vicinity. In addition, both strong and weak interactions may be detected by STD-NMR, due to its capability to monitor the interaction in real time (15). For this reason, implementation of STD-NMR in ligand interaction studies is fast expanding to various biological subjects beyond proteins (16).

SPR determines the mass changes by measuring resonance energy on the surface of immobilized molecules as molecular interactions occurred in real time. Limitation of traditional SPR in studying protein–small molecule interactions was due to the minor mass difference relative to protein size upon binding. This may be overcome by a recent method development that measures conformational change of the immobilized protein upon the binding of small molecules. The conformational change-based SPR has been used to study interaction dynamics when a protein binds to sugars, lipids, and even inorganic ions (17–19). Concentrations of small molecules ranged from nM's to mM's in these studies, suggesting interactions with a wide range of binding strength may be studied with SPR.

Despite the recent advances, study of protein–small molecule interaction faces great challenges in methodology. In several studies, large-scale protein purification was used to fabricate arrays for probing with isotope- or fluorophore-labeled metabolites as well as label-free metabolites (7, 9, 11–13). This platform has allowed fast screening of hundreds to thousands of proteins that may interact with the small molecule probes one at a time. The binding coefficients can be deduced biochemically by altering protein input in binding reactions. Though it can be fully exploited as a system-level approach from the proteomics perspective, this strategy may intrinsically rely on prior knowledge in choosing small molecule probes. It also assumes that protein–small molecule interactions are not disrupted by chemical modification introduced in labeling, where in several cases the tag molecule is of similar size of the tagged molecule. This approach may also be blind to interactions that require more than one protein in vivo, such as GTP to small GTPases that requires Guanine Nucleotide Exchange Factors to dissociate GDP from GTPase (20). Another strategy focuses on the in vivo interactions by affinity-purify metabolite–protein complex directly from cells for non-targeted identification of binding metabolites in later steps (8). This platform in theory has the ideal coverage of all possible interactions, but is limited by available protein purification techniques that must also be compatible with metabolite detection technology.

Nevertheless, a key issue in both strategies is the efficiency and sensitivity of small molecule measurement technology. The technology must be able to detect sub-picomoles of unknown small molecules of diverse chemical properties and wide range of abundance, the chemical complexity of which is typical for most biological samples. At present, mass spectrometry may become the technology of choice in this field for its potential in handling complex samples in most biological researches (21). In practice, mass spectrometry is often coupled with advanced front separation technologies such as liquid/gas chromatography and capillary electrophoresis.

Although international efforts have started to build experimental and knowledge reference databases for data interpretation (22, 23), extensive development of analytic technology is still much needed to improve the coverage of the metabolome from a complex chemical background. Informational tools are also needed to process and share the huge amount of data from various instrument platforms.

Protein–Small Molecule Interactions in Global Interactomics

Integrating protein–small molecule interactions helps construct the complete molecular interaction networks that cover all cellular molecules. In the cell, almost every signaling event involves transitioning of protein function states in activity, stability, interactivity, and localization, which is often achieved through molecular interactions. In recent years, extensive interaction networks have been built to describe how each protein binds DNA, RNA, and protein (1–4, 6, 7). These networks have greatly expanded our knowledge of the wiring of inheritable molecules at system level, and also materialized many signaling pathways known previously as associations in forms of physical interactions. The systemic precision and accuracy brought by these studies not just greatly enhanced our ability to describe operating mechanisms and predict outcome, but also revealed novel molecule entities for potential application in medicine and agriculture (24–27).

Small molecule–protein interaction is unique in its partial independence of genetics. Unlike the genome, transcriptome, or proteome, the metabolome is not strictly passed down through genetics. In other words, despite the common biochemical landscape between generations of the same species, metabolomic differences may be qualitatively or quantitatively introduced and dealt with in offspring organisms due to environmental changes, which ultimately determines fitness, adaptation, and survival. For examples, galactose, tryptophan, and sterol can each switch on or off specific sensor protein functions through interactions to elicit extensive reprogramming of metabolism in adaptation responses to available nutrients (28). Given the potential huge number of such interactions (Fig. 1A), a layer of underlying regulatory mechanisms, despite for a handful well-known examples, is still pending close scrutiny at the system level in the era of genomics.

The improvement in prediction power of a global interactomic map brought by incorporating small molecule–protein interactions can be profoundly important for medical applications (29). At times, the complex nature of global interaction networks often embarrassed the pharmaceutical industry with undesired side effects even after billions of dollars were spent on an government-approved drug (for a list see The switch from “focused” research paradigm, which only examines several isolated nodes or linear pathways at a time, to a global map of molecular actions, might allow targeting on one biological function whereas causing minimal perturbation to the system. This could bring a huge benefit to the pharmaceutical industry in several aspects: (1) it helps predict side effects based on common binding small molecules between the target protein and other proteins; (2) it may be used to accentuate the efficacy of a drug by developing a cocktail medicine that targets on pathways or networks instead of individual molecules; (3) in an extreme but likely case, it is even possible to develop a nutrition-based therapy or preventive means for a patient whose genetic predisposition has been revealed by personalized omic profiling.

Despite the great potentials, a global interaction map may not be of any practical use unless it considers the spatial and temporal dynamics in a quantitative way (30, 31). Even with current revolutionary breakthroughs in both experimental and computational technologies, this building process will not be anything but a painstaking cohort that entails multidisciplinary efforts at a scale far larger than today's functional genomics studies.

Protein–Small Molecule Interactions in Next-Generation Medicine

Next generation medicine is largely personalized medicine. Traditional health management and clinical diagnosis rely mainly on each individual's clinical symptoms and signs, medical and family history, and laboratory examinations and imaging evaluation. This is often a reactive approach to treatment. For example, health management or disease treatment does not start until the signs and symptoms appear. The earlier the diagnosis is made, the more likely the disease can be cured. With recent technological advances, it is now possible to analyze thousands of molecular constituents at the same time and make the medical practice more efficient in both diagnosis and treatment based on individualized clinical information.

Next generation genome sequencing, for one, has enabled more detailed understanding of the impact of genetics in human diseases by high-through scanning of disease-associated genetic variants. This technology in its mature form will guide molecular interaction studies along the way to build global interaction map. Disease-associated pathways will be given priority. In past few years, the use of genome-wide association (GWA) studies has revealed that single nucleotide polymorphisms (SNPs) could account for the individual variation and also the risk for human diseases. In 2005, a study has identified an SNP associated with age-related macular degeneration (AMD), a major cause of blindness in people over 60 (32). GWA studies have also shown good potential to help elucidate the pathophysiology in the disease progression and treatment. Identification of genetic variants associated with cellular response to anti-hepatitis C virus (HCV) treatment is a good example. For the combined treatment of genotype 1 HCV infection with pegylated interferon-alpha-2a or pegylated interferon-alpha-2b and ribavirin, a GWA study has shown that SNPs near the human IL28B gene, which encodes interferon lambda 3, are associated with significant differences in response to the treatment (33). Lately, the same genetic variants were found to be also associated with the natural clearance of genotype 1 HCV (34). Furthermore, several other studies have examined the use of risk-SNP markers as a means that directly improved the accuracy of prognosis (35). This improvement is still immature as only minor benefits were reported by another study (36). Nevertheless, the future GWA studies in medicine will help to identify more SNPs associated with complex, common diseases, as well as variations that might affect a person's response to certain drugs and overall fitness. Interaction networks will then add molecular details for in-depth understanding of disease biology and possible preventative and therapeutic strategies.

Human health depends on both individual genetics and environmental conditions such as nutrition. Not just genetic defects, malnutrition and bad living habits also cause diseases. As a matter of fact, genetic and environmental factors can compensate or aggravate each other in disease inception and progression, which could be exploited in preventative and therapeutic medicine (37). How the environment affects human health is not completely well understood at the molecular level, but evidences emerged in recent GWA studies. For instance, monozygous twins share a common genotype at birth and are epigenetically indistinguishable during their early years of life. However, older monozygous twins exhibited remarkable difference in phenotype and genomic distribution of DNA methylation and histone acetylation, which may affect their gene-expression portrait and susceptibility to disease (38). The impact of environmental factors, such as diet, smoking, physical activity on the phenotype discordance and disease pathology is thought to be partly mediated by epigenetics, a vague term for non-genomic heredity that may involve inheritance of molecular interaction modules. Important epigenetic mechanisms include DNA methylation, post-transcriptional histone modification (including methylation, acetylation, ubiquitylation, and phosphorylation). These modifications are controlled by specific protein enzymes, which are often mis-expressed in various human diseases, including cancer.

Epigenetic study can also help identify potential therapeutic targets, formulate diagnostic methods, and develop new treatments. Whole-transcriptome analysis is a powerful way to investigate entire epigenome. With DNA microarray, distinct molecular subtypes of glioblastoma have been defined as critical in disease stratification, discovery, and assessment of treatment strategy (39). Current epigenetic frameworks combined global DNA methylation, RNA-Seq, and ChIP-Seq analyses to help predict disease and improve the understanding of disease mechanisms (40). Although not yet used in widespread clinical practice, some epigenetic markers, such as methylated DNA sequences, may prove useful to monitor and guide cancer therapy. Despite various observational reports, the mechanistic pathways by which environmental factors affect the epigenotypes are still poorly understood (40). Possible mechanisms include the substrate-dependent regulation of epigenetic enzymes and allosteric regulation of signaling pathways that results in specific chromatin changes, such as dietary intake of methyl donors for DNA or histone methyltransferases and resveratrol/valproate for the modulators of histone acetyltransferase) (41, 42). The increasing efforts to decipher protein–small molecule interactions at systemic level will lay a conceptual framework that provides the global view of complex epigenetic regulation under various environmental inputs. The complete global interactome including protein–small molecule interaction will be crucial to eventually disentangle the complex relationship between genotype and phenotype and provide a much deeper understanding of health and disease (43, 44).

In the long run, a prevention-based health care model can be proposed for next generation medicine (Fig. 3). In this model, physicians will have each individual's genomic and epigenomic information at his/her early years of life in record. The individual will have his/her integrative personal omics profile (iPOP) data and nurture data collected and incorporated in the same record on a regular basis. Advice on health and disease management will be based on the combination of these information and well-established global physiological and pathological interaction network. The integration of all omics information derived from genomic, transcriptomic, proteomic, and metabolomic together with the knowledge derived from internal function networks built between host protein–DNA, protein–RNA, protein–protein, protein–small molecules, and external function networks built between host-pathogen will greatly advance our ability to tackle individual susceptibility and personal pathogenesis. It will greatly improve our capability in early diagnosis by defining disease subgroups and ultimately paving the road toward improved and personalized treatment of these diseases (Fig. 3). Furthermore, modification of inherited susceptibility by altering environmental exposures, such as lifestyle changes or drug uses is likely to become an acceptable part of future public-health and clinical practice.

Figure 3.

A scheme for application of molecular interactions in health and disease management. A model is described in this prevention-based health care for next generation medicine. Initially, each person's genomic and epigenomic information is archived in early years of life, which serves as the reference for personalized health care in later stages. The person will routinely have his/her iPOP data and nurture data collected and incorporated to his own record. Advice on health and disease management will be based on the combination of personal information and popular interaction network. Once a pathological interaction is observed, preventative or therapeutic intervention such as lifestyle changes and/or drugs prescription will be formulated to treat individual health problems based on individualized health information datasets. [Color figure can be viewed in the online issue, which is available at]


This work was supported by research grants from the NIH and NSF to MPS. The authors declare no conflict of interest.