The challenge of constructing, classifying, and representing metabolic pathways

Authors


Correspondence: Ron Caspi, Bioinformatics Research Group, SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025, USA. Tel.: +1 650 859 5323; fax: +1 650 859 3136; e-mail: ron.caspi@sri.com

Abstract

Scientists, educators, and students benefit from having free and centralized access to the wealth of metabolic information that has been gathered over the decades. Curators of the MetaCyc database work to present this information in an easily understandable pathway-based framework. MetaCyc is used not only as an encyclopedic resource for metabolic information but also as a template for the pathway prediction software that generates pathway/genome databases for thousands of organisms with sequenced genomes (available at www.biocyc.org). Curators need to define pathway boundaries and classify pathways within a broader pathway ontology to maximize the utility of the pathways to both users and the pathway prediction software. These seemingly simple tasks pose several challenges. This review describes these challenges as well as the criteria that need to be considered, and the rules that have been developed by MetaCyc curators as they make decisions regarding the representation and classification of metabolic pathway information in MetaCyc. The functional consequences of these decisions in regard to pathway prediction in new species are also discussed.

Introduction

The accumulated knowledge of the metabolic processes employed by living organisms, including their metabolic enzymes and pathways, spans many decades of research. Individuals trying to navigate through this vast wealth of knowledge looking for specific information or seeking to perform broad analyses may be stymied when data are scattered broadly, presented without a relevant biological context, described using alternative compound or gene names, or locked up in out-of-print books, in manuscripts that are often not easily accessible, or in other hard-to-reach resources. Therefore, researchers, metabolic engineers, teachers, and students can benefit when this knowledge is presented in an easily and freely accessible, highly integrated manner. The importance of such resources has become even greater with the genomic revolution, which enables us to project computationally knowledge obtained from one organism to thousands of organisms with sequenced and annotated genomes. However, these new uses for the data present new challenges and require the development of new tools. This review describes some of the challenges that we face while curating and categorizing metabolic pathways in the MetaCyc database (Caspi et al., 2012) and while predicting the presence of these pathways in the various organisms that make up the BioCyc collection of pathway/genome databases. We summarize the guidelines and solutions we have developed to deal with these challenges.

The common definition of metabolic pathways appears misleadingly straightforward. A well-accepted definition describes a metabolic pathway as a series of enzyme-catalyzed chemical reactions occurring within an organism, in which a principal chemical is modified. Most people with some background in biology will recall some well-defined key pathways of central metabolism, such as glycolysis and the citric acid cycle. However, a close inspection of pathways described by different sources, such as the biomedical literature, textbooks, and online databases, quickly reveals that uniformity in pathway description is limited. After all, the metabolic network inside a living cell is very complex, and a pathway is a somewhat abstract concept – a simplification showing a very small subset of that network, intended to make it easier for us to focus on that part. It is up to an investigator or a curator attempting to describe a pathway to decide which network reactions or interactions should be included in the pathway and which should be omitted. Similarly, when trying to classify pathways into meaningful categories such as biosynthetic or degradative pathways, often there can be differences of opinion on the proper categorization(s) depending on which of the principal chemical(s) are valued or, at least, well recognized by the intended audience. The results of the decisions made regarding these issues affect pathway database contents, ontology design, and ease of use for different audiences. Moreover, they affect the computational inference of metabolic pathways. As a result, curators of the MetaCyc pathway database need to continually grapple with the definitions and classifications of metabolic pathways.

Pathway boundaries

The first and perhaps most controversial decision that needs to be made when attempting to describe a metabolic pathway is determining the biochemical start and end points that define the boundaries of the pathway. We use several guidelines to help us make this decision. The first guideline, which is often used in the primary literature, is to describe the pathway using the essential subset of enzymes required to achieve a particular biochemical goal. A second guideline is to start biosynthetic pathways and end degradation pathways with common intermediates of central metabolism. Thus, a pathway that describes the degradation of (R)-mevalonate by the bacterium Pseudomonas mevalonii will start with (R)-mevalonate and end with acetyl-CoA, a common intermediate of central metabolism that feeds into the citric acid cycle (TCA) cycle (Fig. 1) (It should be noted that in many cases, the second guideline is not applicable because the pathway may utilize input compounds whose biosynthesis has not yet been described in the literature).

Figure 1.

A simple pathway showing the degradation of (R)-mevalonate by the bacterium Pseudomonas mevalonii. The pathway starts with the compound that is being degraded and ends with acetyl-CoA, a common intermediate of central metabolism that feeds into the citric acid cycle (TCA cycle). The other end product, acetoacetate, is linked via a pathway link to the pathway that degrades that compound further into acetyl-CoA. Compounds are shown in red, enzymes in yellow, genes in purple, EC numbers in blue, and pathway links in green.

A potential problem that arises when specifying pathways in this manner is the fact that many of the pathways will contain large overlapping segments. For example, consider the degradation of the aromatic compounds L-tryptophan, naphthalene, and L-quinate. These compounds, along with hundreds of related compounds, can be fully degraded to acetyl-CoA, a compound of central metabolism, in degradation pathways that involve an initial conversion to either catechol or protocatechuate (both are extremely common intermediates in the degradation of aromatic compounds), followed by further degradation to acetyl-CoA via 2-oxopent-4-enoate. If we specify the full pathway from each compound to acetyl-CoA, the three reactions involved in the degradation of 2-oxopent-4-enoate to acetyl-CoA would have to be repeated over and over. These repetitions result in redundancy, a much undesired quality in a database. To avoid the redundancy issue, we implemented a procedure that uncouples the data encoding and data display in this respect. We will explain this by continuing with the same example. Instead of repeating the last three steps in all of the pathways, we remove this segment from all pathways and curate it as the stand-alone pathway ‘2-oxopentenoate degradation’ (PWY-5162). Instead of the missing last three steps, we terminate the original pathways with a pathway link leading to the 2-oxopentenoate degradation pathway. The pathway link is a simple arrow that indicates the name of the pathway(s) that continue(s) downstream. Because pathway links function as hyperlinks on a computer, clicking on them allows the reader to navigate to the next segment in the metabolic network (Fig. 1).

By replacing repeated pathway segments with pathway links, we eliminate the redundancy in encoding the data. To enable users to see the full pathway in one diagram, we introduced the concept of superpathways. A superpathway is constructed by combining an individual ‘base’ pathway (e.g. 2-oxopentenoate degradation) with one or more additional pathways and/or individual reactions, to show a larger part of the metabolic network. Because the superpathways are treated differently by the software than nonsuperpathways (a.k.a. base pathways), they do not contribute to data redundancy, and we can define as many superpathways as we find useful.

An example may help illustrate this concept. The full pathway from naphthalene to acetyl-CoA is cleaved into four base pathways in MetaCyc – naphthalene degradation (PWY-5427), salicylate degradation (PWY-6183), catechol degradation to 2-oxopentenoate (P183-PWY), and 2-oxopentenoate degradation (PWY-5162) (The values in parentheses following the pathway names are IDs. Every object in the MetaCyc database has a unique ID). Note that these pathways were broken at salicylate, catechol, and 2-oxopent-4-enoate, all of which are branching points into which multiple pathways are known to feed, and from which multiple pathways are known to depart The superpathway naphthalene degradation to acetyl-CoA (PWY-6956) contains these four base pathways and provides an overall view of the full pathway from naphthalene to acetyl-CoA in one diagram.

It should be noted that different pathway databases differ in how they define pathway boundaries. On one end of the spectrum is the KEGG database (Kanehisa, 2002), which prefers complex metabolic maps that involve all known reactions that are related to a general topic regardless of whether they occur within the same species or even the same kingdom (e.g. methane metabolism – http://133.103.100.191/kegg/pathway/map/map00680.html). On the other end is UniPathway (Morgat et al., 2012), which defines every branching point as the boundary of a ‘linear subpathway’. MetaCyc lies in between these two extremes.

Pathway variants

Another issue that pathway curators face involves pathway variants. It is well documented that different organisms often achieve the same metabolic goal by implementing different pathways. Sometimes, multiple routes for achieving the same goal are found even within the same organism.

For example, salicylate can be used by multiple organisms as the source of carbon and energy. However, different organisms degrade salicylate in different ways. The bacterium Ralstonia sp. U2 hydroxylates salicylate to gentisate in a single reaction (Fuenmayor et al., 1998) (see MetaCyc pathway salicylate degradation II, PWY-6224). Gentisate is then processed to pyruvate and fumarate (Zhou et al., 2001). The bacterium Streptomyces sp. WA46 also converts salicylate to gentisate, but does so by activation to salicylate-CoA, which is hydroxylated to gentisyl-CoA and eventually converted to gentisate (salicylate degradation IV, PWY-6640) (Ishiyama et al., 2004). The yeast Trichosporon moniliiforme decarboxylates salicylate to generate phenol, then hydroxylates the latter to catechol, which is processed to the central metabolites succinyl-CoA and acetyl-CoA (pathways salicylate degradation III, phenol degradation I (aerobic), and catechol degradation III (ortho-cleavage pathway), PWY-6636, PWY-5418, PWY-5417) (Iwasaki et al., 2010). The bacterium Pseudomonas reinekei employs a single decarboxylating hydroxylase that converts salicylate to catechol in a single step (salicylate degradation I, PWY-6183) (Camara et al., 2007). Should all of these routes be part of a single salicylate degradation pathway, or are they different pathways? Once again, different databases treat this topic differently. In KEGG, all of these routes would be combined into a single pathway. In MetaCyc, we curate a different pathway for each known route. To emphasize that these pathways relate to each other, we define them as pathway variants. They are often labeled with a Roman numeral (e.g. salicylate degradation III), and the web page of each of these pathways contains links to the other variants. In addition to providing a more accurate and precise representation of which pathways have been biochemically characterized in which species, the inclusion of distinct variant pathways within MetaCyc would be expected to improve the quality of the pathway/genome databases that are predicted using MetaCyc as a reference. Having multiple pathway variants to choose from permits the prediction software to determine which of the variants is the best fit for the enzyme complement of a particular organism, and incorporate only the appropriate variant(s) in the database (see Pathway Prediction below).

Chimeric and conspecific pathways

Sometimes, it is useful to combine pathways from different organisms into a single diagram that provides an overview of a metabolic field. For example, the pathway ‘superpathway of CDP-3,6-dideoxyhexose biosynthesis’ (PWY-5823) brings together several pathways found in Gram-negative bacteria that produce these unusual sugars (CDP-paratose, CDP-abequose, CDP-ascarylose, and CDP-tyvelose) as the O-antigens of their lipopolysaccharides. Note that no single organism identified to date can naturally produce all of these sugars.

Although it is simple to create such a diagram by generating a superpathway composed of the various base pathways, there is an important distinction between pathways that are expected to occur in their entirety in a single organism, and pathways that are not. In order to maintain this distinction, we have established the concepts of conspecific pathways vs. chimeric pathways. While a conspecific (meaning belonging to the same species) pathway comprises a set of reactions that are expected to be found within each organism that has the pathway, a chimeric pathway comprises reactions from multiple organisms and is not expected to occur in its entirety in a single organism. Chimeric MetaCyc pathways are clearly labeled as such to ensure that the user is aware of their special status (Fig. 2).

Figure 2.

Chimeric pathways, like this one describing the synthesis of different CDP-3,6-dideoxyhexoses, comprise reactions and enzymes from multiple organisms, and are not expected to occur in their entirety in a single organism. The title of chimeric MetaCyc pathways is labeled as such to ensure that the reader is aware of their special status. In addition, rather than a taxonomic range, the pathway comments provide a list of taxa known to possess parts of the pathway (not shown). Compounds are shown in red, enzymes in yellow, genes in purple, and EC numbers in blue.

In addition to their role as a resource for human readers, the pathways in MetaCyc are used by the Pathway Tools software as a reference for prediction of the metabolic networks of organisms with a sequenced genome, enabling the software to generate organism-specific Pathway/Genome Databases. Currently, chimeric pathways are excluded from the prediction process. However, future software will be able to construct conspecific versions of the chimeric pathways. When the software will predict that a certain part of a chimeric pathway occurs in an organism (in a connected set of reactions), it will remove the extraneous reactions from the pathway, producing a truncated conspecific version of it.

Engineered pathways

Another type of pathway that demands special presentation is an engineered pathway – these pathways are constructed artificially by modifying naturally occurring enzymes, and/or by introducing enzymes from different sources into a host organism. They share some characteristics with chimeric pathways, but have the distinction that they operate within a single organism. Engineered pathways are clearly labeled as such in MetaCyc and are excluded from the pathway prediction process as well.

Pathway prediction

A major focus of the BioCyc project is to provide high-quality predictions of what subset of well-curated pathways in MetaCyc are likely to exist in a target species chosen from any kingdom of life, based primarily on its complete genomic sequence (Paley & Karp, 2002; Krummenacker et al., 2005). This task is accomplished using the PathoLogic tool of the Pathway Tools software (Karp et al., 2010). The basic procedure for predicting whether a particular pathway occurs in an organism is based on the presence of the enzymes of the pathway in that organism (usually deduced by the presence of genes predicted to encode such enzymes in the annotated genome). Because it is expected that some of the enzymes may not have been properly recognized and annotated, owing to limited knowledge and variations in sequence, an arbitrary threshold is defined. For example, one can demand that 80% of the enzymes of a pathway must be present in order to predict that the pathway is present in the organism. As can be expected, using such a simple rule often results in false predictions. Take for example, the incomplete reductive TCA cycle that operates in methanogenic archaea (P42-PWY) (Shieh & Whitman, 1987). The majority of the enzymes that participate in this pathway are a subset of the enzymes of the tricarboxylic acid (TCA) cycle, a pathway that operates in aerobic organisms. However, the reductive pathway includes only a subset of the reactions of the TCA cycle and operates in the reverse direction, functioning as a carbon dioxide assimilating mechanism. If we simply look for the presence of the enzymes, we will find the majority of them in most aerobic organisms, and thus might erroneously predict the existence of the incomplete reductive TCA cycle in aerobic organisms that possess the TCA cycle. To avoid such errors, we routinely use the following two features:

Taxonomic range

For many of the pathways, it is possible to define which taxa are likely to possess them. By adding this information to the MetaCyc pathway, it is possible to direct the PathoLogic program to avoid predicting the pathway in species outside its estimated taxonomic range. The taxonomic range is not curated for engineered pathways.

Key reactions

Many pathways require unique enzymes that do not participate in other pathways. By designating the reactions catalyzed by these enzymes as ‘key reactions’ in MetaCyc, and insisting on their presence, PathoLogic can refrain from predicting the pathway in organisms whose genomes do not seem to encode the corresponding key enzymes. For example, some methylotrophic bacteria are able to assimilate formaldehyde in a complex pathway known as the ribulose monophosphate (RuMP) cycle (PWY-1861) (Strom et al., 1974). Most of the enzymes that participate in this pathway also catalyze reactions of the pentose phosphate pathway, a central metabolic pathway that is widespread. However, the RuMP cycle requires two unique enzymes, 3-hexulose-6-phosphate synthase (EC 4.1.2.43) and 6-phospho-3-hexuloisomerase (EC 5.3.1.27). By specifying the reactions catalyzed by these enzymes as key reactions, we can prevent the erroneous prediction of the pathway in nonmethylotrophic organisms that possess the pentose phosphate pathway.

While the use of these features substantially reduces the number of false-positive predictions, it can also contribute to false-negative predictions, leading the PathoLogic program to reject candidate pathways that are legitimately present in a target organism. For example, a curator may set the expected taxonomic range of a plant growth regulator biosynthetic pathway to the plant kingdom (Viridiplantae). However, there are many known examples of specific fungal and bacterial pathogens producing ‘plant’ compounds to modulate the growth and/or defense responses of their plant hosts. Such a case is found in the rice pathogen Gibberella fujikuroi, which can synthesize the plant hormone gibberellin (see gibberellin biosynthesis IV, PWY-5047) (Rojas et al., 2001). To help avoid these negative results, the PathoLogic program uses a complex algorithm that employs more than 16 different heuristic rules (Dale et al., 2010), which may override the negative taxonomic range data if other strong support exists for the presence of the pathway. In addition, the taxonomic pruning option can be turned off by the user during the creation of a new database.

As discussed above, curated information, the complexity of the PathoLogic program, and user decisions can all work together to produce a more accurate final set of predicted pathways for any target species. It should be noted that some false-positive predictions are likely to be made despite all of these tools and guidelines. For example, even if an organism contains all the genes necessary for a pathway, differential regulation of these genes may prevent them from being expressed at the same time, making it unlikely that the predicted pathway would be physiologically functional (Constantinidou et al., 2006).

Classification of pathways

MetaCyc version 17.0 of March 2013 contains over 2000 base pathways. Many more will surely be studied in the years to come, so it is essential that a pathway classification system be employed to group pathways into meaningful categories that can aid researchers, educators, and students. In the absence of a universally accepted classification system for pathways, we have developed a pathway ontology in MetaCyc. The ontology is continually updated to reflect curation needs. Currently, the ontology contains six top-level categories (or classes): biosynthesis, degradation/utilization/assimilation, generation of precursor metabolites and energy, detoxification, activation/inactivation/interconversion, and metabolic clusters. The definitions of these categories are provided in Table 1. It should be noted that the inclusion of metabolic clusters is a compromise. These collections of unconnected reactions are not true pathways, and thus, one could argue that they should not be part of the ontology. However, we have chosen to include them because we find that they serve an important purpose that traditional pathways do not (see, for example, the metabolic cluster tRNA charging (TRNA-CHARGING-PWY)).

Table 1. The higher-level classes in the MetaCyc pathway ontology. The left column lists the master classes; the right column lists direct subclasses
Top CategoryDefinitionSubcategories
  1. Numbers indicate the number of MetaCyc pathways in each class in version 17.0 of MetaCyc. Note that the sum of pathways in all subclasses of a particular top class is likely to be smaller than the total number of pathways for that class, because many pathways are listed directly under the top class. The 13 precursor metabolites mentioned in the definitions include D-glucose 6-phosphate, D-fructose 6-phosphate, D-ribose 5-phosphate, D-erythrose 4-phosphate, D-glyceraldehyde 3-phosphate, 3-phospho-D-glycerate, phosphoenolpyruvate, pyruvate, acetyl-CoA, 2-oxoglutarate, succinyl-CoA, oxaloacetate, and D-sedoheptulose 7-phosphate.

Generation of Precursor Metabolites and Energy (183)This class contains the pathways of central metabolism (glycolysis, pentose phosphate pathways, and TCA cycle), which collectively produce the 13 starting materials, sometimes termed precursor metabolites, for all cellular biosyntheses. Other degradative pathways, sometimes termed feeder pathways, feed into central metabolism. This class also contains the pathways that generate energy under various conditions of growth

Fermentation (48)

Respiration (28)

Chemoautotrophic energy metabolism (17)

Electron transfer (14)

Methanogenesis (13)

TCA cycle (9)

Hydrogen production (9)

Photosynthesis (8)

Glycolysis (7)

Other (5)

Acetyl-CoA biosynthesis (4)

Pentose phosphate pathways (4)

Biosynthesis (1318)This class contains pathways that constitute a complete spectrum of the biosynthetic capacities of a cell, including the routes of synthesis of small molecules, macromolecules, and cell structure components. It does not contain the pathways that generate the 13 starting materials, sometimes termed precursor metabolites, for all cellular biosyntheses

Secondary metabolites biosynthesis (530)

Cofactors, prosthetic groups, electron carriers biosynthesis (209)

Fatty acids and lipids biosynthesis (140)

Amino acids biosynthesis (114)

Carbohydrates biosynthesis (114)

Hormones biosynthesis (57)

Nucleosides and nucleotides biosynthesis (48)

Other biosynthesis (41)

Cell structures biosynthesis (47)

Amines and polyamines biosynthesis (37)

Aromatic compounds biosynthesis (32)

Siderophore biosynthesis (17)

Metabolic regulators biosynthesis (6)

Aminoacyl-tRNA Charging (4)

Degradation/Utilization/Assimilation (872)This class contains pathways by which various organisms degrade substrates to serve as sources of nutrients and energy, utilize exogenous sources of essential metabolites, or assimilate certain sources of essential bioelements

Aromatic compounds degradation (182)

Amino acids degradation (119)

Inorganic nutrients metabolism (101)

Secondary metabolites degradation (96)

Carbohydrates degradation (94)

Amines and polyamines degradation (54)

Carboxylates degradation (44)

Chlorinated compounds degradation (38)

Polymeric compounds degradation (37)

Nucleosides and nucleotides degradation (36)

Hormones degradation (32)

C1 Compounds utilization and assimilation (29)

Degradation/utilization/assimilation - Other (27)

Fatty acids and lipids degradation (24)

Alcohols degradation (19)

Aldehyde degradation (11)

Cofactors, prosthetic groups, electron carriers degradation (5)

Protein degradation (3)

Activation/Inactivation/Interconversion (33)

This class holds pathways for activation, inactivation, and interconversion of metabolic compounds

In contrast to a standard ‘biosynthesis’ pathway in which a biologically active compound is synthesized from precursor molecules, activation pathways involve relatively minor chemical modifications to existing compounds that result in a substantial increase in their biological activity

Similarly, in contrast to standard ‘degradation’ pathways in which a more complex compound is broken down into a set of simple metabolites, inactivation pathways involve relatively minor chemical modifications to existing biologically active compounds that result in a substantial decrease in their biological activity

Interconversion pathways describe the bidirectional conversion of a biomolecule to a different form. The forward and backward conversions often result in significant changes in the biological activity of the compound, thus resulting in its activation and deactivation, respectively

These modifications may be either reversible or irreversible

Interconversion (17)

Activation (8)

Inactivation (8)

Detoxification (39)This class contains pathways by which various organisms protect themselves against the harmful effects of toxic compounds. The sole purpose of these pathways is to avoid toxicity, with no other benefit (such as energy or useful metabolites) to the organism

Methylglyoxal detoxification (8)

Arsenate detoxification (4)

Antibiotic resistance (7)

Acid resistance (2)

Cyanide detoxification (2)

Mercury detoxification (1)

Metabolic Clusters (32)A metabolic cluster is a set of biochemical reactions that are biologically related, but are largely unconnected and therefore do not constitute a pathway in the traditional sense of the word 

Pathways present in these different ontological categories (and increasingly granular subcategories) can be easily browsed from the ‘Search’ menu present on every page in Pathway Tools-generated databases by selecting ontologies and then pathway ontology. The same ontology can be used to initiate a search under the Pathways option on the Search menu. Moreover, users can perform very complex searches by combining selected ontological categories with additional pathway features such as desired evidence codes, key compounds, and target organisms. For example, it is possible to search for pathways for siderophore production found in γ-proteobacteria, or for pathways for the biosynthesis of secondary metabolites with the expected taxonomic range of Viridiplantae (green plants).

Further uses of pathways within pathway/genome databases

As its name implies, the Pathway Tools software includes multiple tools that can enhance the usability of metabolic pathways. One such tool (the Cellular Overview) is a metabolic map that shows all the pathways in the pathway/genome database (PGDB) in one integrated diagram. When combined with the Omics Viewer tool, users can display transcriptomic, proteomic, metabolomic, and fluxomic data values, as well as many other types of data, superimposed on the pathways in this diagram. If groups of enzymes, compounds, etc., fall into specific categories, for example, those up-regulated vs. down-regulated in a particular experimental condition, these different categories can be displayed on the diagram using a Highlight feature. Finer control is obtained by defining numerical values for each category (e.g. highlight pathway reactions catalyzed by enzymes whose genes are up-regulated more than threefold but < 10-fold in blue). While using Pathway Tools online, users can apply these tools to any of the 3000 PGDBs that are currently available in the BioCyc collection. Users who prefer to use the software on their own desktops, perhaps because they create their own PGDBs, are able to easily download any of the BioCyc PGDBs and install them locally using a tool called the PGDB Registry. Using the desktop version of the software allows more control, because the user is able to modify the data in any of these PGDBs. For instance, if additional steps in a metabolic pathway are uncovered, the pathway can be modified using the desktop version of the program, adding the new reactions to the pathway and assigning the enzymes to them. Because the pathway diagrams are generated automatically by the software as soon as the new reactions are added, a pathway diagram can be easily prepared for a publication describing the new enzymes. Once the Metabolic Overview is regenerated by the program, experimental data can then be visually displayed in the context of the extended pathway, which may allow coexpression patterns or metabolic bottlenecks in mutants to be more easily discerned. Additional pathway figures can be prepared, showing omics data superimposed on the new pathway diagram.

Sharing the new data is easy as well. Pathway Tools can be configured to run as a web server, allowing remote users on an intranet or the Internet to connect via their web browsers and browse the data. Pathways can also be exported to files that can be imported by collaborators who want to analyze the same pathway using their own local installation of Pathway Tools.

As discussed at the beginning of this article, while the definition and use of metabolic pathways are clearly beneficial, in reality there are no strictly defined ‘pathways’ that operate in isolation. Rather, all of the reactions performed within an organelle, cell, and/or set of cells are connected at some level. Fortunately, the pathway boundaries imposed by MetaCyc curators provide no barrier to more global analyses. A Metabolite Tracer tool (that is available only on the desktop version of the program) can help researchers to see the connections between pathways. The Tracer enables users to define a starting compound and track all of its metabolic routes in increment of one, two, or more reactions away, no matter what predefined pathway they occur in.

The new MetaFlux tool in the Pathway Tools software (also available only on the desktop version) uses the data in a PGDB to define a flux-balance analysis (FBA) model that considers the densely interconnected web of metabolites, enzymes, and reactions that exist in that organism. Researchers can also use the data files available for each PGDB to generate their own networks outside of the Pathway Tools software. While there are some ready-made tools to facilitate this process (such as the BioCyc plugin for the Cytoscape program, released in 2010), users have countless options for programmatically bringing together the reactions and compounds that MetaCyc curators have ‘separated’ into more comprehensible pathways. Regardless of the tool used, once results are obtained from these more global analyses, their biological significance can often be fruitfully examined with the help of the annotated MetaCyc pathways and their accompanying summaries and lists of references.

Conclusions

Curators of metabolic pathway databases face fundamental problems pertaining to the definition, classification, and representation of metabolic pathways. In this article, we described the tools and guidelines that we use in the MetaCyc database for representing these pathways, as well as the ontology that we developed for classifying metabolic pathways. In addition, we described how the pathways in MetaCyc are used by the Pathway Tools software for prediction of the metabolic network of thousands of sequenced organisms, and some of the tools that the software offers for pathway analysis. We hope that our approach to pathway classification and representation and the tools that we developed help make pathway data analysis easier and more powerful for researchers, students, and educators.

Funding

Funding for this research was provided by the National Institute of General Medical Sciences of the National Institutes of Health (grants GM080746, GM077678, GM088849 and GM075742); Department of Energy (bioenergy-related pathway curation, grant DE-SC0004878); National Science Foundation (plant pathway curation performed by Carnegie Institution for Science), grants IOS-1026003 and DBI-0640769). The content of this manuscript is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

Ancillary