Metabolomics, modelling and machine learning in systems biology – towards an understanding of the languages of cells

Delivered on 3 July 2005 at the 30th FEBS Congress and 9th IUBMB conference in Budapest

Authors

  • Douglas B. Kell

    1. School of Chemistry, Faraday Building, The University of Manchester, UK
    2. Manchester Centre for Integrative Systems Biology, Manchester Interdisciplinary Biocentre, UK
    Search for more papers by this author

D.B. Kell, School of Chemistry, University of Manchester, Faraday Building, Sackville Street, Manchester M60 1DQ, UK
Tel: +44 161 3064492
E-mail: dbk@manchester.ac.uk
Website: http://dbk.ch.umist.ac.uk, http://www.mib.ac.uk/, http://www.mcisb.org/

Abstract

The newly emerging field of systems biology involves a judicious interplay between high-throughput ‘wet’ experimentation, computational modelling and technology development, coupled to the world of ideas and theory. This interplay involves iterative cycles, such that systems biology is not at all confined to hypothesis-dependent studies, with intelligent, principled, hypothesis-generating studies being of high importance and consequently very far from aimless fishing expeditions. I seek to illustrate each of these facets. Novel technology development in metabolomics can increase substantially the dynamic range and number of metabolites that one can detect, and these can be exploited as disease markers and in the consequent and principled generation of hypotheses that are consistent with the data and achieve this in a value-free manner. Much of classical biochemistry and signalling pathway analysis has concentrated on the analyses of changes in the concentrations of intermediates, with ‘local’ equations − such as that of Michaelis and Menten v=({{V}}max·{{S}})/({{S}}+{{K}} m) − that describe individual steps being based solely on the instantaneous values of these concentrations. Recent work using single cells (that are not subject to the intellectually unsupportable averaging of the variable displayed by heterogeneous cells possessing nonlinear kinetics) has led to the recognition that some protein signalling pathways may encode their signals not (just) as concentrations (AM or amplitude-modulated in a radio analogy) but via changes in the dynamics of those concentrations (the signals are FM or frequency-modulated). This contributes in principle to a straightforward solution of the crosstalk problem, leads to a profound reassessment of how to understand the downstream effects of dynamic changes in the concentrations of elements in these pathways, and stresses the role of signal processing (and not merely the intermediates) in biological signalling. It is this signal processing that lies at the heart of understanding the languages of cells. The resolution of many of the modern and postgenomic problems of biochemistry requires the development of a myriad of new technologies (and maybe a new culture), and thus regular input from the physical sciences, engineering, mathematics and computer science. One solution, that we are adopting in the Manchester Interdisciplinary Biocentre (http://www.mib.ac.uk/) and the Manchester Centre for Integrative Systems Biology (http://www.mcisb.org/), is thus to colocate individuals with the necessary combinations of skills. Novel disciplines that require such an integrative approach continue to emerge. These include fields such as chemical genomics, synthetic biology, distributed computational environments for biological data and modelling, single cell diagnostics/bionanotechnology, and computational linguistics/text mining.

Abbreviations
MCA

metabolic control analysis

ODE

ordinary differential equations

The belief that an organism is ‘nothing more’ than a collection of substances, albeit a collection of very complex substances, is as widespread as it is difficult to substantiate…The problem is therefore the investigation of systems, i.e. components related or organized in a specific way. The properties of a system are, in fact, ‘more’ than (or different from) the properties of its components, a fact often overlooked in zealous attempts to demonstrate ‘additivity’ of certain phenomena. It is with the ‘systemic properties’ that we shall be mainly concerned.

H. Kacser (1957) in The Strategy of the Genes (ed. CH Waddington), pp. 191–249. Allen & Unwin, London

Progress in science depends on new techniques, new discoveries, and new ideas, probably in that order.

Sydney Brenner, Nature, June 5, 1980

Systems biology as such is not especially new [1–3], but while it is not hard to find prescient comments from Henrik Kacser and from Sydney Brenner [4], those given above might be seen as epitomizing the key features of the more recent move towards, and interest in, Systems Biology [5–14](Fig. 1).

Figure 1.

Systems biology is usually seen as an iterative activity integrating computational work, high-throughput ‘wet’ experimentation and technology development with the world of theory and novel ideas.

Parallelling the Brenner quote, my lecture also chose to highlight three aspects of our current work with collaborators. The first involves the philosophical underpinnings of our scientific strategy and of the systems biology agenda, which can each be considered to involve an iterative interplay [15–17] between a series of linked activities. These activities include data (observations) and ideas (hypotheses); theory, computation and experiment; and the iterative assessment of the parameters and variables in such computational models and experiments. The second area relates to the actual development of technology for systems biology, specifically analytical and computational technology − especially in metabolomics − to help provide both high quality data and the concomitant modelling that relies on it. The third strand develops various ideas that emerged following our recent findings [18–20] that protein signalling pathways − specifically those involving the nuclear transcription factor NF-κB – may encode their signals not so much in terms of changes in the concentrations of the observable signalling intermediates but in terms of their frequency or dynamics. Such signals must be perceived by downstream signal processing elements that respond to their dynamics, and so to understand such pathways properly one needs to understand and focus on not only the intermediates (the medium) but also the ‘downstream’ means (‘network motifs’– see, e.g. [21–23]. or ‘design elements’[24]) by which such signals are perceived (to make the message). This leads to a profoundly different view of the significance of networks in systems biology, and one that allows one a much better understanding of signalling as signal processing. Put another way, and again quoting Henrik Kacser [25,26], ‘But one thing is certain: to understand the whole one must study the whole’.

Philosophical elements of systems biology

As in Fig. 1, most commentators (summarized, e.g. in [12]), as I do [17,27], take the systems biology agenda to include pertinent technology development, theory, computational modelling and high-throughput experimentation. Hypothesis-driven science is only a partial component of this, and not the major one [16]. More specifically, in systems biology, studies are performed purposively in an iterative manner, in a way that contrasts with previous strategies. This iteration is multidimensional, and can be described or seen in various ways, including both wet (experimental) and dry (computational and theoretical), reductionist and synthetic, qualitative and quantitative, and a systems biologist would lay more stress than is conventional on the right-hand arcs of the diagrams in Fig. 2. A particular feature is the ‘vertical’ focus of systems biology in seeking to relate ‘lower’ levels of biological organization such as enzymatic properties to higher levels of biological organization, and in this sense systems biology shares the same agenda as the long-established approaches of Metabolic Control Analysis [11,26,28–32]) and Biochemical Systems Theory [33,34].

Figure 2.

Some of the iterative elements of systems biology. (A) Science can be said to advance via an iterative interplay between the worlds of ideas and of experimental data. The world of ideas includes theories, hypotheses, human knowledge and any other mental constructs, while the world of data consists of experimental observations and other facts, sometimes referred to as ‘sense data’ in the philosophical literature as an iterative process, movement between these two worlds is not simply a reversible action: analysis is not the reverse of synthesis [339]. (B) One view of systems biology, reflecting a largely bottom-up approach, as in the ‘silicon cell’[340]. First we need what we term a ‘structural model’ (this describes the network's structure, and has nothing to do with structural biology) that defines the participants in the process of interest and the (qualitative) nature of the interactions between them; then we try to develop equations, preferably mechanistic rather than empirical, that best describe the relationships, then finally we seek to parameterize those equations (recognizing that if errors occur in the earlier phases we may need to return and correct them in the light of further knowledge). (C) The hallmark of modelling as a comparison between the mathematical models and the ‘reality’ (i.e. observed experimental data plus noise), again as an iterative process. (D) Producing and refining a model: data on kinetic parameters allow one to run a forward model. However, invoking such parameters from measured omics data (fluxes and concentrations) is referred to as an inverse or system identification problem (e.g. [86–88,90,91,341–347]) and is much harder. One strategy is to make estimates of the parameters and on the basis of the consequent forward model refine those estimates iteratively until some level of convergence (with statistical confidence levels) is achieved. (E) The iteration in models/mapping between levels of biological organization, e.g. in the case illustrated between the overall metabolism of an organism and its enzymatic parts.

It is a curious fact that in physics and chemistry (and indeed in economics) ‘theory’ has a status almost equal with that of experiment, and has claimed many Nobel Prizes, but in modern biology this is not the case. ‘Pure’ theoreticians do not easily make a living (and only partly for sociological reasons connected with their perceived grant-winning abilities). Equivalently, it would be laughable for an engineer not to make a mathematical model of a candidate design for a bridge or an aeroplane before trying to build one, since the chance of it ‘working’ would be remote (because it is ‘complex’, and this is because its components are many and they act in nonlinear ways). By contrast, making mathematical models of the biological systems one is investigating (and seeing how they perform in silico) is generally considered a minority sport, and one not to be indulged in by those who prefer (or who prefer their postdocs and students) to spend more time with their pipettes.

Fairly obviously, it is easy to recognize that molecular biology concentrated perhaps too heavily on parts rather then wholes in its development, or at least that it is time, now that we have the postgenomic parts list of the genes and proteins (though not yet the metabolites) of most organisms of immediate interest, for working biologists to incorporate the skills of the numerical modeller (or indeed the radio engineer [35]), just as the more successful ones needed to become acquainted with the techniques of molecular biology when they began to be developed 30 years ago. In 10 years' time the referees of grant proposals and papers will normally ask only why one did not model one's system before studying it experimentally, not why one might wish to.

This said, it is useful to rehearse the variety of reasons why one might wish to model a biological systems that one is seeking to understand and study experimentally [36] (and see also [12,13,37]):

  • • testing whether the model is accurate, in the sense that it reflects − or can be made to reflect − known experimental facts. This amounts to ‘simulation’;
  • • analysing the model to understand which parts of the system contribute most to some desired properties of interest;
  • • hypothesis generation and testing, allowing one to analyse rapidly the effects of manipulating experimental conditions in the model without having to perform complex and costly experiments (or to restrict the number that are performed);
  • • testing what changes in the model would improve the consistency of its behaviour with experimental observations.

The last two points amount to ‘prediction’.

The techniques of modelling

Most strategies for creating mathematical models of biological systems recognize that the nonoptical, high-resolution experimental analysis of spatial distributions beyond macro-compartments is not yet available and thus it is appropriate to use ordinary differential equations (ODEs) that assume such compartments both to be to be well-stirred and with their components in high enough concentrations that they are ‘homogeneous’. If the former assumption breaks down one can create subcompartments [38], while the latter requires one to resort to so-called ‘stochastic’ methods [39,40].

Modern ODE solvers can deal with essentially any system, even when its ‘local’ kinetics are on very different timescales (so-called ‘stiff’ systems), and many have been devised by and for biologists, thus making them particularly easy to use. A particular trend is towards making models that are interoperable between laboratories, and the website of the Systems Biology Markup Language http://www.sbml.org/[41,42] lists many, including Gepasi[38,43,44].

Figure 2 shows various views of the systems biology agenda. Figure 2A stresses the importance of inductive methods of hypothesis generation; these have unaccountably had far less emphasis than they should have done because of the traditional obsession in twentieth century biology with hypothesis testing[16]. However, the search for good hypotheses can be seen as a heuristic search over a huge landscape of ‘possible’ hypotheses, of the form familiar in heuristic and combinatorial optimization problems [45–47], and the choice of where to look next − this is the ‘principled’ part − is known as ‘active learning’[48–54]. It can be and has been automated in areas such as functional genomics [55,56], in clinical [57,58] and analytical chemistry [59], and in the coherent control of chemical reactions [60]. Principled hypothesis generation is clearly at least as important as hypothesis testing, and appropriate experimental designs, such as those used in active learning (and these go far beyond those usually described in textbooks of experimental design [61–65]), ensure that the search for good candidate data is not an aimless fishing expedition but one which is likely to find novel answers in unexpected places (e.g. [15,16,66–69]).

Figure 2B sets down the overall strategy, usually known as a ‘bottom up’ strategy, that we consider to be appropriate for most systems biology problems of interest to readers of the FEBS Journal. As whole-genome models of metabolism have become available (e.g. [70–72]), it has become evident that one can learn much merely from the structure plus constraints of a qualitative but stoichiometric model of the network (e.g. [14,73–80]). This leads one to stress the importance of first getting the structural model (the fundamental building blocks that determine and constrain the ‘language’ of cells). From the qualitative model, we then require suitable equations that that can represent the quantitative nature of the interactions set down in the structural model. Such equations are preferably mechanistic, as is common in molecular enzymology [81–84], but may also be empirical if they serve to fit the data over a suitably wide range [33,34,85]. After this, one must parametrize the kinetic data, as the parametrized equations (recast into the form of coupled ordinary differential equation) can then be used directly in forward models (e.g. [38,44]). Figure 2C, D and E highlight the basic and iterative relations between computational models and reality on one hand and between changes in the model that are invoked and its subsequent dynamic behaviour, leading to an understanding of how events at one level (e.g. the enzymatic) can be used to gain an understanding of events at a higher level (e.g. physiology or whole-cell metabolism). As mentioned above, the goal of systems biology in integrating these different levels of organization thus shares many similarities with those of metabolic control analysis and biochemical systems theory.

A particular issue with systems biology, which is why we stress the need to measure parameters, is that it is the parameters that control the variables and not the other way round, while omics measurements usually determine only the variables (e.g. in metabolism/metabolomics the metabolic fluxes and concentrations). Going from the variables to the parameters involves solving an inverse or ‘system identification’ problem [86], and this is typically very hard [87–91] as these problems are often heavily underdetermined (many parameter combinations can give the same variables), even if the structural model is correct.

Metabolomics and metabolomics technology development

As enshrined in the formalism of Metabolic Control Analysis (MCA) [11,26,28–32], it has been known for over 30 years that small changes in the activities of individual enzymes lead only to small changes in metabolic fluxes but can lead to large changes in concentrations. These facts are causally related, expected and mathematically proven. Metabolomics, being downstream of transcriptomics and proteomics, thus represents a more suitable level of biological organization for analysis [92] since metabolites are both more tractable in number and are amplified relative to changes in the transcriptome, proteome or gross phenotype [93]. Although we must in due time seek to integrate all the omes, metabolomics is thus the strategy of choice for the purposes of functional genomics, biomarker development and systems biology (e.g. [94–104]).

If we consider metabolic systems, most analysts take discrete samples and provide what we have referred to as ‘metabolic snapshots’[26]. Typical model microbes such as baker's yeast [70] contain upwards of 1000 known metabolites, and most of these have a relative molecular mass of less than 1000 [27]. Indeed, metabolomics is usually considered to mean ‘small molecule metabolomics’, even if cell wall polymers and the like are necessarily produced by metabolism.

The actual number of measurable metabolites in a given biological system is unknown, but numbers such as 10–13 000 have already been observed in mouse urine [105], albeit that some or many are of gut microbial origin [101]. Most of these have yet to be identified chemically.

The history of biomedicine as perceived via the awards of the Nobel Committee indicates the importance to our understanding of the subject of both small molecules (examples: ascorbic acid, coenzyme A, penicillin, streptomycin, cAMP, prostaglandins, dopamine, NO) and novel analytical methods (examples: paper chromatography, X-ray crystallography, the sequencing of proteins and of nucleic acids, radioimmunoassay, PCR, soft ionization MS, biological NMR). An important area of metabolomics thus consists of maximizing the number of metabolites that may be measured reliably [106–109], as a prelude to exploiting such data via a chemometric and computational pipeline [27,107,110]. As above, it transpires that optimizing scientific instrumentation is a combinatorial problem that scales exponentially with the number of experimental parameters. Thus, if there are 14 adjustable settings on an electrospray mass spectrometer, each of which can take 10 values, the number of combinations to be tested via exhaustive search is 1014[111]. Since the lifetime of the Universe is about 1017s [112], it is obvious that trying all of these (‘exhaustive search’) is impossible. So-called heuristic methods [113–117] are thus designed to find good but not provably optimal solutions, and methods [111,118] based on evolutionary algorithms [119] have proved successful. However, they are still slow because the run times are inconvenient and there is a human being in the loop, and the number of experiments that can be evaluated is correspondingly small.

As indicated above, active learning methods are attractive, and, in a manner related to the computationally driven supervised [120] and inductive [16] discovery of new biological knowledge [121], we have contributed to the Robot Scientist project [55]. This was concerned with automating principled hypothesis generation in the area of experimental design for functional genomics. In this arrangement, one seeks to optimize the order in which one does a series of experiments, given that the number of possible experiments n can be done serially in n! (n factorial) possible orders. For n = 15, n! ≈ 1.3.1012. In the Robot Scientist paper [55] a computational system was used: (a) to hold background knowledge about a biological domain (amino acid biosynthesis, modelled as a logical graph); (b) to use that knowledge to design the ‘best’ (most discriminatory) experiment in order to find the biochemical location in that graph of a specific genetic lesion; (c) to perform that experiment using microbial growth tests, and to analyse the results; and (d) on the basis of these to design, perform and evaluate the next experiment, the whole continuing in an iterative manner (i.e. in a closed loop, without human intervention) until only one ‘possible’ hypothesis remains.

We have now combined these ideas to use heuristic search methods in an automated closed loop (the ‘Robot Chromatographer’) to maximize simultaneously the number of peaks observed while also minimizing the run time [59], and in addition maximizing a metric based on the signal : noise ratio. Depending on the sample (serum [107] or yeast supernatant [122–124]), this has more than trebled the number of metabolite peaks that we can reliably observe using GC TOF MS [59](Fig. 3), thereby allowing us to discover important new biomarkers for metabolic and other diseases including pre-eclampsia [125], peaks that were not observed in the original, previously optimized run conditions. The new technology thus led directly to the discovery of new biology, as in previous work in metabolomics (e.g. [67,68]). Sometimes it is a lack of unexpected differences that is the result of interest [126]. An especially useful strategy in microbiology is to study the exometabolome or ‘metabolic footprint’[122–124,127] of metabolites excreted by cells, as this gives important clues as to their intracellular metabolism but is much easier to measure. Current work is concentrating on the optimization of 2D GC technology (GC×GC-TOF) [128–130] and ultra-performance liquid chromatography [105,124,131,132].

Figure 3.

Closed loop evolution of improved peak number in GC-MS experiments. Run time is encoded in the size of the symbols. It may be observed in the figure that this PESA-II algorithm [348] serially explores areas of space that can improve both the number of peaks and the run time. The size of the search space exceeded 200 000 000. Each generation contains two experiments, encoded via the two colours. Data are from the experiments described in [59].

Creating and analysing systems biology models: network motifs, sensitivity analysis, functional linkage and signal processing

As postgenomic, high-throughput methods develop, it is increasingly commonplace to have access to large datasets of variables (′omics data) against which to test a mathematical model of the system that might generate such data. In these cases, the model will usually be an ODE model, and finding a good model is a system identification problem [44,86].

Much less frequently [133], the kinetic and binding constants are available, and a reliable ‘forward’ model can be generated directly. One such case [134] is the NF-κB signalling pathway [135–138]. NF-κB is a nuclear transcription factor that is normally held inactive in the cytoplasm by being bound to one or more isoforms of an inhibitor (IκB). When IκB is phosphorylated by a kinase (IKK) it is degraded and free NF-κB can translocate to the nucleus, where it induces the expression of genes (including those such as IκB that are involved in its own dynamics). The NF-κB system is considered to be ‘involved’ in both cell proliferation and in apoptosis, as well as diseases such as arthritis, although how a cell ‘chooses’ which of these orthogonal processes will happen simply from the changes in the concentration of NFκB in a particular location or compartment is neither known nor obvious. (In a sense this is the same problem as that of ‘commitment’ in developmental biology generally.) Earlier experimental measurements showed oscillations in nuclear NF-κB in single cells, though these were damped when assessed as an ensemble since individual cells were necessarily out of phase ([139], and see also [140] for a different example and [141,142] for a similar philosophy underpinning the use of single-cell measurements in flow cytometry). More recently, with improved constructs and detector technology, the oscillations could clearly be measured accurately in individual cells alone [19]. This ability to effect accurate measurements in individual cells is absolutely crucial for the analysis of nonlinear dynamic systems.

Based on the model of Hoffmann and colleagues [134] (see also [143,144]), and using Gepasi[43,44] we have modelled the ‘downstream’ parts of this pathway (there are 64 reactions and 23 variables), successfully reproducing the main features of the oscillations observed experimentally in single cells (Fig. 4A and B) and performed sensitivity analysis on the model [18]. The model itself is/will soon be available via the ‘triple-J’ website http://jjj.biochem.sun.ac.za/. Sensitivity analysis is a generalized form of MCA [30] that is arguably the starting point for the analysis of any model [36], and that is useful in many other domains (e.g. [145]). This sensitivity analysis showed that only about eight of the 64 reactions exerted any serious control over the timings and amplitudes of the oscillations in the nuclear NF-κB concentration [18], that the nonlinearity of the model implied: (a) both a differential control of the frequency and amplitude [18,19] of the first and subsequent oscillations; (b) that interactions between different elements of the model were synergistic [20] (Fig. 4C); and (c) most importantly that it was not so much the concentration of nuclear NF-κB but its dynamics that were responsible for controlling downstream activities [19]. This leads to a profound emphasis on the role of ‘network motifs’[21,146,147] as ‘downstream’ signal processing elements that can discriminate the dynamical properties of inputs that otherwise use the same components. Biological signalling is then best seen or understood as signal processing, a major field (mainly developed in areas such as data communications, image processing [148] and so on), in which we recognize that the structure, dynamics and performance of the receiver entirely determine which properties of the upstream signal are actually transduced into downstream (and here biological—see also [149]) events. The crucial point is that in the signal processing world these signals are separated and discriminated by their dynamical, time- and frequency-dependent properties. Normally we model enzyme kinetics on the basis of the effects of a static concentration of substrate or effector [81–84]. Thus, the irreversible Michaelis–Menten reaction inline image includes only the ‘instantaneous’ concentration but not the dynamics of S. However, if detectors have frequency-sensitive properties, this allows one in principle to solve the ‘crosstalk problem’ (how do cells distinguish identical changes in the ‘static’ NF-κB concentration that might lead either to apoptosis or to proliferation, when these are in fact entirely orthogonal processes?). Although other factors can always contribute usefully (e.g. spatial segregation in microcompartments or ‘channelling’[150–153], and/or further transcription factors that act as a logical AND, OR or NOT [154]), encoding effective signals in the frequency domain allow one to separate signals independently of their amplitudes (i.e. concentrations) while still using the same components.

Figure 4.

(A) A cartoon illustrating the characterization of oscillations in the nuclear NF-κB concentrations, in terms of features such as amplitude (A1, etc.), time (T1, etc.), Period (P1, etc.) and relative amplitude (RA1, etc.). (B) Time series output of a model [18,19] of the NF-κB pathway showing oscillations in the concentration of NF-κB in the nucleus (green) and of IKK (red). The model is pre-equilibrated then ‘started’ by adding IKK at 0.1 µm. As with many such systems, the mechanism underpinning the oscillations is a coupled transcription-translation system with delays. (C) Effect on IKK and of nuclear NF-κB of varying one rate constant (for reaction 28 in [18]) by two orders of magnitude either side of its basal value. Trajectories start from the right and follow fairly similar pathways for the first oscillation but then diverge considerably. (D) Synergistic effects of individual rate constants in the model [20]. The colour from red to blue shows increasing rate constant 9, while increasing symbol size reflects the increase in rate constant 52. For some values of the rate constants k9 and k52 there is no influence of either on the time to the first oscillation (T1). However, when k9 is low increasing k52 increases T1 while when k9 is high the same increase in k52 decreases T1. Thus the effect of inhibiting a particular step can have qualitatively (directionally) different effects depending on the value of another step. This makes designing safe drugs aimed at targets in such pathways without understanding the system fully a challenging activity. This type of systemic nonlinearity can also account for the unexpected synergism often observed when different metabolic steps or drug targets are affected together, both in theory [349–352] and in practice [294,353,354].

In the most simplistic way, one could imagine a structure (Fig. 5A) in which there was an input signal that could be filtered via a low-pass or high-pass filter before being passed downstream—a low-frequency signal would ‘go one way’ (i.e. be detected by only one ‘detector’ structure) and a high-frequency signal the other way. In this manner the same components can change their concentrations such that they may be at the same instantaneous levels while nevertheless having entirely different outcomes, solely because of the signal processing, frequency response characteristics of the detectors. Of course the real system and its signal-processing elements will be much more complex than this. We note that there is also precedent for the nonlinear and frequency-selective (bandpass) responses of individual multistate enzymes to exciting alternating electrical fields [155–159].

Figure 5.

The importance of signal dynamics and of downstream signal processing in affecting biological responses. (A) A simple system illustrating how two different frequency-selective filters can transduce different features of the identical signal into two different downstream signals and hence two different biological events responses or events. Such downstream responses might be processes as different as apoptosis and cell proliferation. (B) Simple resistor-capacitor (RC) electrical filters (above) can act as a delay line when they are concatenated in series (below), and every biological reaction can act as an RC element, and this may account in part for the use of such serial devices in biology.

While the recognition that electrical circuit (signal processing) elements and biological networks are fundamentally similar representations is not especially new [22,47,146,160–167], Alon [21,147,168,169], Arkin [146], Tyson [22] and Sauro and colleagues [167], among others [170] have made these ideas particularly explicit. Any element (Fig. 5B) in a metabolic or signal transduction pathway acts as a resistor–capacitor element [160] (as indeed do any ‘relaxing’ elements responding to an input, such as an alternating electrical signal [171]). A series of them acts as a delay line (Fig. 5B[17] and see [172] or any other textbook of electrical filters, and in a biological context [173]). This ability to act as a delay element provides another possible ‘reason’, besides signal amplification, for the serial arrangements of kinases and kinase kinases (etc.) in signalling cascades, since amplification alone could (have evolved to) be effected simply by increasing the rate constants of a single kinase. Similarly, a suitably configured (‘coherent’) feedforward network serves to provide resistance to temporally small input perturbations (noise—or at least an amount of fluctuating/diffusing nutrient not worth chasing) whilst transducing longer-lasting ones of the same amplitude into output (biological effects) [174,175]. Other network structures − which like all such network structures effectively act as ‘computational’ or ‘signal processing’ elements − can exhibit robustness of their output(s) to sometimes extreme variations in parameters [22,165,176–187]. Indeed, the evolution of robustness is probably an inevitable consequence of the evolution of life in an environment that changes far more rapidly than does the genotype [179].

Thus the recognition that we need to concentrate more on the dynamics of signalling pathways rather than instantaneous concentrations of their components, means that we need to sample very frequently − preferably effectively in real time – and using single cell measurements to avoid oscillations and other more complex and functionally important dynamics being hidden via the combination of signals from individual, out-of-phase cells. It also means that assays for signalling activity, for instance in drug development, should not focus just on the signalling molecules themselves but on the structures that the cell uses to detect them.

A forward look

By concentrating on a restricted subset of issues within the confines of a single lecture, many topics had to be treated only superficially or implicitly, and it is appropriate to set down in slightly more detail some of the directions in which I think progress is required, important or likely.

Data standards and integration

The first is the need to integrate SBML (and other [188]) biochemical models and model representations into postgenomic databases with schemas such as those for genomics (e.g. GIMS [189]), transcriptomics (e.g. MAGE-ML [190]), protein interactions [191], proteomics (e.g. PEDRo [192] and PSI [193,194]) and metabolomics (e.g. ArMet [195] and SMRS [196]). Progress is being made (e.g. [197]), but significant problems remain before the considerable benefits [198] of extensible markup languages can be fully realized [199], and before well-structured ontologies (http://suo.ieee.org/) become the norm [200].

In a related manner, there are many things one might wish to do with an SBML or other biochemical model, including creating it, storing it, editing it, comparing it with other stored models, finding it again in a principled way, visualizing it, sharing it, running it, analysing the results of the run, comparing them with experimental data, finding models that can create a given set of data, and so on. No individual piece of software allows one to do all of these things well or even at all (for a starting point see http://dbk.ch.umist.ac.uk/sysbio.htm#links). However, plan A (start from scratch and write the software that one wished existed) would require an enormous and coherent effort involving many person-years. Consequently we are attracted by plan B. This is to create a software environment in which individual software elements appear to – and indeed do − work together transparently [201], such that ‘only’ the software ‘glue’ needs to be written, somewhat in the spirit of the Systems Biology Workbench [202] or of software Application Programming Interfaces more generally. Distributed environments using systems such as Taverna [203] or others [204–206] to enact the necessary bioinformatic workflows may well provide the best way forward, and since the difficulties of interoperability seem in fact to be much more about data structures (syntax) than about their meaning (semantics) [207], this task may turn out to be considerably easier than might have been anticipated.

Synthetic biology

Another emerging and important area is becoming known as ‘synthetic biology’[208–213] (a portal for this can be found at http://www.syntheticbiology.org/). Although this has a variety of subthreads [213], an ‘engineering’-based motivation [214–216] is the one which I regard as paramount. Here one seeks, somewhat in the manner of the ‘network motifs’ mentioned above, to develop principled strategies for determining the kind of networks and computational structures in biology that can effect specific metabolic or signal processing acts or behaviours, and to combine them effectively. Ultimately, as a refined and improved strategy for metabolic engineering [30,78,217–223] one may hope that this will give sufficient understanding to allow one to design these and more complex bioprocesses (and the organisms that perform them). Similar comments apply to the de novo design, synthesis and engineering of proteins [224–234] (where there is already progress with building blocks or elements such as foldamers [235–238]), initially as a complement to effective but more empirical strategies based on the directed evolution and selection of both proteins (e.g. [239–252]), and nucleic acid aptamers (e.g. [253–274]).

Chemical genetics and chemical genomics

The modulation by small molecules of biological activities has proven to be of immense value historically in the dissection of biological pathways (e.g. in oxidative phosphorylation [275,276]). Chemical genetics or chemical genomics (e.g. [277–292]) describes an integrated strategy for manipulating biological function using small molecules (the integration aspect specifically including cell biology-based assays and the databases necessary to systematize the knowledge and from which quantitative structure–activity relationships may be discerned [293]). This chemical manipulation is considered to be more discriminating than strategies based on knocking out genes or gene products using the methods of molecular biology since they can be selective towards individual activities that may be among several catalysed by specific gene products. Also, chemical genetics can be used to study multiple effects when the small molecules are added both singly and in combination [294], and such studies − involving only the addition of small molecules − can be performed with far more facility than those requiring complex and serial molecular biological manipulations. As with ‘biological’ genetics, it is usual to discriminate ‘forward’ and ‘reverse’ chemical genetics. In ‘forward’ chemical genetics, the logic goes: screen a library→find cellular or physiological activity→discover molecular target [295], this being somewhat akin to the ‘traditional’ (pregenomic) drug discovery process in the pharmaceutical industry. In ‘reverse’ chemical genetics we start with a purified target, then with the chemical library look for binding activity and then test in vivo to see the physiological effects, much as is done (with decreasing success) in the more recent approaches preferred by Pharma. While these strategies should best be seen as iterative (Fig. 6), we would have some preference for the ‘forward’ chemical genetic approach as the hypothesis-generating arm.

Figure 6.

Chemical genomics as an iterative process in which molecules are screened for effects and their targets identified, thereby allowing the development of mechanistic links between individual targets and (patho-)physiological processes.

Text mining

With the scientific literature expanding by several thousand papers per week, it is obvious that no individual can read them, and there is in addition a large historical database of facts that could be useful to systems biology. Text mining is an emerging field concerned with the process of discovering and extracting knowledge from unstructured textual data, contrasting it with data mining (e.g. [296,297]). which discovers knowledge from structured data. Text mining comprises three major activities: information retrieval, to gather relevant texts; information extraction, to identify and extract a range of specific types of information from texts of interest; and data mining, to find associations among the pieces of information extracted from many different texts [298]. As phrased therein ‘…hypothesis generation relies on background knowledge, and is crucial in scientific discovery’, the pioneering work by Swanson on hypothesis generation [299] is mainly credited with sparking interest in text mining techniques in biology. Text mining aids in the construction of hypotheses from associations derived from vast amounts of text that are then subjected to experimental validation by experts.

Some portals are at http://www.ccs.neu.edu/home/futrelle/bionlp/ and http://www.cs.technion.ac.il/~gabr/resources/resources.html, and a national (UK) centre devoted to the subject is described at http://www.nactem.ac.uk. Although these are early days (e.g. [300–308]), we may one day dream of a system that will read the literature for us and produce and parameterize (with linkages, equations and parameters like rate constants) candidate models of chosen parts of biological systems.

Single cell and single molecule biology

Given the heterogeneity of almost all biological systems, and thus for reasons given above the importance of single cell studies, it is evident that we need to develop improved methods for measuring omics in individual cells, preferably noninvasively and in vivo. Buoyed by experience with the fluorescent proteins [309], and indeed with the more recent antibody-based proteomics [310] (http://www.proteinatlas.org/), it is evident that optical methods are among the most promising here, with detectors for specific metabolites [311] and transcripts (http://www.nanostring.com/) (see also [312]) that can be used in individual cells coming forward as part of the development of Bionanotechnology [313].

What is true about the heterogeneity of single cells [141,142] is also true for that of single molecules [314,315], and many assays capable of detecting the presence or behaviour of single molecules are coming forward. Thus, high-throughput screening for ligand binding [316,317] and nucleic acid sequences [318–320] are now being performed using assays based on miniaturization and single-molecule measurements, bringing the $1000 human genome well within sight (although amplification techniques can of course also be used to advantage in nucleic acid sequencing [321,322]).

The Manchester Interdisciplinary Biocentre (MIB)

Many of the kinds of problems described above, and certainly the solutions being developed to attack them, require the input of ideas and techniques, and scientific cultures, from the physical sciences, engineering, mathematics and computer science. One solution, that we are adopting in the Manchester Interdisciplinary Biocentre (MIB: http://www.mib.ac.uk/, Fig. 7) and the Manchester Centre for Integrative Systems Biology (MCISB: http://www.mcisb.org/), is to colocate individuals with the necessary combinations of skills. Within MCISB we are seeking to develop the suite of techniques for the largely ‘bottom up’ systems biology strategies set down in Fig. 2B.

Figure 7.

The Manchester Interdisciplinary Biocentre, a physical building and intellectual environment that brings together workers from a variety of Schools at the University of Manchester focussing on Engineering and Physical Sciences, including mathematics and computing (≈60%), with those from biology and medicine (40%).

Emergence and a true systems biology

The grand problem of biology, as well as the ‘inverse problem’ (Fig. 2D) of determining parametric causes from measured effects (variables), to which it is related, is understanding at a lower level the time-dependent [323,324] changes of state that are commonly described at a higher level of organization, an issue often referred to using terms such as ‘self-organization’[325], ‘emergence’[326–328], networks [329,330] and complexity [161,165,331–333]. Modelling and sensitivity analysis (see above) can begin to deconstruct such relations, but it is in areas such as ‘causal inference’[334–337] that we shall probably see the most focussed development of principled explanations of such causal linkages.

Coda

Having begun with a couple of quotations, and having stressed the role of technology development in science in general and in systems biology in particular, I shall end with another quotation, from the Nobelist Robert Laughlin [338]:

In physics, correct perceptions differ from mistaken ones in that they get clearer when the experimental accuracy is improved. This simple idea captures the essence of the physicist's mind and explains why they are always so obsessed with mathematics and numbers: through precision one exposes falsehood…A subtle but inevitable consequence of this attitude is that truth and measurement technology are inextricably linked.

Acknowledgements

In addition to the huge contributions of the past and present members of my research group I have enjoyed many friendships and scientific collaborations with numerous colleagues, who are listed as coauthors in the references, but I would especially like to mention Steve Oliver, Hans Westerhoff and Mike White. I also thank the BBSRC, BHF, EPSRC, MRC, NERC and the RSC for financial support.

Ancillary