Structure, location and interactions of G-quadruplexes
J. L. Huppert, Cavendish Laboratory, University of Cambridge, JJ Thomson Ave, Cambridge CB3 0HE, UK
Fax: +44 1223 337000
Tel: +44 1223 337256
Four-stranded G-rich DNA structures called G-quadruplexes have been the subject of increasing interest recently. Experimental and computational techniques have been used to implicate them in important biological processes such as transcription and translation. In this minireview, I discuss how they form, what structures they adopt and with what stability. I then discuss the computational approaches used to predict them on a genomic scale and how the information derived can be combined with experiments to understand their biological functions. Other minireviews in this series deal with G-quadruplex nucleic acids and human disease [Wu Y & Brosh RM Jr (2010) FEBS J] and making sense of G-quadruplex and i-motif function in oncogene promoters [Brooks TA et al. (2010) FEBS J].
fluorescence resonance energy transfer
Nucleic acids can form a very wide range of different structures, aside from the well-known DNA double helix. This double helix is highly unusual in that the structure is largely independent of the sequence. For other structures, what form they adopt, and how stably, is controlled by their sequences, and in particular the different chemical properties of the nucleobases.
It was noted in 1910 that guanine behaved differently from all other nucleobases, in that it could spontaneously form a gel . It then took over 50 years for the structure responsible to be discovered . The core consists of square arrangements of four guanines, bound together using two hydrogen bonds for each side of the square, and with a monovalent cation (preferably K+) in the centre. These squares, known as G-tetrads, can then stack on each other to form higher order structures called G-quadruplexes, typically with the cations now in the interstices, each interacting with eight guanines [3,4].
Although these structures can form from individual guanine bases, in a biological context there are few free bases around and they form from DNA (or RNA) sequences, with the bases held together by the backbone. These structures vary in their molecularity and may be tetramolecular, with one guanine in each square coming from a particular strand, bimolecular or unimolecular. In these latter two cases, there are loops connecting different runs of guanine and these loops play a very important role in controlling the details of the structure and stability of the resulting G-quadruplex .
For a while, G-quadruplexes were simply a structural curiosity, but recently it has become clear that they play important physiological roles. They are found in telomeres  and have been implicated in regulating transcription, translation and replication . This interest has been reflected in the rate of publication, with an exponential growth in the number of articles mentioning the term G-quadruplex (or G-tetraplex, an alternative name) over the past decade.
In this minireview, I describe the detailed structure of various G-quadruplexes and the computational tools that may be used to predict their formation and stability. I then describe where they are found in the genomes of various organisms and how such information allows us to predict their functionality.
G-quadruplex structures may be considered to be comprised of a core G-rich component, consisting of G-tetrads stacked on top of each other and zero or more connecting loops, which may be of variable composition.
The G-rich core typically consists of two or more stacked G-tetrads with a right-handed helical twist. The stacks are joined together by the normal sugar–phosphate backbone. The binding energy arises from three main factors: hydrogen-bonding between the guanines in a plane, π–π interactions between the guanines in adjacent planes and charge–charge interactions between the partially negative O6 of the guanines and cations that typically sit in the octahedral position between the stacks. Monovalent cations, especially K+, are particularly stabilizing. Varying any of the bases to non-G bases is highly destabilizing and such mutant sequences are unlikely to form G-quadruplexes in vivo .
This G-core can form the major part of the structures that form, as found in short sequences such as d(TGGGGT)4, which tetramerizes to form a four-stacked G-quadruplex with trailing ends . Another example is G-wires, very long polymeric sequences of continuously stacked G-tetrads. These are currently being investigated for their interesting electronic properties .
In biological contexts, however, multimerization is relatively unlikely and most interest has focused on unimolecular G-quadruplexes. This requires at least four runs of guanine to be joined together by (at least) three loops. These loops can have varying length and sequence, and this controls the topology of the final structures and is related to the direction of the G-rich strands making up the core.
Loops may link positions on the top (or bottom) of the stacks, forming diagonal or lateral loops depending on which guanines are being linked. Alternatively, they may link a guanine on the top of the stack to a guanine on the bottom, resulting in a double-chain reversal loop. Further details and images are available in Phan .
The nature of the loops is also related to the directionality of the four G-rich runs making up the core of the structure – these may be parallel (requiring double-chain reversal loops) , antiparallel (with two strands running in each direction and either lateral or diagonal loops)  or a mixed ‘3 + 1’ hybrid, with three strands in one direction and one in the other, and a mixture of loop types . Which of these structures is preferred depends on the sequence and length of the loops – in general, shorter loops favour parallel structures .
There are also more exotic ways of arranging the loops, as found in one G-quadruplex taken from the promoter of the c-kit oncogene – its structure involves various internal loops .
Given this complexity, there are clearly many possible structures for most G-quadruplex-forming sequences and experimental evidence suggests that in many cases they are of very similar energies. Hence, many of these structures seem to exist in vitro as a set of polymorphs and small changes in the conditions can favour one or more different detailed structures. As an example, the human telomeric repeat (see below) has been shown to form parallel, antiparallel and hybrid structures under subtly different conditions . This is clearly a big challenge both for structure determination and also for targeting, whether pharmaceutically or naturally.
Telomeres in organisms
All organisms with linear chromosomes have to mark out the ends of their chromosomes in order to distinguish them from unwanted double-strand breaks . In all species studied to date, this is achieved with repetitive sequences of variable lengths characterized by runs of guanines on one strand. In humans, there are typically around 1000 repeats of the sequence d(GGGTTA) in duplex form, followed by a single-stranded overhang of hundreds of bases. All vertebrates have the same sequence and most other organisms have very slight variants on this basic sequence, of the form d(G2–4T1–4A0–1). All of these have been shown to form stable G-quadruplexes under appropriate in vitro conditions.
This remarkable property, suggesting that all linear chromosomes have G-quadruplex caps, implies strongly that there is a function to these structures. Because telomeric regions are generally capped by a wide variety of proteins, it is not entirely clear exactly what form the DNA adopts in vivo [19,20]. By developing an antibody that bound specifically to parallel G-quadruplexes,
Plückthun and co-workers  demonstrated that G-quadruplexes form in vivo in the ciliate Stylonichia lemnae. Although this was an elegant experiment, it has not been replicated for other organisms and it is possible that G-quadruplex formation is induced by the presence of the antibody, although si-RNA experiments suggest that this is not the case . Nonetheless, it is hard to see how G-quadruplexes cannot play a critical role in telomere structure and function.
An active field of research at the moment is to investigate whether G-quadruplexes can form in locations in the genome other than just the telomeres, and what other functions they may have. In order to do this, however, it is necessary to have a way of predicting G-quadruplex formation for other sequences [22–24].
The simplest approach is to identify a G-quadruplex-like sequence in a region of interest and then to study it in vitro. This approach was taken in the classic work by Hurley on the c-myc oncogene [25,26]. The NHE III1 domain of the c-myc promoter was known to be a transcriptional repressor; Hurley proposed and then demonstrated that it could form a G-quadruplex and suggested that this could be responsible for the repressor activity.
However, such identifications do not lend themselves to widespread target discovery and it is necessary to develop some kind of predictive algorithm to identify possible sequences. We  and others [28,29] have developed a variety of different predictive tools, which are reviewed elsewhere [22,23]. Our tool, quadparser, is freely available online at http://www.quadruplex.org/?view=quadparser and identifies sequences of the form G3+N1–7G3+N1–7G3+N1–7G3+, with four runs of guanines separated by variable loops .
This and other tools are based on a series of biophysical experiments that have been performed, studying the thermodynamic equilibrium of G-rich oligonucleotides between a G-quadruplex form and an unstructured single strand. These experiments are typically performed using either ultraviolet melting  or fluorescence melting  and have led to much information associating stability with particular sequences .
However, there are a number of problems with the results so far. First, our ability to accurately predict structure or stability for a novel sequence is very limited. We have recently taken a Bayesian inference approach to come up with the first evidence-based predictor for stability. This method uses Bayesian calculcations to learn from experimentally determined data and allows predictions of thermal stability for new input sequences under various conditions. However, further data are required for it to be more accurate and it currently only predicts the melting temperature rather than more detailed thermodynamic parameters . It is freely available for use online at http://www.quadruplex.org/?view=quadpredict.
Second, the experiments largely describe only a single-strand folding and say little about the effects of having the complementary strand present. Experiments that have been done show that, as expected, the complementary strand tends to favour the formation of duplex DNA , but it is not clear for which sequences G-quadruplexes will still form and with what stability compared with the duplex. Elegant experiments using FRET in plasmids show that G-quadruplexes can exist in a duplex context, but it is not yet possible to generalize this result . Interestingly, the complementary strand of a G-quadruplex may form an alternative C-rich structure called an i-motif, which contains hemi-protonated C≡C+ base pairs .
Lastly, all the experiments are performed in vitro, and so omit many factors that would be present in vivo. These factors include the many proteins that interact with G-quadruplexes, stabilizing, destabilizing (e.g. acting as helicases) or cleaving them . It also neglects the presence of nucleosomes , which stably bind duplex DNA. Supercoiling is also neglected, although experimental evidence for both c-kit and c-myc show that it can play a significant role in promoting G-quadruplex activity . Also, in vitro experiments are generally performed in dilute aqueous solution, although it has been repeatedly shown that molecular crowding (as found in vivo) can induce G-quadruplex formation [40,41].
Despite these limitations, much work has been performed developing and refining these predictors, which allows broad-scale studies of the presence of G-quadruplex-forming units in many organisms’ genomes. In humans, for example, there are ∼ 375 000 possible quadruplex-forming sequences, although it is unlikely that all of these would in fact form in vivo [27,28]. Nonetheless, it is interesting to note that this is significantly below the number that would be expected by chance, suggesting an evolutionary pressure to reduce the number.
Predicted G-quadruplexes are not located randomly throughout the genomes and tend to cluster together in particular regions. Coupled with the depletion across the whole genome, this is highly suggestive of evolved functionality.
Aside from telomeres, one of the first regions considered for the presence of G-quadruplexes was gene promoters. Hurley’s work on c-myc established the principle that such sequences could regulate gene transcription [25,26] and other examples, such as c-kit [42,43], were then found . These are discussed in more detail in one of the other minireviews in this series .
Computational analyses of the entire human genome showed that G-quadruplexes were in fact very likely to be found just upstream of gene TSS positions. Indeed, almost half of all known genes have a putative G-quadruplex in their promoter in a position where it could be involved in gene regulation. These G-quadruplexes tend to be more thermodynamically stable than typical .
These genes are not random in terms of their functions. Oncogenes are more likely, and tumour supressors less likely, to contain G-rich sequences . Detailed GO code analysis showed in general that genes that are involved in regulation (e.g. transcription factors) are more likely to have promoter G-quadruplexes, whereas ‘housekeeping’ genes (e.g. those involved in protein biosynthesis) tend to be depleted in G-quadruplexes . Interestingly, genes involved with olfaction are extremely depleted, raising questions as to their evolutionary history and how they are regulated.
Similar results have been found in a wide variety of other organisms, yielding similar results for other vertebrates and comparable results for other eukaryotes and prokaryotes.
RNA is also capable of forming G-quadruplex structures and indeed forms them even more stably than does DNA. In addition, whereas DNA in biological systems is typically double-stranded, RNA is typically single-stranded, and so G-quadruplex formation does not have to compete with duplex formation and is hence more likely. However, a wide range of alternative structures is possible in addition to G-quadruplexes .
The question therefore arises as to whether RNA G-quadruplexes could play a role in translation regulation, in a manner analogous to DNA G-quadruplexes and transcription. We identified a predicted G-quadruplex in the 5′-UTR of the N-ras oncogene, and demonstrated that it could form a G-quadruplex in vitro, and that the formation of the G-quadruplex led to a significant (fourfold) reduction in the amount of protein produced for a given amount of mRNA . This work has since been developed by others, who have investigated how the positioning affects the effect  and demonstrated that this mechanism applies in vivo in prokaryotes and eukaryotes [49,50].
Although the algorithms used to predict G-quadruplexes have been created based on data from DNA experiments, it is still approximately possible to modify them to predict RNA G-quadruplexes, although considerable further work is needed to bring the accuracy of such predictions up the same level as for DNA. Nonetheless, the tools available can give a clear indication of the location of G-quadruplexes in RNA forming sequences .
The first thing that is clear is that there is distinct asymmetry between the coding strand and the template strand – there are relative few G-quadruplex motifs in the coding strand, suggesting that they have been disfavoured in general. However, those that still exist do cluster at the 5′-end of the 5′-UTR of many thousand genes . As with promoter G-quadruplexes, there is a clear selectivity in terms of gene functions. Interestingly, there is a localized concentration of G-quadruplexes immediately after the transcription end site of some genes, particularly those with additional genes immediately downstream. These are regions that should never be transcribed, and have been associated with regulation of transcription termination via a pausing mechanism. Studies using an antibody evolved to specifically bind G-quadruplexes showed that gene expression changed significantly for genes with G-quadruplexes in their promoters, in the 5′-UTR or around the 3′-UTR, suggesting a widespread range of functions and a complex response .
Duplex DNA in cells is not naked, but is stored as nucleosomes, wrapped around histone proteins. These structures would be expected to have the effect of stabilizing the duplex form and preventing the formation of G-quadruplexes or any other secondary structures. However, nucleosomes are not present along the entire genome and there are gaps between nucleosomes, of a size large enough to allow G-quadruplex formation . Notably, the promoter region immediatetely upstream of genes, where G-quadruplexes are especially common, are denuded on nucleosomes. This inverse association holds true across the human genome, with stable G-quadruplex sequences located in general in the gaps between nucleosomes. This is consistent with them having a function derived from their structure, either as a result of evolutionary pressure to move them into gaps, or because the formation of G-quadruplex structures prevents nucleosomes from forming in those locations .
As previously discussed, the presence of a complementary strand of DNA disfavours G-quadruplex formation and so we may expect that G-quadruplex-based functionality may be more likely when DNA is single-stranded. Such conditions are found as standard in telomeres, where there is a single-stranded overhang, and during replication, when the strands must be separated; G-quadruplex formation has been proposed for both of these. One other, rarer, occasion when DNA is single-stranded is a G-loop , the product of transcription of certain AG-rich sequences where the stability of the normally transient RNA/DNA hybrid outweighs the DNA duplex, resulting in the formation of a loop with the complementary DNA strand being unbound . These structures have been shown to form in immunoglobulin class switch regions, and may occur elsewhere. G-loop formation in plasmids has been shown directly and indirectly to lead to G-quadruplex formation and this may form part of the switching mechanism [54,56].
G-quadruplex-forming sequences have been proposed to play a wide range of physiological roles, including in all the processes of the Central Dogma. Experimental studies have confirmed these functions in a number of specific genes and computational methods have given statistical evidence that these structures may be widespread in the genomes of many organisms. However, further evidence is required to demonstrate clearly exactly how many of these possible G-quadruplexes do actually form in vivo and to clarify many of the details of the mechanisms involved.
In general, it seems that G-quadruplexes are generally a bad thing and most organisms are depleted in sequences that could form them. However, some specific locations and specific G-quadruplexes appear to be highly amplified, presumably as a result of evolutionary pressures. Given the stability of the G-quadruplex compared with the duplex, it seems unlikely that they form permanently (except at telomeres or in RNA), but are likely to be a relatively minor component at equilibrium. They therefore function as switches, forming in response to stimuli such as supercoiling changes. Once formed, they are sufficiently metastable to be long-lived, especially with protein stabilization, and can therefore control other processes, whether by acting as steric blocks, occluding other active DNA sites or recruiting proteins that bind them. Once the trigger is removed, helicase activity (for example) can then return them to their original duplex state, ready for retriggering. At any given time, the number of folded G-quadruplexes in a given cell may be extremely low, although many may form during the lifetime of the cell.
This field is still relatively new, but has come a long way in terms of establishing hypotheses and providing proof of concept and statistical evidence for many of them. Over the next few years, it will become essential to investigate more of the details of the field, and establish exactly how important they are in vivo. I am sure the field will rise to the challenge and continue to be an exciting place to work.
JLH is a Research Councils UK Academic Fellow and Member of Parliament for Cambridge. Caroline Wright is thanked for helpful discussions and Jaime Gomez Marquez is thanked for extreme patience.