Standard Article

You have free access to this content

Sequencing the Human Genome: Novel Insights into its Structure and Function

  1. Hildegard Kehrer-Sawatzki1,
  2. David N Cooper2

Published Online: 15 JUL 2008

DOI: 10.1002/9780470015902.a0001899.pub2



How to Cite

Kehrer-Sawatzki, H. and Cooper, D. N. 2008. Sequencing the Human Genome: Novel Insights into its Structure and Function. eLS. .

Author Information

  1. 1

    University of Ulm, Institute of Human Genetics, Ulm, Germany

  2. 2

    Cardiff University, Institute of Medical Genetics, Cardiff, UK

Publication History

  1. Published Online: 15 JUL 2008


  1. Top of page
  2. Introduction
  3. Key Findings Made by the Human Genome Project and their Impact on our Understanding of the Structure and Function of the Human Genome
  4. Conclusion
  5. References
  6. Further Reading

The Human Genome Project (HGP) was launched in 1990 with the goal of sequencing the entire human genome. This project was undertaken as a collaborative venture by some 20 groups from the United States, the United Kingdom, Japan, France, Germany and China. In 2001, the results of this huge effort, entitled ‘Initial sequencing and analysis of the human genome’, were published in Nature under the banner of the International Human Genome Sequencing Consortium (IHGSC) (Lander et al., 2001). This initial draft sequence covered approximately 90% of the human genome with a redundancy of 4- to 5-fold. In the same year, Celera Genomics also reported a draft sequence of the human genome (Venter et al., 2001). While the version of the human genome sequence produced by the IHGSC was derived from the sequencing of chromosomally mapped and ordered clones, the genome sequence published by Celera Genomics was obtained by random whole-genome shotgun sequencing. The IHGSC assembly represented a composite derived from the genomes of numerous donors, whereas the Celera version of the genome was a consensus sequence derived from only five individuals.

Both draft sequences had major shortcomings, namely >100 000 sequence gaps and incomplete coverage of the euchromatic portions of the genome. The next steps taken towards the ‘finishing’ of the human genome sequence served to increase the coverage of the euchromatic regions (containing the vast majority of the genes) and to close the gaps between contigs. These efforts, which included the integration of several reported clone contigs (McPherson et al., 2001) as well as the Celera scaffolds (Venter et al., 2001), culminated in the ‘finishing’ of the human genome sequence such that it covered more than 95% of its euchromatic portion. This was quite a challenging task because the human genome contains a multitude of dispersed repeats and large segmental duplications which greatly complicate the determination of both its structure and sequence. In 2004, the IHGSC published an improved version of the human genome sequence, which converted it from a draft into a nearly complete genome sequence with a high degree of accuracy. This version of the human genome (Build 35) contained 2.85 billion nucleotides (2850 Mb) interrupted by only 341 gaps, providing coverage of approximately 99% of the euchromatic genome with an error rate of approximately 1 event per 100 000 bases (International Human Genome Sequencing Consortium, 2004; Schmutz et al., 2004a, 2004b). Since then, significant further progress has been made towards the actual completion of the human genome sequence; indeed, the complete euchromatic sequences of all individual human chromosomes, including the annotation of genes and other features, have now been published (summarized in Table 1). Since November 2005, the NCBI (National Center for Biotechnology Information) Build 36 assembly of the human genome sequence has been available in public databases. The data comprise a reference assembly of the complete genome sequence plus the Celera whole genome sequence (WGS) and a number of alternative assemblies of individual haplotypic chromosomes or regions. The full list of assemblies in NCBI 36 as well as the genome sequences are available through the genome browsers:

Table 1. Special features of human chromosomes 1–22 including respective lengths, gene number and density
ChromosomeChromosome length (bp)aNumber of known protein-coding genes per chromosomeaGene density (genes/Mb)Special features of the chromosomeReference
  1. Notes: LINE, long interspersed nuclear element.

  2. a

    Chromosome lengths and the numbers of genes per chromosome are according to the Ensembl database, version 47.36. The chromosome length corresponds to the length of each chromosome that has been sequenced so far. The number of known protein-coding genes represents a conservative estimate of the likely total number, comprising genes which have been fully annotated.

1247 249 71921898.85Largest human chromosome. Rich in disease genes. Huge (∼30 Mb) pericentromeric heterochromatic region at 1q12 spans approximately 5% of the length of the chromosome. Contains clusters of amylase genes (1p21), U1 snRNA genes (1q12–q22) and %S RNA genes (1q) as well as multiple (∼250) tRNA genesGregory et al. 2006
2242 951 14913285.47Chromosome 2 (along with chromosome 4) exhibits the lowest recombination rate of all the autosomes. Contains at 2q13 an ancient telomere–telomere fusion junction at the position where two ape chromosomes once fused to give rise to this human chromosomeHillier et al. 2005
3199 501 82711125.57Lowest rate of segmental duplication of all human chromosomes. Contains several olfactory receptor gene clustersMuzny et al. 2006
4191 273 0637974.17Chromosome 4 (along with chromosome 2) exhibits the lowest recombination rate of all the autosomes. Highest percentage of LINE among all chromosomesHillier et al. 2005
5180 857 8669034.99Rich in intrachromosomal duplications. Contains interleukin and protocadherin gene clusters on 5q31Schmutz et al. 2004b
6170 899 99211336.62Harbours the major histocompatibility complex and the largest tRNA gene cluster in the human genome. Contains at least 3 imprinted genesMungall et al. 2003
7158 821 42410236.44Contains the highest number of intrachromosomal duplications among all human chromosomes. Contains at least 6 imprinted genesHillier et al. 2003 and Scherer et al. 2003
8146 274 8267475.11Contains a fast evolving 15 Mb region on distal 8p with genes related to the innate immunity and nervous systems that appear to have evolved under positive selectionNusbaum et al. 2006
9140 273 2529296.62Structurally highly polymorphic. Contains the large (∼14 Mb) block of pericentromeric heterochromatin. Contains large numbers of intra- and interchromosomal segmental duplications as well as the largest interferon gene cluster in the human genome (9p22)Humphray et al. 2004
10135 374 7378346.16Region of extensive segmental duplication located on 10q11Deloukas et al. 2004
11134 452 384138510.30Rich in both genes and disease genes. Contains 40% of all olfactory receptor gene clusters. Contains at least 9 imprinted genesTaylor et al. 2006
12132 349 53410808.16Chromosome 12 has a unique history of evolutionary rearrangements that occurred in the rodent and primate lineages. Contains clusters of proline-rich protein and type II keratin genes at 12q13Scherer et al. 2006
13114 142 9803613.16Low gene density in general; contains a central 38 Mb segment where the gene density drops to only 3.1 genes per Mb. This acrocentric chromosome contains ribosomal RNA genes at 13p12 and at least 1 imprinted geneDunham et al. 2004
14106 368 5856696.29This acrocentric chromosome contains ribosomal RNA genes at 14p12. Contains two 1 Mb regions of crucial importance to the immune system (T-cell receptor and immunoglobulin heavy-chain genes). Contains serpin gene cluster at 14q32.1 and several regions with imprinted genesHeilig et al. 2003
15100 338 9156416.39This acrocentric chromosome contains ribosomal RNA genes at 15p12. Two large clusters of clinically important segmental duplications are located in the proximal and distal regions of 15q. Contains a number of imprinted genesZody et al. 2006a
1688 827 25492510.41Relatively high gene density. Contains a large number of segmental duplicationsMartin et al. 2004
1778 774 742123615.69High gene density. Has undergone extensive intrachromosomal rearrangement, many of which were probably mediated by segmental duplications. High G+C content of 45% (genome average: 41%)Zody et al. 2006b
1876 117 1532953.88Low gene density overall. Contains serpin gene cluster at 18q21.3Nusbaum et al. 2005
1963 811 651144322.61Highest gene density of all human chromosomes. One quarter of the genes on chromosome 19 belong to tandemly arranged gene families, encompassing 25% of the length of the chromosome. High G+C content of 48–49% (genome average: 41%). Repetitive sequences constitute 53–57% of the chromosome as compared with a genome average of 40–44%. Contains clusters of olfactory receptor genes and cytochrome P450 genes and multiple clusters of zinc-finger genes, and at least 2 imprinted genesGrimwood et al. 2004
2062 435 9646179.88Smallest metacentric autosome. Rich in both genes and disease genes. Contains type 2 cystatin gene cluster and at least two imprinted genesDeloukas et al. 2001
2146 944 3232846.05Smallest human chromosome with fewer genes than any other autosome. This acrocentric chromosome contains ribosomal RNA genes at 21p12Hattori et al. 2000
2249 691 43251910.44This acrocentric chromosome contains ribosomal RNA genes at 22p12. Relatively high gene density. Clusters of segmental duplications at 22q11.2 are associated with several genomic disordersDunham et al. 1999
X154 913 7548915.75Contains the pseudoautosomal regions, PAR1 and PAR2, at the tips of the short and long arms, respectively. These regions are essential for normal male meiosis and recombination. PAR1 undergoes an obligate crossover with the Y, thereby giving this region the highest recombination rate in the human genome, at least in males. One X-chromosome is subject to inactivation in females. Highly enriched in interspersed repeats and has a low G+C content of 39% (genome average: 41%)Ross et al. 2005
Y57 772 954801.38Lowest gene density of all human chromosomes (contains only 82 known genes). Contains the male-specific region which is a mosaic of heterochromatin and euchromatic X-transposed, X-degenerate and ampliconic sequences, that make-up 30% of the euchromatin. PAR1 undergoes an obligate crossover with the X-chromosome. The virtual absence of homologous recombination between the X- and the Y-chromosome has led to a gradual degeneration of Y chromosomal genes over evolutionary time. However, the absence of recombination, at least within the extensive nonrecombining region of the Y, has also favoured the evolutionary accumulation of transposable elements on the Y chromosomeSkaletsky et al. 2003

The results emanating from the HGP have had an enormous impact on biomedical research. Some of the most important insights obtained during the course of this project are discussed later and have previously been reviewed by Collins et al. 2003 and Little 2005.

Key Findings Made by the Human Genome Project and their Impact on our Understanding of the Structure and Function of the Human Genome

  1. Top of page
  2. Introduction
  3. Key Findings Made by the Human Genome Project and their Impact on our Understanding of the Structure and Function of the Human Genome
  4. Conclusion
  5. References
  6. Further Reading

Gene number and density

Among the most publically discussed results of the HGP has been the number of genes in the human genome. In the latest assembly of the human genome (Build 36), which covers a total of 3 253 037 807 base pairs, 23 686 known and novel protein-coding genes have been annotated (genebuild: Ensembl 2007, database version 47.36i; Gene density varies between the human chromosomes, allowing one to distinguish gene-rich and gene-poor chromosomes (Table 1). The gene distribution within chromosomes is also rather uneven. Thus, strikingly gene-poor regions have been identified (‘gene deserts’; Ovcharenko et al., 2005); these are regions that are devoid of protein-coding genes over distances of several megabases but may nevertheless contain regulatory sequences. Functional clustering of genes, and the coexpression of these genes located in distinct chromosomal domains, has also been observed (Yamashita et al., 2004; Gierman et al., 2007) and these properties have often been conserved over evolutionary time (Sémon and Duret, 2006). See also Clustering of Highly Expressed Genes in the Human Genome, Evolution of Gene Deserts in the Human Genome, Gene Clustering in Eukaryotes, and Gene Distribution in Human Chromosomes

Nonprotein-coding RNAs and transcripts of unknown function

The analysis of the human genome sequence has revealed that, in addition to protein-coding genes, several thousand ribonucleic acid (RNA) genes are present. Nonprotein-coding RNAs of known function include not only structural RNAs such as transfer RNAs, ribosomal RNAs and small nuclear RNAs but also regulatory RNAs (microRNAs and small interfering RNAs (siRNAs)) which are involved in the sequence-specific transcriptional and posttranscriptional modulation of gene expression (Kapranov et al., 2007b). MicroRNA gene loci may be quite numerous: already some 5000 microRNA gene loci have been identified (miRBase, release 10.0; See also Evolutionarily Conserved Noncoding DNA, MicroRNA Evolution in the Human Genome, rRNA Genes: Evolution, The Biological Significance of Conserved Nongenic DNA, and Ultraconserved Elements (UCEs) in the Human Genome

In addition to the unambigously noncoding RNAs, large numbers of nonpolyadenylated and polyadenylated transcripts of unknown function (TUF) have been identified which may have some coding potential. Since it may well be that they either represent noncoding transcripts or instead encode short polypeptides, these transcripts are classified collectively as ‘TUFs’. These unannotated transcribed regions or TUFs have been assigned to three different categories:

  • (i)
    antisense transcripts of protein-coding genes (Chen et al., 2005),
  • (ii)
    isoforms of protein-coding genes (Tress et al., 2007) and
  • (iii)
    transcripts that either overlap introns of annotated gene transcripts (on the same strand) or are derived entirely from within intergenic regions (reviewed in Gingeras, 2007; Kapranov et al., 2007a).

Both the complexity and abundance of TUFs are quite remarkable. Indeed, unannotated nonpolyadenylated transcripts originating from intergenic regions have been found to represent the major proportion of the transcriptional output of the human genome (Cheng et al., 2005).

The existence of this additional layer of transcriptional complexity has also been evident from data obtained by the Encyclopedia of DNA Elements (ENCODE) project to analyse 30 Mb from 44 genomic regions with the aim of characterizing the functional elements present in these sequences (ENCODE Project Consortium, 2007; Thomas et al., 2007). More than 65% of the approximately 400 annotated genes present in the ENCODE regions possess 5′ distal previously unannotated, tissue-specific transcription start sites and promoter regions, many of which form parts of TUFs (Denoeud et al., 2007). Importantly, a compilation of all previously annotated and empirically detected RNAs by the ENCODE Consortium has indicated that >90% of genomic sequence is transcribed as nuclear primary transcripts (ENCODE Project Consortium, 2007). Thus, it would appear that the majority of bases on both strands in the human genome probably play some part in encoding at least one primary transcript (Kapranov et al., 2007a).

Expressed pseudogenes may be considered as a special category among TUFs since even if some of them have lost their ability to encode a functional protein, they still may be transcribed (Zheng et al., 2007). It has been estimated that the number of pseudogenes could well exceed the number of functional protein-coding genes (International Human Genome Sequencing Consortium, 2004). It is highly likely that at least some of the pseudogenes have acquired new function by encoding either novel proteins or regulatory RNAs. See also Evolutionary Emergence of Genes Through Retrotransposition, Processed Pseudogenes and Their Functional Resurrection in the Human and Mouse Genomes, and Pseudogene Evolution in the Human Genome

Sequence elements controlling gene expression

The comparison of the human genome sequence with the orthologous sequences of mouse and rat revealed the existence of 481 ultraconserved elements (UCEs) of at least 200 bp (Bejerano et al., 2004). Most UCEs are noncoding and have been evolutionarily conserved since the divergence of the mammalian and avian lineages more than 300 million years ago. Analysis of the derived allele frequencies for the segregating polymorphisms in the human UCEs has indicated that these regions have been under negative selectional constraints which have been much more stringent than those normally acting on protein-coding genes (Katzman et al., 2007). This observation strongly supports the functionality of these UCEs, which may represent long-range enhancers of gene expression (Pennacchio et al., 2006).

The availability of the increasingly well-annotated human genome sequence has to some extent obviated the need for the individualized experimental identification of the promoter regions and cis-acting elements that regulate gene expression. However, to understand the complexity of gene expression regulation and to identify the full spectrum of the different deoxyribonucleic acid (DNA) sequences involved, the ENCODE project (Thomas et al., 2007) was conceived; this project has attempted to make the leap from structural to functional analysis by examining more than 200 experimental datasets from studies which have interrogated the 44 ENCODE regions (together representing approximately 1% of the human genome). The conclusions from this collaborative approach should be seen as second only in importance to the HGP itself in terms of our effort to characterize the human genome.

Revisiting our definition of the ‘gene’

The ENCODE project has suceeded in doing something that the HGP could not, namely to change the way in which we think about genes. The complexity exemplified by gene regulatory elements which are often quite distant from the genes they regulate, the existence of trans as well as cis regulatory elements, the quite unanticipated scale of the extent of transcription in the genome, the abundance of noncoding RNA genes and the presence of evolutionarily conserved noncoding regions have together challenged current notions of the gene. Gerstein et al. 2007 have proposed an updated definition of a gene as ‘a union of genomic sequences encoding a coherent set of potentially overlapping functional products’. This definition deftly avoids the complexities of regulation and transcription by removing the former altogether from the definition of a gene. Instead, this definition argues that it is the final functional gene products (rather than any intermediate transcripts) that should be used to group together the various entities that may be associated with a single gene.

High-copy repeat sequences and segmental duplications

The HGP revealed that repeat sequences account for at least 50% of the human genome sequence. These repeats may be classified as

  • (i)
    transposon-derived repeats,
  • (ii)
    partially retroposed copies of genes (referred to as processed pseudogenes),
  • (iii)
    simple sequence repeats,
  • (iv)
    blocks of tandemly repeated sequences at centromeres, telomeres and the short arms of acrocentric chromosomes and
  • (v)
    segmental duplications (SDs) or low-copy repeats.

The number and wide distribution of SDs in the human genome (5%) were most surprising. SDs represent extensive inter- and intrachromosomal duplications of genomic regions that contain genes as well as intergenic sequences (Lander et al., 2001; Venter et al., 2001). She et al. 2004 extended the initial analyses of these low-copy repeats/segmental duplications and initiated the characterization of the duplicational landscape of the human genome. SDs may be viewed as mutational hotspots since they are prone to aberrant recombination events occurring between highly homologous paralogous SDs, and giving rise to large deletions or duplications of the intervening sequences resulting in human genomic disorders (Shaw and Lupski, 2004). However, the rapid expansion and fixation of some intrachromosomal SDs during hominoid evolution may have contributed to the emergence of ‘new’ genes and transcripts embedded within these SDs, thereby conferring some selective advantage in the process (Jiang et al., 2007). SDs have also been shown to represent frequent sites of copy number variation between individuals, thereby contributing considerably to the genomic diversity among humans. See also Segmental Duplications and Genetic Disease, and Structural Diversity of the Human Genome and Disease Susceptibility

The detailed analysis of the repeat distribution in the human genome was key to answering the long standing mystery of Alu sequence enrichment in GC (guanine–cytosine)-rich genomic regions: strong positive selection appears to favour the retention of Alu sequences in GC-rich regions, which may be in some way beneficial to their hosts (Lander et al., 2001).

Genetic diversity of the human genome

Initially, more than 1.4 million single nucleotide polymorphisms (SNPs) were identified in the human genome (Lander et al., 2001). These have been exploited by the Human Haplotype Map (HapMap) project with the aim of developing methods for the design and analysis of genome-wide association studies to map phenotypic variation in humans (International HapMap Consortium, 2005). In the meantime, a second generation haplotype map based upon 3.1 million SNPs has been published (International HapMap Consortium, 2007). The map was obtained by genotyping 270 individuals from four geographically and ethnically diverse populations and includes approximately 25–35% of common SNP variation in the populations investigated. One novel finding has been that 10–30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent common ancestry. An additional discovery was that up to 1% of all common variants were not tagged by SNPs, primarily because they were located within recombination hotspots (International HapMap Consortium, 2007). Importantly, increased population differentiation at nonsynonymous SNPs was noted as compared to synonymous SNPs. These observations have also indicated systematic differences in the strength or efficacy of natural selection between populations from different geographical areas involving genes linked to Lassa virus in West Africa, skin pigmentation in Europe and hair follicle development in Asia (Sabeti et al., 2007). See also Evolution of Skin Pigmentation Differences in Humans, and HapMap Project

In addition to SNPs, copy number variants and polymorphic inversions have also been shown to contribute to human genomic diversity as evidenced from the results of genome assembly comparisons, array comparative genomic hybridization (arrayCGH) and mapping of large insert clones (Khaja et al., 2006; Redon et al., 2006). This type of genomic variation is likely to have a considerable impact on disease susceptibility in humans as evidenced by several examples. See also Copy Number Variation in the Human Genome, Segmental Duplications and Genetic Disease, Segmental Duplications and Their Role in the Evolution of the Human Genome, and Structural Diversity of the Human Genome and Disease Susceptibility

Reconstruction of ancestral mammalian/eutherian genomes

The sequence of the human genome has not only helped to improve our understanding of its structure and function, and to explore the full range of human genotypic diversity, but also provided the key to understanding the evolutionary history of the human species as well as individual human populations. The importance of human–mammalian genome comparative sequence analysis for the reconstruction of the ancestral eutherian genome has been demonstrated by several studies (e.g. Murphy et al., 2007). Together with other techniques such as comparative chromosome painting, these sequence comparisons have the potential to provide new insights into the evolutionary interrelationship of the different eutherian orders within the mammalian phylogenetic tree. See also Comparing the Human and Canine Genomes, Comparing the Human and Chimpanzee Genomes, The Mouse Genome as a Rodent Model in Evolutionary Studies, The Rat Genome as a Rodent Model in Evolutionary Studies, and The Sequencing of the Rhesus Macaque Genome and its Comparison with the Genome Sequences of Human and Chimpanzee


  1. Top of page
  2. Introduction
  3. Key Findings Made by the Human Genome Project and their Impact on our Understanding of the Structure and Function of the Human Genome
  4. Conclusion
  5. References
  6. Further Reading

The analysis of the sequence of the human genome has had a major impact on biomedical research over the last few years. The HGP has made possible a multitude of genome-wide scaled analyses and has thus provided a wealth of information about the structure of the human genome. In many ways, the HGP has paved the way for what is coming to be called individualized genome medicine. The development of new technologies for improved, less cost-intensive and more precise genome sequencing and assembly has been driven by the overwhelming success of the HGP. The recent sequencing of an individual human's entire diploid genome and its comparison with the human reference sequence (Levy et al., 2007) has yielded new insights into the extent of genetic variation and marks a starting point of a new era of research into the basis of human genetic individuality.

Antisense transcript

Antisense transcripts control gene expression via posttranscriptional gene silencing by annealing to the complementary sequence of the sense transcript.


An ordered arrangement of overlapping cloned fragments that together contain the sequence of the originally contiguous DNA strand.


A short region of DNA that can bind a transcriptional activator protein, thereby initiating the transcription of a gene which may be distant to the enhancer, possibly even on a different chromosome.


Portion of the genome which contains the euchromatin, a form of chromatin that is rich in actively transcribed genes.

Segmental duplication

Genomic duplication of a DNA segment longer than 1 kb.

Ultraconserved element

DNA sequences which have remained unchanged over an extended period of evolutionary time (indicating that they are biologically important) but whose functions remain largely unknown.


  1. Top of page
  2. Introduction
  3. Key Findings Made by the Human Genome Project and their Impact on our Understanding of the Structure and Function of the Human Genome
  4. Conclusion
  5. References
  6. Further Reading

Further Reading

  1. Top of page
  2. Introduction
  3. Key Findings Made by the Human Genome Project and their Impact on our Understanding of the Structure and Function of the Human Genome
  4. Conclusion
  5. References
  6. Further Reading