Is your “gene of interest” interesting?


  • Julie C. Kiefer

    Corresponding author
    1. Department of Neurobiology and Anatomy, University of Utah, Salt Lake City, Utah
    • Department of Neurobiology and Anatomy, 20 North 1900 East, 401 MREB, University of Utah, Salt Lake City, UT 84132
    Search for more papers by this author


Has a large-scale screen turned up a potential gene-of-interest that you know nothing about? Your computer is a portal to a wealth of information that can save you valuable time and resources. Freely available data can help to determine whether a particular gene is worthy of further research, and what direction that research should take. Presented here are approaches to mining the Internet, including searching popular model organism databases. The primer covers two typical scenarios: the gene of interest is well characterized, or mostly uncharacterized. Also featured are interviews with Monte Westerfield, PhD, Director of the Zebrafish Information Network (ZFIN) online database, and Principal Investigator of the Human Protein Reference Database (HPRD) project, Akhilesh Pandey, MD, PhD. Developmental Dynamics 236:2962–2969, 2007. © 2007 Wiley-Liss, Inc.


You have performed a microarray screen comparing primary Mus musculus (mouse) cells cultured with a specific growth factor, to treated cells. The microarray shows that the gene Hspg2 is significantly up-regulated in treated cells. You know nothing about the function of this gene or how it might be regulated by the growth factor. How can bioinformatics help determine whether Hspg2 is a “gene of interest” worth further investigation (Fig. 1)?

Figure 1.

Flow chart outlining approaches to finding information about a potential gene of interest. See text for details.

A. What Is Known About Mouse Hspg2?

Go to Mouse Genome Informatics (MGI; Eppig et al.,2005) Type in “Hspg2”, scroll to “Gene symbols/names” (default).

Note: Gene databases access a wealth of information. They generally contain data that fall into the following categories: data summary, genomic location, gene models, links to sequences, protein domains, gene ontology, homology, expression, interactions, alleles, phenotypes, reagents, external links, and references.

Results sampled from MGI (version 3.54): Names: Perlecan, heparan sulfate proteoglycan 2. Phenotypes: Null mutants die at embryonic day 10.5 with cardiac outflow defects and/or brain exencephaly at birth. They also exhibit skeletal dysplasia, including micromelia and craniofacial defects (Arikawa-Hirasawa et al.,1999; Costell et al.,1999; Rodgers et al.,2007). An exon 3 deletion mutant shows only a lens defect (Rossi et al.,2003). Expression: Hspg2 is in nearly all tissues examined, including muscle, heart, and brain. Alleles: Five alleles including targeted knockouts; Reagents: cDNAs, primer pairs, antibodies.

B. What Is Known About Human Hspg2?

Go to NCBI Entrez Type in “human Hspg2”, scroll to search “gene” (Entrez Gene; Maglott et al.,2005).

Results sampled from Entrez Gene: Function: It stabilizes other molecules and regulates cell adhesion and glomerular permeability (Chakravarti et al.,1995; Morita et al.,2005). Expression: localized to cellular basement membrane.

C. Is HSPG2 Implicated in Any Human Genetic Diseases?

Go to NCBI Entrez. Type in “HSPG2”, scroll to “OMIM” (Online Mendelian Inheritance in Man; McKusick,1998).

Note: OMIM disease sites contain data that fall into the following categories: alternative titles, clinical features, other features, inheritance, mapping, molecular genetics, history of discovery, references.

OMIM search results: 142461 heparan sulfate proteoglycan of basement membrane; HSPG2 (last edited 12/15/2006); 255800 Schwartz-Jampel Syndrome, Type 1; SJS1 (last edited 6/11/2007); 224410 Dyssegmental Dysplasia, Silverman-Handmaker Type; DDSH (last edited 2/21/2007). Function: HSPG2 has growth-promoting and angiogenic properties. It can act as a co-receptor for FGF2 (Sharma et al.,1998). Consistent with this, yeast two-hybrid (Y2H) data analysis shows it binds FGF binding protein 1 (FGFBP1; Mongiat et al.,2001). Disease information: Schwartz-Jampel Syndrome, Type 1 (SJS1) can be caused by several mutations in HSPG2. The best characterized are mutations that result in a truncated protein, and a small deletion that leads to reduced expression of the nearly full-length protein (Nicole et al.,2000; Arikawa-Hirasawa et al.,2002; Stum et al.,2006). Phenotypes point to defects in neuromuscular function and cartilage formation. The syndrome is rare, and autosomal recessive.

Dyssegmental Dysplasia, Silverman-Handmaker Type (DDSH) is caused by homozygous functional null mutations resulting in lethal, neonatal, short-limbed dwarfism (Arikawa-Hirasawa et al.,2001).

D. Are There Mouse Models for SJSI and DDSH?

Go to MGI. Type in “SJS1” or “DDSH”, scroll to “Phenotype/Human disease”.

MGI search results: Targeted mutant strains with phenotypes similar to both SJS1 or DDSH are listed.

E. What Can Other Animal Models Tell Us About Hspg2?

Note: Entrez HomoloGene (go to NCBI Entrez, scroll to “HomoloGene”) is another good place to start. A search for “Hspg2” gives a list of genes identified as putative homologs.

1. Caenorhabditis elegans (nematode worm).

Go to WormBase Type in “Hspg2”, scroll to “any gene” (default).

Results sampled from WormBase (release WS177, 7/2007): Function: C. elegans unc-52 is a Hspg2 ortholog. unc-52 regulates muscle differentiation, and growth factor-like signaling pathways (Mackinnon et al.,2002; Merz et al.,2003). Mutant phenotypes: Paralyzed, locomotion abnormal, sterile, lethal. Expression: Expression begins in mid-embryogenesis. It is localized primarily in basement membrane of muscle cells (Mullen et al.,1999). In larvae and adults it is also expressed in M-lines, dense bodies, and muscle cell margins of body wall muscle. Reagents: Alleles, transgene strain, primers, microarray probes, SAGE tags, cDNAs, antibodies.

2. Drosophila melanogaster (fruit fly).

Note: A search on FlyBase for “Hspg2” yields “hsp67B”, a heat shock protein. Because it is of a different protein class (see step I–F-1), it is not an ortholog of HSPG2. A search for “perlecan” yields “trol”; the two bear the same protein domains.

Go to FlyBase Type “trol”, scroll to “genes” (default).

Results sampled from FlyBase (version FB2006_01, released 12/8/2006): Function: Terribly reduced optic lobes (trol) is a structural molecule that regulates neuroblast division, cell–cell adhesion, cell-matrix adhesion, and signal transduction (Ebens et al.,1993; Voigt et al.,2002; Park et al.,2003). Phenotypes: Allele phenotypes include lethal and sterile. Other phenotypes are not listed. See bibliography for details. Expression: Embryonic midgut, proventriculus, hypo/epipharynx, ventral nerve cord, lateral cord, embryonic and larval circulatory, and muscle systems. Reagents: Alleles, genomic clones, RNAi probes, cDNAs.

3. Danio rerio (zebrafish)

Note: A search on Zebrafish Information Network ZFIN (Sprague et al.,2006) http://zfin.orgfor “Hspg2” and “perlecan” yield no hits.

Identify closest related zebrafish sequence to mouse Hspg2. a. Go to Basic Local Alignment and Search Tool (Altschul et al.,1990) 0- Select “Danio rerio”. On next page input mouse Hspg2 amino acid sequence in FASTA format. Select database of choice and TBLASTN program. RefSeq databases are useful databases to search against, they contain NCBI curated, nonredundant sequences.

Note: Because the zebrafish genome sequencing project is not yet complete, all proteins are not yet curated. A search against a translated nucleotide database includes predicted proteins based on expressed sequence tags (ESTs) and gene models.

TBLASTN results: Closest related sequence is predicted hypothetical protein LOC565429, Entrez nucleotide accession XR_030096.1, E-value 0. Unlike the next closest match, it aligns with most of the mouse Hspg2 sequence. b. Perform “reverse” BLASTP alignment of hypothetical protein LOC565429 against mouse RefSeq protein database.

Note: Unlike for most genes, mRNA translation is not available on XR_03096.1 Entrez Gene or Nucleotide pages. Perform six-frame translation of XR_03096.1 mRNA sequence in Baylor College of Medicine (BCM) Search Luancher (last modified 6/4/2004) Input translated sequence that aligns with mouse Hspg2 in previous step.

Reverse BLASTP results: Hspg2 is the closest related mouse sequence to LOC565429. c. For associated EST expression, go to NCBI Entrez. Type in “LOC565429,” scroll to “UniGene.”

Note: UniGene contains information (gene expression, genomic location, protein similarities, cDNA clone information) for transcript sequences from a common locus, including ESTs.

UniGene results: ESTs representing LOC565429 are expressed in heart and genitourinary tissues.

F. Are the Hspg2-Related Sequences True Orthologs?

1. Do they have similar domain structure?

Perform motif search on Prosite (Gasteiger et al.,2003) Input protein sequences in FASTA format. Note: Human Protein Reference Database (HPRD) a useful resource for identifying protein domains and visualizing domain structure in human proteins.

Prosite results: All proteins have the following predicted domains, although how often domains are represented vary by species: LDL-receptor class A, Ig-like, Laminin IV type A, Laminin-type EGF-like, Laminin G, EGF-like. Mouse and human proteins also contain a SEA domain. Drosophila Trol is the only putative ortholog that bears a C-5 specific cytosine DNA methylases active site.

Sequence of the zebrafish predicted protein, LOC565429, is truncated at its 5' end. Its alignment with mouse Hspg2 initiates after the 5' SEA and LDL-receptor class A domains. Consistent with this, LOC565429 is lacking these two domains. Whether the protein bears these domains awaits completion of the genome sequence.

2. Do they have close family members that could also be orthologous?

Perform BLASTP alignment of Hspg2 ortholog against own species database.

BLASTP results: There are no close family members to human, C. elegans, and Drosophila Hspg2-like genes. In mouse, a BLASTP alignment of Hspg2 against mouse RefSeq protein database finds a sequence similar to Perlecan, predicted protein LOC100047061.

3. Is syteny conserved?

Go to Entrez Gene page for mouse, human Hspg2, and zebrafish LOC565429. Examine synteny under “Genomic context”.

Synteny results: Mouse: ELa3 - 1700013G24–Hspg2–Ldlrad2–Usp48; Mouse: LOC638198 (similar to Ela3B) -1700013G24 - LOC10004- 7061 (similar to Perlecan)–Ldlrad2 - LOC674195 (similar to USP48)–Rap1gap; Human: RAP1GAP - USP48–LDLRAD2–HSPG2–ELA3B–ELA3A; Zebrafish: LOC100007646– LOC1000- 07663 (BLASTP alignment shows most similar to GSTκ1)–LOC565429 – LOC561081 (similar to Oikosin1 protein) –LOC559970.

4. Compare information between Hspg2-related genes.

Function: Mouse, human, C. elegans, and Drosophila genes have conserved roles in cell adhesion, and signal transduction of growth factor signaling pathways.

Additional functions are reported in different model systems. Phenotypes in human SJS1 and DDSH patients, and disease mouse models, implicate roles for HSPG2 in neuromuscular function and cartilage formation. Drosohphila trol plays a role in neuroblast division. C. elegans unc-52 likely regulates muscle function.

Properties unique to different organisms may be attributed to unique protein domains. For example, human and mouse have a SEA domain that is not conserved in the Drosophila and C. elegans genes, and Drosophila trol bears a C-5 specific cytosine DNA methylases active site. Alternatively, perhaps appropriate experiments have not yet been done to identify those specific properties in other model systems.

Expression: While mouse Hspg2 expression is nearly ubiquitous, expression of Drosophila and C. elegans genes is predominantly in muscle. Within the cell, these genes localize to the basement membrane. Expression of ESTs corresponding to zebrafish predicted protein LOC565429 are localized to heart and genitourinary tissues.

Protein domains: See I-F-1.

Synteny: Human, and mouse Hspg2, and mouse predicted protein LOC100047061 have conserved synteny. Therefore LOC100047061 must be Hspg2. Zebrafish predicted protein LOC565429 does not have conserved synteny. Because of the signifciant evolutionary distance, synteny between zebrafish and mouse/human orthologs are not necessarily conserved.


Human/mouse Hspg2, Drosophila trol, and C. elegans unc-52 are likely functional orthologs in muscle. Whether zebrafish LOC565429 is also a functional ortholog awaits cloning of the entire gene, and further experiments. Reagents readily available in mouse, Drosophila, and C. elegans are available for further study.

Bioinformatics searches outlined in section II can be performed to determine the evolutionary relationship between the genes.


A Drosophila enhancer trap screen has identified a line with an intriguing expression pattern in muscle cells and somatic muscle primordia. The insertion is in a gene with accession number CG10641. The gene is relatively uncharacterized. Presented here are ways to exploit bioinformatics for more information (Fig. 1).

A. What Is Known About Drosophila CG10641?

Perform FlyBase search for CG10641 (see I-E-2). Results sampled from FlyBase: Function: CG10641 is a calcium-binding protein involved in mesoderm development. No phenotypes are reported.

Expression: Trunk mesoderm primordium, somatic muscle primordium, and embryonic and larval muscles. Reagents: Alleles and transgenic constructs are available.

B. Which Genes Are Candidate Orthologs?

1. Perform BLASTP alignment against human, mouse, and C. elegans RefSeq protein databases (see I-E-3).

BLASTP results: Human: EF-hand domain family member D1 (EFHD1) E-value 3E-56; EF-hand domain family member D2 (EFHD2) E-value 3E-55; Mouse: EF-hand domain containing 2 (Efhd2) E-value 2E-56; EF- hand domain containing 1 (Efhd1) NP_083165.1, E-value 7E-55; C. elegans: Y48B6A.6a, NP_496961.2, E-value 1E-33.

2. Perform TBLASTN search against zebrafish RefSeq mRNA database.

TBlastN results: Predicted hypothetical protein LOC795463, XM_001333603, E-value 7E-56. Predicted hypothetical protein Wu:fj19b07, XM_678161.2, E-value 9E-55.

C. What Is Known About These Genes?

1. Search Entrez Gene, UniGene, OMIM for data on human sequences (see I-B, C, E-3).

Results: Function (Entrez Gene): None listed for EFHD1 or EFHD2; Expression (UniGene): One set of ESTs associated with EFHD1 are expressed in lung and esophagus, another set is expressed in multiple tissues. EFHD2 associated ESTs show expression in multiple tissues; Disease (OMIM): No genetic diseases are associated with either gene.

2. Search MGI, Entrez Gene, UniGene for data on mouse sequences (see I-A,B,E).

Results: Names (MGI): Efhd1 is also called mitocalcin. Function (MGI, Entrez Gene): Efhd1 is a calcium ion binding protein. Its overexpression can promote neurite extension in cultured 2Y-3t cells (Tominaga et al.,2006). There is no functional information for Efhd2. Expression (MGI, Entrez Gene, UniGene): Efhd1 is localized to the mitochondrial inner membrane. In the cerebellum it is localized to glomeruli of the internal granular and molecular layers (Tominaga et al.,2006). Associated ESTs are expressed in muscle, heart, bone marrow, brain, eye, inner ear, lung, female genital tissue, mammary gland, spleen, sympathetic ganglion, testis, thymus, urinary tissue. Efhd2 associated ESTs are nearly ubiquitous.

3. Search WormBase for data on C. elegans sequences (see I-E).

Results: Gene models: Y48B6A.6a is one of three splice variants of the gene Y48B6A. Phenotypes: An RNAi experiment yielded no discernable phenotype. There are no reported alleles. Expression: Pharynx, body wall muscle, nervous system, and ventral nerve cord.

4. Search Entrez Gene, ZFIN, UniGene for data on zebrafish sequences (see I-B,E).

Function (Entrez Gene, ZFIN): No data for LOC795463 or Wu;fj19b07. Expression (UniGene): LOC795463 also has no associated EST expression data. Wu;fj19b07 associated ESTs are expressed in brain, eye, genitourinary tissue, gills, muscle, and olfactory rosettes.

D. How Are the Genes Related?

1. Identify conserved, similar amino acid residues.

a. Perform protein sequence alignment with ClustalW (version 1.83; Thompson et al.,1994) Input sequences in FASTA format, select “execute multiple alignment”.

b. Use BOXSHADE (version 3.21) to create a pleasing format. Highlight ClustalW alignment from previous step (including symbols and heading “Clustal W…”), copy and paste into BOXSHADE. Select “RTF_new” output format for black and white shading that can be opened by Microsoft Word. Select “ALN” format for input sequence.

2. Identify conserved protein domains (see I-F-1).

Prosite Results: All proteins have two conserved EF hand domains. Figure 2 incorporates ClustalW, BOXSHADE, and Prosite output (see II-D-1). Microsoft Word was used to demarcate and label domains.

Figure 2.

Amino acid ClustalW alignment of human EFHDd1, EFHD2, mouse Efhd1, Efhd2, zebrafish (zfish) LOC795463 (LOC), Wu;fj19b07 (Wu), Drosophila CG10641, Caenorhabditis elegans Y48B6A.6a. C. elegans amino acids 51-450 do not align to other proteins and are excluded for ease of viewing. Two EF-hand domains are marked by gray bars. Residues highlighted in black are identical; residues highlighted in gray are similar.

3. Create phylogenetic tree.

Perform ClustalW alignment (see step II-D-1-a). Scroll to bottom of new page, in “select tree menu” select phylogenetic tree of choice.

Note: For a more rigorous phylogenetic tree analysis, use MEGA. See link in “Bioinformatics Resources”.

Results: Figure 3 shows output from “N-J tree with branch length”.

Figure 3.

Neighbor-joining phylogenetic tree of human EFHD1, EFHD2, mouse Efhd1, Efhd2, zebrafish (zfish) LOC795463 (LOC), Wu;fj19b07 (Wu), Drosophila CG10641, Caenorhabditis elegans Y48B6A.6a.

4. Compare information from Drosophila CG10641-related genes.

Expression: There is overlapping expression between Drosophila CG10641, C. elegans Y48B6A.6a, zebrafish Wu;fj19b07, and mouse Efhd1 and Efhd2 in muscle. All but Drosophila CG10641 are also expressed in the nervous system. Protein domains: Proteins encoded by all genes have two EF-hand domains. There are no other domains identified in any of the proteins. Phylogenetic tree: zebrafish LOC795463 and Wu;fj19b07 are more closely related to Efhd2 than to Efhd1. C. elegans Y48B6A.6a and Drosophila CG10641 may represent single ancestral genes to Efhd1/Efhd2.


Based on expression data, all Drosophila CG10641-related genes may function in muscle. C. elegans Y48B6A.6a, zebrafish Wu;fj19b07, and mouse Efhd1 and Efhd2 may have additional functions, including a role in nervous system that could be conserved among these four genes. The relationship of Drosophila CG10641 to its related genes awaits further experimentation.


Genome Assemblies

Ensembl ( UCSC Genome Bioinformatics ( JGI Eukaryotic Genomics (

Species-Specific Databases

Note: FlyBase, WormBase, MGI have links to many databases (expression, etc.). These will not be listed separately here. MGI (mouse) FlyBase WormBase Xenbase (Xenopus-frog) gallus (chicken) EST and in situ hybridization analysis database (GEISHA)

Gene/Protein Information

NCBI Entrez

Note: Entrez has many useful links, including PubMed, Nucleotide, Gene, HomoloGene, UniGene, etc.)

Human Protein Reference Database (HPRD) GeneCards Swiss-Prot

Human Inherited Diseases


Sequence Alignment


Multiple Sequence Alignment


Note: BOXSHADE formats alignment results MUSCLE Molecular Evolutionary Genetics Analysis (MEGA)

Note: MEGA is only available as free downloadable software for PC.

Protein Translation

BCM Search Luancher

Protein Domain/Motif Prediction

Prosite Pfam BLOCKS

Phylogenetic Tree Prediction

ClustalW Molecular Evolutionary Genetics Analysis (MEGA) TreeTop

Phylogenetic Tree Curated Database


Nomenclature Information

Human Mouse Zebrafish elegans

Others species: Try Internet search for “species name”, “nomenclature”


Below is an interview with Monte Westerfield, PhD, Director of the Zebrafish Information Network (ZFIN) and the Zebrafish International Resource Center (ZIRC), and with Akhilesh Pandey, MD, PhD, Principal Investigator of the Human Protein Reference Database (HPRD) project, and Founder of the Institute of Bioinformatics (Fig. 4). They discuss the current status and future of two databases, ZFIN and HPRD.

Figure 4.

Akhilesh Pandey (L) and Monte Westerfield (R).

Developmental Dynamics: What was the motivation to launch your database?

Monte Westerfield: The first international meeting of the entire zebrafish research community was held at Cold Spring Harbor in 1994. In addition to discussion of science, the community identified two outstanding infrastructure needs: to establish a forum for information exchange and to establish a stock center. At that time, the World Wide Web was relatively new but held promise for online databases, so we started ZFIN as a Web-based resource that serves as the official zebrafish model organism database. Subsequently, we also established ZIRC that serves as the zebrafish stock center.

Akhilesh Pandey: I feel that the promise of Systems Biology can only be realized if we have good building blocks, i.e., basic information about individual pieces before we start attempting a systems approach. If the building blocks do not exist in sufficient numbers or are defective, I do not think that we could move beyond proof-of-principle systems biology studies. In 2002, we were struck by the lack of crucial information about the building blocks of life, i.e., proteins, in publicly available databases and decided to do something to correct the situation. To accomplish this, however, I first founded a nonprofit research institute called the Institute of Bioinformatics in Bangalore, India, where most of the curation and database work for the HPRD is carried out.

Where does funding come from?

M.W.: ZFIN is funded primarily by the National Human Genome Research Institute of the National Institutes of Health with contributions from some of the other Institutes and Centers.

A.P.: The entire HPRD effort came from my conviction that such a resource was sorely needed. I did not have any funding from any agency to accomplish such a goal. Therefore, the Institute of Bioinformatics and consequently the HPRD were funded for the first 2 1/2 years from several personal credit card loans that I took out. Since then, it has been funded sporadically by the NCBI (there are link outs to HPRD from Entrez-Gene and RefSeq pages, as well as information on protein–protein interactions and posttranslational modifications that is provided on these pages that is derived from HPRD).

What is the purpose of the database?

M.W.: The long-term goals for ZFIN are (a) to be the community database resource for the laboratory use of zebrafish; (b) to develop and support integrated zebrafish genetic, genomic, developmental, and physiological information; (c) to maintain the definitive reference data sets of zebrafish research information; (d) to link this information extensively to corresponding data in other model organism and human databases; (e) to facilitate the use of zebrafish as a model for human biology; and (f) to help serve broadly the needs of the biomedical research community.

A.P.: The purpose of HPRD was to provide carefully curated and accurate information from the published literature. The features of proteins that we focused on were those features that were essentially absent from other databases: protein–protein interactions, posttranslational modifications, subcellular localization, and tissue expression. Another goal of this project was to provide this information to the users in a simple manner.

What is the most used feature?

M.W.: Gene expression patterns are the most popular data, with mutant phenotypes running a close second. ZFIN has tens of thousands of images that illustrate gene expression and mutant phenotypes. In addition to using gene names, researchers can search these data using key words from ontologies that describe biological processes, cellular components, molecular functions, and zebrafish anatomy, including organs and cells.

A.P.: Protein–protein interactions are the most used features of HPRD.

How is HPRD different from other protein databases?

A.P.: Currently, the HPRD database is the largest database in terms of its coverage of protein–protein interactions, posttranslational modifications, subcellular localization, and tissue expression of human proteins. All of the features of this database can be queried in a simple manner. It also has a very nice graphic user interface that is lacking in most other databases.

What is the most important thing that someone who is not a zebrafish expert needs to know when using ZFIN?

M.W.: Non-zebrafishologists should know that ZFIN has a diverse, competent and highly professional staff of scientific curators and software developers who are available to help. All the curators were originally trained as researchers and can handle scientific questions. Also, because ZFIN needs to be responsive to the needs of the scientific community, nonzebrafish researchers should feel encouraged to contact ZFIN if they have questions or suggestions for changes or improvements. There is a button on every ZFIN page to provide input.

What is the most often requested item or service that your Web site does not provide?

M.W.: Users very much want to see ZFIN data integrated with the zebrafish genome sequence. ZFIN plans to implement this feature during the coming year, using a genome browser that will place ZFIN data onto the genome.

A.P.: We do not provide any data that are automatically extracted from abstracts of published papers. This is based on our philosophy that we would like to have experimentally proven data in HPRD.

What are the most difficult aspects of running the database and keeping it up to date?

M.W.: The information generated by scientific research is a moving target. New techniques are always being developed. Because it takes considerable planning and work to extend the database, the ZFIN staff has to work very hard to anticipate the future and implement new features in a timely manner while keeping up with curation.

A.P.: The data in the published literature are growing at a tremendous pace; this finding makes it increasingly difficult to keep it current and yet error-free. The bottleneck is essentially that we do not have enough trained biologists to carry out the curation on the scale that is necessary. This is primarily due to lack of a funding mechanism for HPRD.

What new features do you anticipate adding in the next year or two?

M.W.: In addition to a genome browser, ZFIN plans to add support for antibodies and to start curating antibody labeling patterns. ZFIN will also provide expanded support for mutant phenotypes, including much more powerful methods to search the data.

A.P.: We would like to have greater community participation. In this regard, we have already initiated a community effort called Human Proteinpedia (, which is a portal like Wikipedia that allows scientists to annotate their own data (for which they have experimental results) on top of HPRD data. We have over 70 proteomics labs already participating in this and hope to expand it such that, one day, every scientist who generates proteomic data will participate. We are also developing a pathway resource that is linked to HPRD called NetPath ( We have been thinking of adding transcript level information as well as more detailed information related to human diseases such as cancers and levels of proteins in health and disease.

What have been the most significant advances over the past few years?

M.W.: One of the most significant recent advances was our successful negotiation with most of the scientific journals for permission to include figures from zebrafish research articles in ZFIN. The majority of these figures illustrate gene expression patterns and mutant phenotypes. Soon, ZFIN will also display journal figures with antibody labeling patterns. These data records link directly to the original journal article at the publisher's site, thus providing an easy and powerful means for researchers to find what they need.

A.P.: The amount of data in HPRD has been steadily increasing and the community is making greater use of the data (we have 70,000 downloads per month).

Do you anticipate that the database will look much different 5 or 10 years from now?

M.W.: ZFIN will unquestionably look much different in 5 to 10 years. The research enterprise constantly changes as new (and old) questions are studied with new methods. ZFIN will similarly change as it extends to support the new types of data generated by these studies. ZFIN will also need to change in response to constantly changing Web technology. The Web is much different today than it was in 1994, and we can expect similarly dramatic changes in the next 5 to 10 years.

A.P.: Yes, I think it will turn into a systems biology platform and become the reference for information about human proteins, much like an encyclopedia. It will serve as a model for how the community can get involved to continuously update and curate biological datasets in general.

What career opportunities are there for biologists in bioinformatics?

M.W.: One of the most interesting and exciting careers for biologists in bioinformatics is to become a scientific curator. In addition to working with a broad range of different kinds of data, curators often communicate with researchers to help resolve contradictions and ambiguities. Curators also help drive development of new software by providing an analysis of the needs of the scientific community and by participating in interface development. Recently, curators at the model organism databases formed an organization of biocurators that holds an annual meeting to discuss new developments in the research community. There is currently a shortage of scientific curators; more developmental biologists, in particular, are desperately needed.

A.P.: I think that biologists make the best bioinformaticians. We are already at a point where we are producing far more data that we can effectively analyze in a reasonable time frame. Some parts of the data are simply discarded or wrongly analyzed because of lack of competent bioinformaticians. With training in specific disciplines of biology such as genomics or proteomics along with additional expertise in computer programming and statistics, I think such biologists would be instantly recruited by most labs that produce lots of data. In addition, such biologists can also pore over the vast amounts of data in the public domain for meta-analysis or data mining purposes.


I thank Richard Dorsky for critical reading of the manuscript. I also thank Monte Westerfield and Akhilesh Pandey for sharing their insights.