Genome annotation in a community college cell biology lab


Address for correspondence to: C. Timothy Beagley, Department of Biology, Salt Lake Community College, Salt Lake City, Utah. E-mail:


The Biology Department at Salt Lake Community College has used the IMG-ACT toolbox to introduce a genome mapping and annotation exercise into the laboratory portion of its Cell Biology course. This project provides students with an authentic inquiry-based learning experience while introducing them to computational biology and contemporary learning skills. Additionally, the project strengthens student understanding of the scientific method and contributes to student learning gains in curricular objectives centered around basic molecular biology, specifically, the Central Dogma. Importantly, inclusion of this project in the laboratory course provides students with a positive learning environment and allows for the use of cooperative learning strategies to increase overall student success. © 2012 by The International Union of Biochemistry and Molecular Biology, 41(1):44–49, 2013

The science of genomics lends itself nicely to the contemporary educational goals of incorporating inquiry-based learning projects [1-3] and the development of 21st Century skills [4, 5]. Indeed, many educators are finding creative ways to successfully integrate genomics into virtually every part of the life science curriculum [6-11]. Educators are providing rich critical thinking and self-directed learning opportunities for their students [12] by merging student interest and mastery of computers [13] with the conceptually difficult concepts of molecular biology [14]. Additionally, the virtual learning environment used in these projects makes it relatively easy to assess student performance [15] and to facilitate the professional development of educators [16, 17].

Community colleges have struggled to implement the aforementioned laboratory reforms. With necessarily low tuition and diminishing governmental support, these two year institutions operate on very lean budgets [18] with most of their resources going directly to personnel. Even so, the manpower needed to incorporate these important changes is often not available, due to the extraordinarily high teaching loads placed on full-time faculty and the large percentage of teaching performed by part-time adjunct instructors [19]. Ironically, recent dramatic enrollment increases have drawn in many students with accumulated risk factors [20] that is, the very students who would benefit most from these educational restructurings. Fortunately, the IMG-ACT toolbox [8] comprises a low cost and ready-made resource that community college faculty may draw from to initiate the process of updating their curriculum and better serving students.

Using the IMG-ACT system, the Biology Department at Salt Lake Community College (SLCC) incorporates genome mapping and annotation in the laboratory portion of the sophomore-level Cell Biology (BIOL 2025) course. While providing students with an individualized inquiry-based project, this laboratory module also reinforces important student learning objectives. Students learn about the Scientific Method by hypothesizing protein coding genes for a segment of a prokaryotic genome, and then use the annotation tools of IMG-ACT to collect evidence in support of or against their hypotheses. They also gain a deep understanding of the Central Dogma of Biology by using it directly in their annotation projects, then communicating their results to their peers. Lastly, the mapping and annotation students perform in this exercise introduces them to the realm of computational biology and helps them see the capabilities and limitations therein.

The SLCC Genome Mapping and Annotation Project


BIOL 2020 (Cell Biology) is taught at SLCC to approximately 50 students per semester and a total of 100 students per year. Students who take BIOL 2020 are working toward a degree or certificate in a science field (i.e. Biology, Biotechnology, and Chemistry) or are fulfilling entrance requirements for professional programs such as Dental, Medical, and Pharmacy schools. As a prerequisite to Cell Biology, all students must previously have earned a grade of C or better in College Biology I (BIOL 1610).

Genome Mapping

Students initiate their genome projects by generating an Open Reading Frame (ORF) map for a 50,000 nucleotide base pair segment of the Halogeometricum borinquense genome. SLCC selected this archaeal genome to map and annotate in order to compliment the research being done at some of our in-state sister institutions that are assembling the genome sequence for an extreme halophile isolated from the Great Salt Lake. Working in pairs, students use the ORF Finder website [21] and a self-generated spreadsheet to make the ORF map. Specifically, each team copies and pastes the DNA sequence (in FASTA format) into the open dialog box of the ORF Finder program. They then select Bacterial Codes from the drop down menu for Genetic Codes and click the action box for OrfFind. The program instantly generates a listing of all ORFs in the sequence ordered from the largest number of nucleotides to the smallest. The program also lists the position in the sequence for each ORF and the reading frame (+1, +2, and +3 are forward frames while −1, −2, and −3 are reverse frames). Each student team generates a map that graphically displays their interpretation of the ORFs for this segment of the genome. Some leeway is given in the construction of the ORF maps, so as to allow room for students to illustrate their own interpretation of the data. The construction of the map has only three requirements: (1) the nucleotide positions of each ORF must be shown, (2) the reading frame must be somehow indicated, and (3) the ORFs cannot overlap extensively. An example of one of these ORF maps is shown in Fig. 1. In the example shown, students used columns to show map position and color coded rows to show the reading frames.

Figure 1.

Example of a student generated ORF map for a 50,000 nucleotide base pair segment of the H. borinquense genome. Two students work together using the NCBI ORF Finder website [21] to propose the location of protein coding genes for this segment of the genome. Each box in the map represents a distance of 200 nucleotide base pairs and each bold vertical line represents 1000 nucleotide base pairs color coded red or blue to show the accumulated distance on the map. The three forward and three reverse reading frames are color coded as shown on the left and correspondingly throughout the map where ORFs are identified. The red arrows show that these students have located the ORF in a six-frame translation diagram and have identified the in-frame codons immediately upstream and downstream of the ORF. The star near position 21,000 indicates an area of the genome where no satisfactory ORFs were identified and where further investigation is warranted.

Once the basic map is generated, students are required to locate the five largest ORFs in a six-frame translation of the sequence. To demonstrate their ability to perform this task students must list, on the map, the start and stop codons for each of the five largest ORFs as well as the in-frame codons immediately before and after these ORFs (shown by the red arrows in Fig. 1).

When all student teams have completed their ORF maps, the entire laboratory section works together to generate a consensus ORF map. This usually involves considerable discussion among students, centered on how to make the most realistic and likely hypothesis in terms of where the protein coding genes are located. This consensus ORF map then becomes a working group hypothesis that will be tested subsequently using computational biology and the IMG-ACT toolbox.

Gene Annotation using IMG-ACT

Using the consensus ORF Map as a starting point, the laboratory instructor determines the similarity between the student generated map and the genes that have been given gene object identifiers (OID)s by the IMG-ACT program. Genes with OIDs that match the student ORFs are distributed to the laboratory students. Each student is assigned two genes to annotate even if multiple students end up working on the same gene. ORFs that the students have identified that do not appear in IMG-ACT are also assigned to students. Each student has at least one possible gene to annotate that will not be directly accessible through the IMG-ACT system. These later genes are made available to students as FASTA formatted DNA sequences and proposed amino acid sequences. Students attempt to verify protein coding likelihood for the ORFs using the programs in the IMG-ACT system but constructing their own laboratory notebook file where they place evidence gathered.

Working in the Basic Information Module of IMG-ACT, students collect introductory information regarding their assigned ORFs from the Gene Details Page. They verify the position in the genome of their ORF and collect the nucleotide sequence as well as the proposed amino acid sequence. Students enter this information into a virtual laboratory notebook provided by IMG-ACT. While this module may seem somewhat routine, it is actually very important since it allows students the opportunity to become familiar with the IMG-ACT interface, as well as the virtual laboratory notebook where they will store all the accumulated evidence listed hereafter.

The bulk of the evidence students gather to show that a given ORF is, in fact, a protein coding gene comes by performing the Sequence-based Similarity Data Module. In this extensive module, students are linked to various internet programs and databases where they can look for similar genes in other species. Using the Basic Local Alignment Search Tool (BLAST), students use their hypothetical protein as a query to search through curated protein databases looking for proteins with similar stretches of amino acids. They are also introduced to the concepts of computer generated alignments and the mathematical expected value (E-value). The BLAST results also allow students to see whether their proposed gene matches any of the multiple sequence alignments stored in the Conserved Domain Database (CDD). Students then generate their own multiple sequence alignment using the Tree-based Consistency Objective Function For alignmEnt Evaluation (T-COFFEE) tool. From the T-COFFEE results students then generate a diagrammatic display of their alignment using the WebLogo program.

From the Cellular Localization Module, students continue to look for evidence that they are working on a real protein coding gene. They use the Transmembrane Helices Hidden Markov Models (TMHMM) web site to see whether their proposed protein has similarity to any known proteins with transmembrane domains. This program also helps students predict where the proposed protein may be located in the cell. Students also use the SignalP web site to search for possible signal sequences that might play a role in the sorting of their proposed protein. The combined output of TMHMM and SignalP are consolidated into a single graph using the Phobius web site. Lastly, the PSORT-B program is used to look for cellular localization using a different algorithm. To finish this module, students record a formal hypothesis regarding the location in the cell where their proposed protein would most likely be found.

The final required IMG-ACT exercise is the Alternative Open Reading Frame Module. In this module, students use the IMG-ACT graphical interface to reexamine a six-frame translation centered on the start site of their ORF. They specifically look for other possible start codons in conjunction with possible ribosome binding sites. In this exercise, students are encouraged to refer back to their multiple sequence alignments as an additional tool for assessing the location of the most likely start position for translation. After gathering as much information as possible, students are asked to suggest the location for translational initiation and, therefore, the formal amino acid sequence of their proposed protein.

Annotation Reporting

Students are allowed three weeks to gather data in support of or to refute the protein coding hypothesis for each of their assigned ORFs. This work is done largely on their own but some support is available to students via email, tutorials, and instructor availability for assistance. In addition to gathering data, students are required to prepare a presentation for the class. At the end of the time allocated, each student presents their evidence to the class and provides an interpretation of the ORFs they were assigned. If the gene was called by IMG-ACT the student is asked to verify or refute that call based on their gathered evidence. For the uncalled ORFs, students are asked to predict whether or not the ORF represents a real protein coding gene and, if so, what that protein may be. The interpretations reported by students are added to the consensus ORF map so students can see the genome section being annotated and the project being completed.

The laboratory portion of Cell Biology is required and accounts for 20% of the grade for each student. In total, the Genome Mapping and Annotation Project uses approximately one-fourth of the total laboratory time for the semester and, thus, accounts for 5% of the total student grade. The total project point value includes participation points, instructor evaluation points, and a point value assigned collectively by the other students in the laboratory based upon the final project presentations. Specifically, the mapping portion of the project has 10 points awarded for producing a complete ORF map that is accurate within the boundaries established above. The gene annotation portion of the project includes 10 points prorated to indicate how completely the laboratory notebook was filled-in and 20 points awarded during the final presentation which comprises a report for all three assigned ORFs. Each final presentation is judged by the instructor and the lab students with 10 points coming from the instructor and 10 points from the average score awarded by the peers. The scoring rubric used by both the instructor and the peers is provided in the Supporting Information.

Pedagogy and Assessment

The Scientific Method

Students participating in this laboratory project build upon their understanding of the scientific method but, more importantly, use it directly. The consensus ORF map generated by each laboratory section represents a well thought out and testable hypothesis created by the students. Students learn to make reasoned assumptions as they build their hypothesis. For example, students can argue that the largest ORFs are more likely to be actual protein coding genes and that there is unlikely to be a large amount of non-coding sequence in a tightly packed prokaryotic genome. The IMG-ACT system provides students with the tools they need to begin the process of testing their consensus hypothesis. When they see that homologous genes have been identified in other species and that biochemical studies have already demonstrated the function of the encoded protein, they are able to show that their hypothesis has been supported. Of course, failure to find such evidence can be used to show a lack of support for their hypothesis and force the class to reconsider or change the consensus ORF map. Importantly, the project also allows students to see that they can participate in the process of science even as it introduces them to the collective and critical components therein.

The Central Dogma of Biology

Students in sophomore-level Cell Biology courses are expected to have already encountered the Central Dogma of Biology in introductory settings. That said, they often find the concepts to be abstract and difficult to understand. In particular, they struggle with the details and rationale of transcription, translation, and the genetic code. This project provides a hands-on approach to learning these concepts and a practical way for students to encounter the flow of genetic information. Students work directly with nucleotide and amino acid sequences as they construct their ORF maps and as they attempt to annotate their proposed protein coding genes. The double-stranded nature of DNA becomes clear to students as they look for ORFs in a six frame translation of their genome segment and construct a map that takes all six reading frames into account. Additionally, students use the genetic code table as they perform the Alternative Open Reading Frame Module of IMG-ACT.

Since the laboratory project initiates after students have covered the material in lecture and student understanding has been measured on a midterm exam, valid assessment comparisons are possible before and after the project. During spring semester 2012, assessment items regarding various molecular biology topics showed an average student mastery of 64% after lecture exposure but prior to the laboratory project. When similar questions were included on the comprehensive final exam (i.e. after the laboratory project had been completed) the percentage of students showing understanding of these topics increased to between 84 and 95 (Fig. 2).

Figure 2.

Percentage of students answering a set of questions correctly for the five molecular biology topics listed. These results are from the comprehensive final exam given spring semester 2012. The line at 64% represents the average number of correct answers per student on the second midterm exam which covers the entire collection of topics.

Computational Biology

The exercises included in this project provide students with a formidable introduction to the capabilities of computational biology. They encounter websites and algorithms that can rapidly manipulate and analyze large amounts of data, then provide practical output to the user. While the acquisition of literacy in computational biology is clearly beneficial for students, two other aspects of this feature have been directly observed. First of all, the majority of students perform this project on their own computer, be it a laptop, tablet, or even a hand-held device. Using an instrument that they feel comfortable with helps them achieve competency and understanding of a rather difficult academic topic. Students who do not have a portable computer or whose device demonstrates some sort of system incompatibility are allowed to use laptop computers provided in the lab or desktop computers at other SLCC locations. Second, the computer skills learned in this project are of value to a wide array of students. Community college students taking Cell Biology may end up with advanced degrees in Biological Science areas but are more likely to be seeking entrance into a professional program or simply seeking job skills to enhance their current capability. This project teaches skills and aptitudes that can benefit students in any number of these areas.

One concern that has emerged is that students might finish this project thinking that IMG-ACT is the only gateway they can use to perform gene annotations. It is for this reason that each student is asked to annotate one of the ORFS that did not show up in the IMG-ACT system. They are provided with the nucleotide and amino acid sequences for this possible protein gene and are asked to annotate it on their own. Some students go directly into the IMG-ACT laboratory notebook for one of their called genes and use that as a guide for their manual annotation. Other students, however, go directly to the websites used for annotation and produce a report independent of the IMG-ACT laboratory notebook. Either way, it is hoped that students gain an understanding of the true nature of the IMG-ACT toolbox, and appreciate the fact that gene annotations can occur outside of that system.

Student Perceptions

Laboratories continue to play an important role in science education even as their structure and practice changes. The benefits of inquiry-based learning experiences are clear and educators are working hard to provide students with more authentic laboratory opportunities [1-3]. It has also become evident that student predilection toward the laboratory component is a key indicator of success [22]. In order to scrutinize student perceptions of the Cell Biology Laboratory, students were asked to rate how helpful their laboratory experience was. To provide comparison, the same question was asked of students relative to their Microbiology Laboratory which does not yet incorporate such an inquiry-based component. As shown in Fig. 3, while the lectures for Cell Biology and Microbiology were perceived by students to have similar degrees of helpfulness, students perceived the laboratory portion of Cell Biology to be significantly more helpful than that of Microbiology. Incorporating laboratory exercises such as the one described here, can improve student attitudes toward the laboratory and provide them with a more positive learning experience.

Figure 3.

Results of student survey questions asking about the perceived helpfulness of different course components. Students participated in this survey at the end of Spring Semester 2012. The survey was taken by 22 Cell Biology students (55 percent response rate) and 48 Microbiology students (73% response rate). The standard deviations are as follows: Microbiology Lecture 0.27, Cell Biology Lecture 0.35, Microbiology Lab 0.67 and Cell Biology Lab 0.59.

Cooperative Learning

The community college setting is ideal for implementing learning projects that use and model cooperative learning strategies [23, 24]. With a stronger focus on skill acquisition and a somewhat diminished need to have an ultra competitive academic environment, faculty are more free to use this valuable tool to improve student success. This Genome Mapping and Annotation Project offers a number of cooperative learning opportunities. The mapping portion of the project initially involves two students working together to generate their draft ORF map which then becomes available for all teams to examine and scrutinize. Additional cooperation is needed as the two-person teams then aggregate their results into a single consensus ORF map and formal scientific hypothesis. The project continues to be cooperative as individual annotations are presented to the entire class for discussion and the group hypothesis is reexamined and refined.

Individual Research

The Genome Mapping and Annotation Project often spins off questions and projects that students can work on individually and outside the confines of the traditional course setting. Students who become particularly interested in some aspect or another of the project can sign up for elective credit in a section of Independent Study (BIOL 2990). As an example of how this might work, Fig. 1 shows a segment of the genome where the class was not able to identify an ORF and where the IMG-ACT system did not call a protein coding gene (see the star at position 21,000). In this particular case, a student is further investigating that DNA sequence in order to see if there might be a ribosomal RNA gene, some transfer RNA genes, or some other as yet unidentified element. The ability to offer students opportunities for further analysis and research is invaluable at the community college level.


The IMG-ACT toolbox has allowed Salt Lake Community College to create the Genome Mapping and Annotation Project that is now a fixture in the Cell Biology Laboratory. This inquiry-based project provides students with an opportunity to perform authentic laboratory research and has helped produce significant learning gains relative to basic molecular biology. The project has also contributed to positive student attitudes toward the laboratory section, has provided a cooperative learning environment, and allows for further independent study by interested students.


The author would like to thank the developers and support staff in the Education Program at the Joint Genome Institute, especially Cheryl Kerfeld for directing the program and Seth Axen for endless support and assistance using IMG-ACT. The author would also like to thank Anna A. Beagley for helping prepare this manuscript and all the SLCC students who have participated in this project and provided valuable feedback and suggestions. Funding to support this publication was provided by the Microbial Genome Annotation Network, an NSF RCN-UBE funded project (DBI 0954829).