Genomics and bioinformatics in undergraduate curricula: Contexts for hybrid laboratory/lecture courses for entering and advanced science students

Authors


Abstract

Emerging interest in genomics in the scientific community prompted biologists at James Madison University to create two courses at different levels to modernize the biology curriculum. The courses are hybrids of classroom and laboratory experiences. An upper level class uses raw sequence of a genome (plasmid or virus) as the subject on which to base the experience of genomic analysis. Students also learn bioinformatics and software programs needed to support a project linking structure and function in proteins and showing evolutionary relatedness of similar genes. An optional entry-level course taken in addition to the required first-year curriculum and sponsored in part by the Howard Hughes Medical Institute, engages first year students in a primary research project. In the first semester, they isolate and characterize novel bacteriophages that infect soil bacteria. In the second semester, these young scientists annotate the genes on one or more of the unique viruses they discovered. These courses are demanding but exciting for both faculty and students and should be accessible to any interested faculty member.

INTRODUCTION

Two courses at the early and late stages of a biology/biotechnology curriculum were designed to fill a disciplinary gap and meet a need for more active student engagement in science education. Genomics, the study of the entire genome of an organism, has “come of age” as a subdiscipline of biology since the explosion of genome sequencing beginning in the early 1990s. This has created the need to include the underlying skills and content about genomics in a complete biology curriculum, either as a stand-alone course or embedded in related subject areas such as molecular genetics or bioinformatics [1–5]. Over the same time period, science education research has shown that the incorporation of hands-on, inquiry-based activities provides a superior learning environment for science students [6–14]. The experiences described herein provide this kind of valuable student experience and are based on a philosophy of conducting original research in the classroom.

Efforts to introduce genomics to the biology curriculum at the James Madison University began with the development of an upper level Genomics course in 2005, in which multi-antibiotic resistance plasmids or viruses, isolated as part of ongoing research programs of faculty members, were the subjects of sequencing and analysis. With increasing interest by the scientific community in bacterial viruses (bacteriophage) as sources of unique genomic information, the Howard Hughes Medical Institute's Science Education Alliance (HHMI-SEA) program using phage discovery as an educational tool was a natural avenue for genomics being introduced at a very early level. Thus, Viral Discovery and Bioinformatics, designed for entering college freshmen, began in 2008, and it is partially funded by HHMI-SEA. This is a two-semester sequence where soil viruses are isolated and purified, and the DNA is isolated, sequenced, and analyzed. Both are original research based, in that the information generated is new to everyone, including the instructors. Both are also unique, in that they are hybrids of lecture and laboratory, physically meeting in teaching laboratories with computers available either in the same room or in adjacent space. Thus, activities are often hybrids of web lab and computer lab work. As an entry-level course, the goal of Viral Discovery and Bioinformatics is primarily to motivate students to study biology or related fields and pursue a career in research. In contrast, the goal of Genomics is to allow students to develop a set of skills and to become proficient in genome analysis.

The goal of this study is to describe the theoretical underpinnings of these two courses and to give enough detail so that others might create similar courses, using their own research models, or incorporate portions of our materials into existing courses.

GENOMICS: UPPER LEVEL BIOLOGY COURSE

The upper level genomics course (4 credits) in the biology curriculum is an elective course that can be used to meet one of the requirements for biology and biotechnology majors. A 2-year biology core sequence is required before students can enroll. Two molecular biologists, one studying plant proteins and the other microbial genomics, originally developed the course. This combination of interest and expertise led to the twin emphases of 1) DNA sequencing, analysis, and gene predictions with 2) protein structure and structure/function predictions based on alignments of protein families. This course has run for four semesters, with an enrollment varying from 9 to 18.

Genomics is structured on primary literature as a context for learning the concepts, processes, and skills required to understand genomics. Software skills and understanding are taught through tutorials with an application from the literature or using primary data from the course. For example, learning about and creating phylogenetic trees is performed after reading two articles [15, 16]. Furthermore, alignment of sequences using ClustalW is performed using multiple copies (paralogs) of a gene in Bordetella avium, the subject of one of the author's research (LT). Simplified information about the theoretical background of certain software packages is often provided (for example, refs.17 and18). The exercises for one iteration of the course can be seen and downloaded at this web site: http://csm.jmu.edu/biology/Cresawsg/genomics/genomics.html. The wet lab and bioinformatics analyses are interwoven with short lectures, tutorials on software, literature presentations by students, and guest research presentations that illustrate practical applications of what the students are learning. A new module with rudimentary programing exercises relevant to bioinformatics was added recently to introduce students to this field.

Material Used for Sequence and Analysis

In each of four iterations of this course thus far, the sequence of all or a portion of a small (∼40–80 kb) genome (phage or plasmid) was obtained and analyzed. For reasons of cost, time, and the availability of suitable facilities on campus, sequencing was performed by collaborators at larger academic institutions or by commercial entities. In the case of phage genomic sequencing, the novel phages were obtained by research students of the authors (LT, SC). A number of scientists around the country have isolated and stored novel phages that have not yet been sequenced and analyzed; this is a source of materials that could be obtained on request. Anyone interested in pursuing this type of course using phage genomes should contact the Pittsburgh Phage Institute, HHMI-SEA, or the authors as sources of novel phages. The plasmid that was partially sequenced in another iteration of this course was obtained from a JMU professor who studies naturally occurring plasmids containing multiple antibiotic resistance genes. One very interesting observation that resulted from using a phage genome 1 year and the environmental plasmid the next is that the phage had a majority of unique genes (few or no matches in GenBank to genes with known functions), while the environmental plasmid was a mosaic of many, highly conserved sequences. These major differences required creative teaching approaches.

Cost of the Course

The major expense of this course is the cost of sequencing. Depending on the estimated size of the phage genome or plasmid to be sequenced, the work can be contracted for approximately $1,000–$3,000. Many computer programs for sequence analysis and assembly are freely available via the Internet. Examples include the Contig Assembly Program (http://www.cs.sunysb.edu/∼algorith/implement/cap/implement.shtml) and the Phred/Phrap/Consed suite of programs for which one must obtain a license (free to noncommercial users).

In a process as complex as sequencing a genome, student involvement at every possible step is important. Not only does this make sense from a faculty workload perspective but it also helps students who might otherwise lose track of the big picture and encourage their ownership in the project.

Sequence Analysis

Sequence analysis and genome annotation make up a major portion of the course. This process involves learning about gene structure as well as the decision of which genes to include in the final analysis, as some are not homologous to existing genes. In all cases thus far, our research subjects have been prokaryotic, thus obviating the need for predicting introns and exons. The programs, GenMark and Glimmer, have been used for gene calling in the sequenced genomes (URLs on class web site, see earlier). By using both prediction programs, the students see that there can be some disagreement between them, and this situation is an excellent place for learning about the ambiguities and limitations of predictive programs. In one exercise, teams of students compare the gene predictions in assigned regions of the genome and make an informed choice between the two programs, when differences occur. This may require students to argue the merits of their annotations.

Once the positional annotation of the genome is complete, students annotate the function of the genes using sequence homology along with domain and structural prediction programs. Annotations are prepared in the GenBank format. Once the set of genes has been agreed upon, procedures for choosing the information about each gene product prediction are established. This is the annotation process, the format of which must conform to GenBank or EMBL guidelines, such that later submission of the information will not require additional work. Although the submission format is defined by GenBank or EMBL, the information about the genes and other features is completely dependent on what the authors (scientists) wish to include. An obvious question about the data acquisition and analysis is whether any of this is publishable, and if so, how easily publishable is it? The answer to that question depends somewhat on how much more information, besides the sequence and annotation, is essential. We have not thoroughly answered this question, as our first project was a much larger genome than we had predicted, and we only recently completed the analysis of the phage and are currently finishing studies needed for publication in a journal such as Virology, a journal that will publish bacteriophage genome papers. The environmental plasmid sequence information was supplied to the professor who had originally isolated it; thus, this information will be published in the context of related elements in his research project. Publication of the data generated by laboratory experiences such as these is a very important matter, and one that instructors considering this work will want to consider in advance.

Results of genome analysis from spring, 2009, are shown in diagrammatic form in Fig. 1, which illustrates genome relationships among a group of bacteriophages that infect Mycobacterium smegmatis. Students annotated viral or plasmid DNA using Artemis, Glimmer, and GeneMark. Annotations of viral genomes were loaded into Phamerator (described later) for comparative analysis and representation. In the spring 2009 semester, students annotated the genome of a M. smegmatis phage, Maury. This virus is closely related to a cluster of mycobacteriophages, though it contains several insertions, deletions, and other regions of difference [19]. Because approximately 85% of the genes found in these bacteriophages are of unknown function, these types of comparisons are a useful way to identify which genes may be essential for virus growth (Fig. 1).

Figure 1.

Phage Maury was sequenced and analyzed as part of the upper level Genomics course. Phages 244, Cjw1, Porky, and Kostya were sequenced and annotated by students at the Pittsburgh Bacteriophage Institute. Genes are color-coded according to phage protein “phamily,” and shading between genomes indicates areas of nucleotide conservation. Nucleotide shading is colored by BLASTN e-values, with colors at the violet end of the spectrum indicating lower e-values.

Gene Family Structure/Function Project

In parallel with the team-organized sequencing project, students in Genomics also worked independently on a gene family investigating aspects of structure and function through amino acid sequence alignments and analysis of protein structure.

Amino acid sequence alignments are useful to determine the relatedness of proteins and the organisms in which they occur, and they can shed light on protein function. Three-dimensional models of proteins are commonly used in classes to illustrate concepts about protein structure. These two types of data are by themselves of somewhat limited value in the classroom, but when combined they can lead to interesting insights. After using alignments to identify conserved and nonconserved regions of a protein's primary structure, these regions can be mapped onto 3D models to allow students to make subtle but powerful connections between structure and function. What may initially seem to be complex and somewhat random noise becomes a detailed history of evolutionary change. When students generate the datasets and discover the connections themselves, it makes for them a valuable learning experience.

The protein family project was developed to introduce students to the power of linking amino acid sequence alignments to 3D protein structures using freely available tools. Students were provided with a list of protein families from which each student chose one to work on throughout the semester (see “GeneFam paper rubric.doc” in Supporting Information). The criteria for selecting the families included the following: a structure was known for at least one member, there was at least one genome that contained a relatively large number of paralogs, there was a relatively large number of known orthologs from different genomes, and lastly, something was known about the function of the family. In addition to the name of the family, students were provided with a Protein Data Bank ID number for one structure. After finding the structure, students then used the amino acid sequence associated with that structure and BLAST to identify orthologs and paralogs.

Once students identified a set of paralogs and orthologs using BLAST, the full-length amino acid sequences were obtained from the database. Alignments were performed using ClustalW (http://www.ebi.ac.uk/clustalw/). Alignments were then inspected to identify (and eliminate) identical sequences derived from multiple entries for the same sequence, or sequences that did not align well, either because they were not homologous or because they were very different in length making them difficult to align.

Once a ClustalW alignment was generated, it was used as input to two programs. First, the alignment was made more visually informative using BoxShade (http://www.ch.embnet.org/software/BOX_form.html). Second, the alignment generated by ClustalW was used as input in a second submission to ClustalW where the option to select a neighbor joining (NJ) tree type was selected. The resulting tree was then visualized using Treeview (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html).

Sequence alignments were then scanned for regions of strong identity that might be single amino acids or entire domains, or for regions of high divergence either due to point mutations or insertion/deletions. Students then mapped these regions onto the 3D structure of the homologous protein to visualize the structure of conserved and nonconserved regions.

In addition to exploring the structure and function of their gene family using sequence alignments and 3D structures, students also read the primary literature on their protein to gain insight into known active-site residues or those residues with known functions based on biological experiments. Once these regions of interest were identified, students located them within the 3D structure of the protein and in the alignments. For example, in 2007, a student analyzed the ribonuclease family RNase HII, enzymes that degrade ribonucleotides in a hybrid duplex consisting of RNA hybridized to complementary DNA. This student examined the 3D structure of the Pyrococcus horikoshii RNase using the PDB file, 1UAX. Paralogs from each branch of the eukarya and from archea and bacteria were aligned, and the resulting tree was consistent with relatedness of the organisms (see “RNase.ppt,” “GF paper.doc,” and “Thioredoxin.ppt” in Supporting Information). This type of meta-analysis has the potential to produce original, publishable results.

Other Projects and Resources Used in the Course

To help students appreciate the breadth of genomics and also to put some of the human-centric genome information into what is otherwise a somewhat prokaryotic-oriented course, we used additional activities and resources. A small project done in groups of two required the students to research an assigned genome with regard to several parameters, such as the rationale for sequencing, the methods used, the funding source, and any interesting outcomes (see “Instructions for Whole Genome Poster.doc” in Supporting Information). Students then shared their knowledge through posters covering their research (see “fly_poster.ppt” and “mouse_poster.ppt” in Supporting Information). We held these reporting sessions as “poster sessions” in which the students were responsible for summarizing and evaluating all other posters. In our experience, peer review helps students participate more actively and learn more in a poster session.

Additional learning resources are described in Supporting Information.

VIRAL DISCOVERY AND BIOINFORMATICS FOR ENTERING FRESHMEN

The stated goals of the freshman course are to capture the interest and imagination of less-experienced students and retain bright students in science who might otherwise become bored with standard science experiences. The impetus for starting this course came from a solicitation from HHMI-SEA (see http://www.hhmi.org/grants/sea/). Two of us (LT and SC) wrote the application to become part of the first cohort of 12 schools doing similar experiences. The funding from HHMI-SEA included all the supplies for one section of the course and training for the instructors and teaching assistants. With regard to choosing students for the course, we selected students strictly based on interest in an original research project, having advertised widely to incoming freshman, as well as a select group of current students. The latter group included students in our K-8 teaching program. We have recently completed the first iteration of the course, in which 32 students participated, including students majoring in biology, biotechnology, physics, health sciences, and education. Although this course is not required for any major, it does fulfill electives in several majors. A web site (http://phage.cisat.jmu.edu/) shows some of the management of and results from this class.

Semester 1: Discovery and Analysis

In the first semester (fall, 1 credit), students met for 3 hours a week to isolate and partially characterize a new bacteriophage. Two host bacteria were used for this discovery portion of the course. The HHMI-SEA-supported project involved the use of M. smegmatis, a fast-growing, nonpathogenic soil bacterium related to the human pathogen, M. tuberculosis. A number of phages of M.smegmatis have been isolated and sequenced by the Pittsburgh Bacteriophage Institute (http://www.pitt.edu/∼biohome/Dept/Frame/pbi.htm; [19]), so that further isolates can be placed into the growing data set for comparative genomics purposes. We also used as a host organism, Bacillus pumilus, a strain isolated from soil in our local greenhouse. For both of these hosts, local soil samples were searched for cognate phages. Ideally, each student would isolate his/her own phage, empowering them to make decisions in the experimental process.

Using a host bacterium and soil samples, students learned to extract phages from the soil, remove all bacteria, enrich on the host strain, and do phage infections. When plaques appeared, several rounds of purification were done to insure that a single phage was purified. At that point, a large amount of phage particles was produced by infecting multiple plates. Then, a phage-specific DNA purification kit (Qiagen) was used to isolate DNA in adequate amounts for restriction digests and gel electrophoresis. Finally, the students placed a portion of their purified phage preparations onto electron microscope grids, stained them with uranyl acetate, and visualized them using transmission electron microscopy.

Semester 2: Bioinformatics and Genomic Analysis

In the second course (spring, 2 credits), students met twice a week for about 1½ hour to learn and use the software required to analyze sequence data from one phage genome. The packages included Sequencher and Phrap/Phred/Consed, which take raw data to create a consensus sequence, and GeneMark and Glimmer, which make gene predictions. They also learned to use Phamerator, a software package designed to make comparisons among related phages based on protein families; this program was designed and is maintained by one of the authors (SC). For all of these programs, students used data from their phage to learn the software.

Working with the software and original data, the students assigned gene locations on one of the phage genomes (BC from B. pumilus). Then they used BLAST [20] to study likeness between the predicted gene products and the protein database at the NCBI site. Using an analysis program called Artemis (freely downloadable from http://www.sanger.ac.uk/Software/analysis.shtml), they entered the information about the closest relatives of each gene.

DISCUSSION

Compare and Contrast

Both genomics-based courses are a mixture of didactic information and experiments. Both are primary and original research project based. The freshman course is almost entirely experiential; there is some teaching of concepts, processes, and vocabulary on a need-to-know basis. The upper level course is grounded in literature, with the expectation that students will then be able to read and analyze literature on exams. The freshman course is “discovery” in the sense that new viruses are found from the soil, new genome sequences are produced, and new gene families are often found. The upper level course is original research but begins after genome sequencing so as to have more time for analysis of the sequence than in the freshman course. Both courses are focused on original analysis of novel genomes, and many novel genes are discovered. Both courses have the potential for students to make presentations at regional or national meetings, based on the work of the class, and there is also the potential for publication of the work, discussed further. Both courses provide the opportunity for students to participate in ongoing projects and recognize their important role in future studies.

Challenges

Laboratory experiences of this type are demanding for the instructors and students. There have thus far been two instructors for most of the times the courses have been taught. The instructors must be comfortable with a chaotic and unpredictable schedule of activities. However, original research is time consuming, chaotic, and unpredictable, so these qualities are unavoidable and perhaps even desirable when transforming a laboratory course into an authentic research experience.

The courses are heavily dependent on data emerging in a timely way. This requires that the professors be quite flexible and creative in their preparation and handling of the class. Because of this, small class size is preferable. Not only can a flexible and somewhat open-ended course design be stressful and time consuming to the instructors but it also does not work well for all students, particularly if they have never encountered laboratories such as these that build from week to week for the duration of the semester. Because of the nature of the course, it is difficult to have rigid expectations spelled out to the students at the beginning of the course. This can particularly be a problem for those students who tend to plan ahead and be very organized about their work.

In courses focused on research and literature, evaluation of student performance can be challenging. In particular, assessment must not be so onerous for the students that the enjoyment of and excitement about the experience ceases to be the focus.

With the goal of dissemination always in mind, a course like Genomics requires careful handling, and we have not yet achieved what we aspire to in this regard. Data management in genomics is a challenge, even in a professional laboratory. Thus, to expect that a paper might emerge from one semester of this course is a naïve view. However, to expect that a paper might emerge after a faculty member, with or without undergraduate researchers helping, completes the work or checks through for quality control, is not unreasonable. We are currently preparing two manuscripts on phage genomes that have been the subject of this course in the past. One of these will include extensive work for about 1 year following the course, performed by an undergraduate researcher and a postdoctoral fellow. The other is virtually complete with regard to analysis of the genome, but may require further experimentation to complete a story adequate for publication.

The complete genome sequence of three Bacillus phages and one Mycobacteriophage were obtained by the first iteration of the Viral Discovery and Bioinformatics class. Preliminary information with regard to DNA content and morphology of many of the 20 nonsequenced phages was also obtained. How best to use the data remains an open question, but one way is to use undergraduate researchers who take an interest in following up on the biology and genomics of uncharacterized phages. One advantage to the glut of data produced if multiple genomes are sequenced is that raw genome data might be shared with others whose course budgets or circumstances might not allow the time for discovery or the cost of sequencing. Similar to the Genomics course, it is most likely that even with two semesters of work, unless things go extremely well, there will be work remaining. We are working with others in the consortium to find the best ways to disseminate the information that we are gathering.

THE REWARDS

By and large, both courses have been successful, even in their first versions, based on faculty and student feedback. The over-arching goal of the Genomics course is to expose students through hands-on experiences to the world of genomics. The tutorials based on the literature and raw data are effective, probably due to the fact that there is a context for the exercises. Other software exercises build on each other, such that students have to take steps one at a time and use their results to move on. These kinds of experiences work because there is accountability at each step. The protein family project is very synthetic, requiring a number of skills acquired in the course or before. The whole genome poster works to broaden exposure to the burgeoning world of completed genomes from varied forms of life.

Student enthusiasm for science was definitely achieved in the freshman course. We retained 82% between the two semesters, which contrasts with 55% retention for a comparison group in the first two semesters of the traditional biology curriculum. Thus far, all the students have remained science majors, foreshadowing success, but the true outcomes will not be known for several years. Extensive assessment data being gathered by HHMI-SEA using the large cohorts involved in similar courses will certainly yield interesting information on the outcomes of this type of course for beginning students.

The discipline of genomics requires the ability to acquire, process, and store large volumes of data. Additionally, it requires facile sharing of numerous forms of digital data such as sequences, annotations, genome maps, dot plots, protein families, and information about biological pathways. One way to maintain student interest in the course and maximize student effectiveness in the research is to encourage the use of collaboration and social networking tools that the students already know. The students already use this approach informally in their courses, and we are beginning to follow suit. For example, we have augmented our use of course management software with our own wiki (http://phage.cisat.jmu.edu/hhmi) for collaborative projects.

SUMMARY

We believe that any faculty member attempting an “innovative” laboratory experience needs to go into it with an awareness of the issues and challenges that are inherent. We believe the advantages outweigh these issues, and having collectively done original research in classes for many years before diving into genomics, we might never teach a traditional kind of laboratory again by choice. The students' message is unanimous: “This is really fun!” We agree.

Ancillary