A data science practicum to introduce undergraduate students to bioinformatics for research

An explosion of data available in the life sciences has shifted the discipline toward genomics and quantitative data science research. Institutions of higher learning have been addressing this shift by modifying undergraduate curriculums resulting in an increasing number of bioinformatics courses and research opportunities for undergraduates. The goal of this study was to explore how a newly designed introductory bioinformatics seminar could leverage the combination of in-class instruction and independent research to build the practical skill sets of undergraduate students beginning their careers in the life sciences. Participants were surveyed to assess learning perceptions toward the dual curriculum. Most students had a neutral or positive interest in these topics before the seminar and reported increased interest after the seminar. Students had increases in confidence level in their bioinformatic proficiency and understanding of ethical principles for data/genomic science. By combining undergraduate research with directed bioinformatics skills, classroom seminars facilitated a connection between student's life sciences knowledge and emerging research tools in computational biology.


| INTRODUCTION
The data revolution in the life sciences has initiated a gradual paradigm shift from a descriptive science to a primarily quantitative discipline.To address this, academia has gradually developed programs in bioinformatics and quantitative methods for biology. 1 These programs began at the graduate level and have percolated into undergraduate curriculums. 2Efforts to increase the number of informatics/data science courses in undergraduate curricula often fail to emphasize general principles of data science or consider skills required to perform analysis of multi-format data. 3To be sure, articulating clearly defined general principles is imperative for students to choose the correct tools for analysis, assess the necessary computing resources, manage and clean data, and apply ethically approved standards in data collection, analysis, and storage. 1 Imparting fundamental disciplinary skills and integrating a solid theoretical background with more specialized knowledge creates a solid foundation for students in the life sciences.
Learning through traditional introductory molecular biology courses has historically been passive 4 ; however, it has been shown that activity-based learning increases student interest in the natural sciences. 5,6An active learning approach can instill a theoretical background of the significant concepts in molecular biology while providing skills that mirror knowledge creation within a particular domain (Table 1).Moreover, active learning of foundational biological concepts through inquiry-based labs with data-oriented exercises can facilitate peer-to-peer learning.Students learn from talking to their peers during group discussions of topical issues.Such conversations can increase student interest while achieving a broader understanding by amalgamating an instructor's explanation. 7ne active method is through dynamic programming exercises for students, where students can combine molecular biology concepts with basic computational skills. 8As the size of data increases, it is critical that students have functional skills related to technology and computing to interact with and explore these data. 9Understanding nextgeneration biological tools and the full extent of biological data has become a highly computational exercise. 10Several different computing languages are common in biology; these include R, 11 Python, 12 and MatLab. 13In bioinformatics, the currently ascendant language is R. 14 Many students do not have computing expertise; thus, programming concepts must be integrated into the active learning curriculum simultaneously to introduce biological concepts. 15ur study aimed to explore how a newly designed introductory bioinformatics seminar could leverage the combination of in-class instruction and independent research to build the practical research skill sets of undergraduate students beginning their research careers in the life sciences.The seminar design included inquiry-based labs for teaching data science together with foundational concepts in molecular biology.This new curriculum was introduced within the INBRE IV (Institutional Biomedical Research Excellence-NIH project 5P20GM103466-18) student undergraduate research program to understand core concepts while teaching independent research practices through an independent research project.The 10-week undergraduate seminar's core concepts were gleaned from data science projects that traversed data with varying characteristics. 16,17We assessed the impact of the active learning activities (inquiry-based labs, data-oriented exercises, peer discussion, group projects) format that integrated data science and R markdown with molecular biology content.We hypothesized that this method of instruction would improve student understanding of core genomic concepts.Student perceptions about the domain knowledge they are learning impact their approach to the material and how they learn it. 18We tested this by surveying the student participants at the end of the seminar to understand their knowledge, perceptions of their knowledge, and the relevance of this knowledge to their research.The goal is to enhance student research outcomes by offering multidisciplinary training in genomics to complement student independent research projects.The following criteria were considered in seminar design: operating system, programming languages, technology requirements, and student demographics.Seminar activities incorporated data retrieval, data cleaning, and data processing.Online labs were developed to provide practical experience in the data science process through bioinformatics.Students identified relevant topics and designed scripts to execute the appropriate strategies, including an inquiry-based final project.R may impose a learning curve that is too advanced for demonstrations in an introductory bioinformatics seminar.To overcome this, our team developed R modules that were made available using R markdown so that students could complete them with minimal knowledge of R programming.Seminar labs focused on several core concepts: working with data, making inferences, and prediction modeling.A final project spanned the entire seminar.The project began with a biological rationale and subsequently utilized MEGA, 19 a multifaceted bioinformatics software for phylogenetics, to generate data about the chosen hypothesis.

| Survey design
At the end of the seminar, students completed a survey that included a set of questions assessing their interest level and knowledge before and post-seminar, and questions on knowledge gained, level of confidence in subject ability, the relevance of the seminar's topics/techniques to their (planned) research, and seminar elements (e.g., group project).The survey was covered under the University of Hawai'i IRB Protocol #2020-00940.A retrospective pre/post survey was selected primarily to avoid response-shift bias. 20Specifically, the survey addressed bioinformatic tools and ethical concepts in data science, including phylogenetics, NCBI databases, and genomic privacy.Each closed-ended question utilized a 5-point Likert scale (For full survey see Document 2 in Data S1).

| Survey analysis
Survey results were analyzed using a scoring system as follows: very disinterested/not at all useful/not confident/ strongly disagree, À2; disinterested/somewhat useful/ somewhat confident/disagree, À1; do not know/no opinion/neutral, 0; interested/moderately useful/confident/ agree, +1; very interested/very useful/extremely confident/ strongly agree, +2.A Standard error was calculated for the percent of students responding with a particular answer (Tables 3-8).A Pearson correlation was used to examine the relationships between self-reported student confidence, interest, and utility.Effect size was estimated using a onesample wilcoxin signed-rank test.Student responses for before the seminar to after the seminar were compared as a percentage of students giving a particular response and the variance was expressed as ±standard error.

| Student interest and confidence
The seminar covered a variety of topics spanning genetic data analysis, programming, and data science.Average student interest increased across all topics covered during instruction (Table 3, Figure 1).The most significant gain in student interest was in sequence alignment (wilcoxon effect size, 0.89 ± 0.01), with four students self-reporting increased interest in the topic during the seminar (Table 3).Learning to use NCBI tools also showed an overall increase, this topic also had the highest proportion of interested students before the seminar (wilcoxon effect size, 0.9 ± 0.03).The pattern regarding student confidence was less clear; more common terms and concepts to biology such as sequence alignment and NCBI tools seemed to garner higher confidence among students as compared to more specialized downstream analytical processes like multiple sequence alignment and phylogenetic analysis, however there was no statistical difference between these two groups (Table 4).

| Perceived utility by students for research
The seminar took place during the backdrop of the SARS-CoV-2 pandemic, with restrictions on in-person learning, coupled with an online format that was new to the INBRE program.There was a statistically significant correlation between students' interest and perceived usefulness (R 2 = 0.57, p = 0.05), suggesting the importance of practical applications for lab exercises (Figure S1 and Table S1).This relationship was not significant for student confidence (R 2 = 0.80, p = 0.16; Figure S1, Table S1).

| Self-reported understanding of core concepts
Students self-reported a greater understanding of bioinformatics, and the majority indicated they would like to learn more about the discipline after taking the seminar (Table 6).Students also indicated that the frequent practice of exacting computational techniques was helpful to them in improving their understanding.Finally, most students said that both exercises using R markdown documents and group projects were helpful in their learning of bioinformatics (Table 7).Strongly disagree (%) Disagree (%) Agree (%) Strongly agree (%) The R tools provided in the homework have enhanced my learning as an INBRE student.The R homework assignments were easy to use.0 42 58 0 The group projects were a useful addition to the course content 17 17 67 0

| Student understanding of general data science concepts
To measure the impact of student discussions, students were asked to rate their level of agreement with statements related the discussion topics-as a comparison, students were also asked to rate their level of agreement with statements not discussed, but in a similar discipline.After completing the seminar, most of the students agreed with factual statements about the science concepts explicitly addressed in the curriculum (marked with an asterisk in Table 8).For example, students reported a better understanding of privacy concerns in genomics after completing group discussions on this topic.Algorithmic bias, a subject not explicitly discussed, was not understood better (Table 8).

| DISCUSSION
The results from this small study are encouraging; they suggest that in-class instruction, active learning, and connections to students (planned) research did build the intended skills and interest.While not every aspect of every lesson contained active learning, many SLOs were addressed with active learning components-an instructional style shown to increase student learning. 21In addition, the seminar structure allowed students to explore varied disciplinary techniques and identify where they thought they had understanding, deficiencies, or mastery.The seminar's inclusion of critical cross-disciplinary skills such as computer science and data ethics helps students gain a more accurate perception of life sciences and their training as scientists.
F I G U R E 1 Scatterplot of student interest before versus after the bioinformatics seminars.Average self-reported interest in different seminar topics for 12 student survey respondents was compared in a retrospective pre/post survey.Red line represents a 1:1 relationship, that is, no change from before to after.Responses were scored as follows: very disinterested/not at all useful, À2; disinterested/somewhat useful, À1; do not know/no opinion, 0; interested/moderately useful, +1; very interested/very useful, +2.Well-designed courses incorporate a flexible, goaloriented approach with a first step in course design focused on identifying desired results. 22Once the end goals of teaching are clear, the designer can determine acceptable evidence and plan learning experiences and instruction.In practice, we measured our end goal of enhancing student research outcomes through multidisciplinary training in genomics by means of group projects and a final pre/post survey.When running a decentralized online seminar that is cross-disciplinary, the limitations concerning the completion of student activities can provide an opportunity to assess student performance through authentic activities such as these students' group projects. 23In a research-focused program, such as INBRE, it is possible that students became more interested in topics as they were perceived to be relevant to their laboratory work.Topics with a higher increase in students interest were biology-centric (e.g., sequence alignment, homology, phylogenetics; Figure 1).This evolution in interest can be attributed to understanding new methods of how to ask/answer biological questions.
Most students self-reported an increased understanding across all topics in genomics covered in the postcourse survey (Table 4).Students reported the greatest increase in agreement with statements covered in the seminar including the privacy concerns impacting human genetics, and whether bioinformatic skills could be applied to their INBRE research projects.Students were most engaged when the seminar combined theory and practice in biology, tying their research to the lessons in bioinformatics and the ethical implications of those technologies-underscoring a need to design computational biology courses around current biological concepts and the ethical implications of these concepts. 24The essential combination of theory and practice could explain why student interest was correlated with perceived usefulness, while not associated with increased student confidence (Figures S2 and S3).Engaging in peer discussions was a central component in improving student understanding of ethical concepts-students reported a better understanding of privacy concerns in genomics after completing group discussions on this topic.Incorporating ethics into college curricula has been demonstrated to be essential to student engagement and an ability to connect learning in the classroom to a broader context. 25hile this seminar's format was largely successful, we did receive student feedback that could help future cohorts.First, because coding was new to many students, students noted that introducing R sacrificed valuable class time that could be used explore a single tool, in-depth without requiring R skills.In bioinformatic analysis, selecting the correct tool is critical. 26In future iterations of the seminar, we hope to place additional emphasis on selecting between tools and programs that perform similar tasks.Previous studies have found selftaught R skills to be inadequate, therefore we plan to continue emphasizing introductory R skills. 27Second, students asked for group size to be limited for the final project.Crossdisciplinarity group work contributes to success in computational biology. 28We hope to incorporate this feedback and continue group projects but experiment with smaller groups.Third, students came into the program with different expectations for conducting in silico research projects through INBRE.Some expected to augment their laboratory experience with such approaches, and others expected to work solely in a wet lab (Document 3 in Data S1).
In addition to student feedback, we observed that additional effort is needed to reach the target student demographic.Though we sought to reach freshman and sophomore students, 20 of 26 students were upperclassmen (junior and above).The seminar was advertised equally to all levels of students at participating institutions.It is possible that freshman/sophomore students are still exploring their career aspirations and less interested in skills-based preparation for biomedical research.These challenges can be managed by identifying the interests of first and second-year college students and through diligent communication about the expectations of the INBRE program.

| CONCLUSION
Utilizing modern tools, data science requires that theory be interwoven tightly with concrete skill development.In this study, students improved understanding of biology and genomics through practical exercises and discussion questions.Students were instructed on using multiple tools during lectures for each biological concept to ascertain when a specific technical solution is needed.Future iterations of our seminar series will see some changesfirst, the groups for our seminar were too large and varied in size (from 1 to 13 students).Groups were not assigned.Students with a similar interest formed a group, regardless of how many students were in each group.This system did not give each student adequate exposure to computer skills and software, which groups of 3-4 would have done.Though they were engaged in the discussion, students expressed little disagreement.Solutions for future iterations of the seminar include having the student groups formed based on which discussion questions they choose and providing instruction on having productive debates.A secondary benefit would be grouping students with differing levels of computer skills.Given the size of the seminar and the number of survey respondents (n = 12), additional iterations are needed to understand the complex interplay between the seminar's concepts (e.g., the ethics of technology by genetics interaction).Further iterations are essential to understanding any links between demographic factors and levels of interest, the correlation between different sub-subjects, and the relationship between areas of interest with computer skills.Having a better understanding of these complex interactions will allow for the continued improvement of future seminars and tailored attention to students' needs, resulting in better student retention and job preparedness.
Students in the program were part of the INBRE program'housed' at the University of Hawai'i.The student cohort T A B L E 1 Selected modes of instruction versus active modes of instruction.seminars designed to give undergraduate students across the University of Hawaii system and INBRE partner institutes (Chaminade and Hawaii Pacific University) an introduction to bioinformatics concepts and resources during their college education.Students derive skills in data science from empirical research.Activities include introducing students to biological databases, R, and phylogenetic analysis (see syllabus in Document 1 in Data S1).The mode of seminar instruction is entirely online.Students' comprehension and mastery of bioinformatics are assessed through online lessons, labs, quizzes, and inquiry-based group activities.The seminar's active learning components are designed around embedded interactions with bioinformatics databases during lecture time culminating in a final group project.Group size ranged from 1 to 13 students.Student learning objectives 2.2 | Seminar descriptionBioinformatics seminars at The University of Hawaii introduce core concepts early in the college curriculum, that is, in the first and second years of college.Our INBRE (IDeA Networks of Biomedical Research Excellence) program consisted of cross-disciplinary (SLOs) included the following: utilization of R to enter and edit expressions and scripts; read, subset, and reshape tabular data; find and install external R packages; make figures and tables from data; gain a fundamental understanding of bioinformatic data; and learn the general principles of designing a bioinformatic study (see also Document 1 in Data S1; examples of R-markdown can be found in Document 4 in Data S1).
Student demographics.Overview of the demographics of 26 total student seminar participants.
T A B L E 2 Question 1: Please rate your level of interest in the following subjects in biological sequence data before you took this seminar and now, after the seminar.Question 2: As a result of this seminar, what is your confidence level in your knowledge of the following subjects in bioinformatics.
T A B L E 3 Question 3: How useful/not useful are the following regarding the INBRE research you are conducting/planning to conduct.Question 4: Please indicate your level of agreement with the following statements before you took this seminar and now, after the seminar.
T A B L E 5 T A B L E 7 Question 5: In thinking about this seminar, indicate your level of agreement with the following statements.
T A B L E 8 Question 6: Please indicate your level of agreement with the following statements before you took this seminar and now, after the seminar.
a These topics were explicitly addressed in the seminar.BARTLETT ET AL.