Lessons and recommendations from three decades as an NSF REU site: A call for systems‐based assessment

Abstract For more than 30 years, the US National Science Foundation's Research Experiences for Undergraduates (REU) program has supported thousands of undergraduate researchers annually and provides many students with their first research experiences in field ecology or evolution. REUs embed students in scientific communities where they apprentice with experienced researchers, build networks with their peers, and help students understand research cultures and how to work within them. REUs are thought to provide formative experiences for developing researchers that differ from experiences in a college classrooms, laboratories, or field trips. REU assessments have improved through time but they are largely ungrounded in educational theory. Thus, evaluation of long‐term impacts of REUs remains limited and best practices for using REUs to enhance student learning are repeatedly re‐invented. We describe how one sociocultural learning framework, cultural–historical activity theory (CHAT), could be used to guide data collection to characterize the effects of REU programs on participant's learning in an educationally meaningful context. CHAT embodies a systems approach to assessment that accounts for social and cultural factors that influence learning. We illustrate how CHAT has guided assessment of the Harvard Forest Summer Research Program in Ecology (HF‐SRPE), one of the longest‐running REU sites in the United States. Characterizing HF‐SRPE using CHAT helped formalize thoughts and language for the program evaluation, reflect on potential barriers to success, identify assessment priorities, and revealed important oversights in data collection.

the Research Experiences for Undergraduates (REU) program. Since then, REU has become one of the largest supporters of undergraduate research programs; $1.12 billion was invested in supporting thousands of undergraduates each year between 2002 and 2017 through both REU Site and REU Supplement awards ( Figure 1).
All REU Site and REU Supplement awardees share the common goal of preparing undergraduate students for careers in science, technology, engineering, and mathematics (STEM) fields by providing research opportunities. Our focus here is on REU sites, which support cohorts of six or more students working with more than one senior researcher and that explicitly include educational programming beyond the field or laboratory research itself. Individual REU sites are defined uniquely by their intellectual themes (approximately 10% related to ecology or evolution) and communities of researchers. The design of educational experiences at each REU site depends on these themes and the values articulated by program leadership and individual scientists. Sites vary in their personnel, infrastructure, intellectual pursuits, and the student populations they serve. Sites also vary in how they evaluate their goals and assess their success.
At least through 2010, if individual REU sites evaluated and assessed themselves at all, they selected and managed their own assessment protocols. Individual site assessments generally were unique case studies (Dávila, Cesani, & Medina-Borja, 2013) derived from internally developed participant surveys (McDevitt, Patel, Rose, & Ellison, 2016;Seymour, Hunter, Laursen, & DeAntoni, 2004) administered only after the program ended. Qualitative data from these surveys elicited insights about student experiences but the data were neither representative nor a random sample and were expensive to collect. More widely used quantitative surveys created less of a burden on programs, but the surveys often consisted of conceptually ambiguous questions that rarely were validated and were incomparable among programs (Linn, Palmer, Baranger, Gerard, & Stone, 2015;McDevitt et al., 2016).
The flexibility afforded to REU sites by NSF encourages innovative pedagogical approaches but also increases heterogeneity among programs. In contrast, surveys such as URSSA were developed to assess programmatic goals prioritized by NSF. Both individual site-based surveys and cross-site surveys like URSSA serve their intended purpose, but both lack theoretical underpinnings that make it difficult to relate their findings to the broader literature on education or to understand similarities and differences among REU sites (Beninson, Koski, Villa, Faram, & O'Connor, 2011;Linn et al., 2015;Wilson et al., 2018).
Our own experience suggests that using atheoretic assessment tools makes it difficult to understand why an REU program is successful. We previously analyzed 10 years of before/after ("pre/post") surveys of student participants in the Harvard Forest Summer Research Program in Ecology, which has been supported continuously by NSF as an REU site since 1989(McDevitt et al., 2016. The design of our short self-reporting survey was an intentional compromise between sample size and survey depth, and we asked questions about topics we as scientists thought were important rather than those that educators might have identified as central to learning science. The former included changes in students' attitudes toward science; identification with scientific norms and professional practices; specific skills associated with conducting and disseminating scientific research; and postprogram career and educational plans. We observed significant differences in learning gains correlated with students' prior experiences in classrooms, laboratories, or the field, but we were unable to attribute causes to these observations or compare our results with similar observations at other sites (e.g., Scott et al., 2012).
These experiences led us to consider aligning our assessment tools with established educational frameworks and theories. Here, we present one such systems-based framework-cultural-historical activity theory (CHAT)-which we think would be useful for assessing and evaluating REU sites both singly and together. We illustrate how we have begun to apply the CHAT framework to study and improve our own REU site at the Harvard Forest. We suggest that by framing questions as testable hypotheses, results of REU evaluations and assessments can be used to adaptively improve individual undergraduate research experiences and illuminate causes of successes-and failures-across REU sites in ecology, evolution, and other STEM fields.

| US ING A SYS TEMS -BA S ED APPROACH TO S TUDY REU S ITE S
Ecologists have long recognized the complexity of biological systems and have developed techniques and models-"systems thinking"-to study the interconnected components that make up these systems (Patten & Fath, 2018;Patten & Odum, 1981;Trewavas, 2006).
Key features of ecological systems include hierarchical structure, interconnectedness between system components, and emergent properties. REU programs are similarly complex, and by extension, we suggest that systems thinking could be applied to understand and evaluate REU programs if relevant system components could be identified and adequately contextualized.
At REU sites, groups of students engage in research guided by an experienced researcher or laboratory group. REU goals usually extend beyond learning research skills and completing a research project. They also aim to promote the development of scientific identity and cultural capital. Students not only are mentored in research, but they also are connected to a community of peers who can help them navigate through their research and life experiences. In such collaborative learning experiences, paths to success differ among students, cohorts, and programs. Context is very important for understanding both why a program is successful and how to transfer successful practices across programs.
Many learning theories recognize social and cultural influences on learning. A common property among most sociocultural learning theories is that learning is culturally mediated: words, texts, social cues, and other symbolic objects fundamentally shape how an individual constructs knowledge (e.g., Vygotsky, 1980;Wertsch, 1993).
Each of these sociocultural learning theories provides a slightly different perspective on learning and the context of a research question determines the selection of a theoretical framework (or competing frameworks). Among these, cultural-historical activity theory (CHAT; Roth & Lee, 2007) includes all three themes and flexibly accommodates most concepts proposed in the other sociocultural learning theories. Thus, we consider it to be an ideal platform for a well-structured assessment of REU programs.

| CULTUR AL-HIS TORIC AL AC TIVIT Y THEO RY
CHAT provides a broad blueprint describing the components that influence the social construction of knowledge (Cole & Engeström, 1993). It is an expansion of activity theory that allows researchers to study the completion of goals by individuals or collaborative groups while recognizing interacting cultural and historical influences acting on the system (Roth & Lee, 2007). Activity theory as a framework for learning builds from a core tenet of cultural psychology (Cole, 1998): The process of learning by an individual can be culturally mediated (Wertsch, 1993). Activity theory is distinguished from other sociocultural learning theories through its explicit identification of the tools an individual uses to learn, how other individuals mediate learning through cultural norms, and the examination of their interactions (i.e., an activity system). The cultural-historical aspect of CHAT extends analysis of an activity system to understand how the activity develops and changes over time and how it relates to other activity systems with which an individual interacts.

| Visualizing CHAT systems
Cultural-historical activity theory's activity systems are best visualized through what are known as "activity triangles" (Figure 2;Roth & Lee, 2007). CHAT requires the identification of seven distinct elements ("nodes") that take part in an activity within a system of interest and the examination of connections ("edges") between them (Cole & Engeström, 1993;Roth & Lee, 2007;Yamagata-Lynch, 2010). F I G U R E 1 Funding for Research Experiences for Undergraduates (REU) programs. Support for REU programs based on (a) yearly congressional allocations and (b) NSF directorate support for REU sites. Funding data (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017) were compiled based on yearly NSF congressional budget requests. Archives of REU awards (nsf. gov/awardsearch/) provided estimates for remaining years and directorate contributions To help our colleagues cut through the educational jargon associated with CHAT, we illustrate its elements in the context of describing a student writing a research proposal: 1. Subject-The individual or group of focus during the specified activity (e.g., the undergraduate student(s) writing the proposal); 2. Object-The goal or motive behind the specified activity (e.g., students should think critically about their project, connect with the primary literature, and establish feasible milestones for it); 3. Rules-The stated or unstated rules that govern how individuals act within the context of the specified activity (e.g., proposal guidelines, conventions of scientific writing, laboratory expectations, or culture as established by research mentor); 4. Community-The social context in which the specified activity is conducted (e.g., including the student, research mentor, members of a laboratory, broader group of student participants); 5. Division of labor-How tasks are shared among the community to accomplish the specified activity (e.g., the student is responsible for most of the writing, the mentor provides some direction and feedback, and other laboratory members are available to answer questions); 6. Mediating artifacts-The tools used in creating or completing the object (e.g., example project proposals, relevant journal articles, attending workshops, written feedback); 7. Outcome-The effect generated by subject working in concordance with other components of the activity system to accomplish the object (e.g., formal evaluation of written proposal, performance review based on expectations outlined in proposal, gaining a skill).

| Using CHAT to make sense of contradictory information in complex learning environments
REU programs are complex social learning environments, and CHAT provides the ability to make sense of contradictory information that arises within the system and through time (Cole & Engeström, 1993). These contradictions are classified into four types (Engeström, 1987): primary contradictions exist within an element (e.g., contradictory rules); secondary contradictions exist within interactions between two elements (e.g., division of labor is not aligned with mediating artifacts); tertiary contradictions are manifested during temporal transitions of an activity system (e.g., mentors refining or modifying their approach "on the fly" while the student is writing their research proposal); and quaternary contradictions exist between similar activity systems of which the subject is a member (e.g., REU experience compared to scientific coursework).
Primary contradictions often result from differing value judgments that underlie the system (Engeström, 1987). These contradictions are fundamental to the system and form the foundation of higher orders of contradictions (Engeström, 1987;Foot & Groleau, 2011). After program values are established, components within an activity system should be aligned to aid the subject in accomplishing the object, measured by the outcome(s). For example, in developing a research proposal, the student (subject) should be supported in a way that helps them write a successful research proposal (object) that is measured by the expectations set by their research mentor or review panel (outcome). However, it is common that two or more of these components are not aligned.
Secondary contradictions help to illuminate these misalignments and may lead to subsequent changes within the activity system (Engeström, 1987). For example, an undergraduate student (subject) writing a research proposal (outcome) may not possess the necessary background knowledge to read a highly technical literature review on their topic (mediating artifact); the research mentor or other laboratory members (community) may not have enough time to adequately support the student by answering questions and providing feedback (division of labor); or expectations conveyed via a micromanagement approach (rules) conflict with the ability for the student to meaningfully connect with the literature or think independently about their project (object). These conflicts between system components may result in specific obstacles that are manifestations of fundamental F I G U R E 2 System components of the cultural-historical activity framework (CHAT). The activity triangle highlights how components interact with others within the system (top), and the contradictions that can be examined through CHAT tensions (primary contradictions) within the activity system (Foot & Groleau, 2011). Because conflicts and contradictions may arise from fundamental components of the system, it is better to address their source(s) rather than their symptoms. To resolve secondary contradictions by addressing underlying primary contradictions, some type of change must occur in the activity system. For example, before trying to develop new mediating artifacts to help a student read a highly technical literature review (secondary contradiction), it would be prudent to first evaluate whether there already are mediating artifacts in place that send conflicting messages (primary contradiction), which once addressed, might resolve the secondary contradiction.
Tertiary contradictions are differences in the system that occur at temporal transitions (Engeström, 1987); program directors may be interested in examining them as they change various instructional activities or procedures. For example, an REU program may implement a new proposal-writing workshop (mediating artifact) that is intended to help students (subject) connect their proposals to the available scientific literature (outcome) and simultaneously shift some of the duties from the research mentor to the workshop facilitator and the student's peers (division of labor). As new procedures are implemented, a transition to more "advanced" practices may not be immediate (Engeström, 1987;Foot & Groleau, 2011). Examining barriers to change may reveal additional information about primary contradictions and potentially lead to smoother tertiary transitions.
Alternatively, the cause of these underlying contradictions may not reside solely within the activity system itself, but rather may be rooted in cultural expectations from adjacent activity systems (quaternary contradictions). Students (subjects) bring their past experiences with them to the activity system, and it is likely that members of the community may not have the same shared experiences. For example, the rules established in adjacent activity systems may carry over for an individual and impact how said individual interacts with system components such as mediating artifacts or the community. For example, if a student (subject) has prior experience writing a research proposal (object) in another context (e.g., in a different laboratory, discipline, or institution), their perceptions of this current experience in writing may be influenced by rules, mediating artifacts, or division of labor from their other experience (adjacent activity system). In this case, the success in writing their REU research proposal (outcome) is driven by the recognition of these quaternary contradictions and relevant interventions, such as the adjustment of rules, addition of mediating artifacts, or changes to the division of labor that can lead to more productive writing process by the student (subject).

| APPLYING THE CHAT FR AME WORK TO THE REU E XPERIEN CE
To help REU programs connect program evaluations with the CHAT framework, we have developed some guiding questions related to activity system components and contradictions (Table 1). These questions are intended to elicit values and perspectives that might not be included in atheoretic surveys or other assessment tools.
After fully characterizing the activity system of interest, we prioritized data collection efforts based on our understanding of program values, the magnitude of impact contradictions could have on the activity system, and plausibility of contradictions occurring. We then suggest a rubric (  (Tables 3-5), we specifically reflect on data we have collected in the last five years aimed at examining the alignment between our program priorities and current assessment practices.

| Recruitment and hiring practices
The first stage of all REU programs, including the HF-SRPE, is the recruitment and hiring of a diverse cohort of student participants At these earliest stages of the program, primary contradictions exist in establishing the priorities for recruitment (object). We try to Ob.3 -If you are studying this activity system over time, how has the object changed over time?
Ob.4 -What are additional activity systems might have similar objects/goals?

Outcome
Out.0a -What values or perspectives are important to recognize when assessing the object? (e.g., student "success" may mean many things to different people) Out.0b -How do you plan on assessing the object?
Out.1 -Why is this outcome measurement appropriate for the object? What are the documented validity and reliability arguments for this object?
Out.2 -What evidence do you have that the outcome measures are aligned with your object? Out.3 -If you are studying this activity system over time, how have your outcome measures changed? (e.g., refinement or development of instruments/surveys) Out.4 -If you are using an instrument that was developed to under a different context/population, how might this impact the validity and reliability arguments?
Community C.0 -Who interacts with the subject to accomplish the object?
C.1 -Does the subject acknowledge or recognize the rest of the members of the community?
C.2 -Do both the subject and community know who is involved in the activity system?
C.3 -If you are studying this activity system over time, how has the community changed?
C.4 -How does this community compare to similar activity systems? What is the impact these differences have on the outcome? MA.2d -Are any rules in conflict with mediating artifacts (or with how the community divides the labor with these artifacts)?
MA.3 -How did the addition, removal, or refinement of mediating artifacts impact the activity system? MA.4 -How do these mediating artifacts compare to similar activity systems? What is the impact these differences have on the outcome?
Division of labor DL.0 -What are the expectations of the community to be involved in helping achieve the object?
DL.1 -What evidence can be provided to demonstrate that community members are achieving these expectations? DL.2a -Is the division of labor appropriate to meaningfully support individual mediating artifacts?
DL.2b -Is the overall division of labor in the activity system appropriate to meaningfully support the object?
DL.3 -If you are studying this activity system over time, how has the division of labor changed over time? DL.4 -How does the division of labor differ from similar activity systems? What is the impact these differences have on the outcome?

TA B L E 1 (Continued)
TA B L E 2 Recommendations for rating the quality of evidence in CHAT systems. This ordinal rating system is intended to evaluate the quality of evidence for responses to guiding questions in Table 1 Measurement S.3 Data were collected using a random sample of the population (or sample frame) of interest or data exists for the entire population.

TA B L E 3
Example responses to the CHAT questionnaire (Table 1)

for assessing participant selection into the Harvard Forest Summer Research Program in Ecology
Question (Table 1) Response Relative priority to evaluate Quality of evidence (Table 2) S.0a HF-SRPE is a paid summer research program for undergraduate participants located at the Harvard Forest in Petersham, Massachusetts, USA. Over the past 30 years, HF-SRP has been primarily supported through the NSF REU program; however, additional participants within the program have been supported through various grants and programs. There are numerous stakeholders in the hiring process who, under a more extensive evaluation, would receive their own activity system: applicants, referees for applicants, researcher mentors, and program administrators S.4 An activity system can/should be created for each stakeholder in this process. Individual activity systems naturally group around the collective action of hiring for an individual position, followed by projects supported by multiple REU positions, and finally, HF-SRP. However, the applicants and mentors may also perceive aspects of this activity system differently based on their experiences or knowledge of similar Based on the nature of some projects, there may be a larger community involved during the hiring process (e.g., more applicants, larger research groups) Applicants may also receive support from their home institutions (faculty mentors, advisors, peers) constructing application materials We also have anecdotal evidence that our community is similar in structure to that of other REU programs. However, we do expect that the size of our applicant pool (~800 undergraduate annually), PIs, and program staff are greater than most REU sites DL.0 Application materials: HF-SRPE staff create and disperse most of these materials; mentors may also disperse materials through their own networks; some applicants may need to seek out these materials, while others may be introduced to them through their network of peers, faculty, or career services programs Submission tool-front-end: Applicants submit their own materials, but they may be aided in this process by peers, faculty, or career services programs. Applicants also indicate their top interests in projects Submission tool-back-end: HF-SRPE program director reviews all applicants prioritized by established funding sources. This provides a second set of eyes on hundreds of applications and reduces some of the burden on mentors hiring for individual projects Hiring documents: Program mentors have some autonomy in how they describe project descriptions, set expectations for desired experience, and conducting interviews. HF-SRP program staff provide resources and accountability to help ensure that mentor decisions are aligned with the program's goals and values  (Table 2) S.0a The subjects are individual participants in the HF-SRPE program who come to Harvard Forest from a range of undergraduate institutions. Since research mentors hire participants for specific research projects (e.g., plant ecology, soil microbiology, biogeochemistry, paleoecology, programming or data science), participants bring with them a variety of educational experiences. Additionally, some participants may be specifically selected based on their skillsets (or lack thereof) based on structure of project goals during the 11-week program. For example, some projects may be structured in a way that allows participant to learn and explore with limited scientific skills or knowledge, whereas other projects may require participants to have a specific set of skills or background knowledge to generate a specific research product within the 11-week time period  (Table 2) R.0 There are many cultural norms and conventions associated with collecting, visualizing, analyzing, or communicating data and these norms may also change within subdisciplines. We teach participants R and there is a certain amount of fluency necessary to interact with this coding language strike a balance between selecting students who appear to be best qualified (i.e., most experienced) to do research and those who have the most to gain out of the experience. These contradictions arise in part from cultural biases of academic research where success is measured through productivity (theses, posters, peer-reviewed papers); the "best" students are those with proven "track records" of productivity. As mentors and educators, we also want to work with students who are willing to push beyond their comfort zone and maximize the impact of a research experience. At HF-SRPE, this primary contradiction is further complicated by the different stakeholders involved in the hiring process. Individual research mentors advocate for their projects; funders push for students from certain institutions, demographics, academic majors, or skillsets; and program directors seek a lasting and cohesive identity for the program.
At HF-SRPE, we have sought to balance the quaternary contradictions between activity systems of multiple stakeholders (including the program directors, program manager, mentors, external collaborators, and funders) by building research teams (mediating artifact).
Research teams consist of multiple mentors and multiple students who work together to address scientific inquiries through complementary collaborations. Stakeholders meet to discuss the formation of research teams prior to creating a position (mediating artifact) and  (Table 2) C.4 We imagine this community structure may be like that of other coordinated research programs but will likely differ considerably compared to independent research and coursework Medium QL.0 QT.0 A.0 S.0 R.0 Our goal is to treat students as employees and colleagues rather than "undergraduate students." The aim is to model professional behavior and provide support to students (for some of whom, this is their first job) so that they eventually feel a sense of autonomy and accountability for their actions  (Table 2) MA.1 When designing the programming, we consider how professional development activities support students with respect to the mission of the program and student long-term goals These mediating artifacts are common to other research programs and educational settings. We find that even if a student has completed a similar activity before, the repetition is useful as they may gain a different perspective the second (third or more) time around The research mentor and student are asked to outline expectations for the project proposal at the beginning of the summer. Division of labor is somewhat variable among research mentors which is why HF-SRPE provides formal programming to help ensure a consistent exposure to professional development resources. Sometimes, a student will maintain a research relationship with their mentor after the end of the summer (often resulting in a research project such as a poster, undergraduate thesis, or manuscript) For other professional development activities, HF-SRPE strives to provide resources for students that their research mentors may not otherwise have the time or expertise to provide Medium QL.1 QT.0 A.1 S.1 DL.1 Although we have a proposal that outlines expectation for each student's project, we do not revisit these documents to evaluate whether the agreed upon division of labor was met. Our reluctance to analyze these documents is due to how these documents are formatted (some projects require a lot of structure while others are more trial-and-error) and that research goals may change rapidly throughout the summer project requirements (rules) by hiring students with skillsets and the potential to gain additional value from the experience. This two-step applicant review process, although time consuming, provides additional oversight that helps guard against implicit biases that might cause us to overlook applicants who can contribute to research outcomes and benefit from the research experience.
Another barrier to recruiting and hiring a diverse population of students are secondary contradictions between potential applicants (subjects), application materials (mediating artifacts), and the norms surrounding finding an internship (rules). Reviewing recruitment and application materials through a multicultural lens is a continual process and has been our primary tool for limiting these secondary contradictions. However, an unanticipated recruitment strategy of the HF-SRPE has been to take advantage of positive research experiences our students have. They tell others about the experience at their home institutions, conferences, and meetings, through social media outlets, and forward emails/promotional material (mediating artifacts). In some cases, they have returned to HF-SRPE as mentors.
Characterizing these various components and assessing whether recruitment and hiring goals are being met is especially difficult when nearly 1,000 applications are reviewed in less than four weeks.
Applying CHAT to HF-SRPE's requirement and hiring practices (  (Hartigan & Kleiner, 1984) that illustrate the observed frequencies of the y variable conditional on the x variable. For example, the plot of Gender (y) versus TUG (x) illustrates the frequencies of female, male, or other-gendered individuals conditional on whether each individual is from a group traditionally underrepresented in science. The area of each tile is proportional to the corresponding cell entry given any previous conditioning. Continuing with the Gender versus. TUG example, we first conditioned on TUG (the x variable); there have been more non-TUGs than TUGs in the Harvard Forest Summer Research Program in Ecology, so the width of the "no" group is much larger than that of the "yes" group. We then split Gender conditional on TUG; there are many more females than males, and few nonbinary individuals. The shading (red to grey to blue) is proportional to the residual from a χ 2 contingency table (i.e., difference of observed from expected values); the overall P value for the χ 2 test is given below the vertical residual scale-bar. In the Gender versus TUG example, the residuals are small, and there is no significant relationship in our hiring of students of different genders given their ethnicity (p = .92). The panels below the diagonals are association plots (Cohen, 1980). As with the mosaic plots, the association plots illustrate differences from expectation of the y variable conditioned on the x variable. Rather than illustrating the observed frequencies, the association plot illustrates the standardized deviations of observed frequencies from the expected frequencies. The direction of each rectangle from the dotted (zero) line indicates the sign of the residual; its height is proportional to the magnitude of the residual; its width is proportional to the square root of the expected counts; and its area is proportional to the difference between the observed and expected frequencies. Colors match those of the mosaic plots. Plot constructed with the pairs() function within the vcd library in R (Meyer, Zeileis, & Hornik, 2006). Currently, the most consistent information we and most other REU sites collect about the hiring process are demographic data and quantifiable metrics such as gender, ethnicity, grade-point average (GPA), class rank, and type of institution ( Figure 3a). These data are relatively easy to gather from applicants, can influence decisions about who to interview or hire, and are straightforward to track through time or compare among multiple REU sites. The data illustrate that our applicants are predominantly white, female, and from a mix of institutional backgrounds (Figure 3a). Conditional models suggest little difference from expectation except that our applicants who are the first in their family to attend college tend to be from ethnic groups broadly underrepresented in science and attend either community colleges or comprehensive universities (Figure 3a).
However, these data align with only a few of the priorities highlighted by CHAT and are insufficient for accurate evaluation and assessment. While we have carefully thought about and tried to address these priorities, this reflective exercise reveals that we should integrate additional information into our formative and summative program evaluations; evaluate the effectiveness of recruitment materials by analyzing their messages through a multicultural perspective (Dumas-Hines, Cochran, & Williams, 2001;Pippert, Essenburg, & Matchett, 2013); consider how our application requirements and selection criteria may be biased against student populations we wish to serve (Ployhart & Holtz, 2008); and review how tools, procedures, and policies impact the division of labor among various stakeholders (students, mentors, program administrators, program leadership).

| Understanding variation in learning gains
The scientific theme for the most recent five-year (2015-2019) REU Site award for the HF-SRPE was the collection, visualization, analysis, and communication of ecological "Big Data." Like other REU sites in biology, we have used URSSA (Hunter et al., 2009) to provide self-assessment of gains in learning through questions about broad items related to thinking and working like a scientist.
URSSA includes questions that address students' attitudes, feelings, and motivation related to analyzing data for patterns, problem-solving, and identifying limitations. Superficially, these may seem like they can assess the learning gains of interest, but the developers of URSSA defined its scope only as a broad indicator of progress (Weston & Laursen, 2015). The questions are not aligned with our specific program goals (i.e., they have "poor criterion validity") and are unable to provide meaningful measurements for any of our "Big Data" learning outcomes. Additionally, the limited student or programmatic context provided by URSSA, such as demographics, rarely accounts for much of the variation in URSSA's measured gains (Figure 4). Such limitations have constrained our ability to improve the HF-SRPE or assess whether we are helping students achieve defined goals.
We describe components and contradictions within a CHAT activity system (Table 4) to identify more useful data to address the learning objectives of the HF-SRPE. Although it would be ideal to align and characterize all seven components of the activity system with respect to learning gains, we set priorities for assessment characterizing the skills and knowledge a student brings with them to the research experience (subject); the resources used by the student during their research experience (mediating artifacts such as R workshops or project proposals); the level of support they received (division of labor); and what success (object) means given a student's prior research experience. These priorities align with the idea that the tools individuals use to construct knowledge are culturally mediated (Vygotsky, 1980;Wertsch, 1993). For example, although two students may participate in the same R workshop (mediating artifact), their prior experiences and the workshop's relevance to their project may fundamentally shape how they interact with the activity (secondary contradictions between the subject, rules, and mediating artifact). The variation in student projects also means that students may be interacting with different resources or using them to different extents (quaternary contradictions in mediating artifacts and division of labor). We recognize that the program's "Big Data" learning goals may not be a priority for all students (i.e., secondary contradictions between the subject object, and outcome) and "success" may  (Table 5) to prioritize the collection of the following data: the skills and knowledge a student brings with them to the research program (subject); the professional development opportunities available to them during their research experience (mediating artifacts); the interactions students have with other members of the research community (community); and the goals of the research experience (object).
As with our applicants, much of the information we have for program alumnae(i) is related to basic demographic information.
Additionally, for alumnae(i), we have information on research products and responses to annual surveys. Reporting program impacts to our funders has focused primary on the annual surveys, but these occur after students have participated in the HF-SRPE and collect only data on educational level or attainment and employment status.
We therefore know little about why our students do or do not persist in STEM disciplines or careers.
Other research provides evidence that students participating in structured undergraduate research programs obtain advanced degrees and generate research products at a higher rate than a matched cohorts of students, but an understanding about how or why this occurs is limited (Wilson et al., 2018). Given the priorities identified by CHAT, we would want to collect data that help explore hypotheses related to the procedures and cultural expectations (rules) that determine who is selected to participate in HF-SRPE or other REU F I G U R E 5 Career outcomes ("pipeline") of participants in the Harvard Forest Summer Research Program in Ecology (HF-SRPE). Annual alumni surveys were sent to alumnae(i) (cohorts from 2001 onward) between 2012 and 2016. Averages of yearly snapshots reveal that most alumnae(i) have pursued or received environmental-or ecology-related graduate degrees and continue to use these disciplines during their careers. Further information is required to determine the impact of HF-SRPE on these outcomes. The CHAT activity triangles (bottom) illustrate how components could be assessed with current frameworks (bottom right) or within a full CHAT framework (bottom center, bottom left) sites (biasing for characteristics that may be independent of basic demographic descriptors such as gender, ethnicity, home institution, or GPA); specific mediating artifacts that help students achieve their career goals (recognizing that there are likely multiple equivalent paths to long-term success); and acknowledging how students (subjects) and other members of their community may view and support success (outcome). Collecting rich data to explore these mechanisms would likely require ethnographic interviews (e.g., Carlone & Johnson, 2007;Hernandez & Morales, 1999). Although this type of study would certainly prove useful as formative program evaluation, the amount of time and resources needed would not make it practical to collect at the same scale of our annual surveys.

| CON CLUS IONS
We have provided three examples that demonstrate the flexibility of CHAT for framing the study and assessment of different aspects of REU programs: recruitment and hiring practices, student learning gains, and the impact on participant persistence in STEM. CHAT provided an opportunity to reflect upon the complex educational system that is an REU site in a way that allowed us to connect with existing sociocultural frameworks. Examining HF-SRPE's hiring practices required us to consider the activity systems of all individuals contributing to the process and the quaternary contradictions between similar activity systems. This was a slightly different approach from when examining learning gains related to "Big Data." There, the emphasis was directed more to the students (subject) and the application of CHAT focused on secondary contradictions that might hinder students from achieving the learning goal. Finally, when applying CHAT toward the impact HF-SRPE may have on participant's persistence in STEM, we considered the different opportunities students may have had during their REU experience (quaternary contradiction) and acknowledge that we have limited information about the activity systems of other experiences that might also shape a participant's persistence in STEM.
Based on our positive experiences, we advocate for integration of sociocultural frameworks such as CHAT in assessment and evaluation. This systems approach has proven useful for studying other complex educational phenomena by helping derive meaning from seemingly contradictory information (Daniels, Edwards, Engeström, Gallagher, & Ludvigsen, 2013;van Oers, Wardekker, Elbers, Veer, & eds., 2008;Talbot et al., 2016). As with most scientific inquiry, the research questions ultimately should drive the types of data that are collected. However, we believe that CHAT is broad enough that it can guide the summative and formative evaluations for most aspects of REU programs. Meaningfully engaging with this framework requires both a clear understanding of programmatic goals and a familiarity with the theory and literature in education research. However, we have found that spending time characterizing activity systems has helped us to formalize our thinking and evaluate the alignment of our programmatic priorities with our assessment tools.
Characterizing components of any activity system and examining its contradictions can help identify barriers to success within it (Engeström, 1987(Engeström, , 2001. REUs are complex activity systems, and characterizing and connecting them to established theoretical frameworks should make it easier to transfer novel ideas and best practices across the larger REU community. Applying these principles to the HF-SRPE has revealed to us that we are overemphasizing our data collection efforts on subject-object-outcome while ignoring artifacts, communities, division of labor, and rules. This is limiting because the REU experience is a sociocultural experience that takes place within nested or articulating communities and those communities are socially, culturally, and historically influenced. As evaluative research continues to develop within the REU community, we see systems-based theoretical frameworks as useful guidelines for programs to follow when assessing REU programs. REU programs provide an opportunity for students to work and learn with experienced researcher(s) and develop a community with their peers. The social and cultural experiences of REUs are its greatest strength, but REUs can potentially fail students when the social-cultural-historical underpinnings of the program are not given their due. REU students must navigate sociocultural contexts, which in turn should influence how REU sites are designed and implemented. Sociocultural frameworks such as CHAT provide a systems-based perspective that helps characterize and identify important components and interactions within the complex learning environment.
We thank Associate Editor Meghan Duffy and two anonymous reviewers for constructive comments that have greatly improved the manuscript.

CO N FLI C T O F I NTE R E S T
Aaron Ellison and Manisha Patel are the founding principals of Sound Solutions for Sustainable Science LLC, which provides project management for scientific education and research programs. We do not advocate for any particular theoretical framework in managing, evaluating, or assessing research and education programs.

AUTH O R CO NTR I B UTI O N S
Andrew McDevitt collected data; analyzed data; and contributed to writing the manuscript. Manisha Patel is an HF-SRPE program manager; collected data; and contributed to writing the manuscript.
Aaron Ellison is an HF-SRPE PI/PD; obtained funding; developed conceptual framing; analyzed data; and contributed to writing the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data reported in this paper are available from the Harvard Forest Data Archive (http://harva rdfor est.fas.harva rd.edu/data-archi ve/),