C‐programming self‐assessment exercises versus final exams: 12 years of experience

The new curricula derived from the Bologna process encourage continuous evaluations during the teaching period. This situation causes already depleted teachers to have an additional workload during the course, so the emergence of tools that automatically evaluate this work is necessary. In computer science, the ideal evaluation technique would be to use automatic code evaluators (such as C, Java, C++, etc.). The main objective of this work is to analyze whether the use of C‐coding self‐assessment exercises correlates with an improvement in exam performance. For this purpose, the implementation of self‐assessment exercises on the AulaWeb platform during the last 12 academic years (2010/2011–2021/2022) in a Programming Fundamentals course (Fundamentos de Programación) in an engineering faculty at the Universidad Politécnica de Madrid (UPM) was collected. The main tasks carried out in this research have been: (i) recovery and analysis of the information collected from the AulaWeb platform and (ii) the study of the influence of the self‐assessment exercises on the final grades obtained by the students of the course through correlations. The most important findings are: (a) self‐assessment exercises generate experience and confidence when facing problems (this situation translates into a higher probability of passing the course) and (b) self‐assessment exercises influence the final exams taken at the end of the semester, mainly due to the short time between the end of classes and the exam. Self‐assessment exercises provide valuable information for teachers to monitor the progress of large groups of students during the semester and for students to pass exams.


| INTRODUCTION
The methodologies proposed in the Bologna process 1 for university studies encourage the implementation of activities that promote continuous student evaluation throughout the academic period. Continuous evaluation, unlike the traditional one, requires the teachers to assess their students not only at the end of the learning process (typically using exams) but also propose periodic evaluation activities that progressively facilitate the assimilation and development of the subject matter [10].
These actions can be very inefficient and difficult to implement in courses with a wide range of students enrolled compared to the number of assigned teachers. According to Eurostat statistics, 2 there were 17.5 million students of tertiary education in 2018 in EU-27, and the average number of students per academic staff, for ISCED 3 levels 5 through 8, was 15.3 (Greece has the maximum ratio of 38.7 and Luxembourg the minimum ratio with 4.4). In view of these data, the activities that promote continuous evaluation, that is, evaluation tests and, especially, correcting these practices, can be an unaffordable task for the academic staff. Moreover, the transmission of the correction results can be excessively delayed in time, resulting in a pedagogically ineffective process considering the nature and content of some courses and the relatively short academic periods.
To address these problems, some strategies have been developed to involve students in their education through formal processes in which they can compare their learning with established criteria [22]. These strategies are called self-assessment. To make self-assessment evaluation effective, multiple automatic tools have been developed to make the process (publishing exerciseobtaining results) as fast as possible. Some authors propose the development of a web-based system that allows the automatic and immediate evaluation of students before, during, or, even, after the course [3,4,13,17]. Most of these tools are used to correct a set of questions (single-choice or multiple-choice or numerical response test-type questions [18,26]). A positive aspect of these question templates is that teachers can save a lot of time and the evaluation is almost instantaneous. On the other hand, these types of questions may be suitable for many academic fields by their nature, but in others, such as, for example, computer programming, they may not be the best way to proceed [25]. In these last disciplines, the best practice is that the student gives a complete program or a fragment of programming code as a solution to the problem. These coding exercises are beneficial for the students for three main reasons: (i) they can help to set up a real-world problem and to structure the student's mind; (ii) they can encourage the students to become fluent in programming; and (iii) due to they are similar to exercises that might be encountered in the exams, students can find a successful solution more easily by association of ideas. The first and second points give very useful skills to students, not only in academia but also in the real world, where both features combined (thinking and programming fluent) are covered, from the companies' point of view. The last point is relevant only at university, insofar as it enables students to pass the subject. These reasons lead the authors to check how effective the use of self-assessment exercises is for the academic evolution of students.
Several studies have investigated the use of selfassessment exercises in programming courses. Chung and Hsiao [8] in their study, through self-assessment exercises, show how constant the effort (the student's motivation for a subject can decline over time) of students during the course, and how this point is related to academic performance. Along the same lines, Papamitsiou et al. [24], study how student motivation changes through their self-assessment exercises during the Covid-19 crisis. On the other hand, Cedazo et al. [5], measure the effectiveness of using self-assessment exercises in practical exams in four academic courses.
This work checks whether the use of C-code selfassessment exercises, during twelve academic courses, is directly correlated with improved exam performance.
The remainder of this article is organized as follows. First, some related work in the area is presented. Second, some previous concepts that might be useful to understand this study are described: (i) the subject under study; (ii) the platform used to carry out the selfassessment activities and to extract the data; (iii) a description of the self-assessment module, and (iv) the final exams. Third, a preliminary analysis of the data is presented and the correlation metric used is described. Fourth, the results achieved are presented and discussed. Finally, the main conclusions of the study and the outline of future work are drawn.

| RELATED WORK
The first tool used to assess programming exercises in an automatic way appeared in 1960 according to [2,14]. This system developed by Hollingsworth [19], evaluated programming exercises written in machine code on punched cards. Since then, many other platforms have appeared, especially in recent years, due to the growth of online distance education (Massive Open Online Courses [MOOCs]) [21,27]. These MOOCs are interactive online courses aimed at an unlimited number of students. This massive attendance at the courses is one of the strengths and weaknesses of these courses because the number of students may exceed the number of professors, making it impossible for the last to grade the former [12], in a reasonable amount of time. Although the problem of the student/professor ratio is more pressing in online distance education, the face-to-face mode also has to deal with this problem. For instance, according to the Spanish Ministry of Universities, there were over 1.4 million students enrolled in the 2019/2020 academic year and more than 125,000 professors, combined public and private universities, in 2018/2019. 4 The increase in the number of students is due to three main reasons: (i) families have more financial resources; (ii) the increase in the number of scholarships (according to the Spanish Ministry of Universities, 41.9% of first-year students obtained a scholarship in 2018/2019); (iii) for technical careers, these degrees usually have more employment and payment opportunities. 5 On the other hand, universities have not been able to enlarge their workforce in the last few years due to economic crises. For this reason, nowadays it is a key factor to have educational systems that automatically allow create and evaluate programming exercises [17]. There are different approaches to generating and evaluating programming self-assessment exercises, some of which are summarized in Table 1. 6,7 The second column indicates the approach followed in the study and the third column explains the methodology applied to evaluate the system proposed. As can be seen, most of the studies reviewed used tools in which the students can upload their solutions, and the output of their algorithms is compared with the output generated by the teacher's code. Most of these studies evaluate their proposals by conducting satisfaction surveys with their students, only [7,14] conduct correlation studies between the performance of the exercises and the grade obtained in their exams.
This study wants: (i) to prove whether or not there is a correlation between the self-assessment exercises proposed during the course of Programming Fundamentals (basic C programming; onwards, FP) and the fact of passing the course; (ii) to relate the quantity of exercises done by the students and the drop-out rate and (iii) to establish when the students do their exercises during the semester, through the last 12 academic years.

| BACKGROUND KNOWLEDGE
We describe in this section: the course under study, the teaching tool used for student learning and selfassessment, the self-assessment module and the way to perform the final exams.

| Programming fundamentals
The course from which we conduct our study is FP (Fundamentos de Programación or Programming Fundamentals). This course belongs to the first year and second semester of Grado en Ingeniería en Tecnologías Industriales (GITI) at Escuela Técnica Superior de Ingenieros Industriales (ETSII) of Universidad Politécnica de Madrid (UPM) with an average of 655 students enrolled per year (taking into account both repeaters and first-year students) and 6 ECTS (European Credit Transfer and Accumulation System 8 ). University entry-level requirements for GITI were 12.266 out of 14 for the academic year 2021/2022. 9 The syllabus for this subject is shown in Table 2.

| AulaWeb
The AulaWeb 10 platform has been developed by the UPM Computer Science Laboratory 11 and has been used by students and professors of this university since 1999 12 [16]. This software has been used mainly as a support tool to teach subjects in this department. Among other things, the environment allows various academic 4 https://www.universidades.gob.es/stfls/universidades/Estadisticas/ ficheros/Datos_y_Cifras_2020-21.pdf 5 https://www.puckermob.com/money/college-majors-highest-payinglowest-unemployment-rates-and-more/ 6 www.classcraft.com 7 https://openai.com/blog/openai-codex/ T A B L E 1 Some state-of-the-art studies.

Reference System
Methodology [5] Use an online C compiler where students create their code and test cases. These test cases allow students to check that their algorithm is correct by comparing their output with the solution. When the students have checked that everything is fine, they can mark the exercises as submitted and then the teachers can access the answers.
The tool was evaluated by using a mandatory questionnaire completed by 301 students who attended the practicals of the course in that year (2013/2014). The questionnaire consisted of 10 questions rated on a scale of 0 to 10.
[9] DSLabs evaluates whether the final state of the code uploaded by the students in the system is the same as the teacher's, in which case it is marked as correct; otherwise, it is evaluated as incorrect.
The tool was used in an online course that lasted seven weeks. In this course, 110 people participated. The activities could be carried out individually or in pairs. Evaluation using a questionnaire on the student's perception of the tool.
[23] Use of a gammification tool (Classcraft a ) to engage students. The course under study was taught in an upper secondary school and involved 30 students randomly divided into two groups: the control group and the experimental group. The evaluation was carried out through a mixed system of surveys and questionnaires given to the students. [14] Creates their own framework that automatically review programming assignments. It receives two inputs: the correct solution by the professor and the attempt made by the students. Apply combinatorial testing techniques to find out whether or not the student and professor outputs have the same result.
The study consist of over 2000 students and eight academic years (2011-2018). There were four voluntary practices (three corrected automatically and the last manually) whose difficulty was incremental. One of their main conclusions is that there is a correlation between completing tasks and obtaining a good score on the exams.
[7] The ProgEdu tool [6] was used to extract the data. In this software, the students run their code until they meet the proposed requirements. Each submission is evaluated in three aspects: (i) compilation: the code is free of errors; (ii) functional: the code must comply with the requirements; and (iii) quality: the code must satisfy the coding conventions.
The study was carried out on 69 students in a programming subject during the 2018/2019 academic year and made a correlation between effort and time spent on the completion of the tasks and the final grade obtained in the subject by characterizing the relationship between study behavior and the achievements obtained.
[11] Creates a library called CAC++ to generate programming tasks. The library receives the solutions implemented by the teacher and the students, then verifies and executes the code, and finally generates an output with a list of errors and suggestions to the student.
The tool was evaluated based on the opinion of 91 students. For this purpose, a 9-question survey was used, where the first 8 questions are scored from 1-5 and the last is a free-text response.
[20] Proposes the blank element selection algorithm to generate automatic programming assessments. It consists of three different parts: (i) constraint graph: a vertex is a blank candidate and the edge is the restriction that all its incidents vertex can not be blank simultaneously; (ii) compatibility graph: generated from the previous constraint graph and (iii) maximal clique: the max set of blank elements with unique answers.
The algorithm was evaluated on 42 students of a programming course. For this purpose, the authors generated two problems: (i) on basic concepts and (ii) on data structures and algorithms. For both exercises, the following characteristics were extracted: the ratio of correctly answered questions, the submission time ratio and the number of submissions.
[28] Automatic generation of coding exercises using the OpenAI Codex b tool.
A total of 240 coding questions were generated using different OpenAI Codex parameters. The evaluation of these programs was carried out in two aspects: (i) qualitative evaluation of 120 randomly selected exercises focusing on the following a spects: sensibleness, novelty and readiness; and (ii) quantitative evaluation in terms of manual assessment rubric performed by the evaluators and (Continues) activities to be carried out (publication of training resources, publication and submission of assignments, configuration, and performance of self-assessment exercises, the establishment of virtual tutorials, etc.) through the use of a Web browser.

| Self-assessment module
The AulaWeb self-assessment module includes different functionalities [15] to facilitate the management and generation of C programming exercises. It is worth mentioning: 1. The question manager is used to create self-assessment questions in the database. These questions can be wide-ranging: (i) true-false; (ii) single-choice; (iii) multiple-choice; (iv) numeric (integer or real); (v) string answer; (vi) variable formulation; and (vii) C-programming code. In addition, teachers can assign a unit (see Table 2) and a difficulty level (very easy, easy, medium, difficult, and very difficult) to each of them. 2. The configuration system creates self-assessment exercises and allows setting up all their parameters: (i) title of the exercise; (ii) target group of students; (iii) number of questions; (iv) random or nonrandom selection of questions; (v) type of questions; (vi) lesson and difficulty level; (vii) time limit; (viii) correction method; (ix) deadline; and (x) net mask of the computers IP from which to carry out the exercise. 3. The exercises manager, which, based on the content of the question database and the configuration parameters, composes its content, presents the questions, stores the answers and results in the database and finally displays these results to students and teachers.
As long as progress is made in the technical knowledge that appears in the syllabus during the semester in which FP lasts, teachers establish new self-evaluation exercises. Figure 1 shows the interface of the professor with the list of exercises scheduled during an academic period. This scheme is very similar to all previous courses. Figure 2 shows the student's interface with a C-programming exercise.
At the end of the semester, the students obtain a grade, out of 10 points, extracted from the arithmetic average of their self-assessment exercises performed from each unit of the syllabus.

| Final exams
The final exams of the FP course are held at the end of the academic year in two main calls: June and July. These exams are face-to-face (except for the 2019/ 2020 course, in which, due to COVID restrictions, all exams were online). Students must solve different problems using a written C code on paper. Each exam involves 10 questions and each question can only be evaluated with a binary mark (0 or 1). Therefore, the total score is an integer between 0 and 10. Finally, the students pass the course in two cases: (i) if the total score obtained is greater than or equal to 5; (ii) if the score achieved in the final exam of June is around 4, and the students satisfactorily completed the selfassessment exercises. F I G U R E 2 C-programming question interface.

| EXPERIMENTAL FRAMEWORK
The primary focus of our experiments is to determine whether there is a correlation between performance in self-assessment exercises and the final grade obtained on the exams. To do so, we extracted relevant data, designed the experiments, and performed a preliminary study of the data obtained.

| Data collection
The data showed here: (i) results of self-assessment exercises and (ii) exam grades have been collected through the AulaWeb platform. Exam correction records are available to the Teaching Unit in charge of teaching the subject to which the authors of this paper are attached.

| Experimental design
During the course, self-assessment exercises were available for completion for a certain period, usually 10 days per exercise. To complete these assignments, students needed to have a computer and an Internet connection that would allow them to connect to the portal.
The experiment was carried out on all students enrolled in the course (see Table 3 to check the number of students per year). When the correlations were extracted, those students who were unqualified in each exam session were eliminated.

| Preliminary analysis
In this section, we present a preliminary study about when students perform the exercises during the semester, followed by a discussion and a correlation analysis of the data retrieved from AulaWeb.

| Exercises completion
In this section, an analysis is performed of when students complete self-assessment exercises during the semester. Figure 3 shows, as an example, the distribution of exercise completion by dates within the 2019/2020 course, together with the exposition and evaluation intervals for each.
The period for self-assessment exercises extends from 11/ 02/2020 to 29/05/2020 and involves 11 configured exercises for this subject. Due to the different teaching rhythms, and to avoid saturation problems on the server side, alternative deadlines (on consecutive days) are set for each exercise according to the student's group. As shown in Figure 3, students prefer to complete self-assessment assignments during working days. In fact, the number of self-assessment exercises performed over the weekend decreases. These results are consistently similar in all courses.
Lastly, Table 3 shows the performance of the FP students in self-assessment exercises during each academic year.
The first column indicates the academic year, the second column the number of total students enrolled in FP, the third column represents the number of students with at least one self-assessment exercise submitted (and its associated percentage over the total of students). The fourth column represents the number of students with at least half of the self-assessment exercises submitted (and its associated %). The fifth column represents the number of students with all submitted self-assessment exercises (and its associated %) and finally, the sixth column represents the number of students who obtained at least a 5 out of 10 in all their selfassessment exercises (and its associated %). As can be seen, most of the students complete at least one self-assessment exercise; however, the enthusiasm for carrying out these exercises decreases over time. The maximum drop achieved is 22.13% (2020/2021) toward the middle of the total number of assignments and a maximum decrease of 55.52% (2012/ 2013) at the end of the academic year. Furthermore, not all T A B L E 3 Self-assessment performance according to each academic year.

| Data analysis
In this section, a preliminary analysis of the academic performance achieved by students of the FP course is performed for 12 years, 2010/11-2021/22. Figure 4 shows the temporary evolution of FP students during the last decade. In Figure 4, there are three different categories of output for each assessment: passed, failed, and unqualified. A student has passed an exam if the corresponding grade was greater than or equal to 5. A student has failed the exam if the evaluation mark was less than 5. And finally, a student was unqualified: (i) in the June evaluation, a student did not take the exam; (ii) in both assessments (June and July), a student was absent from the exam; (iii) a student failed the June test and also the student was not present in the July evaluation.
From the point of view of the data, as we can see in Figure 4, this course is taken by many students: from 481 in 2010/2011 to 762 in 2017/2018. Information reveals that, in general, most of the students taking the final exam in June failed (around 50%). In only 3 years: 2011/2012, 2019/2020, and 2020/2021, this circumstance did not happen. Student's performance in the July call is also poor, since in none of the years reviewed the number of passed students exceeds the number of failed ones. This point may be due to the fact that there are only a few weeks between the final exams in June and July. The publication of marks takes 2 weeks, and students have to take exams for other subjects. Therefore, in fact, those students who have failed in June only have a few days to prepare for the July exam. In relation to the last statement, we can observe that the number of unqualified students increases in the July call, probably due to the lack of time to study between exams.

| Correlation metric
The following correlation metric was used in the study: • Pearson (standard) [1]: measures the strength of a linear correlation between two sets of data. This correlation is calculated as follows: , are two sets of data of length n and m x and m y are the means of x; and y, respectively.
The following section shows the results obtained by correlating the self-assessment exercises with the results obtained in both exams.

| ANALYSIS AND RESULTS
Tables 4, 5, and 6 show the descriptive statistics for each course and the proposed evaluation measurements: selfassessment exercises, and June and July exams. The results in Table 4 show that, on average, the scores obtained in the self-assessment exercises are high. These scores are spread over a wide range, typically from 0 to 10, but tend to be around the mean. Additionally, all distributions are asymmetric to their left, so most students perform well on their tasks.
The results in Table 5 show that, on average, the scores obtained in the June call are poor. In no single academic year, the average achieves a passing grade. These scores are spread over a wide range, typically from 0 to 10; in addition, they are slightly concentrated around their average. Also, almost all distributions are asymmetric to their right, which is related to poor performance -on average-in these exams.
The results in Table 6 show that the results obtained in July are worse than the June exams, not only in the average score obtained but also in the range of grades. As in June, the grades are slightly concentrated around their average and almost all distributions are asymmetric to their right.  Table 7 shows the main statistics obtained from the correlated self-assessment exercises with the June and July grades.
Columns r refer to the values of the correlation coefficient, the columns 95% CI mean the 95% parametric confidence intervals around r, and finally, columns p-val is the p-value score. The table shows that in both the June and July calls, the correlation coefficient is greater than 0, so there is a correlation between the self-assessment exercises and the results obtained in the exams. However, there are cases in July where the correlation is very close to 0, so there is no correlation between the grade obtained on the exams in that year (2016/2017, 2018/2019, 2019/2020, and 2020/2021) and the self-evaluation exercises.
This situation corroborates the hypothesis that self-assessment exercises are more important in June than in July. The main reason behind this idea is that students, in June, have less time to prepare for the exams and, therefore, self-assessment exercises become critical to have a better understanding of the concepts taught during the course. On the other hand, self-assessment exercises are less important for the July call because the students focus their time on doing other kinds of problems instead of repeating the self-assessment exercises to prepare for the July exam. However, as can be seen, this correlation is declining over time (about 43.85% loss, if we compare the June correlation 2010/2011 and the June correlation 2021/2022). This situation may be due to the increase of students with 0 self-assessment exercises submitted: from 4.78% in 2010/2011 to 15.73% in 2021/2022.
The p-values indicate that the results obtained are statistically significant (since ≤ p 0.05), which reinforces the hypothesis that there is a correlation between performing self-assessment exercises and obtaining a good score on the exam. However, there are cases where this limit is overcome. In these years, the correlation coefficient is practically 0 (2016/2017, 2018/2019, 2019/ 2020, and 2020/2021).
The results derived from the correlation results of the self-assessment exercises and the grades obtained in June and July are summarized in Figures 5 and 6, respectively. Figure 5 shows the existence of a correlation between the June exams and the self-assessment exercises as seen in Table 7. However, Figure 6 shows a weaker correlation between the July exam and the self-assessment exercises. This evidence only strengthens the hypothesis mentioned above that self-assessment exercises tend to be more important in June than in July.

| CONCLUSION AND FUTURE WORK
This work presents a novel example of the application of computational tools for the implementation of selfassessment systems for learning computer programming and provides a valuable contribution to the T A B L E 7 Pearson correlation statistics.

Academic years
June July r 95% CI p-val r 95% CI p-val pedagogical justification of their use in a subject of a university degree with large groups of students. The main goal of this study was to show the subject of FP whether there was a correlation between the results of the self-assessment exercises and the results of the final exams. To do so, we discussed and analyzed the data provided by AulaWeb for 12 academic years. The main results are as follows: (a) Most of the students participate in solving selfassessment exercises throughout the course.
F I G U R E 5 Self-assessment versus June grades from 2010/2011 (a) to 2021/2022 (l) courses.
(b) Completion of self-assessment coding exercises during the course provided more experience with which to deal with problems. This skill is necessary to successfully pass the exams (regardless of the call).
(c) According to the results achieved by the studies reviewed in the state-of-the-art section, the solution of self-assessment exercises helps students to perform better in the final exams.
F I G U R E 5 (Continued) (d) Regarding the June final exam, solving these exercises is key to obtaining satisfactory results, mainly due to (i) the short time between the end of regular classes and the day of the exam; (ii) the existence of exams in other subjects to prepare for.
(e) Regarding the July final exam, although selfassessment exercises are an interesting tool for facing new problems, the fact of having less time to prepare for the exams makes them not as critical as in the June call from the students' point of view.
F I G U R E 6 Self-assessment versus July grades from 2010/11 (a) to 2021/22 (l) courses.
(f) Students prefer to do their self-assessment exercises during the working days over the weekends. (g) Around 20% of students do not pay enough attention to self-assessment exercises, even though they are a determining factor in passing the course.
The use of technologies that implement self-assessment methodology benefits learning managers in many ways, for example, getting students' self-assessment results automatically and immediately, and saving time and human resources in the tedious task of correcting the exercises. On the other hand, students improve their autonomous learning capacity by having a system that allows them to carry out exercises at any time and place and obtain immediate results, which contributes to increasing the flexibility of the learning process, the immediacy of the response in the evaluation, the disappearance of spatiotemporal barriers and the motivation of the student.
Our work has some limitations. The most important thing is that the experimental results are not general, since the information about the students of this FP course of ETSII-UPM is in particular. In addition, some difficulties can be intuited in the implementation of this methodology in subjects with contents that cannot be easily evaluated by computer systems (e.g., artistic subjects).
Concerning future work, we want to: (a) study the relation between self-assessment exercises and final exams in FP taught in the first year and semester of Grado en Ingeniería en Organización and Grado en Ingeniería Química also at ETSII UPM; (b) study the relation between self-assessment exercises and final exams in other courses taught at ETSII UPM.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available on request from the author, AGB. The data are not publicly available due to their containing information that could compromise the privacy of research participants.