Impact of C‐coding self‐assessment exercises on exam performance: A study in engineering education

The Bologna process has brought about a revolution in university studies since, among other things, it encourages continuous evaluations throughout the academic period. This situation causes already depleted teaching teams to have an additional workload, so automatic tools are needed to evaluate the new assignments. In the computer science field, the ideal evaluation technique would be to use automatic code evaluators (i.e., Java, C, C++, etc.). The main objective of this work is to analyze whether the use of C‐coding self‐assessment exercises correlates with an improvement in exam performance. For this purpose, we have collected self‐assessment exercises carried out on the AulaWeb platform belonging to the Degree in Organizational Engineering and Degree in Chemical Engineering, taken in the last 5 and 3 years, respectively, in the course of Programming Fundamentals (Fundamentos de Programación) in an engineering school of the Universidad Politécnica de Madrid. In total, 688 students completed these assignments. The most important results are: (a) regarding the January final exam, self‐assessment exercises have influence on the final grade and (b) regarding the June final exam, there are academic years where students forget to carry out programming problem of greater complexity making a negative correlation between self‐assessment performance and final scores achieved.


| INTRODUCTION
The Bologna process* has brought about a revolution in university studies; this new paradigm promotes the use of activities that continuously assess student performance throughout the academic period.This continuous assessment not only focuses on evaluating students at the end of the academic period, usually through exams, as has been done in traditional methodologies, but instead, in a progressive manner, proposes activities that facilitate the assimilation and development of the theoretical and practical concepts seen in class.
As a result, already overloaded teaching teams must make extra efforts to prepare and correct these activities.Due to the high imbalance between the number of teachers and students, † these continuous assessment tasks can become an unaffordable task for teachers.Moreover, the transmission and correction of results may take too long to be a useful and pedagogically effective tool, given the relatively short periods of time in which academic years occur.
To overcome these problems, some strategies have recently been developed to involve students in their educational process by comparing their learning with clearly defined criteria [23].These strategies are known as self-assessment.To make this self-assessment as effective as possible, by making the process of creating and correcting assignments as quick as possible, several automated tools have been developed.Some authors have proposed the development of web-based systems that allow automatic assessment of students before, during and after the course [3,4,14,18].Most of these tools are used to correct a set of questions (single or multiple choice, numerical answers, etc. [19,26]).A positive aspect of these tools is that they help teachers assess and, in addition, they transmit the results almost immediately to the students, thus improving the effectiveness of continuous assessment.
From the point of view of teaching computer science, the types of question mentioned above may not be the best way to proceed, since the interest is to evaluate the student's programmed solutions to problems set by teachers.There are several studies that have investigated the use of self-assessment exercises in programming classes.In Baruque and colleagues [1,18] different platforms are presented that allow the creation and evaluation of self-assessment programming problems.In Cedazo et al. [5] the effectiveness of using selfassessment exercises in practical exams is evaluated over four academic years.In Chung and colleagues [8,25] it is examined how student motivation changes during the academic year through the use of self-assessment exercises and how this affects academic performance.
As can be seen, the implementation of systems that allow self-assessment in learning programming languages in large groups of students is not a trivial problem.
This work has been performed on how the results obtained in the self-assessment exercises carried out throughout a term (from September to December) by a student are relevant to the final exam grades (in January or June) in a university programming course.For this purpose, data from several academic years of the same subject in two different degree courses will be used.
The remainder of the article is organised as follows.First, some related work is presented.Then, some concepts that may be useful to understand this study are described: (i) the subject under study; (ii) the platform used to perform the self-assessment activities; (iii) description of the self-assessment module, and (iv) the final exams.Third, the experimental framework is presented: (i) data collection; (ii) experimental design; (iii) preliminary analyses; and (iv) evaluation metrics.Fourth, the results obtained are presented and discussed.Finally, the main conclusions of the study are described and the lines of future work are outlined.

| RELATED WORK
The increase in student enrollment coupled with a stagnant number of teachers requires the development of automated systems for the creation and evaluation of exercises [18].The evaluation task becomes nearly impossible for teachers given the student-to-teacher ratio, highlighting the need for efficient solutions [13].The first system for assessing programming exercises involved evaluating machine code written on punched cards and was developed by Hollingsworth in 1960 [20].Subsequently, numerous platforms have emerged, especially with the rise of Massive Open Online Courses (MOOCs) [22].
In Cedazo et al. [5] an online C compiler is presented, allowing students to create their own code and test cases.These test cases enable students to verify the correctness of their algorithms by comparing their output with the provided solutions.Daradoumis et al. [11] introduce DSLabs, a tool that checks whether students' uploaded code matches that of the teacher.If they match, the answer is marked as correct; otherwise, it is deemed incorrect.Papadakis and Kalogiannakis [24] uses a gamification tool called Classcraft ‡ to engage students.Galan et al. [15] develops a custom framework for automatically reviewing programming assignments.It compares the correct solution provided by the professor with students' attempts, using combinatorial testing techniques to determine if their outputs yield the same result.In Chen et al. [7], ProgEdu [6] is utilized.This software evaluates each code submission based on three aspects: (i) compilation, ensuring the code is error-free; (ii) functionality, checking if the code meets the requirements; and (iii) code quality, ensuring adherence to coding conventions.Delgado-Pérez and Medina-Bulo [12] introduce a library called CAC++ to generate programming tasks.The library verifies and executes the teacher and student code, generating a list of errors and suggestions for students.Kyaw et al. [21] propose a blank element selection algorithm for automatic programming assessments, involving a constraint graph, compatibility graph, and maximal clique.Lastly, Sarsa et al. [28] use the OpenAI Codex § tool to automatically generate coding exercises.The works Corral Abad and colleagues [9,10] introduce a novel Android tool named MaqTest.This tool serves multiple purposes, including assessing students' understanding of theoretical concepts, verifying their comprehension of these concepts, and providing comprehensive explanations and demonstrations of the concepts for learning.
Most of the reviewed studies utilize tools where students can upload their solutions, and their algorithmic output is compared to the teacher's output.Satisfaction surveys with students are commonly used to evaluate these approaches, with only Chen and colleagues [7,15] conducting correlation studies between exercise performance and exam grades.In line with these studies, Rodríguez-Vidal et al. [27] correlate the performance achieved in self-assessment exercises with the final grades achieved by the students.

| BACKGROUND KNOWLEDGE
In this section, we describe the course under study, the teaching tool used for student learning and selfassessment, the self-assessment module, and the way to perform the final exams.

| Programming fundamentals
The course from which we conduct our study is called FP (Fundamentos de Programación or Programming Fundamentals).This course belongs to the first year and first semester of the Degree in Organizational Engineering (from now on, GIO) and the Degree in Chemical Engineering (from now on, GIQ) at the Escuela Técnica Superior de Ingenieros Industriales (from now on, ETSII) of the Universidad Politécnica de Madrid (from now on, UPM) with an average of 77 and 80 students enrolled per year, respectively (taking into account both first-year students and repeaters) and six ECTS (European Credit Transfer and Accumulation System**).The cut-off marks for the University access exam (out of 14) for GIO and GIQ students and the number of students enrolled per year in FP are shown in Table 1.
The syllabus for this subject is shown in Table 2.

| AulaWeb
The AulaWeb platform has been developed by the UPM Computer Science Laboratory and has been used by students and professors of this university since 1999 [17].This software has been used mainly as a support tool for teaching subjects in this department.Among other things, the environment allows various academic activities to be carried out (publication of training resources, publication and submission of assignments, configuration, and performance of self-assessment exercises, establishment of virtual tutorials, etc.) through the use of a Web browser.The AulaWeb platform uses IIS (Internet Information Services) on Windows Server 2016 and Microsoft SQL Server 2019.The technology used for the development of the system combines the use of ASP.NET and Java.

| Self-assessment module
The AulaWeb self-assessment module includes different functionalities [16] to facilitate the management and generation of C programming exercises.It is worth mentioning: 1.The question manager is used to create questions in the database.These questions can be wide-ranging: (i) true-false; (ii) single-choice; (iii) multiplechoice; (iv) numeric (integer or real); (v) string answer; (vi) variable formulation, and (vii) C-programming code.Additionally, teachers can assign a unit (see Table 2) and a difficulty level (very easy, easy, medium, difficult, and very difficult) to each of them.2. The configuration system which creates selfassessment exercises and allows setting up all their parameters: (i) title of the exercise; (ii) target group of students; (iii) number of questions; (iv) random or nonrandom selection of questions; (v) type of questions; (vi) lesson and difficulty level; (vii) time limit; (viii) correction method; (ix) deadline, and (x) netmask of the computers IP from which to carry out the exercise.3. The exercises manager, which, based on the content of the question database and the configuration parameters, composes its content, presents the questions, § https://openai.com/blog/openai-codex/**https://ec.europa.eu/education/resources-and-tools/european-credittransfer-and-accumulation-system-ects_enstores the answers and results in the database and finally displays these results to students and teachers.
As long as progress is made in the technical knowledge that appears in the syllabus during the semester that FP lasts, teachers establish new selfevaluation exercises.Figure 1 shows the interface of the professor with the list of exercises scheduled in an academic period.This scheme is very similar to all previous courses.
Figure 2 shows the student interface with a C-programming exercise.
At the end of the semester, students obtain a grade, out of 10 points, extracted from the arithmetic average of their self-assessment exercises performed from each unit of the syllabus.

| Final exams
The final exams of the FP course are held at the end of the first semester and at the end of the academic year in two main calls: January and June.These exams are faceto-face, except in June 2019/20 call that, due to COVID restrictions, the exam was online.Students must solve different problems using a C code written on a computer.Depending on the year, the exams follow a different structure: (i) one short problem and two large ones or (ii) one S-A question and two large problems, with a weight of 20% of the exam grade for the S-A and the short problem, and 40% for each one of the long problems, the total grades obtained by the students are floats between 0 and 10.Finally, the students pass the course in two cases: (i) if the total score obtained is greater than or equal to 5 and (ii) if the score achieved in January final exam is equal or greater than 4, and the students satisfactorily completed the self-assessment exercises.

| EXPERIMENTAL FRAMEWORK
In this section, we present a preliminary study about the data retrieved from AulaWeb, an overview of word embeddings, and a description of the metrics used in the experimentation.

| Data collection
The data shown here: (i) results of self-assessment exercises and (ii) exam grades have been collected through the AulaWeb platform.Exam correction records are available to the Teaching Unit in charge of teaching the subject to which the authors of this paper are attached.

| Experimental design
During the course, self-assessment exercises were available to be completed for a certain period, usually 10 days per exercise.To complete these assignments, students needed to have a computer and an Internet connection that would allow them to connect to the website.
The experiment was carried out on all students enrolled in the course (see Table 1 to check the number of students per year).When the correlations were extracted, those students who were unqualified in each exam session were eliminated.

| Preliminary analysis
In this section, a preliminary analysis of the academic performance achieved by the students in the FP course is carried out over 5 years, 2017/18-2021/22 (GIO) and 3 years, 2019/20-2021/22 (GIQ).

| Exercises completion
In this section, an analysis on when students complete self-assessment exercises during the semester has been carried out.Figure 3 shows, as an example, the distribution of exercise completion by dates within the course 2021/22, together with the exposition and evaluation intervals for each.Only the distribution associated with the GIQ group is shown, as both groups are fully coordinated and therefore the exercise configuration is the same.The period for self-assessment exercises extends from September 21, 2021 to December 28, 2021 and involves 10 configured exercises for this subject.As shown in Figure 3, students prefer to complete their selfassessment exercises during the working days of the week.It can also be seen that the number of selfassessment exercises performed decreases as the academic year progresses.
Lastly, Table 3 shows the performance of the GIO FP students in self-assessment exercises during each academic year.
The first column indicates the academic year, the second column the number of total students enrolled in FP, the third column is the number of students with at least one self-assessment exercise submitted (and their associated percentage over the total of students).
F I G U R E 3 Distribution of completed exercises by date-GIQ.
The fourth column means the number of students with at least half of the self-assessment exercises submitted (and their associated %).The fifth column shows the number of students with all submitted self-assessment exercises (and their associated %) and finally, the sixth column represents the number of students who obtained at least a five out of 10 in all their selfassessment exercises (and their associated %).As can be seen, most of the students complete at least one selfassessment exercise; however, the number of complete exercises decreases over time.The maximum drop achieved is 8.54 (2017/18) in the middle of the total number of assignments and a maximum decrease of 31.77%(2021/22) at the end of the academic year.Furthermore, not all students achieve enough scores to pass the self-assessment exercises.This difference between the number of students who perform at least one self-assessment exercise and those who pass it can be quantified between 14.64% (2017/18) and 1.22% (2020/21).
Table 4 shows the performance of the GIQ FP students in self-assessment exercises during each academic year.
The meaning of each column has previously been explained.As in the previous case, the number of complete exercises decays over time.For GIQ, the maximum drop achieved is 12.19% (2021/22) in the middle of the total number of assignments and a maximum decrease of 25.27% (2021/22) at the end of the academic year.The difference between the number of students who perform at least one self-assessment exercise and those who pass it can be quantified between 14.28% (2021/22) and 6.19% (2020/21).

| Data analysis
In this section, a preliminary analysis of the academic performance achieved by students of the FP courses (GIO and GIQ) is performed.Figures 4 and 5 show the temporary evolution of the FP students during these years for GIO and GIQ.
In Figures 4 and 5, there are three different categories of output for each assessment: passed, failed, and unqualified.A student has passed an exam if the corresponding grade was greater than or equal to 5. A student has failed the exam if the evaluation mark was less than 5. And, finally, a student was unqualified: (i) in the January evaluation, a student did not take the exam; (ii) in both assessments (January and June), a student was absent from the exam; (iii) a student failed the January test and also, the student was not present in June evaluation.
According to the data presented in Figure 4, many students take this course: from 82 in 2017/18 and 2020/21 to 91 in 2018/19 for GIO and from 62 in 2019/20 to 97 in 2020/21 for GIQ.Information extracted from Figure 4 reveals that the majority of students who attended in January failed these exams (53.22% on average).In only 1 year: 2021/22, this circumstance did not occur.Student performance in the June call is poor, since only 2 years (2018/19 and 2020/21) the number of students who passed the exam is above the number of students who failed it.In this call, we can observe the high number of students who did not take the exam.This point may be because the subject belongs to the first semester of the course and the difference between the January and June exams, T A B L E 3 Self-assessment performance according to each academic year-GIO.almost 6 months, so students who failed in January have very little recent knowledge of the concepts needed to prepare for the exam and prefer to focus on recovering more recent subjects.

Num
In Figure 5, the data show a high number of students enrolled: from 72 in 2019/20 to 97 in 2020/21.On the basis of the previous results, we can see that most students fail the exam in the January call (45.58% on average) and that the vast majority of students do not take the exam in June (49.62%on average).This last result reinforces the hypothesis that students prefer to study more recent subjects (those ones belonging to the second semester).

| Metrics
To evaluate the different tasks proposed in this study, the following metrics were used: • Pearson (standard) [2]: measures the strength of a linear correlation between two sets of data.This correlation is calculated as: , are two sets of lengths n and m m , x y are the means of x y , , respectively.
The following section shows the results obtained by correlating the self-assessment exercises with the results obtained in both exams.show the descriptive statistics for each course and academic year and the proposed evaluation measurements: self-assessment exercises, January and June exam.
The results in Figure 6 show that the average exhibits an upward trend in the last 2 years; however, overall, the The results in Figure 7 show that the average exhibits a consistent trend across all years; however, overall, the results obtained are favorable.As for the standard deviation and variance, they display higher values, particularly in the year 2020/21, possibly attributed to the fact that classes were conducted entirely remotely due to COVID-19 during that academic year.The range of values remains relatively consistent across all years, indicating a stable dispersion of results.According to the skewness indicator, the distributions tend to exhibit a slight leftward skew, suggesting that a majority of students tend to achieve high scores.Finally, the kurtosis indicator signals the presence of outliers in the 2020/21 course.
The results depicted in Figure 8 indicate that, on average, the scores obtained in the January call at GIO are low, with only 1 year having an average score exceeding 4. The scores exhibit a wide range of variation.Additionally, nearly all distributions are skewed to the right, which is associated with subpar performance on average in these exams.
The results shown in Figure 9 indicate that, on average, the scores achieved during the January call at GIQ are low, with 2 out of 3 years having an average score exceeding 4. The scores display a significant range of variation.Furthermore, nearly all distributions are skewed to the right, which is indicative of belowaverage performance in these exams on average.
The results in Figure 10 demonstrate that, on average, the scores obtained in the June call for GIO continue to be poor.It is worth mentioning that the results achieved by GIO students are better compared to the first call; in  almost all academic years, the scores are above 4.The scores are distributed across a wide range.The majority of distributions in GIO are skewed to the left, which is related to the overall improvement in performance on these exams.
The results depicted in Figure 11 indicate that, on average, the scores achieved during the June call for GIQ are also poor, and in fact, they are even worse than those in January.Similar to GIO, the scores exhibit a wide range of variability.Most of the score distributions in GIO are right-skewed, reflecting the overall poor performance on these exams.
Tables 5 and 6 show the main statistics obtained from the correlated self-assessment exercises with the January and June grades.
The columns r refer to the correlation coefficient values, the columns 95% CI mean the 95% parametric confidence intervals around r, and finally the columns p value is the p value score.The tables show that in the majority of January and June calls, the correlation coefficient is greater than 0 and therefore there is a correlation between the self-assessment exercises and the results obtained in the exams.However, there are cases in June, GIO, where the correlation is negative and very close to 0, so there is no correlation between the grade obtained the exams in that year (2019/20, 2020/21, and 2021/22) and the self-evaluation exercises.
The fact that self-evaluation exercises are part of exams makes students pay more attention to these exercises and therefore, there is a correlation between the grades obtained and the exercises performed during the course (except for the aforementioned academic years).However, as can be seen, this correlation is declining over time (about a 21.43% loss, if we compare the January correlation of 2018/19 and 2021/22 for GIO and about 58.54%, if we compare the January correlation of 2019/20 and 2020/21 for GIQ).
The p-values indicate that the results obtained are statistically significant (since p 0.05 ≤ ), which reinforces the hypothesis that there is a correlation between performing self-assessment exercises and obtaining a good score on the exam.However, there are cases where this limit is overcome.In these years, all June calls for both grades and January calls for 2020/21 in GIQ, the correlation coefficient is near the limit, but, in the cases of negative correlation, this p-value is too high, which is the conclusion that in these years and call the study factors are not correlated with each other.
In the following, the results derived from the correlating results of the self-assessment exercises and Figure 12 shows the existence of a correlation between exams in January and self-assessment exercises as seen in Table 5. Figure 13 shows that there was a negative correlation between performance in June and self-assessments (as shown in Table 5) in the last academic years.This is surprising considering that there is a specific self-assessment question in the exams.On the other hand, Figure 14 shows a positive correlation   between all of its tests and self-assessment exercises, as seen in Table 6.

| CONCLUSIONS
This study aimed to investigate the correlation between the results of self-assessment exercises and the performance achieved in final exams in the context of FP course.The data showed in this work was collectd from AulaWeb platform for two different groups, GIO and GIQ, over different academic years, allowing to draw the following main conclusions: • Active participation of the students in this kind of exercises throughout the course suggest that selfassessment activities play an important role in the learning process.• In line with the results achieved by the reviewed state-ofthe-art studies, solving self-assessment exercises positively contributes to improve academic performance.• Regarding the January final exam, these kinds of exercises are an important factor to obtain satisfactory results, owing to various factors, including the similarity between these exercises and the exam format.• However, in the June exam, it is observed academic years where the correlation was negative.This suggest that some students overly focused on shorter selfassessment exercises instead the resolution of more complex programming problems.• When comparing the performance of GIO and GIQ students, it is noticed that, in general, GIQ students obtain higher grades in self-assessment exercises.However, this does not necessarily translate into better results in the final exam of FP.This difference could be attributed to the characteristics of students in each program.
The use of self-assessment technologies offers significant advantages for education managers, including automation and instant feedback, which enhance flexibility in the learning process.
Despite these findings, our work has some limitations.The most important thing is that the experimental results are not in general, since the information about the students of this FP course and for GIO and GIQ grades in particular.In addition, this methodology cannot be easily implemented in other subjects (e.g., artistic matters).
In terms of future work, we want to study: (i) the influence of the previous exams statements on academic performance; (ii) the relation between self-assessment exercises and final exams in other courses; and (iii) the performance achieved by first-year students versus repeating students and the influence of self-assessment exercises in both cases.

F I G U R E 4
Evolution of the performance students through the different academic years for GIO.results obtained are good.The range of values is consistently close to 10, indicating the spread of grades achieved by the students in this subject.As per the skewness indicator, the distributions tend to show a slight leftward skew, suggesting that a majority of students perform well on their assignments.Lastly, the kurtosis indicator points to the presence of outliers in the most recent 2 years (2020/21-2021/22).

F I G U R E 5
Evolution of the performance students through the different academic years for GIQ.F I G U R E 6 Statistics for self-assessment exercises-GIO.F I G U R E 7 Statistics for self-assessment exercises-GIQ.

F I G U R E 8
Statistics for the January exam-GIO.F I G U R E 9 Statistics for the January exam-GIQ.RODRÍGUEZ-VIDAL and GARCÍA-BELTRÁN | 11 of 18

F
I G U R E 10 Statistics for the June exam-GIO.thegrades obtained in January and June are summarized in respectively.

F
I G U R E 11 Statistics for the June exam-GIQ.T A B L E 5 Pearson correlation statistics-GIO.
T A B L E 1 GIO-GIQ university access requirements.Syllabus of FP.
F I G U R E 1 List of exercises carried out during the 2021/22 academic year.
T A B L E 6RODRÍGUEZ-VIDAL and GARCÍA-BELTRÁN