Re-defining the virtual reality dental simulator: Demonstrating concurrent validity of clinically relevant assessment and feedback

Introduction Virtual reality (VR) dental simulators are gaining momentum as a useful tool to educate dental students. To date, no VR dental simulator exercise has been designed which is capable of reliably providing validated, meaningful clinical feedback to dental students. This study aims to measure the concurrent validity of the assessment, and the provision of qualitative feedback, pertaining to cavity preparations by VR dental simulators. Methods A cavity preparation exercise was created on a VR dental simulator, and assessment criteria for cavity preparations were developed. The exercise was performed 10 times in order to demonstrate a range of performances and for each, the simulator feedback was recorded. The exercises were subsequently three-dimensionally printed and 12 clinical teachers were asked to assess the preparations according to the same criteria. Inter-rater reliability (IRR) between clinical teachers was measured using a free-marginal multirater kappa value. Clinical teacher assessment responses were compared with the VR simulator responses and percentage agreements calculated. Results IRR values for each exercise ranged from 0.39-0.77 (69.39-88.48%). The assessment of smoothness (κfree0.58, 78.79%) and ability to follow the outline (κfree0.56, 77.88%) demonstrated highest agreement between clinical teachers, whilst the assessment of undercut (κfree0.15, 57.58%) and A cc ep te d A rt ic le


Introduction
Dental students must be capable of carrying out basic operative dental procedures prior to treating real patients safely and effectively 1,2 . Many of these skills are complex to learn, involving the acquisition and application of knowledge and the development of fine motor skills. Pre-clinical operative dental training is commonly carried out within a clinical skills laboratory and within Europe, the vast majority of these are equipped with mechanical patient simulators, commonly referred to as "phantom-heads" 3 . These phantom heads exist typically as replicas of a human head and torso, fitted with jaws that contain either extracted human, or plastic typodont teeth. Phantom heads are used as a basis for both teaching and assessing the necessary operative techniques in order to demonstrate that students are safe to progress to treat patients. Despite the ubiquitous nature of the clinical skills laboratory, the construct is resource-intensive, in terms of time, staffing, restorative materials and tooth substrates 4 .
In Dentistry, Virtual Reality (VR) simulators are computer-based systems that attempt to recreate aspects of the real world and often incorporate physical interactivity through haptic technology that provides tactile force-feedback to the user. VR simulators have been successfully employed in the learning of high-risk procedures in aviation and surgery 5,6 . These systems are gaining momentum as a useful tool to educate dental students 7,8 . The reported advantages of VR simulation in dental education include 9-11 :  the potential to provide iterative and unlimited practical learning  greatly reduced overheads for resource consumables and teaching staff  immediate, objective feedback  the ability to create tailored and standardised exercises It is clear that VR simulators have the potential to complement traditional teaching methods in preclinical operative skills training. However, it is important to recognise that VR simulators need to be supported by well-defined and clear pedagogic values in order to maximise their utility -and this includes validated approaches to assessment 12,13 .

The validity of VR systems
Validity can be defined as "the extent to which an assessment instrument measures what it was designed to measure" 14 . Different aspects of validity can be demonstrated through objective (construct, concurrent and predictive validity) or subjective (face and content validity) means 15 .
Most of the literature that attempts to establish the validity of VR dental simulator feedback claims Accepted Article to establish construct validity, by comparing the assessment of the performance of experts and novices 11,[16][17][18] . Most often this involves comparison of single criterion data, although it is argued that a number of different sources of evidence are required in order to demonstrate and establish construct validity [19][20][21] . Other studies have attempted to establish the predictive validity of their simulator feedback by comparing student performance with a VR simulator and subsequently, after a time lag, with traditional pre-clinical course performance 22,23 .
To date, there is no published research that attempts to validate simulator feedback for an exercise by comparing it to an externally validated measure of the same performance. This is known as concurrent validity and would involve comparing simulator feedback to that of a trained clinical tutor.
A likely reason for this lack of research is that all of the published assessment methods in VR dental simulators are quantitative in nature 4,9,11,[16][17][18]23,24 . The exercises that have been developed for dental education typically involve either the preparation of various geometric shapes 16,18,23,24 , or operative procedures on teeth 4,9,11,17,22,25,26 . This quantitative feedback typically provides the user with a score that is based on the amount of the target material removed, the amount of surrounding (non-target) material removed and the time taken to complete the exercise.
Quantitative feedback is often considered advantageous 4,9,18,26,27 , primarily due to the objectivity that it provides. However, the true usefulness of this quantitative feedback is questionable as the scoring model is not truly reflective of the task or domain structure itself. For example, the presentation of a coloured region of tissue to be removed provides a clear indication of what is expected within the exercise -although the score does not reveal if a good performance is as a result of a sound understanding of the principles of cavity design, or simply the operator having a steady hand. This is known as construct-irrelevant easiness 21 . In clinical settings, students receive qualitative feedback on their performance, which should be meaningful and actionable to support students in improving their performance. Examples of such feedback for an occlusal cavity may include: handpiece control, depth of the preparation and flatness of the floor of the preparation 28 .
Despite multiple calls for feedback to conform with that given by dental educators in clinical settings 11,18 , to date no VR dental simulator exercise has been designed which is capable of reliably providing this meaningful clinical feedback. This sentiment is echoed by Bakr 29 and Rhienmora 4 . In reality, designing VR software that provides qualitative clinically relevant feedback is undoubtedly extremely complicated, and this may be the primary reason for its underdevelopment.

Accepted Article
This article is protected by copyright. All rights reserved Aims This study aims to:  introduce a novel process for measuring aspects of the validity of the assessment provided by VR dental simulators  demonstrate a proof of concept for the provision of qualitative clinical feedback with VR dental simulators  demonstrate the concurrent validity of the VR dental simulator feedback by comparing it with that obtained from clinical tutors Methods A visual outline of the methods is presented in Figure 1. An exercise that focussed on the essential features of occlusal cavity preparations was conceptualised by the authors, and developed for use on a Virteasy dental simulator by HRV (Changé, France). The exercise consisted of a block of material having a simulated density similar to human enamel and had a straight-line template on its surface.
Users were asked to prepare a cavity of 2mm depth, with maximum undercut, that followed the line.
The instruments available for the exercise consisted of a high-speed dental handpiece, a pearshaped diamond bur and a dental probe. A screenshot of the exercise can be seen in Figure 2.

Assessment criteria and feedback statements
Objective and qualitative criteria for assessing the preparation were obtained from existing published teaching material 28,30 . These criteria were combined with a range of feedback statements derived from published teaching material 28,30 and the expert opinions of experienced senior clinical teaching staff within the School of Clinical Dentistry, University of Sheffield, UK. The qualitative assessment criteria and associated feedback statements can be seen in Tables 1 and 2, respectively.

Development and testing of the assessment and feedback
Software modifications were made to the Virteasy simulator to enable it to make judgements about each of the qualitative assessment criteria based on user performance on the exercise. This involved empirical refinement of mathematical rules and thresholds based on user motions and handpiece angulation until the simulator analysis was aligned with each of the qualitative assessment criteria.

Accepted Article
This article is protected by copyright. All rights reserved The methods of calculation that the simulator employed for each qualitative assessment criteria are summarised in Table 3. Based on the output of these measurements, threshold values were set to determine a "yes" or "no" judgement for each criteria. This allowed the simulator software to quantitatively assess a preparation, and yet provide qualitative feedback statements to the user.
Once these methods of calculation were established and the exercise was able to provide qualitative statements across the five assessment criteria, a period of testing was undertaken to ensure the simulator always provided the expected feedback. This testing involved the repetitive assessment of preparations of varying quality and a comparison between the clinician's judgement of the preparation and that provided by the simulator. The threshold for each of the methods of calculation were modified until the simulator analysis was aligned with expected clinical feedback, as agreed by the clinical members of the project team (JD, JF, NM).

The delivery of feedback
Once the exercise is completed, users are asked to critically appraise their own work across the 5 cavity features (Table 1, Figure 4). The simulator then delivers its assessment of the actual performance along with any necessary recommendations for improving the performance (advice statements in Table 2) alongside the user's assessment of their own work. This should encourage critical reflection about any discrepancies in the user's perceived performance and the objective assessment of the simulator.

The validation procedure
To establish the concurrent validity of the assessment provided by the simulator, the obtained qualitative statements were compared to clinical teachers' assessment of the same preparations (as the standard). A series of 10 attempts at the exercise were produced by the project clinical skills lead (JD) in order to specifically demonstrate a range of good and bad performances based on the identified assessment criteria presented in Table 1. A combination of preparation errors were prescribed across the 10 exercises (Table 4). For each of these 10 exercise attempts (A-J), the simulator's assessment (yes or no) for each of the 5 assessment criteria was recorded.
Concurrently, the 10 exercise attempts were exported in stereolithography (STL) format and threedimensionally (3D) printed in the same dimensions using a stereolithography (SLA) 3D printer (Form 2 -Formlabs, Somerville, Massachusetts, USA). A separate overlay template showing the correct Accepted Article position of the straight line was printed in clear resin to facilitate assessment of the user's ability to follow the outline. An example of the 3D printed models can be seen in Figure 3.

Data collection
In order to assess the 3D printed models, assessors were equipped with a straight probe and magnification as per individual routine practice, plus the transparent position template.
12 clinical teachers were asked to assess each preparation, based on the same criteria as the VR simulator ( Table 1). The clinical teacher's assessments were blinded from the VR simulator assessment scores and the project clinical skills lead (JD), who produced the preparations, did not assess the preparations.

Statistical Analysis
The inter-rater reliability (IRR) for assessment scores between the clinical teachers determined by measuring a free-marginal multirater Kappa value, as described by Randolph 31 . This test was chosen to account for the fact that examiner distributions of scores into categories was not restricted. The IRR was calculated per exercise and for each assessed criteria (cavity feature). Exercises that demonstrated low (<0.3) free-marginal multirater kappa scores for IRR were excluded from further agreement analyses with the VR simulator scores.
In order to validate the VR simulator feedback, pooled clinical teacher assessment responses were compared with the VR simulator responses and percentage agreements were calculated. The mode of clinical teacher responses for each assessment criteria for each exercise was also calculated. This allowed for comparison between the "average" clinical teacher and the VR simulator assessments through percentage agreements.

Results
The IRR per exercise, calculated as the free-marginal multirater kappa and the percentage of interrater (IR) agreement, can be seen in Table 5. The IRR for two exercises (C,D) fell below the 0.30 κ free score threshold and were subsequently removed from further analyses. The κ free values for the remaining exercises ranged from 0.33-0.77, with the percentage agreement ranging from 66.36-88.48%.

Accepted Article
This article is protected by copyright. All rights reserved The IRR per assessment criteria (cavity feature), calculated as the free-marginal multirater kappa and the percentage of inter-rater agreement, can be seen in Table 6. The κ free values for the assessment criteria ranged from 0.15-0.58, with the percentage agreement ranging from 57.58-78.79%. The assessment of smoothness of the preparation (κ free 0.58 78.79%) and the ability to follow the outline (κ free 0.56, 77.88%) demonstrated the highest agreement between clinical teachers. Whilst, the assessment of undercut (κ free 0.15, 57.58%) and depth (κ free 0.28, 64.09%) demonstrated the lowest agreement between clinical teachers.
The degree to which the pooled clinical teacher assessments agreed with the VR simulator's assessment was then analysed. This is reported as a percentage agreement with the simulator, per exercise ( Given that we expected a degree of variance in the clinical teachers' responses, the modal response (agree or disagree) for each assessment criteria and exercise, was then compared to the VR simulator assessment (Table 8). These agreements ranged from 20-100% depending on the exercise.
The mean agreement across all exercises was 77.5%. Similar to the pooled data, exercises A (100%) and H (100%) demonstrated high agreement, whilst exercise F (40%) and I (20%) demonstrated the lowest agreement across the two assessors.

Discussion
Currently, there is no published evidence that VR dental simulators are able to provide validated, qualitative feedback in a manner akin to that provided by dental educators in a clinical setting.
Whilst there have been attempts to establish the construct validity of VR dental simulators by comparing the performance of expert and novice dental professionals 11,16-18 , it is not clear how useful existing computer-derived quantitative feedback is to students. Repetitive practical experience might result in improvements in the performance of completing a specific task as measured by objective criteria -in the same way that expert dentists might perform better than novices. However, these task-specific percentage scores are more a measure of 'shape agreement' ,8 i.e. how well the user can control the handpieces to follow a predetermined pattern. Whilst there may be a degree of demonstrable correlation with this approach 11,[16][17][18]26 , this feedback does not

Accepted Article
This article is protected by copyright. All rights reserved relate or translate to other operative clinical tasks or reflect the structural aspects of the construct domain 21 .
Carter 32 argues that meaningful and clinically relevant feedback is a vital part of the learning process.
Some examples exist of VR exercises that provide feedback in relation to force application and mirror position 17 , however these are difficult to standardise in a VR system, and the value of this feedback to learners is questionable. Instead, the authors would argue the need for more 'human' or 'clinical teacher-style' feedback that more closely matches the feedback given within a real clinical environment. Further, this approach is more robust pedagogically, as it indicates to the user how they might improve and supports self-assessment and reflection, the importance of this in improving clinical competence was demonstrated by de Peralta et al. 33 .
Other authors have used tutors to contribute to the assessment of criterion measurements of their simulators 11,34 , by looking for independent corroborative evaluations of performance. However, this paper presents the first example of establishing a measure of external validity of a simulator's feedback approach using the same criteria as used by the simulator itself. The use of 3D prints of the exercise attempts allowed the assessors to evaluate the performances using the tools and approaches that they would normally use in a clinical setting. This facilitates a more authentic feedback process and mitigates against the confounding factors which might be caused by the differences between the VR environment and the real-world 8 .
A high level of agreement was demonstrated between clinical teachers and the simulator after removal of two exercises that had low IRR. As it would not be appropriate to assess simulator agreement with an exercise that a group of experienced dental educators could not agree on, a decision was made by the authors to remove exercises that had low IRR and a threshold was set at <0.3 free-marginal multirater kappa score 35 . The decision to remove these exercises from further analyses was taken to ensure that these analyses were comparing the simulator assessment with clinical teachers that showed a fair to moderate level of agreement. This point brings to light an unexpected level of poor correlation with some tasks; a point that will require further investigation in the development of these validation criteria. After the removal of exercise C and D, the freemarginal multirater kappa scores for the IRR between clinical teachers demonstrated fair to moderate agreement at a minimum 35 .

Accepted Article
It is important to highlight that clinical teachers who assessed the preparations, did so in the manner of a routine clinical teaching assessment and were not specifically calibrated to assess these exercises. Whilst calibration of assessors may have led to an increased IRR across all of the exercises and cavity features, the authors felt that calibration for a routine operative dental exercise (assessed against standardised features) would not be representative of a routine assessment of operative skills. Furthermore, a degree of variance is expected between clinical teachers even when assessing preparations against objective criteria, and this phenomenon is reported by Seet et al 36 . As such, we expected that obtaining high levels of agreement between the clinicians and the VR simulator would be challenging. Despite this, the results demonstrated a mean agreement across all exercises of 77.5%.
Higher than average (over 80%) agreement between the clinical teachers and the simulator was obtained for exercises A, B, H and J. Interestingly, these exercises demonstrated the extremes of each set of criteria; these results are expected, and suggest that clinical teachers and simulators are more likely to agree when a preparation is more obviously "good" or "bad". Exercises that showed the lowest agreement between clinical teachers and the simulator (I and F), demonstrated similar errors with the preparations. These consisted of the preparations being too deep, having insufficient undercut and not being smooth enough. This finding is in agreement with the IRR scores per cavity feature, and, anecdotally with the authors' experience, that depth and undercut appear to be the most challenging of the criteria to reliably assess. The finding is also in keeping with Seet et al 36 who reported that less obvious features of crown preparations (such as occlusal reduction) resulted in lower inter-rater agreement than features that were more easily assessed (such as marginal width).
Here, the kappa values reported for IRR were significantly lower, ranging from κ = 0.103 (slight agreement) to κ = 0.399 (fair agreement). The remaining exercises in this study (E, G) showed strong agreement and incidentally only contained one of the two challenging criteria described above (undercut). Finally, the results suggest that it is the more borderline performances that result in greater disagreement between clinical teachers. This is also expected and demonstrates the true value of the simulator scores in these cases -in order to ensure consistent feedback is delivered to students. It also highlights the importance of the data analysis thresholds that are set for exercise analysis and feedback.
The statistical methods used in this study were carefully chosen to match a relatively complex data set. A free-marginal multirater kappa (κ free ) was used to measure the IRR due to the number of assessors; the commonly used Cohen's kappa is only designed for two raters 35 . The κ free was also Accepted Article selected due to the complexity involved in each assessors assessing all five independent criteria (cavity features) per exercise. When comparing clinical teacher and simulator agreements, the use of percentage agreement is a suitable test -and it is particularly useful when the responses are limited to two values (yes or no) 35 .
Whilst the results of this study are very promising in terms of showing that a simulator can generate clinically relevant feedback based on assessment criteria comparable to those used by a tutor, this novel method of assessment and feedback is currently limited to a single simple exercise -as such, further research must look to employ this technique across a broader range of exercises that help to develop other complex operative dental skills. This method of objective qualitative assessment and feedback will be of particular value in relation to feedback criteria that typically generate low tutor IRR.
This proof-of-concept study has demonstrated that clinically relevant, qualitative feedback is possible with VR dental simulators. This was achieved by establishing assessment criteria and corresponding qualitative feedback statements for dental operative skills exercises, linking them to measurements on computer systems and subsequently comparing the assessment given by the simulator with dental clinical teachers. The results of this study demonstrated a high level of agreement between clinical teacher assessment and that provided by the VR dental simulator. This suggests that, for the exercise used, it is possible for simulators to reliably assess and provide valid, meaningful and qualitative feedback to students on their performance.

Conclusion
The results of this study demonstrate that it is possible to provide reliable and clinically relevant qualitative feedback via a VR dental simulator. These findings provide a proof of concept for the concurrent validity of VR dental simulator assessment by comparing it to dental educator assessment. Further research should look to employ this technique across a broader range of exercises that help to develop other complex operative dental skills.

Data availability statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Accepted Article
This article is protected by copyright. All rights reserved

Accepted Article
This article is protected by copyright. All rights reserved

Accepted Article
This article is protected by copyright. All rights reserved

Accepted Article
This article is protected by copyright. All rights reserved

Accepted Article
This article is protected by copyright. All rights reserved