Clinical trainee performance on task‐based AR/VR‐guided surgical simulation is correlated with their 3D image spatial reasoning scores

Abstract This paper describes a methodology for the assessment of training simulator‐based computer‐assisted intervention skills on an AR/VR‐guided procedure making use of CT axial slice views for a neurosurgical procedure: external ventricular drain (EVD) placement. The task requires that trainees scroll through a stack of axial slices and form a mental representation of the anatomical structures in order to subsequently target the ventricles to insert an EVD. The process of observing the 2D CT image slices in order to build a mental representation of the 3D anatomical structures is the skill being taught, along with the cognitive control of the subsequent targeting, by planned motor actions, of the EVD tip to the ventricular system to drain cerebrospinal fluid (CSF). Convergence is established towards the validity of this assessment methodology by examining two objective measures of spatial reasoning, along with one subjective expert ranking methodology, and comparing these to AR/VR guidance. These measures have two components: the speed and accuracy of the targeting, which are used to derive the performance metric. Results of these correlations are presented for a population of PGY1 residents attending the Canadian Neurosurgical “Rookie Bootcamp” in 2019.


INTRODUCTION
At one level, this paper describes experiences and lessons learned while conducting a neurosurgical training 'bootcamp' (see Figure 1) at canadian surgical technologies and advanced robotics (CSTAR), in Canada in 2019.This paper also includes a description of the overall course objectives and descriptions of the kinds of training techniques used at each of the AR/VR simulator-based [1,2] training stations that were scheduled for the 53 trainees.This paper also describes the kinds of simulations, with particular emphasis on the methodologies used for gathering data, the systematic evaluative procedures in place at each station, and the results analysed from these data; gathered from 53 PGY1 resident trainees (Figure 1).Moreover, at another more philosophical level (more interesting for the AECAI audience) this paper carries a narrative and sets up an initial salvo that opens opportunities for discussion at the workshop about what should be methodologies for evaluating the effectiveness of simulation-based training involving augmented environments for computer-assisted interventions, and especially for neurosurgical procedures [3].

Canadian neurosurgery PGY1 bootcamp training
The Canadian Neurosurgery Society has sponsored, in collaboration with the Neurosurgery Specialty Committee of the Royal College of Physicians and Surgeons, since 2013 [4,5] an intensive training camp for first-year post-graduate residents (PGY1).The ratio of instructors to trainees is approximately 1:2, and the trainees can expect to experience a full gamut of training experiences ranging from didactic to pragmatic, from lectures to hands-on low-fidelity box simulators.
This paper considers, a subset of these training exercisesones which involve AR/VR-based simulator training scenarios.In particular, these will involve extra-ventricular drain (EVD) placement or alternately, endoscopic third ventriculostomy (ETV).Both of these procedures have a main surgical phase that involves the targeting of brain ventricles.

The red herring: Training simulator realism and face validity
When considering the evaluation of a surgical simulator, one often considers the amount of 'realism' that can be designed for the simulation.This paper will argue that this is a red herring, and in fact, it is generally confused with 'face validity.'[6] For example, if you present a realistic wound to a trainee without a task and therefore without a test of performance (as if you were left in the room pictured in Figure 2), then its "face validity" is zero.Face validity means: 'on the face of it,' (i.e., prima facia): would an expert consider the testing evaluation to be a reasonable one?Similarly, consider an AR/VR-based simulator that displays realistic views of a simulated neurosurgical case.Without a test, 'realism' has nothing to do with the 'face validity' of that simulator.[7] Accordingly, with a 'fully realistic' VR environment, a trainee might just as well be facing a cadaver head without any instructions, if they have no curriculum.To be sure, 'face validity' estimates must be estimated by domain experts based on an assessment of the evaluation 'metrics' for the explicit tasks that have been chosen from the curriculum (not from naive estimates of the 'realism' of a simulator).[8] To make the point sharper: Do the measures of performance on the simulator seem to an expert to be reasonable measures of their performance as related to the real task?That is 'face validity.'

1.3
The real problem: How can we move towards integrating the surgical 'process model' with the 'evaluation methodology?' We will be examining a set of procedures that involve targeting neuroanatomical structures within the brain.The trainees will be performing this after viewing the patient's preoperative scans (Figure 3) using mixed reality for surgical guidance [9].After reviewing them, they must form a mental representation of the target within the patient.The trainees will be required to learn how to transform that 2D information into 3D in order to visualize a location for entry (on the surface of the skull), along with a 3D direction vector and a depth of entry into the brain (with the experimental apparatus shown in Figure 4).This central phase of the surgical procedure is quite wellposed, and so measures of the 3D position, the 3D orientation, and the depth of entry are all well-specified quantities that can be gathered on time-stamped log file entries.These log files can be examined line-by-line in order to form assessments that facilitate inter-and intra-trainee evaluations.Individuals can be compared across their training group and can be compared to data gathered from expert neurosurgeons.

ESTIMATES OF PERFORMANCE IN LOCALIZATION TASKS
In this section, we will expose the numerical relationship between the rankings assigned by clinical experts (neurosurgery consultants) who are training PGY1 residents and the mathematically well-posed distance metric.A consultant neurosurgeon was asked to rank the ventricle targeting on a scale of 1 to 6.In the same spirit as the mathematical distance metric derived purely geometrically, a subjective ranking of targeting performance was assigned to each targeting trial by a consulting neurosurgeon.
The subjective scores were assigned such that, 'if the targeting was placed well within the ventricle, with a good approach angle,' then the score given was 1.If the targeting was in the ventricle but at a poor angle or inside but close to the ventricle wall, then the score was 2. If the targeting was slightly outside the ventricle, then the score was 3. If the targeting was outside the ventricle but with a poor approach angle, the score was 4. And finally, if the targeting was a wide miss, the score was 5 or FIGURE 2 Surgical simulators can have varied 'realism,' to be sure.This one, in particular, is extremely realistic but has no intrinsic curriculum; a grim demonstration the 'realism' is not the key issue for 'face validity.'Most simulators have a programmed curriculum workflow, and so in fact, it is the validity of the simulator's assessment metrics that can be studied.

FIGURE 3
The skilled task being trained is to view a set of CT slices, and then, on the basis of reasoning about those views, to plan a point of entry on the skull and then introduce the EVD into the ventricular system.
6, depending on whether the approach angle was good or not, according to expert subjective scoring.
Just as reading a length measure from a ruler can be argued as being either 'subjective' or "objective," depending on how much trust is placed on the observer, then so can an expert's ranking be considered objective, so long as there can be a principled functional relationship from their scores to a metric function.A set of sample expert-assigned subjective scores [10] is illustrated in Figure 5, and full results and analysis are in the Results section.

FIGURE 4
Once the trainee has reasoned about the point of entry and the location of the target, they must move the surgical tool through their estimated 3D entry position along an estimated 3D orientation and depth.

FIGURE 5
After each trial, for each participant, the targeting of the randomly presented ventricular system is displayed as a green line as feedback to the participant (with a blue line showing an ideal trajectory).A screenshot of each trial is saved, and it is subsequently ranked by an expert neurosurgeon.Sample rankings of the index of accuracy (IA) assigned by the neurosurgical expert are shown.

RELATIONS BETWEEN OBJECTIVE METRICS AND EXPERT RANKINGS
Purely geometrical analysis can lead to the implementation of purely objective metrics.However, when considering patient-specific anatomical structures, objective metrics can be infeasible to formulate or compute.Fortunately, subjective metrics, which can be assigned by experts, can be articulated in a way that maintains consistency with the well-formed geometrical objective metrics.This consideration will then lead us to a broader and more over-arching discussion about the parallels between objective metrics and subjective scores as we consider the unavoidable trade-off between 'internal validity versus external validity.''

Construction of a purely geometrical metric
Consider Figure 6, which shows a number of measures proposed from the literature that can be used to estimate the accuracy of ventricle targeting: (a) 'Engagement,' (b) MDM (distance between point and closest ventricle wall), (c) angulation in the sagittal plane and (d) angulation in the coronal plane.We take a moment here to justify why 'engagement'' may be a very appropriate measure for EVD placement performance and relate it to the classical Euclidean distance metric.
In principle, we would like to construct a metric which is such that if the surgeon places the catheter anywhere inside the ventricle, the task has been performed with 'utility' (i.e., in a binary go/no-go sense)-yet questions can still arise about a quantitative score to be attributed to the procedure.Consider the case where the ventricle has been missed: heuristically, being far away from the ventricle should be associated with an error measure, and so the Euclidean distance metric (distance from the centre of the ventricle) is an appropriate measure.However, inside the ventricle, the metric should be 'close to zero' anywhere in the centre of the ventricle, but smoothly increasing as approaching the ventricle wall.We will now extend this heuristic to mathematical formalism as follows: Consider 'E' to be the engagement, and so if the target volume was a sphere of radius R, we could use E to form a very well-formed distance metric.Through simple geometry, since in the left diagram in Figure 6, R 2 = (E∕2) 2 + d 2 , we have the following expression: This distance metric has very interesting and satisfying properties.By observation, when the trajectory is through the centre, giving maximal 'engagement,' then D = 0 (perfect accuracy).As

EVALUATION OF BASELINE SPATIAL REASONING SKILLS OF PARTICIPANTS
Our over-arching commentary in this paper is intended to prompt discussion about whether the evaluation of the performance of a participant on a simulator should follow the model of expert rankings and subjective assessments-or whether it should be modeled more on the paradigms of experimental psychology and cognitive science.In the sequel, we will argue that "construct validity" can only be attained asymptotically after effortful iterative convergence between both approaches.To begin, we examine a classical methodology for establishing a baseline of user performance for spatial reasoning.

3D spatial reasoning from clinical 2D images
It is not controversial to state that the task of observing 2D slices of neuroanatomical structures and developing a 3D internal representation of those structures forms the foundation for the planning phase of this procedure.The internal representation allows the surgeon to form a plan for burr hole placement, and to move in a way that allows them to introduce the EVD from that location on the skull, at an angle that will allow the tool to intersect the ventricular system.
What is controversial, however, is the nature of this skill: from an information-processing perspective, how are the input images being transformed into a motor plan and an action?It is this skill that is the construct we are trying to measure.Accordingly, this consideration leads to the question: Is this form of spatial reasoning a bottom-up process?(learnable by trial and error, analogous to some deep-learning prescription) or is it a top-down cognitive process that can be learned through instruction and debriefing?To address this question, we make use of our evaluation methodology to explore the nature of spatial reasoning, within the context of what some would call "mental rotation."

Objective methodology for evaluating 3D spatial reasoning from 2D images
We have implemented a computer-based test that makes use of the same set of stimuli introduced by Roger Shepard [11] and Jacqueline Metzler in 1971.For each trial, participants view a pair of randomly-selected images, which are 2D presentation of 3D objects, which are simple block shapes.The pairs are systematically randomized so that half of the time the items are "Same," and the other half of the trials are "Different."The participants press a key 's' or 'd ,' and the response is recorded in addition to the response time.

Results and analysis
We begin with the analysis of the results of the test of baseline 3D spatial reasoning from 2D images.The table on the left of Figure 8 shows the number of errors made for each participant, along with the mean time to report 'Same' correctly and the mean time to report 'Different' correctly.Across all trials, some participants are very accurate, but tend to have longer task time, while some participants are faster (shorter task time), but tend to have more errors, as shown in Figure 9.
We can also make use of this classical paradigm to test a hypothesis about whether the task of performing a 3D task on the basis of 2D images is cognitive; involving top-down spatial reasoning (an alternative hypothesis to what Shepard and Metzler originally proposed [11]).Many researchers propose that the brain has a mysterious functionality that allows 2D/3D 'analogical' representations to be rotated 'in the head.'The postulated theory says something to the effect of seeing two objects, somehow 'rotating' them so that they overlap, and then checking to see if they are the same or different.One problem with this theory is that no suggestion has been made as to which direction the 3D representation should be rotated in order to check (and you would first need to know in which way they were different in order to know which way to rotate them to check, "so, as they say… Catch-22").Aside from that, this analogical theory would predict that once the check has been made, the response at that time would either be the same or different, and so the amount of time needed to say 'same' or 'different' would be identical.
Conversely, for the top-down spatial reasoning account, participants would examine the two side-by-side images that represent 3D structures.On this basis, they would deductively (top-down) check parts of the structures to see if there was any reason that they were different.Notice that once there is any evidence that the shapes are different, then the participant can respond 'different,' whereas if the two objects are the same, then the participant must terminate the checking process on their own.In other words, at some point they would realize that with no evidence for a difference, that the objects must be 'same.'Accordingly, this alternate hypothesis would predict that the reaction time to say 'same' would be longer than 'different.' The results in Figure 10 show the reaction time, averaged across all participants, as a function of a raw measure of the index of difficulty for each trial pair.(The raw measure is simply the number of errors made across the population of participants-some image pairs were never mistaken by any of the participants, and some image pairs are so difficult that as many as nine participants judged them as same/different incorrectly).This figure shows that the reaction time is indeed a function of the index of difficulty, as is to be expected.But in addition, we see that the correlation is different for 'same' versus 'different trials!Saying 'Different' is on average at least 2 s different from saying 'Same.'One observation afforded by this analysis is the response bias among participants: 53 times they say 'same' but the pairs are different, and 29 times they say 'different' when the pairs are the same.In other words, participants are biased to say "same," even though the stimulus pairs are balanced evenly.Nevertheless, the difference shown here is problematic for an account based on a putative 'mental rotation' capacity, and it provides more converging evidence for a topdown cognitive account of spatial reasoning skills.This will have implications for the way that we recommend training surgical skills for PGY1 residents.
One motivation for studying user behaviour from a psychophysical perspective is that it can allow a focus on the modes of teaching and learning that are most effective.For example, from a cognitive science perspective, the behaviour of the user is modeled in a top-down fashion.Accordingly, the teaching and learning mode should be one of didactic training, debriefing, and mentorship (rather than low-level, bottom-up repetitive training and perceptual-motor adaptation).

4.4
A review of the construct tested with this simulator: 'EVD trajectory planning skill' There is no broader division in our literature, for approaches to clinical skills training and assessment than the divide between approaches that prescribe the use of subjective scores versus objective metrics.Furthermore, because earlier sections of this paper have raised a discussion point about the use of the terms 'face validity,' 'content validity,' and the elusive 'construct validity,' we offer a controversial point that might help to refocus efforts, at least for discussion amongst participants in this AECAI workshop.First, on the one hand, 'objective metrics' seem to be desirable because, if well-formulated, they remove the ubiquitous cognitive biases that are typically manifest for both the experimenter and the subject of the experiment.On the other hand, in order to formulate an objective metric, many ideological assumptions inevitably must be asserted-so many, it would seem-that the desire to control the parameters of the simulated environment leads to an explosion of experimental conditions that make their presentation infeasible (this is the burden that psychophysical investigations bear in the classical literature of experimental psychology and cognitive science).In contrast, when trying to assess the performance of a skill in a reasonably realistic variety of possible scenarios, the assessment of performance is subject to the ill-posed nature of such scenarios, in the sense that there may be several possible reasonable actions.Accordingly, the only recourse for evaluation is to appeal to the subjective scoring provided by a domain expert [12].
It does not seem to stretch the imagination of a reasonable researcher to propose that the distinction between 'objective metrics' and 'subjective scores' is exactly the distinction between the irreconcilable difference in the vValidity literature [13] between 'internal validity' and 'external validity.'As such, we must face an inevitable trade-off, which has always been acknowledged in experimental psychology and cognitive science: The more internal validity 'baked' into your experimental paradigm, the more objective your measures can be.And, on the other hand, the more 'external validity' attributed to your experimental paradigm, the more inevitability will be the need to recourse to subjective scores.Accordingly, we have an inescapable trade-off that needs to be acknowledged faceon.You cannot have both!Your experimental paradigm cannot feasibly have both 'high internal validity' and 'high external validity.'Your AR-VR-based simulator cannot be 'anatomically faithful' and at the same time control for the index of difficulty induced by 'anatomical variations' across your training set [14][15][16].
In order to try to establish 'construct validity,'' there must be iterations between the two paradigmatic styles (not to men- tion the often missed step of first identifying what you mean by your 'construct' before you try to validate whether your measures are able to provide evidence of such).Iterations between these two styles of experimental paradigm are necessary if they are to establish converging evidence that they are valid measures of the elusive 'construct.'Only through effortful convergence and iterative design and evaluation cycles can the holy grail of 'construct validity' be established.(You certainly cannot establish 'construct validity' by showing that your scores correlate with the PGY rankings of your participant pool.That is merely 'criterion validity,' and should be acknowledged as such: an initial indicator that shows a rough, but necessary, sensitivity of your measures.)

CONCLUSIONS AND DISCUSSION
Our main intent for this paper has been to foster a renewed discussion about the importance of the domain of the 'evaluation of clinical simulators for procedure training.'Our first comment on this topic, which will not be controversial, is that the evaluations should be as objective as possible-and therefore metrics of performance need to be formulated according to effortful and descriptive information-processing models of the task or procedure that is being trained within a clinical curriculum (as demonstrated in Figure 11).While that may seem trite and obvious, many reports in the literature ignore this.Furthermore, by adopting this as a principle, one can still be led to some difficult questions: For example, if your trainees are performing a procedure in which the outcome does not depend on 'path length,' then why would you propose measuring 'path length' just because another study measured it as an explicit constraint on their task?Likewise, if the performance of a procedure does not depend on the force with which you grasp a tool, then why include that as a measure of performance on the task?If your performance on the task can be measured on the basis of the speed and accuracy performance of the task, then why measure the characteristics of the heartbeat or the conductance of the sweaty hands of the participant?Or the 'hot spots' of their eyemovements?To be sure, each of these measures may (if you are lucky) correlate with the clinician's overall performance, but so might their performance correlate with the proportion of grey hairs on their head, or the price of their car in the parking garage.Each of these measures will probably correlate with performance.But at the end of the day, the only real measure of performance will be to decompose the procedure into phases, and the phases into tasks, and the tasks into sub-tasks, until the leaf nodes are either 'targeting' tasks or 'decision-theoretic choices.'At that point, each phase can either be evaluated objectively in terms of speed and accuracy, or (inversely) the task time and error rate.Yes, now those are controversial stances-so let us discuss.

FIGURE 1
FIGURE 1 25 PGY1 residents attended the Canadian Neurosurgical 'Rookie Bootcamp' in 2019, hosted at the Canadian Surgical Technologies and Advanced Robotics (CSTAR) Centre.

FIGURE 6 '
FIGURE6 'Engagement' a natural objective measure of the success of targeting a ventricle, but in addition, we can show the relationship between 'engagement' and a distance metric, which is then in a form to be used as an error measure.

FIGURE 7
FIGURE 7 Functional relationship between normalized objective distance metric D (domain) and expert ranking "index of accuracy" (range), illustrating the central Lorentz transformation within the sphere of radius = 1.

FIGURE 8
FIGURE 8After each session on the EVD simulator, participants tested using the stimuli from Shepard and Metzler's classical task.

FIGURE 9 FIGURE 10
FIGURE 9 By comparing the raw response time with the number error trials, we can demonstrate a classical speed-accuracy tradeoff across the population of participants.

FIGURE 11
FIGURE 11The scores on EVD performance are correlated with the participants' objective scores on the spatial reasoning task.