SEARCH

SEARCH BY CITATION

Keywords:

  • *education, medical, graduate;
  • clinical competence/*standards;
  • professional practice/*standards;
  • quality control;
  • evaluation studies [publication type];
  • medical history taking;
  • physical examination;
  • judgement;
  • decision making;
  • England

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References

Objectives  This study represents an initial evaluation of the first year (F1) of the Foundation Assessment Programme (FAP), in line with Postgraduate Medical Education and Training Board (PMETB) assessment principles.

Methods  Descriptive analyses were undertaken for total number of encounters, assessors and trainees, mean number of assessments per trainee, mean number of assessments per assessor, time taken for the assessments, mean score and standard deviation for each method. Reliability was estimated using generalisability coefficients. Pearson correlations were used to explore relationships between instruments. The study sample included 3640 F1 trainees from 10 English deaneries.

Results  A total of 2929 trainees submitted at least one of all four methods. A mean of 16.6 case-focused assessments were submitted per F1 trainee. Based on a return per trainee of six of each of the case-focused assessments, and eight assessors for multi-source feedback, 95% confidence intervals (CIs) ranged between 0.4 and 0.48. The estimated time required for this is 9 hours per trainee per year. Scores increased over time for all instruments and correlations between methods were in keeping with their intended focus of assessment, providing evidence of validity.

Conclusions  The FAP is feasible and achieves acceptable reliability. There is some evidence to support its validity. Collated assessment data should form part of the evidence considered for selection and career progression decisions although work is needed to further develop the FAP. It is in any case of critical importance for the profession’s accountability to the public.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References

The first 2 years following medical school in the UK are now spent in foundation programmes which aim to provide broad clinical experience with a focus on acute care. In addition, the importance of patient safety and team work are emphasised. At the end of the programme there is competitive selection for specialist or general practice training. A competency-based curriculum for the foundation years sets out the expected outcomes and central to its effectiveness is a robust system of workplace-based assessment (WBA).1

The goals of the Foundation Assessment Programme (FAP) are to: determine fitness to progress to the next stage of training; identify doctors who may be in difficulty, and provide focused feedback to all trainees in keeping with a quality improvement model. A blueprinting exercise against the Foundation curriculum and Good Medical Practice2made it apparent that no single measure would meet all of these goals. Consequently, four complementary methods of assessment and feedback were employed: multi-source feedback using the mini-peer assessment tool (mini-PAT); the mini-clinical evaluation exercise (mini-CEX); case-based discussion (CbD), and direct observation of procedural skills (DOPS).3–6

Although the evaluation of individual assessment devices is well worked out, there is less experience with systems of assessment. To fill this void, the Postgraduate Medical Education and Training Board (PMETB), a statutory body responsible for quality assurance (QA), defined nine principles against which the FAP system should be judged.7 Among these, three were immediately satisfied (i.e. fitness for a range of purposes, content based on Good Medical Practice2, and the provision of relevant feedback). In recognition of the fact that 3–5 years of data would be required to fully evaluate the system, initial analyses were considered important to identify emerging strengths and weaknesses. This paper reports these initial QA analyses, using data for first-year (F1) trainees collected from August 2005 to July 2006. Specifically, it presents data on the validity, reliability and feasibility of the methods, their relationships with one another, and important issues that emerged.

Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References

Instruments and procedures

Multi-source feedback, the mini-CEX, CbD and DOPS required that trainees be assessed in a number of specific areas, as well as globally. For example, mini-CEX assessors were asked to grade trainees in six areas (history taking, physical examination skills, clinical judgement, professionalism, organisation and efficiency) as well as in overall clinical care. All four methods captured these assessments on a 6-point scale where 4 = meets expectations for completion of F1. Selection of a 6-point scale was based both on a review of the literature8 and a desire to facilitate feedback. As not all elements can be assessed on every occasion, there is also an ‘unable to comment’ (U/C) option. Assessors can use this option when, for example, they only observe a trainee explaining a diagnosis to a patient and are therefore unable to comment on physical examination skills.

Trainees were expected to participate in two rounds of the mini-PAT with eight assessors from a range of health care colleagues nominated on each occasion. Based on work which did not show any difference between self-nominated raters and those nominated by a senior colleague, trainees self-selected their own assessors for the mini-PAT.9 The trainee’s self-rating was also collected for each mini-PAT round. The process was managed centrally and the collated feedback was provided, facilitated by the trainee’s supervisor, on a chart which also showed the trainee’s self-rating and the performance of a large cohort of peers. Free text comments were fed back verbatim but were not attributed to an individual assessor.

All trainees were asked to undertake six each of mini-CEX, CbD and DOPS assessments by mid-June 2006. They were encouraged to spread the assessments over the year and were reassured that some low scores early in the year were to be expected. These methods were administered using triplicate carbonised pads. One copy was returned to a central location for scanning, one copy was retained in the trainee’s portfolio and one copy was kept by the trainee’s educational supervisor. Immediate feedback was provided at the end of each encounter and the form was designed to facilitate this.

Recognising that assessor variability and content specificity represent the greatest threats to reliability, trainees were asked to use a different assessor and cover a different clinical problem for each assessment event.10 Assessors were asked to record both the time taken for observation and the time taken to deliver feedback. They then rated their own level of satisfaction with the process (rather than the trainee’s performance) on a 10-point scale, where 1 = not at all satisfied and 10 = highly satisfied. A 10-point scale was chosen for this question to allow comparison with equivalent data collected in work involving the mini-CEX elsewhere.6

To facilitate standardisation and QA, 10 deaneries utilised a central data management system for the FAP. All forms were either scanned or directly downloaded into an Structured Query Language (SQL) database. Data collected for each encounter included details on the complexity of case, the clinical setting and the occupational group of the assessor. This information was recorded on the forms. Basic demographic data for each trainee, such as ethnicity, gender and university of qualification, were also gathered to allow for exploration of potential sources of bias.

Analyses

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References

All data were anonymised and analyses were undertaken using sas (SAS Institute, Inc., Cary, NC, USA) and spss Version 14.0 (SPSS, Inc., Chicago, IL, USA). They were conducted at the level of the individual assessment (form), the trainee, and the assessor. Clearly, complete analysis of this dataset is beyond the scope of any one study. Given their central importance to the QA of an assessment system, this paper focused on feasibility, reliability and validity.11

For each method, mean scores and standard deviations (SDs) were calculated for each encounter. They were based on all items on the form excluding the global rating.

Total number of encounters, assessors, trainees, mean number of assessments per trainee, mean number of assessments per assessor, and median time taken for the assessments (observation and feedback) were used to explore feasibility. Mean assessor satisfaction scores were also calculated. For CbD, mini-CEX and DOPS, the percentages of assessors who received face-to-face training, web-based training, read the guidelines, or did not respond to the question were also calculated.

Reliability was explored using 95% CIs based on generalisability theory.12,13 Generalisability theory systematically quantifies errors of measurement in educational tests. For this study, variance components were calculated based on a random-effects, encounter-within-assessor-within-trainee design. In an ideal study, it is possible to include the effects of facets such as hospitals, deanery and occasions to determine how much each contributed to measurement error. In this study, however, trainees and assessors were nested within hospitals, many had only one observation, and the encounters were not scheduled regularly throughout the year. Consequently, these results are subject to various potential biases and should ideally be replicated in a controlled setting.

The variance components were calculated using the varcomp procedure (SAS Institute, Inc.), and the minque (0) method was used for estimation. For each method, 25 random samples of roughly half of the encounters were chosen. For each random sample, variance components were calculated for trainees (the variability in ratings that would occur if each trainee were examined by a large number of assessors while seeing a large number of patients), assessors within trainees (the within-assessor variation that would occur if each assessor examined a large number of trainees while seeing a large number of patients), and the residual (the within-trainee variation that would occur over a very large number of encounters with different patients and assessors). These variance components were averaged over the 25 random samples and their SD offers an index of the component’s stability.

These data were used to generate the 95% CIs reported in this study. In keeping with its use in biostatistics, the CI has the advantage of addressing both psychometric and practical issues simultaneously. To obtain the CIs for a trainee’s total score, the error variance (assessor : trainee plus residual) was divided by the number of encounters (one to 14 encounters), and the square root was taken and multiplied by 1.96. Adding or subtracting this from a trainee’s mean rating produces the range within which the trainee is expected to fall 95 times, if independent reassessment were to occur 100 times.

The percentage of individual aggregate assessment (form) ratings of < 4 were calculated for each instrument, and separately for the first and second rounds of the mini-PAT. Relationships amongst the methods were explored using Pearson’s correlation with the goal of understanding whether the measures were providing redundant information.

The mean score for assessments carried out in the first half of the training year and those performed in the last half of the year for CbD, DOPS and the mini-CEX was calculated as preliminary evidence to support validity. The mean score for all trainees for the first and second mini-PATs was calculated for the same purpose.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References

At least one assessment was submitted by 3640 trainees. Table 1 presents data analysed at the level of the encounter (i.e. the form) for the three case-based methods (the mini-CEX, DOPS and CbD). The total number of assessments was 60 512, with a mean of 5.3, 5.2 and 6.2 assessments per trainee for the mini-CEX, CbD and DOPS, respectively. Mean scores (SD) at the level of the trainee were 4.88 (0.38), 4.77 (0.37) and 4.98 (0.40) for the mini-CEX, CbD and DOPS, respectively. Mean scores (SD) for the mini-PAT were 4.6 (0.5) for round 1 and 4.7 (0.5) for round 2. Overall, a mean of 16.6 case-focused assessments were submitted per F1 trainee, 40% of which were submitted centrally in May and June 2006.

Table 1.   Number of assessments by both trainee and assessor for each method and mean (SD), maximum and minimum aggregate scores at the level of the assessment
 Mini-CEXCbDDOPS
  1. * Total number of assessment forms submitted

  2. Mini-CEX = mini-clinical evaluation exercise; CbD = case-based discussion; DOPS = direct observation of procedural skills; SD = standard deviation

Total number of assessments*19 10218 71022 700
Trainees359235953640
Mean assessments undertaken per trainee5.35.206.23
Raters872891258701
Mean assessments undertaken per rater2.22.052.61
Mean (SD) score by individual assessment4.89 (0.62)4.78 (0.62)4.98 (0.68)
Minimum1.51.02.33
Max6.06.06.0

Table 2 shows the time taken for the assessments. Total time is equivalent to a median time of 8.4 hours per trainee per year to undertake and receive feedback on six each of the CbD, DOPS and mini-CEX assessments and to complete two rounds of the mini-PAT with eight assessors.

Table 2.   Time taken for assessments
 Mini-CEXCbDDOPS
  1. Mini-CEX = mini-clinical evaluation exercise; CbD = case-based discussion; DOPS = direct observation of procedural skills; Mini-PAT = mini-peer assessment tool

Median time for observation/discussion15 minutes15 minutes10 minutes
Median time for feedback10 minutes10 minutes5 minutes
 Mini-PAT  
Median time for form completion7 minutes  

Mean assessor satisfaction scores (SD) for the mini-CEX, CbD and DOPS were 7.1 (1.78), 7.26 (1.71) and 7.34 (1.96), respectively. Of the assessors, 22%, 36% and 32% had received face-to-face training for the DOPS, CbD and mini-CEX assessments, respectively, whereas 43%, 37% and 38%, respectively, had only read the written guidelines. Only 2% of the assessors for each instrument had used the web-based training, but 33%, 23% and 27%, respectively, of assessors did not complete this question.

At least one of each of the four methods was submitted by 2929 trainees. However, 40% of these were submitted in the last 6 weeks before the end of academic year deadline. For the case-based methods, the percentage of unsatisfactory encounters as judged by the mean scores was 2.7% for CbD, 1.9% for DOPS and 1.1% for the mini-CEX. As shown in Table 3, mean (SD) scores were higher for the second half of the year than the first for all methods. Moreover, comparison of round 1 of the mini-PAT with round 2 shows that the percentage of trainees with an aggregate rating of < 4 was higher in the first (13.1%) than the second (7.6%) round.

Table 3.   Mean (standard deviation) score for first and second halves of the year
 Mini-CEXCbDDOPSMini-PAT*
  1. * Represents mini-PAT 1 and mini-PAT 2

  2. Mini-CEX = mini-clinical evaluation exercise; CbD = case-based discussion; DOPS = direct observation of procedural skills; Mini-PAT = mini-peer assessment tool

First half4.82 (0.62)4.72 (0.64)4.96 (0.68)4.6 (0.5)
Second half4.94 (0.59)4.84 (0.59)5.03 (0.63)4.74 (0.46)

Table 4 shows the Pearson correlation among the methods. The DOPS method correlates least highly with the other three measures (all < 0.40). The mini-CEX and CbD have the highest correlation (0.62) with each other. The mini-PAT has comparable correlations with the mini-CEX (0.45) and CbD (0.46).

Table 4.   Pearson correlations between methods
 CbDDOPSMini-CEXMini-PAT
  1. Mini-CEX = mini-clinical evaluation exercise; CbD = case-based discussion; DOPS = direct observation of procedural skills; Mini-PAT = mini-peer assessment tool

CbD 0.350.620.46
DOPS  0.390.32
Mini-CEX   0.45
Mini-PAT    

Table 5 shows the 95% CI for four to 12 encounters or assessors for each of the four methods based on the variance components reported in Table 6. For each method, the 95% CIs decrease as the number of assessors or encounters increase. For the recommended minimum of six interactions, the 95% CI is 0.45 for both the mini-CEX and CbD, 0.48 for DOPS and 0.47 for the mini-PAT.

Table 5.   95% confidence intervals for each of the four methods for four to 12 cases
 CbDMini-CEXDOPSMini-PAT
  1. Mini-CEX = mini-clinical evaluation exercise; CbD = case-based discussion; DOPS = direct observation of procedural skills; Mini-PAT = mini-peer assessment tool

4 cases0.550.550.590.57
6 cases0.450.450.480.47
8 cases0.390.390.420.40
12 cases0.320.320.340.33
Table 6.   Variance estimates used in the calculation of 95% confidence intervals
 CbDMini-CEXDOPSMini-PAT
  1. Mini-CEX = mini-clinical evaluation exercise; CbD = case-based discussion; DOPS = direct observation of procedural skills; Mini-PAT = mini-peer assessment tool

Trainee0.059 (0.003)0.061 (0.005)0.071 (0.006)0.117 (0.009)
Assessor :  Trainee0.263 (0.027)0.251 (0.020)0.282 (0.018)0.142 (0.290)
Error0.056 (0.025)0.064 (0.019)0.080 (0.018)0.200 (0.289)

The use of G coefficients is less advantageous both psychometrically and practically. The desired level of reliability is often arbitrarily set and impacts on sampling for all participants. The use of 95% CIs is advantageous in terms of feasibility. For those doctors doing well, which will represent the majority, smaller numbers of assessments are required to achieve a result that can be placed above the cut score with confidence. For the smaller number of doctors around or below the cut score, more assessments will be required. This supports feasibility by focusing more resources into those individuals who are potentially struggling and diverting them away from the majority who are doing well. However, it would require a more psychometrically supported implementation process.

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References

This paper reports preliminary results for the FAP. There are a number of key messages generated from this implementation strategy and analysis that are of importance to any future programme development.

Assessment as a quality improvement model

Firstly, the FAP represents an important example of assessment in the workplace as a programme, rather than an individual instrument or event.14 The FAP is, in its design as a conceptual assessment framework, based firmly around a curriculum. The quality improvement model embeds the provision of feedback within its design and it was therefore reassuring that a third of the total time taken up with case-focused instruments was spent on giving feedback. Quality improvement models also require the ability to identify doctors in difficulty. They were identified here but in relatively small overall percentages, especially for DOPS. Trainee progression through training will be important to explore outcomes for these individuals and will provide opportunities to explore consequential validity.

Change management

A number of issues arise from the challenges experienced with change management. Overall high scores may reflect unfamiliarity with the system and assessors’ reluctance to be responsible for giving a trainee a poor assessment. It is important for assessors to recognise that their individual assessment is only part of an overall trainee profile and that failure to record and feed back areas for development is detrimental to patient safety, the trainee and the professionalism of the assessor.

The FAP was implemented in a very short time-frame in response to a central mandate and there was understandably significant concern about feasibility and the time it would require. Despite this, a mean of 16.6 case-focused assessments were submitted by each F1 trainee, although 40% of these were submitted in the last 6 weeks. It is likely that this reflects anxiety about achieving low scores early on in the year. Although the programme explicitly states that some scores of < 4 would be expected early in the year, this represents a major cultural shift in assessment.

Change supported by training

A number of national training days were run and materials provided to facilitate local cascading of this training. However, it was recognised that it would not be possible to train every assessor by August 2005. In reality only about a third of assessors received face-to-face training and another third said they had read the guidelines. It will take some time for all assessors to be adequately trained, given the numbers involved, although this will be helped by the introduction of similar assessments at more senior levels. It is also important that training is directed at all the health professionals involved in assessments and that it includes senior trainees and nurse specialists. In order to fully meet the PMETB principles, not only will assessors need to be trained, but there will need to be systematic processes in place to provide them with feedback on their performance. A programme of certified assessors may be desirable as the impact of assessor bias, particularly leniency, is the most important factor in undermining the validity of WBA. Centralised assessment management would allow for the provision of feedback to assessors and the identification of outliers and this will constitute part of future work.

Feasibility

Median time to undertake assessments and provide feedback equates to about 1 hour per month for each trainee, allowing for some time to find cases, fill in forms, etc. Although from a public accountability and patient safety perspective, this may seem a minimal time investment, the time burden nevertheless remains a concern, especially once widespread WBA at all levels is implemented. The provision of high-quality assessment requires resources in terms of funding to implement, manage and evaluate, as well as time to undertake assessments effectively. Recognition of the investment of clinical staff time is essential. It is the only way to ensure that all assessment is delivered to a standard which will see patients protected, trainees in difficulty identified and the public reassured.

Validity

Determining whether or not the assessment methods are valid – whether they assess what they are supposed to – is difficult. Evidence of validity comes from two main sources in this cohort: the relationships among the instruments, and change in scores over time.

Correlations among methods were estimated to explore hypotheses based on the intended focus of assessment as evidence for construct validity. For example, the mini-PAT, where humanistic aspects of performance are particularly important, would be expected to correlate least well with DOPS, which focuses on technical expertise. The mini-CEX and CbD would be expected to correlate more highly with one another than with DOPS. Our findings support these hypotheses.

For instruments which measure a range of aspects of performance, an improvement in scores over the course of the year is expected. For the case-focused instruments, such an increase in ratings over time was observed. In addition, the mean score for the first mini-PAT round was lower than that for the second round and the percentage of ratings with an aggregate score of < 4 fell from 13.1% to 7.6%.

Decision making

Widespread use of WBA to inform high-stakes decisions about progression in training is new to postgraduate medical training. An important message for appraisers and course leaders is that a borderline trainee will need more assessments than a trainee who is doing very well or very poorly in order to be sure on which side of the satisfactory line he or she falls. Increasing the number of assessments reduces the 95% CI and thus reduces the uncertainty associated with the collated judgement. An evaluation of factors that might influence scores, such as assessor gender, ethnicity or occupation, will need to be undertaken, but this is beyond the scope of this initial evaluation. This should also include the influence of trainee progression. Any correctly placed ‘disagreement’ between assessors about trainee performance over time is simply included in the ‘assessor–trainee’ variance in this study, when in fact it may represent actual and hoped for improvement in performance. Taking assessments clustered in time might improve reliability.

However, decisions about progression in the FAP are based on evidence from all the assessors and contexts. In this cohort, trainees had a mean of just over 16 case-focused assessments and the two rounds of mini-PAT, which provides evidence for triangulation.14 This evidence should contain both the quantitative (scores) and qualitative (free text) information, enhancing its value as a means of informing an overall judgement by providing a richness of data. Further work is needed on how to undertake this synthesis but it ultimately relies on the expert judgements of experienced supervisors as to whether the combined evidence represents satisfactory performance.15 Only the local programme gives a complete picture of a trainee’s performance and details of possible confounding factors such as ill health or other personal issues.

It will never be possible to entirely standardise WBA owing to its very opportunistic nature. However, by ensuring that assessment processes are as standardised as possible, by including a core of common assessments for all trainees, centralisation of data and robust QA, it should be possible to make the process as fair as possible. Ultimately it is essential that the evidence underpinning the decisions is as robust as possible, that the decision-making process is transparent, and that plans for remediation (including further assessments) are explicit. Where there is cause for concern about a trainee, additional assessment data are likely to be needed (such as the use of video to explore the nature of a communication problem).

Conclusions

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References

Despite these issues, preliminary QA is reassuring. It must be recognised, however, that significant work remains to fully demonstrate that the FAP meets both its stated purposes and the requirements of the relevant national training board. Detailed evaluation of the individual methods is required and will be published elsewhere.5

The FAP represents an important step forward in WBA. Despite the many difficulties generated by the fact that it takes place in the real world, it most closely represents what doctors actually do on a day-to-day basis, which is of critical importance in terms of public accountability.

Contributors:  HD, JA, LS and JN were all involved in the design of the Foundation Assessment Programme. LS led the design of CbD based on work undertaken to design the performance procedures at the GMC. JA designed mini-PAT as part of an ongoing programme in MSF as part of his PhD thesis. Data was collected, using scanning and online systems developed by HD and JA, from trainees within participating deaneries. HD and JN undertook the analyses. HD and JA wrote the paper, with assistance from the co-authors.

Acknowledgements:  the authors thank the members of the research team, Healthcare Assessment and Training (HcAT), based in Sheffield Children’s Hospital Foundation NHS Trust.

Funding:  none.

Conflicts of interest:  none.

Ethical approval:  the assessment programme was implemented as part of programme requirements within participating Deaneries. This paper reports the assessment quality assurance exercise.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Analyses
  6. Results
  7. Discussion
  8. Conclusions
  9. References