Evaluation of three short-listing methodologies for selection into postgraduate training in general practice


Professor Fiona Patterson, c/o Executive Assistant Sarah Marshall, Department of Psychology, City University, Northampton Square, London EC1V OHB, UK. Tel: 00 44 115 940 7978; Fax: 00 44 115 940 3000; E-mail: f.patterson@city.ac.uk


Objective  This study aimed to evaluate the effectiveness and efficiency of three short-listing methodologies for use in selecting trainees into postgraduate training in general practice in the UK.

Methods  This was an exploratory study designed to compare three short-listing methodologies. Two methodologies – a clinical problem-solving test (CPST) and structured application form questions (AFQs) – were already in use for selection purposes. The third, a new situational judgement test (SJT), was evaluated alongside the live selection process. An evaluation was conducted on a sample of 463 applicants for training posts in UK general practice. Applicants completed all three assessments and attended a selection centre that used work-related simulations at final stage selection. Applicant scores on each short-listing methodology were compared with scores at the selection centre.

Results  Results indicate the structured AFQs, CPST and SJT were all valid short-listing methodologies. The SJT was the most effective independent predictor. Both the structured AFQs and the SJT add incremental validity over the use of the CPST alone. Results show that optimum validity and efficiency is achieved using a combination of the CPST and SJT.

Conclusions  A combination of the CPST and SJT represents the most effective and efficient battery of instruments as, unlike AFQs, these tests are machine-marked. Importantly, this is the first study to evaluate a machine-marked SJT to assess non-clinical domains for postgraduate selection. Future research should explore links with work-based assessment once trainees are in post to address long-term predictive validity.


Unprecedented changes in postgraduate medical training in the UK, as a result of the Modernising Medical Careers initiative,1 have placed more emphasis on the methodologies used for selecting doctors into specialty training.2,3 Almost all large-scale recruitment uses a multi-stage process to progressively reduce the applicant pool.4 This paper compares the effectiveness and efficiency of three methods used to short-list applicants for selection into postgraduate training. Specifically, this paper compares the effectiveness of a machine-marked situational judgement test (SJT) targeting non-clinical domains with the effectiveness of traditional selection methodologies. The results will directly inform optimal selection practices for postgraduate training across all medical specialties, which has been a topic of fierce debate. We present these data to encourage further analysis within the scientific community and to develop a future evidence-based research agenda for selection methodology.

As with any selection methodology, various psychometric and legal criteria must be satisfied, including those pertaining to standardisation, reliability, validity and fairness.3,5 The method must discriminate fairly between applicants and ensure few ‘false negative’ decisions (i.e. where applicants who are potentially successful at interview are rejected). Processes must be cost-efficient, especially with large numbers of applicants, where recruiter time and administrative efficiency are important. The methodology should be acceptable to applicants so that procedural fairness is maintained.5

Recruitment into general practice in the UK

This study evaluates three short-listing methodologies for selection into postgraduate training in general practice (GP) in the UK. The selection system is designed to process several thousand applicants per year6,7 and the methodology comprises three stages. Stage 1 consists of long-listing eligibility checks. Stage 2 comprises short-listing via, firstly, a clinical problem-solving test (CPST), a machine-marked test developed from an existing item bank in which the applicant applies clinical knowledge to solve a problem reflecting a diagnostic process or to develop a management strategy for a patient, and, secondly, structured application form questions (AFQs) comprising open-ended questions targeting non-clinical domains in the person specification, such as empathy or integrity, to which applicants provide a word-limited response.8 Once short-listed, applicants attend Stage 3, a selection centre (SC) using work-relevant simulations (patient consultation, group and written exercises) to target both clinical and non-clinical domains from the person specification. This methodology has been validated.7 However, the short-listing process (Stage 2) is relatively resource-intensive, especially with large-volume recruitment.

Typically, 5–10% of applicants are rejected at long-listing and a further 20–30% are selected out at short-listing. This paper reports on the evaluation of a new short-listing methodology, a situational judgement test (SJT) targeting non-clinical domains, where applicants are presented with written depictions of scenarios they may encounter at work and asked to identify an appropriate response from a list of alternatives. The evaluation considered whether this new approach could improve efficiency by replacing the hand-marked AFQs (which require 30 minutes of assessor time per applicant in addition to invigilation) with a machine-marked measure, using a previously agreed scoring frame.

Recent reviews indicate that the predictive and incremental validity of SJTs is well established.9–13 Although the SJT methodology has been validated for use in medical school admissions,14 this is, to the authors’ knowledge, its first application in postgraduate specialty selection. Here, the SJT focused on three non-clinical selection criteria: empathy, integrity and coping with pressure. Table 1 provides two example items and illustrates different response formats (ranking and multiple best answer).

Table 1.   Example situational judgement test items using ranking and multiple best-answer response formats
1A 55-year-old woman with ischaemic heart disease has smoked 20 cigarettes per day for 40 years. She requests nicotine replacement patches. She has had these previously but has been inconsistent in their use and has often continued to smoke while using the patches.
Rank in order the following immediate actions in response to this situation (1 = most appropriate; 5 = least appropriate)
AEmphasise the dangers of smoking but do not prescribe
BEnquire about the difficulties she has with stopping smoking and any previous problems with patches
CInsist on a period of abstinence before prescribing any further patches
DPrescribe another supply of patches and explain how they should be used
ESuggest that nicotine replacement therapy is not suitable for her but explore alternative therapies
2You are looking after Mrs Sandra Jones, who is being investigated in hospital. You are asked by her family not to inform Mrs Jones if the results confirm cancer.
Choose the THREE most appropriate actions to take in this situation
AIgnore the family’s wishes
BAgree not to tell Mrs Jones
CExplain to the family that it is Mrs Jones’ decision
DAsk Mrs Jones whether she wishes to know the test results
EAsk Mrs Jones whether she wishes you to inform the family
FInform Mrs Jones that her family do not wish her to have the results
GGive the results to the family first
HGive the results to the next of kin first

Table 2 provides a summary of the content, scoring process and related properties of each of the three short-listing methodologies. All are written tests, completed independently and under invigilated conditions. By comparing the predictive validity and resource requirements of each, the purpose was to identify the most effective methodology for future use.

Table 2.   Overview of three short-listing methodologies
 Clinical problem-solving test (CPST)Structured application form questions (AFQs)Situational judgement test (SJT)
Item content and response formatThe CPST evolved from an existing test of clinical knowledge developed by the North Western Region in the UK. Operational papers have around 100 questions and a completion time of 90 minutes. Each paper contains a mix of extended-match and single best-answer formats covering the range of clinical areas defined by the UK training curriculum.Seven AFQ items are drawn from an existing item bank. Each question targets a single non-clinical competency domain, including empathy, integrity and ability to cope with pressure. An example question is: ‘Give an example of how you won the trust of a worried and sceptical patient. Describe what you did, and why, and the effect it had on both you and the patient.’ Applicants are allowed 2 hours to complete seven questions, with a 250-word limit per question. The assessment is completed under invigilated conditions.The SJT is a newly designed instrument consisting of 50 questions of two different types (rating and multiple best-answer formats; see Table 1). Twenty senior general practitioners (four female, one from a minority ethnic background), with experience in both training and designing assessments, worked with three psychologists to generate, review and refine a bank of questions. Questions target three non-clinical domains: empathy, professional integrity and ability to cope with pressure. 186 items were written and divided into four pilot test forms of 50 items each (14 items were repeated in a second test form).
Scoring processPapers are machine-marked according to the agreed key.Each paper is scored independently by two trained assessors working from a validated scoring framework. Assessors are trained during a 1-day workshop and calibrated to enhance reliability.Papers are machine-marked according to the agreed key.

As the CPST and AFQs were currently used in selection, the SJT was piloted alongside them. Specifically, this paper addresses the following questions.

  • 1The machine-marked CPST is efficient to use. What validity does it offer compared with other short-listing methodologies?
  • 2Structured AFQs are relatively time-consuming to score and require trained assessors. Do they add incremental validity over other short-listing measures?
  • 3The SJT is a newly designed, machine-marked methodology in this context. Is it valid and does it have incremental validity over the other measures?
  • 4Which combination of the three measures represents the most effective short-listing methodology?


Design and procedure

Data were collected during the 2006 selection round for GP training in the UK. At short-listing stage, applicants completed one of four pilot forms of the SJT together with operational assessments consisting of the CPST and AFQs. Applicants gave full consent to participation in the SJT as this was for evaluation purposes only and scores were not used in selection decisions. The SJT was completed in advance of the CPST and AFQs.

Applicants who were successful in being shortlisted were invited to the SC, on the basis of which job offers were made. Performance at the SC was used as an outcome measure, using the mean scores across three simulation exercises, as a primary aim of short-listing is to identify applicants who are likely to perform well at the SC. Performance at the SC has been shown to predict performance 3 months into GP training.7 The regression for SC results on scores at short-listing from the CPST, AFQs and SJT was examined to establish predictive validity. Hierarchical regression analysis was used to explore incremental validity.

Psychometric properties of the instruments

The reliability of each assessment methodology was evaluated using Cronbach’s α coefficient and the Spearman–Brown formula where appropriate. The CPST, AFQs and SC show satisfactory levels of reliability (CPST α = 0.89; AFQs α = 0.78; SC α = 0.89). As the SJT was a pilot rather than an operational test, some items did not meet required psychometric standards and were not subsequently included in the bank. Results show 71% of items had sufficient psychometric quality to be included in an operational test. The Spearman–Brown formula was used to estimate the α reliability of operational length SJT forms from the items of acceptable quality in each pilot form; reliability ranged from 0.80 to 0.83. In preparing SJT items, 10 experienced item writers from the examiner panel for Membership of the Royal College of General Practitioners (MRCGP) in the UK, with no previous involvement in the SJT development process, each responded to two pilot forms in order to provide a five-person concordance sample for the response key. This concordance analysis was undertaken to ensure experts were in agreement over the keyed response to each item. Kendall’s W was computed for each ranking item, which showed 85% had a concordance over 0.6 and 71% above 0.7. This indicates substantial and significant levels of agreement between experts. Items with poor concordance were reviewed by item writers after piloting and only included in the bank if consensus was reached on the response key.


This study is based on a total of 524 doctors who applied for GP training programmes in the UK commencing in 2007, who attended an SC as part of final-stage selection. Of those, 463 applicants had attended short-listing at a location which piloted the SJT alongside the CPST and AFQs. Situational judgement test data were not available for 12% of applicants because one pilot paper was not used as a result of an administrative error on the paper. Applicants were short-listed via the CPST and AFQs. Of the 463 applicants, 178 were males and 275 were female (10 unknown). Overall, 39.0% described themselves as White (UK and Ireland), 46.0% as Asian, 4.6% as Black and 10.4% as belonging to other ethnic groups (European, Arab etc). Approximately 58% of the sample was aged ≤ 30 years; 37% were aged 31–40 years and 5% were > 40 years.


All three short-listing methodologies and the SC showed score distributions close to normal with effective levels of variability (Table 3). There were significant correlations between scores on each of the short-listing methodologies. The correlation between the AFQs and the SJT was = 0.41 (< 0.001), which is consistent with the similarity of the non-clinical constructs measured by these two instruments (empathy, integrity, coping with pressure). The CPST and SJT also correlated positively (= 0.39, P < 0.001). This may reflect the common closed-response format of both these measures (as opposed to the open-response format of the AFQs). The correlation between the CPST and AFQs was = 0.32 (P < 0.001), reflecting the differences in content (clinical versus non-clinical domains) and response method (closed versus open). In summary, this level of correlation suggests that each pair of instruments have both common and independent variance.

Table 3.   Summary of descriptives for all measures in the study
n = 463Selection centreApplication form questionsClinical problem-solving testSituational judgement test
  1. Situational judgement test scores were standardised within the trial group in order to combine scores across the different trial forms

Standard deviation0.460.267.50.57
Min, max1.50, 4.002.1, 3.764, 98− 1.24, 1.97

In analysing associations between each short-listing methodology and SC scores, both uncorrected correlations and coefficients corrected for restriction of range are reported.15 In this context, uncorrected correlations underestimate ‘true’ relationships between predictors and criteria when, as in this case, the predictors are used to select applicants to be included in the sample (to attend the SC) and thus restrict the variation in scores. The uncorrected correlation between AFQs and the SC was = 0.26 (P < 0.001), indicating moderate predictive validity for a selection methodology. Uncorrected correlations were = 0.30 (P < 0.001) for the CPST and SC, and = 0.46 (P < 0.001) for the SJT and SC. The AFQs and CPST were then corrected for direct restriction of range caused by their use in short-listing. The SJT was corrected for the indirect restriction of range resulting from the correlation of the SJT with the operational short-listing measures (i.e. AFQs and CPST). When corrected for restriction of range, the correlations between each short-listing measure and the SC increased to = 0.40 (AFQs), = 0.44 (CPST) and = 0.56 (SJT). A comparison of short-listing methodologies revealed the SJT to show the strongest association with SC performance and AFQs to show the weakest association, although it was still moderately strong.

Results demonstrate that the strongest prediction of SC performance is a combination of all three measures (uncorrected = 0.51, P < 0.001). Hierarchical regression analyses showed all three measures to have significant incremental validity over all other measures (P < 0.001) (Table 4). The SJT is the best single predictor (adjusted R2 = 0.213, P < 0.001) and offers the most incremental validity over other methodologies. The predictive validity of the SJT is superior to that of both the CPST and AFQs, but all three are effective predictors. This has important implications for the development of selection methodologies in this context, especially in the assessment of non-clinical domains.

Table 4.   Predictive validity results for the clinical problem-solving test, structured application form questions and situational judgement test
PredictorsAdjusted R2R2 changeF (d.f. 1, d.f. 2)*
  1. Performance at the selection centre was used as the criterion variable, using the mean score across three simulation exercises in the selection centre

  2. * All values were significant at P < 0.001

  3. CPST = clinical problem-solving test; AFQs = application form questions; SJT = situational judgement test

CPST0.090 46.85 (1,461)
AFQs0.066 33.43 (1,461)
SJT0.213 125.74 (1,461)
CPST and AFQs0.138 37.93 (2,460)
Incremental prediction of CPST over AFQs 0.07439.62 (1,460)
Incremental prediction of AFQs over CPST 0.04926.42 (1,460)
CPST and SJT0.232 70.81 (2,460)
Incremental prediction of CPST over SJT 0.02112.69 (1,460)
Incremental prediction of SJT over CPST 0.14386.12 (1,460)
AFQs and SJT0.235 71.87 (2,460)
Incremental prediction of SJT over AFQs 0.170102.91 (1,460)
Incremental prediction of AFQs over SJT 0.02414.35 (1,460)
CPST, AFQs and SJT0.252 52.94 (3,459)
Incremental prediction of AFQs over CPST and SJT 0.02213.38 (1,459)

Resource-efficiency is an important consideration in developing selection methodologies.2,16 In terms of cost comparison and efficiency between the methodologies, a utility analysis of the GP recruitment process (including analysis of assessor time and administration costs) produced estimated costs of £50 (€59, $74) per applicant using the AFQ method compared with actual costs of £20 (€24, $30) per applicant using a machine-marked test.


Main findings

All three short-listing methodologies independently show sufficient predictive validity for selection purposes. Corrected correlations over = 0.3 between predictor and criterion indicate a moderately strong level of prediction.17 Although there is scope for improvement, the existing selection process using the CPST and AFQs is an effective methodology. The CPST focuses on clinical knowledge and problem-solving ability and the AFQs focus on non-clinical domains, including empathy, integrity and ability to cope with pressure. The results show that AFQs add significant incremental validity over the CPST in predicting SC outcome. This might be expected as AFQs target non-clinical domains, whereas the CPST focuses on clinical problem-solving skills. The SJT shows the strongest validity in predicting SC performance. This concurs with other studies using SJTs in college admissions procedures, where this methodology is best suited to assessment of non-clinical domains.14 However, this is the first application of an SJT in postgraduate selection and it demonstrates that the methodology is effective in this context. Although each methodology independently predicted later SC performance, the most accurate overall prediction was obtained using the newly designed SJT in combination with the other measures.

Interpretation of results

The degree of predictive validity is an important evaluative standard for any selection procedure. However, although sufficient validity is achieved, the AFQ method is relatively costly to implement, especially for large-volume recruitment such as this, as it requires substantial marking resources (two trained assessors score each response independently) in addition to invigilation. The SJT provides similar measurement properties to the AFQ process but does not carry the associated marking costs. When considering efficiency in addition to predictive validity, results show the optimum short-listing methodology to be a combination of the CPST and SJT, as both are machine-marked. It must, however, be acknowledged that SJTs (like CPSTs) require investment in development costs and, because they are machine-marked, are most cost-beneficial when there are large applicant numbers.


Although the validity of the three short-listing measures in predicting performance in the final stage of selection has been clearly demonstrated, further research is needed to examine whether the measures also predict performance in training. This is currently being explored in a separate validation study. One limitation refers to the loss of some data (12%) for administrative reasons. However, the sample size remains adequate and there is no evidence to suggest a systematic bias in the data lost. As the SJT was a pilot and candidates were aware that scores would not contribute to selection decisions, it could be argued that candidates may not have addressed this paper with the same degree of seriousness, which might have attenuated its observed reliability and, therefore, validity. Additionally, because the SJT was a pilot paper rather than an operational test, the validity coefficients may, again, be attenuated.


The results have important implications for developing selection systems for large-volume recruitment to optimise both effectiveness and efficiency. Although there are limitations to this study, as outlined above, the evaluation of a machine-marked SJT to assess non-clinical domains is an important innovation in this context. This is relevant to current proposals for selection into UK postgraduate training for all medical specialties regarding the use of machine-marked tests for short-listing purposes.18 However, recent evidence shows although there are common competencies across specialties, there are also differing priorities in the selection criteria, especially in non-clinical domains.19 Research is required to examine whether an SJT is an appropriate methodology for use in other specialties. Future research should explore applicant reactions and perceptions of fairness5,20 and other crucial evaluative standards in judging the quality of selection methodologies. In addition, further research should explore whether assessments are prone to coaching or practice effects.14 Results at selection should be linked with other work-based assessment methodologies once trainees are in post to explore predictive validity in the long-term.

Contributors:  FP contributed to the original conception and design of the study, data collection, analysis and interpretation, and the write-up of the paper. HB and VC contributed to the overall study design, data collection, analysis and interpretation, and the write-up of the paper. PL and SP contributed to the conception of the original study, and data collection and interpretation, and commented on the paper. All authors approved the final manuscript for publication.


Acknowledgements:  Gai Evans is acknowledged for her contribution to data collection via the General Practice National Recruitment Office. Our gratitude to the various contributors to item writing for all three short-listing methodologies is acknowledged.

Funding:  this research was funded by the Committee of General Practice Education Directors (COGPED) in the UK via the National Recruitment Office.

Conflicts of interest:  FP, HB and VC are employed by Work Psychology Ltd. This organisation has advised the UK Department of Health on selection and recruitment issues. SP has subsequently become the GP Adviser in the DH MMC team.

Ethical approval:  this study was approved by City University, London.