Early validity and reliability evidence for the American Board of Emergency Medicine Virtual Oral Examination

Abstract Background The American Board of Emergency Medicine (ABEM) in‐person Oral Certification Examination (OCE) was halted abruptly in 2020 due to the COVID‐19 pandemic. The OCE was reconfigured to be administered in a virtual environment starting in December 2020. Objectives The purpose of this investigation was to determine whether there was sufficient validity and reliability evidence to support the continued use of the ABEM virtual Oral Examination (VOE) for certification decisions. Methods This retrospective, descriptive study used multiple data sources to provide validity evidence and reliability data. Validity evidence focused on test content, response processes, internal structure (e.g., internal consistency and item response theory), and the consequences of testing. A multifaceted Rasch reliability coefficient was used to measure reliability. Study data were from two 2019 in‐person OCEs and the first four VOE administrations. Results There were 2279 physicians who took the 2019 in‐person OCE examination and 2153 physicians who took the VOE during the study period. Among the OCE group, 92.0% agreed or strongly agreed that the cases on the examination were cases that an emergency physician should be expected to see; 91.1% of the VOE group agreed or strongly agreed. A similar pattern of responses given to a question about whether the cases on the examination were cases that they had seen. Additional evidence of validity was obtained by the use of the EM Model, the process for case development, the use of think‐aloud protocols, and similar test performance patterns (e.g., pass rates). For reliability, the Rasch reliability coefficients for the OCE and the VOE during the study period were all >0.90. Conclusions There was substantial validity evidence and reliability to support ongoing use of the ABEM VOE to make confident and defensible certification decisions.


INTRODUC TI ON
Since its first administration in 1980, the American Board of Emergency Medicine (ABEM) Oral Certification Examination (OCE) had been delivered in an in-person format. The ABEM OCE was typically conducted once in the spring and once in the fall each year. Inperson OCE administrations were halted abruptly in 2020 due to the COVID-19 pandemic; the 2020 spring OCE and the 2020 fall OCE administrations were canceled. It soon became apparent that due to public health limitations and institutional travel bans, the resumption of an in-person OCE was untenable. Consequently, ABEM reconfigured the OCE to be administered in a virtual environment. The first Virtual Oral Examination (VOE) was administered in December 2020.
Reliability is a necessary complement to validity. For the VOE, a sufficient level of reliability is required to demonstrate that the assessment is a credible measure that reflects a consistent level of quality. Reliability includes the ability of an assessment to have both reproducible results and high internal consistency. High-reliability assessments have lower incidences of errors occurring by chance.
Ideally, a high-stakes examination such as a medical certification exam should have minimal error caused by extraneous factors (e.g., problems with consistent administration). For the past decade, the in-person OCE was highly reliable (Cronbach α > 0.85) (ABEM, unpublished data, 2022).
Validity is an argument built on the evidence and experience resulting from the repeated use of an assessment. A validity argument supports the interpretation of an examination's results, as well as how those results are used, such as when making a decision about a physician's certification status. An assessment is valid if it measures what it is intended to measure and the assessment's scores can be interpreted as intended. 1 For the purposes of this study, the authors modeled the types of validity evidence from the Standards for Educational and Psychological Measurement 2 that included test content, response processes, internal structure, and consequences of testing. 3

The Standards for Educational and Psychological
Measurement is the most widely used psychometric reference for assessing performance using high-stakes testing.
As a certifying organization, ABEM has an interest in being confident that the VOE measures competencies that a board-certified physician must have. Those who use the results of ABEM's assessments include certified physicians, hospital credentialers, state medical licensing boards, and the public who have a vested interest in certification being an accurate measure of a physician's capacity to deliver safe, high-quality care. Like the in-person OCE, the VOE must demonstrate that a physician has the competencies that contribute to providing such care.
Given the high-stakes nature of specialty board certification, it was important to determine the degree to which the VOE was reliable as well as determine the presence of validity evidence to support the new format. The in-person OCE had substantial validity evidence and reliability data to support its use in the certification of emergency physicians. [4][5][6][7][8] The purpose of this investigation was to determine whether there is sufficient early validity and reliability evidence to support the continued use of the ABEM VOE for certification decisions.

Content validity
Evidence for validity focused on multiple sources that included test content, response processes, internal structure, and consequences Finally, although the content emphasis is not on disease conditions per se, it is important the content involving medical and traumatic conditions is relevant to clinical practice. When combined with the written, 305-question Qualifying Examination, the VOE creates an assessment system that covers a substantial span of the EM Model based on a weighted content blueprint. 12 Adjustments to the content blueprint have used several data sources including responses to detailed surveys ("job analyses") by emergency physicians.
VOE cases (both "traditional cases" and structured interview cases) were created anew as well as modified previously administered cases. Modified content involved reformatting and updating cases from the eOral format (quasi-simulation format) as well as case topics used in the paper-only format that were used prior to the introduction of the eOral format in 2015.
The two post-examination survey items of interest that assessed content relevance applied a 5-point Likert scale (strongly disagree, disagree, neutral, agree, and strongly agree). The items were: (1) "Overall, the types of cases on this examination were cases that an emergency physician should be expected to see"; and (2) "In my practice, I have seen most of these cases." These survey items were developed by an expert panel of clinically active emergency medicine experts and have been used to support validity claims on prior OCEs. 5

Response process
Validity evidence for response processes included a clear chain-of-

Consequences of testing
Evidence for the consequences of testing were determined by pass rates as well as the marketability of certified physicians and the value that hospital systems and credentials placed on the resulting certificate.
Another dimension of consequential validity was determined based on how the examination results are used, not the examination itself. The ability to use a test's results to make a high-stakes certification decision supports consequential validity. Further consequential validity support of the assessment comes from third-party use of the credential that is determined by the assessment. Specifically, the way academic departments, hospital systems, physician employers (including the military), and the public would view certification obtained through the VOE could also support consequential validity arguments.

Reliability
To determine the reliability of the VOE, a multifaceted Rasch measurement was used. 13 This analytic tool is part of a family of mathematical models (item response theory) that attempt to explain the relationship between a latent trait (e.g., cognitive skill and medical knowledge) and performance on a test (e.g., the VOE

Analysis
The data were largely descriptive (e.g., survey response frequencies), with Chi square testing to evaluate nominal values. Rasch reliability coefficients were calculated as a byproduct of fitting a multifaceted Rasch model to the examination data. ABEM conducted a post-hoc analysis of the OCE using the Rasch method to compare reliability between the OCE and VOE using a similar methodology. A priori parameters for reliability were determined to be good (0.80-0.89) and excellent (0.90-0.99), which are typical psychometric thresholds for most measures of reliability. 14 All data were deposited in a highly secure ABEM server and all data reports used aggregate, deidentified, and not re-identifiable data.

RE SULTS
There were 2279 physicians who took the 2019 in-person OCE examination and 2153 physicians who took the VOE during the study period ( Evidence supporting test content validity of the VOE was found in the continued use of the EM Model as the source for all VOE material. 15 The Response process focused on assessing the reasoning that a physician used in working through a case. Skills such as obtaining a medical history and gathering data from a physical examination are assessed, which are axiomatic when caring for a patient with undifferentiated complaints. Diagnostic uncertainty is often clarified using a think-aloud protocol whereby the physician is asked to role play or state a specific diagnosis or provide the interpretation of a diagnostic study (e.g., describing a radiographic finding).
Validity support for the VOE was also evident in the new type of case that was used-the SI. This case format was a conversation between the examiner and test taker in which test takers were regularly asked for the rationale for their responses to specific questions.
This interrogatory approach can better assess a physician's thought processes and logic, which is a think-aloud protocol.
Of note, several other medical certifying boards have used this approach successfully to determine certification status (e.g.,

DISCUSS ION
This study is the first report of validity and reliability evidence for the virtual format of the ABEM certifying examination. Prior studies TA B L E 3 Comparison of virtual versus in-person agreement to expected case relevancy.

Strongly agree n (%)
Virtual total (n = 1235) 15  well as the psychometric performance of the examination cases. [4][5][6][7][8] In addition, the OCE amassed consequential validity by the use of ABEM certification as a criterion for hiring or a promotion in community and academic practice settings.
Establishing validity is an iterative process whereby evidence is ac- Certain elements of validity evidence will be stronger than others and some types of validity evidence require an assessment of clinical performance, which can take years to obtain.
Test content validity evidence was supported by the use of the EM Model in developing VOE content. All VOE content was contained in the EM Model. The EM Model is publicly available and used to define educational content, including residency curricula. [18][19][20] Basing the VOE in the EM Model provides substantial content validity evidence. In addition, substantial evidence of test content was provided through the way the EM Model was developed initially and is amended regularly-the EM Model's stability of form and content over time, its alignment with detailed specialty-specific surveys, and its use for multiple similar assessments in emergency medicine. 21 Validity evidence was also provided by the VOE's psychometric performance, such as pass rates and distribution indices, compared with the OCE.
Evidence for test content validity was provided by the physician survey responses regarding the types of clinical cases that an emergency physician has seen or is expected to see. Responses were measured for the VOE, as well as compared with prior OCE survey responses. More than 90% of physicians confirmed the relevance of VOE cases as akin to cases seen in clinical practice, which provided additional evidence of "face validity" and content validity for the VOE. Of note, the frequency of agreement responses regarding the case that a physician has seen was high, which supports assertions of content relevance. This finding provides additional test content validity evidence. Though survey results were relatively close in frequency, the OCE had a statistically significant higher rate of agreement. The importance of this statistical result is uncertain as a practical finding.
Validity evidence based on the internal structure of the VOE was obtained largely through measuring reliability. The Rasch reliability coefficients demonstrate excellent internal consistency. The ability to equate each administration via the application of item response theory further supports the validity of the internal structure of the VOE. Moreover, equating allowed the maintenance of an interexamination difficulty scale, which provided additional validity evidence based on internal structure.
When apply evidence for the consequences of testing (consequential validity), the VOE scores were used to determine ABEM certification status. ABEM certification achieved by passing the VOE was used by third parties to make hiring and promotion decisions.
These certification decisions made based on candidate performance on the VOE were identical to decisions made on the basis of candidate performance on the OCE even though the test formats varied.
The in-person OCE has had decades of use in the marketplace with consistent and generalizable results from one cohort to another supports consequential validity. By hiring physicians who were certified through the VOE process, the market confirmed that certification awarded by passing the VOE was a sufficient credential. There has been insufficient time to study the hiring of physicians who have been certified through the VOE process compared to the OCE process. ABEM is unaware of any instances where the certification credential obtained through the VOE has been, by itself, insufficient for

LI M ITATI O N S
Establishing validity for the VOE is a long-term and ongoing activity requiring the appropriate use of the assessment and resultant certification. Acquiring additional validity evidence will create greater confidence in physicians, physician employers, and the public about the use of the VOE for ABEM certification.
The data and validity evidence are early and limited. Additional experience may demonstrate differences not identified in this early study. As early career physicians become more familiar with the format of the SI and test performance changes, the validity and reliability evidence could change.
There was a large response rate difference between the OCE and VOE. It is possible that the VOE had a lower response rate because test takers received the survey by email after the examination, while at the in-person examination, test takers received paper surveys that they would have to walk past to exit the examination venue. Despite this discrepancy, the sample sizes for both formats were large and probably overcomes concerns about sample size and self-selection bias.
The survey questions about case relevancy were "agreement" questions that lend themselves to an affirmative response bias.
Whether such bias was present is less important than any difference between the two exam formats. The same questions were used on both formats, allowing for an accurate comparison. Although the responses were statistically different to a significant degree, the general level of agreement was similar.
Another limitation to the analyses mentioned in this study is the lack of inter-rater reliability statistics. ABEM specifically chose to forgo its typical observer rating program for the initial phase of the VOE. Every examiner did undergo direct observation by an examination leader to observe that the case was being administered to ABEM standards. Every scored result was reviewed to ensure adherence to case scoring guidelines.
Finally, because the VOE was implemented after a 1-year delay, it is possible that there could have been an impact on the experience. Clearly, a different format mandated a different examiner and candidate experience. Despite any potential impact, the VOE was able to be administered at a record rate and led to results that were consistent with prior test administrations-any consequential impact was not apparent.

CON CLUS IONS
There is substantial early validity evidence to support ongoing use of the ABEM VOE to make confident and defensible certification decisions. There are also strong reliability data to support confi-