Systematic review of applied usability metrics within usability evaluation methods for hospital electronic healthcare record systems

Abstract Background and objectives Electronic healthcare records have become central to patient care. Evaluation of new systems include a variety of usability evaluation methods or usability metrics (often referred to interchangeably as usability components or usability attributes). This study reviews the breadth of usability evaluation methods, metrics, and associated measurement techniques that have been reported to assess systems designed for hospital staff to assess inpatient clinical condition. Methods Following Preferred Reporting Items for Systematic Reviews and Meta‐Analyses (PRISMA) methodology, we searched Medline, EMBASE, CINAHL, Cochrane Database of Systematic Reviews, and Open Grey from 1986 to 2019. For included studies, we recorded usability evaluation methods or usability metrics as appropriate, and any measurement techniques applied to illustrate these. We classified and described all usability evaluation methods, usability metrics, and measurement techniques. Study quality was evaluated using a modified Downs and Black checklist. Results The search identified 1336 studies. After abstract screening, 130 full texts were reviewed. In the 51 included studies 11 distinct usability evaluation methods were identified. Within these usability evaluation methods, seven usability metrics were reported. The most common metrics were ISO9241‐11 and Nielsen's components. An additional “usefulness” metric was reported in almost 40% of included studies. We identified 70 measurement techniques used to evaluate systems. Overall study quality was reflected in a mean modified Downs and Black checklist score of 6.8/10 (range 1–9) 33% studies classified as “high‐quality” (scoring eight or higher), 51% studies “moderate‐quality” (scoring 6–7), and the remaining 16% (scoring below five) were “low‐quality.” Conclusion There is little consistency within the field of electronic health record systems evaluation. This review highlights the variability within usability methods, metrics, and reporting. Standardized processes may improve evaluation and comparison electronic health record systems and improve their development and implementation.

and reporting. Standardized processes may improve evaluation and comparison electronic health record systems and improve their development and implementation. implemented systems should facilitate timely clinical decision-making. 1,2 However 3 the prevalence of poorly performing systems suggest the common violation of usability principles. 4 There are many methods to evaluate system usability. 5 Usability evaluation methods cited in the literature include user trials, questionnaires, interviews, heuristic evaluation and cognitive walkthrough. [6][7][8][9] There are no standard criteria to compare results from these different methods 10 and no single method identifies all (or even most) potential problems. 11 Previous studies have focused on usability definitions and attributes. [12][13][14][15][16][17] Systematic reviews in this field often present a list of usability evaluation methods 18 and usability metrics 19 with additional information on the barriers and/or facilitators to system implementation. 20,21 However many of these are restricted to a single geographical region, 22 type of illness, health area, or age group. 23 The lack of consensus on which methods to use when evaluating usability 24 may explain the inconsistent approaches demonstrated in the literature. Recommendations exist [25][26][27] but none contain guidance on the use, interpretation and interrelationship of usability evaluation methods, usability metrics and the varied measurement techniques applied to assess EHR systems used by clinical staff. These are a specific group of end-users whose system-based decisions have a direct impact on patient safety and health outcomes.
The objective of this systematic review was to identify and characterize usability metrics (and their measurement techniques) within usability evaluation methods applied to assess medical systems, used exclusively by hospital based clinical staff, for individual patient care. For this study, all components in the included studies have been identified as "metrics" to facilitate comparison of methods when testing and reporting EHR systems development. 28 In such cases, Nielsen's satisfaction attribute is equivalent to the ISO usability component of satisfaction.

| METHODS
This systematic review was registered with PROSPERO (registration number CRD42016041604). 29 During the literature search and initial analysis phase, we decided to focus on the methods used to assess graphical user interfaces (GUIs) designed to support medical decisionmaking rather than visual design features. We have changed the title of the review to reflect this decision. We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines 30 (Appendix Table S1).

| Eligibility criteria
Included studies evaluated electronic systems; medical devices used exclusively by hospital staff (defined as doctors, nurses, allied health professionals, or hospital operational staff) and presented individual patient data for review.
Excluded studies evaluated systems operating in nonmedical environments, systems that presented aggregate data (rather than individual patient data) and those not intended for use by clinical staff. Results from other systematic or narrative reviews were also excluded.

| Search criteria
The literature search was carried out by TP using Medline, EMBASE, CINAHL, Cochrane Database of Systematic Reviews, and Open Grey bibliographic databases for studies published between January 1986 and November 2019. The strategy combined the following search terms and their synonyms: usability assessment, EHR, and user interface. Language restrictions were not applied. The reference lists of all included studies were checked for further relevant studies. Appendix Table S2 presents the full Medline search strategy.

| Study selection and analysis
The systematic review was organized using Covidence systematic review management software (Veritas Health Innovation Ltd, Melbourne). 31 Two authors (MW, VW) independently reviewed all search result titles and abstracts. The full text studies were then screened independently (MW, VW). Any discrepancies between the authors regarding the selection of the articles were reviewed by a third party (JM) and a consensus was reached in a joint session.

| Data extraction
We planned to extract the following data:  33 ). For the purpose of this review, we adopted the term "metric" to describe any such component) but we include all metric-similar terms used by authors in included studies: a. satisfaction, efficiency, effectiveness metrics, b. learnability, memorability, errors components, 6. Types and frequency of usability metric analysed within usability evaluation methods.
We extracted data into two stages. Stage 1 relied on the extraction of general data from each of the studies that met our primary criteria based the original data extraction form. Stage 2 extended the extraction to gain more specific information such as the measurement techniques for each identified metric as we observed that these were reported in different ways.
The extracted data was assessed for agreement reaching the goal of >95%. All uncertainties regarding data extraction were resolved by discussion among the authors.

| Quality assessment
We used two checklists to evaluate quality of included studies. First used tool, the Downs & Black (D&B) Checklist for the Assessment of Methodological Quality 34 contains 27 questions, covering the following domains: reporting quality (10 items), external validity (three items), bias (seven items), confounding (six items) and power (one item). It is widely used for clinical systematic reviews because it is validated to assess randomized controlled trials, observational and cohort studies. However, many of the D&B checklist questions have little or no relevance to studies evaluating EHR systems, particularly because EHR systems are not classified as "interventions." Due to this fact, we modified D&B checklist to have usability-oriented tool. The purpose of our modified D&B checklist, constructed of 10 questions, was quality assessment of the aim of the study (specific to usability evaluation methods) evidence that included methods and metrics were supported by peer reviewed literature. Our modified D&B checklist investigated whether the participants of the study were clearly described and representative of the eventual (intended) end-users, the time period over which the study was undertaken being clearly described and the results reflected the methods and described appropriately. The modified D&B checklist is summarized in the appendix (Appendix Table S3). Using this checklist, we defined "high quality" studies as those which scored well in each of the domains (scores ≥ eight). Those studies, which scored in most but not all domains were defined as "moderate quality" (scores of six and seven). The remainder were defined as "low quality" (scores of five and below). We decided to not exclude any paper due to low quality.

| RESULTS
We followed the PRISMA guidelines for this systematic review (Appendix Table S1). The search generated 2231 candidate studies.
Of the included studies, 16 evaluated generic EHR systems.
Eleven evaluated EHR decision support tools (four for all ward patients, one for patients with diabetes, one for patients with chronic pain, one for patients with cirrhosis, one for patients requiring haemodialysis therapy, one for patients with hypertension, one for cardiac rehabilitation and one for management of hypertension, type-2 diabetes and dyslipidaemia). Seven evaluated specific electronic displays (physiological data for patients with heart failure, arrhythmias, also genetic profiles, an electronic outcomes database, longitudinal care management of multimorbid seniors, chromatic pupillometry data, and pulmonary investigation results).
Four studies evaluated medication specific interfaces. Three evaluated electronic displays for patients' clinical notes. Three studies each evaluated mobile EHR systems. Two evaluated EHR systems with clinical reminders. Two evaluated quality improvement tools.
Two evaluated systems for use in the operating theatre environment and one study evaluated a sequential organ failure assessment score calculator to quantify the risk of sepsis.
We extracted data on GUIs. All articles provided some description of GUIs, but these were often incomplete, or were a single screenshot. It was not possible to extract further useful information on GUIs.   Table S4 presents the specification of type of data included in EHR systems.

| Usability evaluation methods
Ten types of methods to evaluate usability were used in the 51 studies that were included in this review. These are summarized in Table 2.
We categorized the 10 methods into broader groups: user trials analysis, heuristic evaluations, interviews and questionnaires. Most authors applied more than one method to evaluate electronic systems. User trials were the most common method reported, used in 44 studies (86%). Questionnaires were used in 40 studies (78%). Heuristic evaluation was used in seven studies (14%) and interviews were used in 10 studies (20%). We categorized thinking aloud, observation, a threestep testing protocol, comparative usability testing, functional analysis and sequential pattern analysis as user trials analysis. Types of usability evaluation methods are described in Table 3.
Three heuristic evaluation methods were used in seven of the included studies. Four studies used the method described by Zhang The three remaining studies used the heuristic checklist introduced by Nielsen. 67,68 The severity rate scale was sometimes used to judge the importance or severity of usability problems. 76 Findings from heuristics analyses are summarized in Appendix Table S5.
Six types of interviews were used in 10 (20%) studies. The interviews were carried out before the user trial, in the middle of user trial or after the user trial.
The purpose of interviews (unstructured, 38 follow-up, 38 and semistructured 38 ) before the user trial was to understand the end-users' needs, their environment, information/communication flow, and identification of possible changes.
The purpose of interviews (contextual 73 ) during the user trial was observation by the end-users while using the system to collect information about potential system utility.
The purpose of interviews following the user trial (prestructured, 71 posttest, 38

| Usability metrics
The usability metrics are summarized in Table 4. Satisfaction was measured in 38 studies (75%), efficiency was measured in 32 studies (63%), effectiveness was measured in 31 studies (61%), learnability was measured in 12 studies (24%), errors was measured in 16 studies (31%), memorability was measured in one study (2%) and usefulness metric that was measured in 20 studies (39%). Table 5 summarizes the variety of usability evaluation methods used to quantify the different metrics. Some authors used more than one method within the same study (e.g., user trial and a questionnaire) to assess the same metric. Results were reported in different ways regardless of types of usability evaluation methods or types of usability metric applied, so we created a list of measurement techniques.

| Usability metrics' measurement techniques
We found that different measurement techniques (MT) were used to report the metrics. User trial A process through which end-users (or potential end-users) complete tasks using the system under evaluation. Every participant should be aware of the purpose of the system and analysis. According to Neville et al. 35 participants should be "walked through" through the task under analysis One of the main objectives for a user trial is to collect observation data but sometimes information comes from the post-test interviews or questionnaires.

35
Studies using user trials are indicated in Table 2 Thinking aloud Verbal reporting method that generates information on the cognitive processes of the user during task performance. The user must verbalize their thoughts as they interact with the interface 36,37,38,39,40,41,42,43,44,45,46, Observation Direct and remote observation of users interacting with the system 52 Comparative Usability Testing Examines the time to acquire information and accuracy of information 53 Three Step Testing Protocol Tests for intuitiveness within the system.
Step one asks users to identify relevant features within the interface.
Step two requires users to connect the clinical variables of interest.
Step three asks users to diagnose clinical events based on the emergent features of the display 54 Functional Analysis Measures "functions" within the EHR and classifies them into either Operations or Objects. Operations are then sub classified into Domains or Overheads.

55,56
Sequential Pattern Analysis Searches for recurring patterns within a large number of event sequences. Designed to show "combinations of events" appearing consistently, in chronological order and then in a recurring fashion.

57,58
Cognitive Walkthrough Walkthrough of a scenario with execution of actions that could take place during completion of the task completion with expression of comments about use of the interface. It measures ease of learning for new users.

8,36,39,59-62
Heuristic evaluation Method that helps to identify usability problems using a checklist related to heuristics. Types of HE methods are reported in Appendix Table S5. Questionnaire/ Survey Research instrument used for collecting data from selected group of respondents. The questionnaires used in studies included in this review are summarized in Appendix Table S7.
Appendix Table S7 Interview Structured research method, which may be applied before the usertrial, in the middle of user trials or after the user trial. We identified six types of interviews (follow-up, unstructured, prestructured, semi-structured, contextual and post-test interviews), described in Appendix Table S6. The purpose of interviews (unstructured, follow-up and semistructured), applied before the user trial, was understanding the end-users' needs, their environment, information/communication flow and identification of possible changes, which could improve the process/workflow. The goal of interviews (contextual), applied during user trial, was endusers observation while they work to collect information about potential utility of systems. The purpose of interviews (prestructured, posttest, semi-structured), applied after the user trial, was mainly gathering information about missing data, system's weaknesses, opportunities for improvements and users' expectations toward further system's development. information about the type of errors (n = 5), or reason for errors (n = 1). These measurement techniques were investigated within user trials.
The or users preferences across two tested system versions (n = 1).
The usefulness metric was reported using 12 different measurement techniques. These included users' comments regarding the utility of the system in clinical practice (n = 5), comments about usefulness of layout (n = 1), average score of system usefulness (n = 5), and total mean scores for work system-useful-related dimensions (n = 1).

| Quality assessment
Results for the quality assessment are summarized in the appendix (Appendix Table S9). We did not exclude articles due to poor quality.
For the D&B quality assessment, the mean score (out of a possible 32 points) was 9.9 and the median and mode score were 10. The included studies scored best in the reporting domain, with seven out of the 10 questions generating points. Studies scored inconsistently (and generally poorly) in the bias and confounding domains and no study scored points in the power domain (Appendix Table S10).
Using the Modified D&B checklist the mean score was 6.8 and the median was 7.0 out of a possible 10 points. Seventeen studies (33%) were classified as "high-quality" (scoring eight or higher), 26 studies (51%) were "moderate-quality" (scoring six or seven), and the remaining eight studies (16%) were "low-quality" (scoring five or below). The relationship between the two versions of the D&B scores is shown in the appendix (Appendix Figure S1). In the future, well-conducted EHR system evaluation requires established human-factor engineering driven evaluation methods.
These need to include clear descriptions of study aims, methods, users and time-frames. The Medicines and Healthcare Regulation Authority (MHRA) requires this process for medical devices and it is logical that a comparable level of uniform evaluation may benefit EHRs. 97

| Strengths
We have summarized the usability evaluation methods, metrics, and measurement techniques used in studies evaluating EHR systems. To our knowledge this has not been done before. Our results' tables may therefore be used as a goal-oriented matrix, which may guide those requiring a usability evaluation method, usability metric, or combination of each, when attempting to study a newly implemented electronic system in the healthcare environment. We identified usefulness as a novel metric, which we believe has the potential to enhance healthcare system testing. Our modified D&B quality assessment checklist was not validated but has the potential to be developed into a tool better suited to assessing studies that evaluate medical systems. By highlighting the methodological inconsistencies presented by researchers in this field we hope to improve the quality of research in the field, which may in turn lead to better systems being implemented in clinical practice.

| Limitations
The limitations of the included studies were reflected in the quality assessment: none of the included studies scored >41% in the original D&B checklist, which is indicative of poor overall methodological quality. Results from the modified D&B quality assessment scale, offered by our team, were better but still showed over half the studies were of low or medium quality. A significant proportion of the current research into EHR systems usability has been conducted by commercial, nonacademic entities. These groups have little financial incentive to publish their work unless the results are favourable, so although this review may reflect publication bias, it is unlikely to reflect all current practices. It was sometimes difficult to extract data on the methods used in studies included in this review. This may reflect a lack of consensus on how to conduct studies of this nature, or a systematic lack of rigour in this field of research.

| CONCLUSION
To our knowledge, this systematic review is the first to consolidate applied usability metrics (with their specifications) within usability evaluation methods to assess the usability of electronic health systems used exclusively by clinical staff. This review highlights the lack of consensus on methods to evaluate EHR systems' usability. It is possible that healthcare work efficiencies are hindered by the resultant inconsistencies.
The use of multiple metrics and the variation in the ways they are measured, may lead to flawed evaluation of systems. This in turn may lead to the development and implementation of less safe and effective digital platforms.
We suggest that the main usability metrics as defined by ISO 9241-1 (efficiency, effectiveness, and satisfaction) used in combination with usefulness, may form part of an optimized method for the evaluation of electronic health systems used by clinical staff.
Assessing satisfaction via reporting the users positive and negative comments; assessing efficiency via time to task completion and time taken to assess the patient state; assessing effectiveness via number/percentage of completed tasks and quantifying user errors; and assessing usefulness via user trial with think-aloud methods, may also form part of an optimized approach to usability evaluation.
Our review supports the concept that high performing electronic health systems for clinical use should allow successful (effective) and quick (efficient) task completion with high satisfaction levels and they should be evaluated against these expectations using established and consistent methods. Usefulness may also form part of this methodology in the future.

ACKNOWLEDGEMENTS
We would like to acknowledge Nazli Farajidavar and Tingting Zhu

CONFLICT OF INTEREST
The authors declare that they have no competing interests.

ETHICS STATEMENT
Ethical approval is not required for the study.

AUTHORS' CONTRIBUTIONS
MW, LM and PW designed the study, undertook the methodological planning and led the writing. TP advised on search strategy and enabled exporting of results. JM, VW and DY assisted in study design, contributed to data interpretation, and commented on successive drafts of the manuscript. All authors read and approved the final manuscript.

DATA AVAILABILITY STATEMENT
The data that supports the findings of this study are available in the supplementary material of this article