EHR STAR: The State-Of-the-Art in Interactive EHR Visualization

Since the inception of electronic health records (EHR) and population health records (PopHR), the volume of archived digital health records is growing rapidly. Large volumes of heterogeneous health records require advanced visualization and visual analytics systems to uncover valuable insight buried in complex databases. As a vibrant sub-field of information visualization and visual analytics, many interactive EHR and PopHR visualization (EHR Vis) systems have been proposed, developed, and evaluated by clinicians to support effective clinical analysis and decision making. We present the state-of-the-art (STAR) of EHR Vis literature and open access healthcare data sources and provide an up-to-date overview on this important topic. We identify trends and challenges in the field, introduce novel literature and data classifications, and incorporate a popular medical terminology standard called the Unified Medical Language System (UMLS). We provide a curated list of electronic and population healthcare data sources and open access datasets as a resource for potential researchers, in order to address one of the main challenges in this field. We classify the literature based on multidisciplinary research themes stemming from reoccurring topics. The survey provides a valuable overview of EHR Vis revealing both mature areas and potential future multidisciplinary research directions.


Introduction and Motivation
Several healthcare data institutes strive to exploit software-based technology to study and ultimately improve a nation's collective health [Bus04,Act09,FJV*09,Lar14,Eur17]. Due to the large volume of heterogeneous electronic healthcare data, researchers incorporate techniques such as Machine Learning (ML), Event Sequence Simplification (ESS) and Natural Language Processing (NLP) with interactive visualization and visual analytics, in order extract useful information enabling healthcare providers to obtain a more comprehensive understanding of the underlying patterns and behaviour related to health [SNLOB17]. Those insights provide a useful context in assisting in the optimization of diagnostic and treatment process for both individuals and cohorts of patients, the evaluation of quality and effectiveness of healthcare services, and the prevention of a future public health crisis [FP10].
We present a state-of-the-art (STAR) report of research literature focusing on visualization of EHRs and PopHRs to address these ongoing trends. The contributions we provide to the field in this EHR STAR include: • An up-to-date overview of recent EHR Vis literature featuring a concise overview of important terminology and recent research in the field, with 213 related references and 19 tables, • Novel classifications of 51 EHR Vis literature based on six reoccurring research themes and the Unified Medical Language System (UMLS), • A survey of 34 high quality open access healthcare data sources and datasets, • An EHR STAR that appeals to researchers from visualization, visual analytics, healthcare, biomedical and related disciplines, • An overview of future challenges and open research directions in the field.
We have developed an online EHR STAR literature browser for the readers: https://ehr.wangqiru.com. It features all of the EHR papers and datasets along with several filtering and sorting options based on author, year, technique and search terms. We believe it offers a valuable resource for those interested in this topic.

Survey challenges
This section describes the challenges in the field of EHR Vis and in conducting a survey of related literature. We face a number of challenges stemming from the related literature search.

Diverse literature sources:
As literature is spread across conferences and journals from different communities, researchers struggle to keep up with the latest published work. This also increases the time and effort required to identify solved and unsolved problems.

Multidisciplinary research themes:
A well-defined classification and scope to organize relevant literature is challenging due to multidisciplinary research themes. As the complexity of research grows, cross-disciplinary collaborations are fostered, and the literature on EHR Vis often spans multiple themes. Different combinations of research expertise produce papers that may be difficult to classify. A typical EHR Vis project might involve visualization, Natural Language Processing (NLP) and Machine Learning (ML).

Inconsistent Medical terminology:
The choice of medical terminology standard varies between authors, this increases the work required to classify literature and the difficulty to provide a concise overview of recent research in the field. We address the challenge directly by adopting a medical terminology standard, UMLS, in Section 2.2, and presenting a list of standardized terminology and definitions used in the related literature in Section 1.4.
We also face a number of challenges stemming from digital healthcare data.
Healthcare data acquisition: It is generally challenging to find open and accessible healthcare datasets for conducting research in the field, due to the sensitive nature of the data [MIT16]. There are a number of ways of acquiring electronic healthcare datasets: 1. Cooperation with relevant health institutes: This can be the ideal situation but not every researcher has the opportunity to work closely with a relevant institute and obtain access to electronic healthcare data. 2. Open Access datasets: There are a number of open access datasets available online. In order to address this challenge directly, we classify and describe them in Section 6. However, the challenge with such datasets is that the access may be restrictive. EHRs may be redacted and lack some data dimensions that are important for EHR Vis research. Based on our investigation, some datasets are old and outdated. 3. Proprietary datasets: A license to access proprietary datasets can be expensive. We provide some example license costs in Section 6.5.4 where we describe some proprietary datasets.
Data protection: Electronic healthcare data contains highly sensitive information that requires extra precaution during analysis. Researchers and institutes must comply with the laws and regulations such as HITECH [Act09] and GDPR [Eur16]. This increases the difficulty in data acquisition for research.
Data heterogeneity: Electronic healthcare data is heterogeneous, it may include free text, scalar, ordinal, images and categorical attributes in one record [PL20].

Scalability:
The size of an electronic healthcare dataset is often huge. The rate of data growth exceeds the capacity of algorithms and software developed to visualize it [AKDA15].
High-dimensionality: Closely related to heterogeneity, healthcare datasets are high-dimensional and complex [GS14,TRL*17]. The ability to visualize large datasets with many attributes effectively remains a challenging problem [AKDA15].
We address some of these challenges directly in this STAR in Section 6, which includes a survey of open access electronic healthcare data sources, with a dedicated list of Data References. We also present related future challenges in the field in Section 7.

Literature search methodology
We started our literature search primarily on papers from the following conferences and journals: Analytics in Healthcare is also reviewed, since VAHC primarily focuses on applying interactive visualization techniques for healthcare data After the initial search and looking into the references, we found more literature from venues listed in Table 1.
We first conduct a breadth-first search. Table 2 shows the list of keyword combinations we use for our breadth-first literature search. We use IEEE Xplore [IEE], The ACM Digital Library [Thea], Google Scholar [Gooc], Vispubdata [IHK*17], Semantic Scholar [The19], Mendeley [Els] and Research Gate [Res] as digital libraries and tools for searching. Previous surveys serve as a good starting point for finding papers on topics of interest. Cross-referencing the extensive Survey of Surveys by McNabb and Laramee [ML17], we find another two related surveys on EHR Vis [RWA*13,WBH15].
We then conduct a depth-first search on the results obtained from the breadth-first search. We review each paper to find other relevant research including: • The previous related work section and its references • Mendeley's [Els] "related documents" functions • The "cited by" function provided by Google Scholar [Gooc] and Semantic Scholar [The19] to discover forward-looking related papers

Survey scope
In this section, we describe the scope of the survey. Due to the large volume of publications related to EHR Vis, we apply constraints to narrow down the list of literature. We describe those constraints below in this section.  We include peer-reviewed literature focusing on real-world scenarios and empirical applications of EHR Vis. We emphasize research with healthcare data collected through clinical practice and that which provides clinical decision support.
Novel techniques are also included. We include Event Sequence Simplification (ESS), a widely adopted technique to provide succinct visual layouts [MLL*13] hidden in EHR data-related processes. We include papers on EHR Vis with geospatial vi-sualization, as a geographical dimension might be relevant in a PopHR dataset. Geospatial visualization partially overlaps with this survey. We include research describing EHR Vis with Natural Language Processing (NLP) techniques. Friedman and Hripcsak recognize text visualization with NLP as one of the most commonly used tools to extract information from EHR data and for studying clinical and research questions [FH99]. We also include papers describing EHR Vis systems developed with Machine Learning (ML) and data mining techniques, as they have gained traction in their applications in assisting clinical research [WZM*16].
We focus on papers published in the previous 10 years. We refer to these papers as focus papers. Older papers such as LifeLines [PMR*96], LifeLines2 [PMS*98b] and PatternFinder [FKSS06], contribute significantly to the field, with mature implementations deployed in clinical practices. We still include them as context papers and in the meta-data such as the classification Table 3, without a detailed description. By considering the publication year, we are able to investigate the fields that are less mature and provide more accurate future research directions.

Out of Scope:
We introduce the following criteria to constrain the scope of this STAR.
Non peer-reviewed publications: We exclude papers that are not peer-reviewed.

Resource-oriented:
We exclude papers focusing on the visualization of related resource-oriented EHR data. We define resourceoriented EHR data as the data that focuses on the management of clinical practices, such as hospital bed occupancy rates and inter/intra-hospital patient transfer times. These studies generally do not focus on the clinical decision support directly.

Off-topic:
We exclude papers that focus on the use of EHR in the study of disease relations and pathogen outbreaks.

Basic visual designs:
In order to focus on novel and interactive visualization techniques, we exclude papers that describe EHR Vis with very basic, static visual designs such as a pie chart, line chart, bar chart or bubble chart. Including classic, static visual designs does not advance the state of the art.
Off-the-shelf solutions: We exclude papers that use off-the-shelf solutions to generate images. In generally, they do not propose a novel visualization technique. We also exclude papers that demonstrate visual designs but do not provide a custom-built solution.

Background and terminology
Healthcare-related terminology is one of the challenges in the literature. We address this challenge by studying some of the must popular terms used in the literature. Here we provide and classify the terminology used in this STAR.

EHR:
To the best of our knowledge, there is no standard definition of an Electronic Health Record (EHR) even since its inception in the 1960s [MIT16]. Iakovidis defines EHR as digitized healthcare information on individual patients that is accessible, secure and highly usable for supporting the analysis of healthcare, education and research [Iak98]. Gunter particular provider, including demographics, progress notes, problems, medications, vital signs, past medical history, immunizations, laboratory data and radiology reports" [Thee]. The World Health Organization (WHO) defines EHR as, "Health records residing in an electronic system specifically designed for data collection, storage, and manipulation, and to provide safe access to complete data about patients" [WP17],p.16].
In this STAR we define EHR as a longitudinal collection of comprehensive patient medical information in machine readable formats, that is maintained and shared by healthcare providers, and stored securely in an electronic system. EMR: EHR and Electronic Medical Record (EMR) are sometimes used interchangeably to represent digitized health records used to improve quality of care and estimate costs [ZAR*11,Eva16,GBMG17,STBR18]. Unlike EHR, an EMR is stored and used internally without inter-organization sharing [HBAS17]. For purposes of this STAR, we group EMR terminology and literature into the EHR category.  [FP10], p.360]. A PopHR dataset focuses on the health of a population, without storing identifiable information on individual members of the population. We make a distinction between EHR and PopHR in this survey. The research focusing on PopHR is summarized in Section 4.6.

EHR Visualization:
We consider the visualization of EHR and PopHR for clinical decision support, as a sub-field of information visualization and visual analytics (EHR Vis), with the following definitions.

Literature Classification
This section describes our literature classification method. We derive classification dimensions based on: • Recurring multidisciplinary research themes derived from our literature search, described in Section 2.1. • The Unified Medical Language System (UMLS), introduced in Section 2.2, as the medical terminology standard for classifying literature.

Figure 1:
The various subdomains integrated in the UMLS Terminology. Image courtesy of [Bod04].

Multidisciplinary research themes
EHRs are often large-scale and may contain noisy data [CXR18]. This means an automated process can be implemented in order to achieve both efficiency and accuracy in the pre-processing and visualization stages. From the related literature, we have identified several major research themes in processing and visualizing EHRs. We provide a brief description of these themes here and review the related literature in detail in Section 4.  Table 8 shows an overview of our literature classification based on multidisciplinary research themes. We describe Table 8 in Section 4 on the visualization of EHR data.

Adopting a medical terminology standard
Gesulaga et al. identify one of the primary barriers to the adoption and deployment of EHR Vis systems in a clinical environment as stemming from resistance from clinical professionals due to the lack of expertise in computer systems including visualization [GBMG17]. By adopting a medical terminology standard, we hope to bridge the gap between two communities, thus reaching a wider audience beyond information visualization and visual analytics, and take advantage of the extensive work invested into the standardized terminology development.
UMLS was introduced by the US National Library of Medicine in 2004. It incorporates a growing list of 2.5 million medical concepts and 12 million relations among these concepts from multiple dictionaries in order to provide a terminology standardization. A schematic of the integrated dictionaries is shown in Figure 1. Dictionaries often use different lexical items to describe identical or similar terms. An integrated standard will make these resources interoperable, machine-readable and help dismantle the barrier to multidisciplinary research [Bod04].
In order to classify each paper, we first extract their keywords to obtain their corresponding code and terminology from the UMLS. Table 3 shows the overview classification of research papers found from our literature search. The x-axis is mapped to the number of EHRs visualized in the corresponding paper. The y-axis is mapped to the corresponding UMLS Code and terminology found along with the keywords appearing in each paper. We can observe from Table 3 the lack of convergence or consolidation with respect to the health conditions addressed in the EHR Vis literature. This is most likely due to the relative immaturity of the field. We also do not observe many research groups working together as a wider team-effort to tackle challenges in the field. And finally we can observe that not many papers are dealing with the really large EHR and PopHR datasets with over 100,000 records.

Related Work
This section introduces related work with a special emphasis on previous related surveys. Papers with a focus on visualization or visual analytics of EHR data are described in Section 3.1. We present previous PopHR survey papers in Section 3.2.
Our STAR differs from previous ones by including a novel, upto-date overview using a medical terminology standard described in Section 2.2, with 29 more recent publications on EHR visualization. Table 5-7 clearly indicate both the overlap and divergence between this STAR and previous surveys. In addition, we introduce a survey of 34 open healthcare data sources in Section 6 to address the challenge of healthcare data access.

Related work with an EHR focus
In this section, we divide related work with an EHR focus into two sub-categories, related work with an EHR Vis focus and related work without an EHR Vis focus but rather on analysis. We also investigate both the overlap and divergence of the literature presented here with previous surveys, as shown in Table 5-7 for focus papers, context papers, and out of scope papers respectively.

Related Work with an EHR Vis Focus
The IEEE Workshop on Visual Analytics in Healthcare (VAHC) started in 2010 and has been hosted six times at the IEEE VIS conference and four times at the American Medical Informatics Association (AMIA) Annual Symposium. EHR Vis has great potential for influencing the clinical decision-making process and conducting research on epidemiology [Eva16]. The quantity of literature has grown since an early survey published in 2013 by Rind et al. [RWA*13]. There are a number of older related surveys published since then, we present them in this section.
Roque et al. compare six information visualization systems designed for providing overviews of EHR data [RST10]. Systems are classified based on the users, goals, and tasks. Four of these systems  are included in our survey as context papers (Table 6) and two are excluded with reasons indicated in Table 7.
Rind et al.
[RWA*13] review 14 information visualization systems for exploring and querying EHR documents, as shown in Table  5.2 in their work. The survey identifies four major challenges in the field, and highlights the potential that information visualization has on supporting medical tasks. Some 14 systems are compared by (1) supported data types (categorical and numerical), (2) multi-variate support, (3) subject cardinality (support for one patient versus multiple patient records), and (4) supported medical scenarios. Two systems are included in our survey as focus papers (Table 5), nine are included as context papers (Table 6), and seven are papers considered out of scope with reasoning indicated in Table 7.
Simpao et al. [SAGR14] discuss applications of visual analytics in healthcare since the HITECH Act in 2009. The authors review eight visual analytics tools for EHR and categorize their application  into different scenarios: (1) using mathematical and algorithmic based processing techniques such as text mining and NLP to derive insight from data, (2) predefined data models to input EHR and output predictive risk assessment results for stratifying patients, (3) enhancing EHR systems with more sophisticated rules-based functions, (4) analysing continuous data streams in the nontraditional healthcare environment, such as data transmitted from wearable monitors, (5) aimed at cost-cutting and revenue-generating, such as automated billing and auditing, optimizing resource allocation. From these eight EHR Vis tools, one is included in our survey as a context paper (Table 6), and seven are considered out of scope (Table 7).
West et al. [WBH15] publish a systematic review of 18 papers, by highlighting crucial metrics to evaluate EHR systems. Those metrics include (1) visualization techniques applied to utilize the screen space efficiently while preserving as much data as possible, (2) interactive user options to identify abnormalities within the data, (3) visualization of the entire dataset even if there are missing values or inaccurate data entries, (4) visualization of temporal data including event sequences and real-time data streams, and (5) training time required for users and software. Some 13 EHR systems are described in these 18 papers. We include four as focus papers (Table 5), three as context papers (Table 6) and exclude six papers (Table 7).
Onukwugha et al. [OPS16] publish a survey of EHR Vis for cancer analysis. The authors describe five cancer-related EHR Vis systems followed by two EHR systems in detail with case studies visualizing a prostate cancer archive and a health insurance claim dataset. The authors focus on EHR systems from three perspectives, (1) the ability to identify and rectify errors in data, (2) visualization techniques and interactive options provided to support data analysis, and (3) cogent visualizations generated to present findings to decision-makers. From these seven EHR Vis systems, we include four as focus papers (Table 5), one as a context paper (Table 6) and exclude two papers (Table 7). Gotz and Borland [GB16] discuss challenges and opportunities for the interactive visualization of EHR, with four EHR Vis systems reviewed in detail. The authors provide a broad range of empirical applications incorporating EHR Vis, (1) Patient-centred point-of-care applications that provide support for clinicians on communication and analysis for a single patient. (2) Patient-facing applications, similar to patient-centred point-of-care applications, providing patient-oriented support via techniques such as storytelling. (3) Population management applications supporting institutional policymakers to allocate healthcare resources intelligently.
(4) Health outcomes research that support discovery and insight that generalize across a population at large. We include two as focus papers (Table 5) in our survey and exclude two papers (Table 7). Rind et al. [RFG*17] publish a survey of EHR Vis with a focus on time-oriented datasets. The authors identify technical challenges arising from the temporal dimension of EHR datasets, as (1) the interpretation of discrete and continuous temporal dimensions, (2) the scalability from a single patient to a cohort of patients and (3) data-processing techniques to address uncertainties caused by data quality. Detailed descriptions of four EHR systems are provided, we include two as focus papers (Table 5) and two as context papers (Table 6).

Related Work with EHR Focus Outside the Visualization Community
To date, we have not found any further related EHR Vis surveys beyond what we describe. However, we found other work related to EHR analysis outside the visualization community with a focus on EHR data.
MIT Critical Data published a related book, Secondary Analysis of Electronic Health Records [MIT16]. The first chapter identifies the objective of secondary analysis of EHR data as the utilization of EHR data to provide evidence for informing best practices in clinical care. EHR has comparative advantages in both cost-effectiveness and feasibility. The second chapter reviews three open access EHR databases (as one of them no longer provides open access, we only include two of these databases in Table 17 in Section 6.5 as focus data sources) in detail with compact descriptions of three additional databases with more restrictive access limitations (we exclude these three databases, as two have discontinued and one no longer provides open access).
Chapter three introduces opportunities and challenges in the secondary analysis of EHR. EHR creates novel opportunities for researchers and clinicians, large datasets and queries provide evidence to support hypotheses. The authors identify that scalability and data accessibility as two major challenges in the field, which overlap with our findings in Section 7 and Table 18. Other challenges identified are data protection, data interoperability, the cost of data infrastructure and the varied quality of research outputs. The rest of the book describes techniques in data pre-processing and analysis with example studies conducted using EHR databases reviewed in chapter two.
Shickel et al. [STBR18] survey six ML-EHR systems developed with Deep Learning techniques for predictive analytics using EHRs in detail. In addition, 25 systems are included for comparison and discussion. These systems are divided into two categories based on their applied machine learning techniques: Supervised and Unsupervised, as shown in Figure 3 in Shickel et al. [STBR18] survey. Another classification dimension is derived from the target task and subtasks of previous EHR systems. Koleck et al. [KDBB19] systematically review 27 systems that adopt NLP algorithms for extracting structured data from free text EHRs. Table 3 in Koleck et al. [KDBB19] shows the classification by the clinical specialty. The survey scope is defined to include symptom science research that focuses on the description, evaluation, or use of an NLP algorithm or pipeline to process or analyse patient symptom terms. Reporting demographic information is essential for NLP-EHR studies, as symptom experience is known to vary by common demographic factors. Reporting such information helps avoid potential bias and improve the effectiveness of tailored interventions. Some 27 systems are evaluated, with eight critical indicators identified by the authors.

Related work with a PopHR vis focus
This section introduces related work with an emphasis on PopHR, which focuses on the visualization of the health of a population, rather than individuals. Carroll et al. [CAD*14] publish a systematic review of 88 articles with a primary focus on infectious disease, needs of public health users, or usability of information visualizations. Each article is reviewed and classified into the following six categories with a focus on: (1) information needs and learning behaviour of public health professionals, (2) architecture of tools, (3) user preference with a focus on usability issues and barriers to adoption of tools, (4) features of tools, (5) usability and evaluation and (6) implementation and adoption. These categories are not mutually exclusive, in total 14 EHR systems are reviewed in detail, we include three (Table 5) as focus papers, none as context papers, and exclude 11 with reasons indicated in Table 7.
Preim and Lawonn review the existing visual analytics solutions for supporting Public Health (PH) [PL20] with structured data. The authors describe PH datasets as heterogeneous and highdimensional, often containing temporal and spatial dimensions, therefore flexible visual analytics solutions will benefit the analysis process and provide support for PH decision-making. The survey classifies these solutions based on commonly used visualization and visual analytics techniques, as shown in Tables 4 and 5 in their work. The survey then expands into three particular areas of PH, (1) analysis and control of epidemics with 8 solutions, (2) visual analytics for epidemiological research with 14 solutions, and (3) visual analytics of population-based cohort study data. We include two (Table 5) as focus papers, none as context papers, and exclude six with reasons indicated in Table 7.

Visualization of EHR data
This section describes 41 focus papers on EHR Vis found from our literature search. We further categorize these papers based on six multidisciplinary research themes derived from our investigation, as shown in the Table 8. Each theme is described in this section in detail. We also provide an interactive EHR STAR Browser containing all literature described in this section. Note that each paper description follows the guidlines proveded by Laramee ([Lar11]).

Machine learning
This section introduces the literature that combines Machine Learning (ML) and EHR Vis. We follow the definition of ML by Alpaydin [Alp10] as the process of optimizing the performance of a predefined model, based on example data or past experience. The outcomes from the process are either predictive to provide guidance on the future or descriptive to acquire knowledge from the existing data. The application of ML techniques such as deep learning [ROC*18], neural networks [KCK*19], support vector machines [ZXG19,BRMF19] and topic models [BRMF19], have evolved recently to increase automation of processing EHR archives. From examining Table 8, we can observe that incorporating ML into EHR Vis is a relatively new trend and not very mature. Also, we believe that EHR Vis could benefit more with the help of ML techniques. Table 9 presents an overview of the EHR literature in this subsection indicating which ML techniques are used. We can  ables physicians to evaluate the well-being status of prostate cancer patients by exploiting the patient's history as recorded in their respective EHRs. The phrase visual active learning system refers to a system that uses an active learning approach which requires physicians for feedback and corrections during the training of the model. The resulting visualization enables quick identification of possible diagnoses of individual patient's symptoms.
Dabek et al. propose a timeline-based framework for aggregating and summarizing EHRs [DJC17]. The main challenge they address is the heterogeneous nature of EHR data sources. The framework implements a patient timeline that conveys temporal events with nodes. Each node contains a textual summary generated automatically via machine learning. A separate panel is presented with a sunburst chart visualizing patient diagnoses and a horizon chart visualizing lab test results. Glueck et al. present PhenoLines, a visual analysis tool for the interpretation of disease subtypes that exploits the application of topic modelling applied to clinical data [GNDV*18]. Based on the Human Phenotype Ontology (HPO) extracting and mapping method introduced in the prior work [GHC*16, GGC*17], Phe-noLines aims to support the filtering, comparison, simplification and interpretation of temporal evolution of phenotype probabilities within and between disease subtypes. Topic modeling is used to mine cross-sectional patient's comorbidity data from high dimen-sional EHRs. PhenoLines enables interactive analysis of the derived topic models, by encoding them in sunburst charts, as shown in Figure 2. Kwon et al. [KCK*19] first present RetainEx, a recurrent neural networks (RNN) approach that develops interactivity and interpretability for prediction tasks and incorporates the temporal dimension in patient history data. As the RNN uses a black-box approach, it is difficult to couple the predictions to a particular attribute used during training. The authors then introduce RetainVis, an interactive visual analytics tool for assisting the user in understanding the process of prediction. Histogram, scatterplot, matrix and glyph designs are used to present influential attributes leading to the prediction.
Jin et al. [JCG*20] introduce CarePre, an intelligent system that converts EHR data from a cohort of patients into sequences of events, and leverages machine learning techniques for the prediction of a patient's risk level during diagnosis. The system then recommends the most influential treatment plans. Based on the available  EHR data, CarePre is also able to predict the likelihood of an outbreak for a set of potential diseases selected by the user. The MIMIC dataset [JPS*16] is used for thorough evaluations with seven physicians including two case studies. We include this open access dataset in Table 17. Kwon et al. [KAS*20] present DPVis, a multiple views visual analytics system that focuses on visual disease progression analysis in order to develop fully interpretable and interactive visualizations. Hidden Markov models (HMMs) are trained to infer the most probable state sequences based on the user-chosen attributes. DPVis incorporates multiple interactive visual designs including matrix, chord diagram and parallel beeswarm plots to support the exploration of disease progression and discover associations between patterns and variables.

Natural language processing
This section introduces EHR Vis papers incorporating Natural Language Processing (NLP) as a complementary technique. We follow the definition of NLP from Liddy as, "a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications" [Lid01]. As an active area of research, NLP has evolved since its inception in the 1940s.
As one of the most widely used analytical techniques in healthcare, NLP is capable of transforming unstructured text into a structured and machine-readable format [KDBB19]. Clinicians have very diverse ways of documenting patient records. This may require appropriate modifiers to capture words, phrases and their relationships in EHRs [SSBC19]. Table 10 shows a summary of the NLP techniques used in the EHR Vis literature. It is evident that incorporating NLP techniques is still in its early stages and has much room to grow. Zhang et al. develop AnamneVis in order to capture a complete picture of a patient's medical history [ZAR*11]. AnamneVis incorporates NLP algorithms to extract structured medical information from unstructured data sources such as doctor-patient dialogs and medical reports. The International Classification of Diseases (ICD) as the medical standard for mapping diseases and symptoms. The Five Ws concept [ZBA*13] is adopted for mapping the relations between extracted information. A sunburst diagram is used to visualize the data in two layouts, (1) a hierarchy-centric layout for the hierarchy information representing diagnosed ICD codes, and (2) a patient-centric layout for the past diagnoses and procedures taken.
In addition, a sankey diagram is used to illustrate past medical diagnostic flow of the patient.
Trivedi et al. introduce NLPReViz, a visual analytic and visualization tool that uses Support Vector Machine for training NLP models in real time [TPC*18]. Users are able to train, review and revise trained NLP models by rectifying the binary results from the previous execution. Re-trained models are used for next execution and provide a more accurate result. We classify NLPReVis as NLP, since it uses a combination of NLP and ML, but with more of a focus on NLP, this is reflected in Table 8.
Sultanum et al. present Doccurate, a system embodying a curation-based approach that automatically extracts relevant information from large clinical text datasets, to provide an accurate and sufficient overview for a patient [SSBC19]. After interviewing six domain experts, the authors conclude that preserving the original text in clinical notes is crucial for the visualization of EHRs. Doccurate provides automation in data processing and customization for visualization while preserving the link to the original data.

Event sequence simplification
This section includes EHR Vis literature with a focus on Event Sequence Simplification (ESS). We follow the definition of ESS as any technique used for reducing the visual complexity of event sequences in aggregated display overviews [MLL*13, MSD*16]. EHRs by nature are temporal events unfolding successively, ESS enables events to be trimmed down to their core elements, improving both data-processing and visualization of EHRs. The technique is adopted by EHR Vis systems such as LifeLines [PMR*96] and EventFlow [MLL*13]. Table 11 provides a summary of event types appearing in this sub-section. Events associated with hospitals are a recurring theme. Wongsuphasawat et al. [WGP*11] introduce LifeFlow for providing an interactive visual overview of event sequence data. Following the approach used in LifeLines2 [WPS*09], the authors introduce an aggregation method that groups events into a tree-based hierarchical data structure. Nodes of the same type are rendered as a color-coded event bar, the height of an event bar is proportional to the number of records, and the gap between event bars represents the average time between events. Although the case study of LifeFlow focuses on the analysis of patient transfers between hospital departments, we still include the paper for its aggregation method. We believe the technique is also applicable to EHRs.
Wongsuphasawat et al. [WG12] introduce Outflow, a visualization of temporal event sequence data. Outflow uses a different approach that visualizes the aggregation results using a graphbased representation, which simplifies the comparison of alternative paths with the same state. Both papers are supported by user studies. Monroe et al. introduce a technique to simplify temporal event sequence data [MLL*13], following their previous work called EventFlow [MWP*12]. EventFlow transforms temporal events into an aggregated display to identify hidden trends in the data, this is particularly useful for EHRs as the scalability and the dimensionality of EHRs grow, the visual complexity also increases. An example is shown in Figure 3. The authors propose user-driven simplification, achieved via filtering-based selection: (1) filtering by record which allows the user to remove records through querying or clicking, (2) filtering by category which hides the selected categories and aggregate visual elements into fewer and larger displays, (3) filtering by time enables the user to define a time frame to reduce visual density, (4) filtering by attributes which enables the user to define threshold values. However, filtering-based simplification removes events from the original data. Transformation-based simplification is introduced to preserve the logical relations between events: (1) interval event merging is used to remove gaps or overlap between events, (2) category merging enables categories to be combined to reduce visual elements without removing events, (3) marker event insertion allowing the user to collapse multiple events into a single one.
Gotz and Stavropoulos [GS14] introduce DecisionFlow for visualizing large numbers (thousands) of high-dimensional temporal event sequence data. Instead of visualizing the entire dataset from the beginning, DecisionFlow allows the user to construct a query with multiple constraints to retrieve the initial data. The result is then aggregated to generate milestones, and visualized for further analysis and interactions. The user is able to set and modify milestones interactively to achieve filtering and selection. [WG12], CoCo enables users to explore statistics about the underlying dataset as they interact with the simplified temporal event sequences. CoCo offers a combination of user-driven and automated methods to enable comparisons of cohort events. The authors evaluate the work [MDM*15] with two case studies. Loorak et al. [LPK*16] present TimeSpan, a visualization tool designed to explore the temporal aspects of the stroke treatment process. The authors collaborate with a team of domain experts to derive and classify a list of basic tasks in the domian of stroke care analysis. Temporal events are visualized using a parallel coordinates with stacked bar charts extended with the Bertin-style matrices [Ber99], and are aligned based on their positive effect on the patient. A unique evaluation with a focus group session is also presented.

Malik et al. present CoCo [MDM*14a], a visual analytics tool for comparing groups (cohorts) of temporal event sequence data. Inspired by EventFlow [MLL*13] and Outflow
Guo et al. describe EventThread [GXZ*18], a visualization system for revealing the evolution of patterns across stages in event sequence data. EventThread uses Term Frequency -Inverse Document Frequency (TF-IDF) [RJ76], a common technique used to measure the importance of text segments in a document, to capture the primary sequential pattern in the data. Events are then grouped into threads by similarity, with interactive options provided to  [BSKR19]. Instead of treating patient histories as event sequences, the segmented results are presented using a static dashboard, with extensive use of colors and glyphs for encoding variables, in order to visualize longitudinal changes in patient histories. The segmentation of patient histories is done by using a sliding window approach to traverse through the dataset. Evaluation is performed with groups of both expert and non-expert users.

Visual analytics and comparison
This section describes research on visual analytics combined with analytical comparison of EHRs. We follow the three categories of comparative visual designs by Gleicher et al. [GAW*11], juxtaposition, superposition and explicit encodings. Table 12 summarizes the types of comparison techniques used in the EHR Vis literature. Juxtaposition, the simplest, is the most common choice by a wide margin. Gschwandtner et al. present CareCruiser [GAK*11], an enhanced visual analysis system to explore the result of each applied clinical action and identifies sub-optimal treatment choices. CareCruiser supports the visualization of (1) hierarchical data which includes the structure of treatment plans and sub-plans, (2) temporal data referring to the execution sequence of treatment plans and subplans, and the patient's condition over time, (3) qualitative data which represents relevant characteristics of treatment plans and sub-plans. Aligning, filtering and focus+context are provided for investigation of the patient's condition and responses to treatments, as well as comparison between multiple patients.
Borland et al. [BWH14] describe radial coordinates, a visualization technique based on parallel coordinates, a scatterplot and a chord diagram. The technique allows a more efficient utilization of the space by representing each variable using an axis, arranged radially around a scatterplot. Chords are used to represent relationships between variables. The design supports comparison of high and low prevalance values across all dimensions in the data. The radial style parallel coordinates visual design is applied to NHS data from the UK.  of multiple patients with (1) an overview that supports direct selection of patients, (2) dynamic queries against attributes to achieve filtering, and (3) a history panel that stores previous cohorts that can be retrieved easily for comparison. The system also offers a guided analysis of correlations between patients in the cohort.
Federico et al. introduce Gnaeus, a guideline-based knowledgeassisted visual analytics system for EHRs [FUS*15]. Gnaeus utilizes computer-interpretable clinical guidelines (CIGs), which are generated based on evidence-based clinical practice guidelines, to assist the analysis of EHR data. Selected parameters from the raw data are placed in parallel with clinical actions executed to visualize the outcome, with related CIGs on the side to provide recommendations. The system enables the user to compare administered treatment with evidence-based best practices.

Glueck et al. introduce PhenoBlocks, a visual analytics tool that supports the comparison of phenotypes between patients
[GHC*16]. PhenoBlocks introduces a differential hierarchy comparison algorithm for analysing phenotypes pairwise between patients, and uses a customized sunburst radial hierarchy layout [SZ00] for visualizing the results. Glueck et al. present PhenoStacks, a visualization system to support comparison of cross-sectional phenotype within and between patient cohorts [GGC*17]. The system adopts glyphs of Human Phenotype Ontology (HPO) developed in the prior work PhenoBlocks [GHC*16] and supports sorting and filtering by phenotype or patient attributes. Search is powered with natural language queries (See Figure 4). To reduce visual redundancy, the authors propose a topology simplification algorithm, a greedy depth-first approach, for eliminating duplicates in phenotype datasets. Zhang et al. describe IDMVis [ZCD19], a temporal event sequence visualization system developed for Type 1 diabetes treatment decision support. They provide a new method of hierarchical task abstraction for clinicians. Inspired by Temporal Folding, a technique for visualizing temporal event sequences [DSP*17], the authors propose a visual technique of dual sentinel event alignment and time scaling to further enhance the visualization for a large number of temporal event sequences. In addition to the single-event alignment that enables the alignment of trend lines based on a single designated event, the technique enables the alignment of trend lines between two user-chosen events with zooming. Wang et al. present LetterVis, a visualization tool to support the analysis of clinic letters through five customized visual layouts with support from natural language queries [WLLP21]. A letterspace layout is derived from the physical layout of text on A4size letters used by clinicians, exploiting implicit knowledge of the clinicians who compose the letters. This layout is used to depict the query results in (1) the global view that shows all the letters loaded in one superimposed letter-space, (2) a thumbnail view for individual letters, and (3) a focus view for the original content with query results highlighted. (4) A co-occurrence matrix is included for visualizing antiepileptic drug (AED) co-prescriptions.
In the (5) drug chain view, where each AED is represented by a block in the chain, provides a visual representation of prescription progression.

Visual analytics with clustering and others
This section describes papers that use hierarchical clustering algorithms for EHR analytics. According to the survey by Xu and Wunschll [XW05], hierarchical clustering algorithms are widely used in the information visualization discipline. This conforms with our findings that all papers included in this section (Table 13) adopt hierarchical clustering algorithms to produce homogeneous subgroups based on similarities. EHRVis may benefit from applying other clustering techniques (e.g. Vector Quantization and Estimation via Mixture Densities) to assist in analysis. Gotz et al. introduce DICON [GSCE11], a visualization tool that supports the exploration of similarity in cohorts of patients. Clusters are represented by dynamic icons and are generated using similarity and cluster analysis algorithms. The cluster refinement stage requires user guidance to evaluate cluster quality and apply refinements. Users can drag and drop, merge and split an individual patient or a cluster to refine clustering results.
Kamaleswaran et al. uses a tri-event heatmap representation for displaying high frequency complex data [KPT*14], neonatal spells, collected in neonatal intensive care units. Their clustering includes a temporal factor and a non-linear similarity metric. The authors apply both density estimation and logarithmic clustering to normalize and discretize the non-parametric distribution during data pre-processing. The resulting visualization supports the exploration of frequency, duration, and severity of spells.

Kamaleswaran et al. introduce a visualization technique called a
Temporal Intensity Map (TIM) [KCJM16], a customized heatmap with the y-axis representing the critical distance interval determined by a density estimation function. The focus is on the visual analysis of event streams that reveal important infomation about frequency and duration of streaming events derived from real-time event stream algorithms. The authors further introduce a dashboard visual analysis system, PhysioEx, formed by a TIM, a sequence graph, a linear graph, and a streams graph for analysing neonatal data and predicting physiological behaviours of newborns. Glicksberg et al. describe PatientExploreR [GOT*19], an interactive interface that facilitates the visualization and querying of EHRs. By incorporating the Observational Medical Outcomes Partnership (OMOP) common data model introduced by the Observational Health Data Sciences and Informatics [Obs20], PatientExploreR's advanced querying function allows physicians to search, filter and compare patients with combinations of items from multiple medical terminology standards such as the UMLS described in Section 2.2. When a patient is selected, an interactive timeline presents all clinical events with the ability to expand the details, along with basic visual designs. We include this paper for the advanced querying support coupled with the integration of OMOP common data model.

PopHR vis and Geospatial visualization
This section describes research on EHR Vis with a geospatial focus. Table 14 summarizes the geospatial landscape coupled by this sub-section of literature. PopHR Vis papers are also included in this section.

Alonso and McCormick describe Epidemiological Parameter
Investigation from Population Observations Interface (EPIPOI), that automatically extracts three parameters describing trends, seasonality and anomalies, and a time series from large epidemiological datasets [AM12]. These three dimensions can be visualized using maps combined with time series data to reveal spatial patterns. Based on the existing data, SIMID simulates the spread of infectious disease using interactive animated maps. With customizable input parameters such as vaccination rate and mortality rate, SIMID is able to generate different mitigation strategies with variation and uncertainty that reflect the randomness in disease outbreak progression.  to support the visual exploration of large healthcare datasets. Based on UMLS described in Section 2.2, the authors extract related terms from unstructured clinical notes via NLP. The authors propose a spatial texture based approach to integrate geospace with other dimensions, that consists of (1) constructing random noise patterns with color variations to map different attributes, and (2) color-coding the offset contours of geographical regions to map the temporal dimension. The authors propose a visual design called a Spiral Theme Plot based on ThemeRiver [HHN00] and spiral pattern [WAM01], to help physicians discover patterns and trends in events. Health-Terrain is included in this section, since it is a combination of geospatial visualization and NLP with the main focus on the former. Klemm et al. [KLG*15] propose the 3D Regression Heat Map, a novel 3D visual encoding that offers an overview of a hepatic steatosis dataset (a subset of the SHIP dataset included in Table 17). The resulting 3D heat map enables the exploration of relationships between several user-defined independent features and a user-defined target disease. Each 3D heat map slice can be projected onto a 2D space for further analysis. The approach enables experts to verify their disease-specific hypotheses and derive new ones.
Ola and Sedig [OS16] present a geospatial visual design for studying large healthcare datasets. The design combines several visualization techniques to support the exploration of the relationships between age group, risk, and cause of death at multiple levels of granularity.
Tong et al. present a hybrid visual layout called a cartographic treemap, to visualize high-dimensional healthcare data collected by the National Healthcare Service (NHS) in the U.K. [TML*17]. By combining the space-filling advantages of treemaps for the display of hierarchical, multivariate data together with geospatial information, cartographic treemaps support exploration, analysis and comparison of complex population healthcare data from Public Health England. They further extend the work it by adding a time variate, enabling the visualization of the temporal evolution trends hidden in EHR data [TRL*17].
Tong et al. extend their previous work with a cartographic layout algorithm that generates cartograms with topological features using NHS's population healthcare data [TML18]. The proposed algorithm preserve nearby node's topological features to increase the recognizability and reduce layout errors.
VIVID is a web-based framework proposed by Alemzadeh et al. [ANI*19] to support the handling of the missing values in cohort study data. The framework includes various visual designs to enable the user to explore the missing values (stacked barchart and matrix) build imputation models (bean plot and bee swarm plot) and generate predictions for the missing values (chord diagram and parallel coordinates).
McNabb and Laramee present a glyph placement algorithm to support multivariate geospatial visualization of a Public Health England dataset [ML19]. The authors identify four major challenges for representing geospatial data on existing choropleths: (1) Size perceivability: sizes of glyphs and areas on a map are not easily perceivable.
(2) Visualization of multivariate geospatial data: geospatial designs such as choropleths, cartograms, symbol maps etc. generally fail to depict multivariate data. (3) Occlusion: glyphs on a map often overlap and are over-plotted. (4) Glyph placement: existing solutions to address occlusion often de-couple glyphs from their original geospatial regions they are intended to represent. The authors address these challenges by introducing a scale-aware map that supports dynamic modification to the level-of-detail shown via zooming and custom scaling options. The algorithm produces a map that is enhanced with glyphs which are dynamic, scale-aware and coupled to their geospatial contexts.

Evaluation
Evaluation of EHR and PopHR visual designs is very difficult due to their complex visual interfaces. An EHR Vis system is often characterized according to target user requirements. The resulting visual designs may not seem useful to evaluators [Mun09]. Furthermore, an isolated evaluative process is hardly sufficient to assess an EHR Vis system. Grounded evaluation [IZCC08], where visualization designers work closely with EHR experts to (1) understand predesign context, (2) conduct iterative prototyping and refinement, and 3) conduct late-stage acceptance testing, might be a solution to address the evaluation problem. We observe that grounded evaluation is being practiced in many projects (especially from the visualization community) included in this STAR.
In this section, we summarize the evaluation techniques adopted by each paper. The result is summarized in Table 15.  Controlled user study (8 papers, 16%) is a type of usability study, where a set of predefined tasks are performed by participants with a certain level of expertise (or novice participants after training) in a controlled environment. They may benefit from a large number of participants [FKSS06,WS09,GSCE11,MDM*14b,BSKR19]. We observe a decrease in popularity since 2014. Isenberg et al. suggest a controlled user study is typically time-consuming and resource-intensive to design, conduct and analyse [IIJ*13]. Controlled user studies are difficult to design for complex systems.

Open Access Healthcare Data
Finding open access EHR data is very time-consuming and sometimes challenging, because VIS researchers are not often involved in EHR data collection and curation. This is usually performed by healthcare organizations. As a response to the challenges stemming from healthcare data visualization, we present a collection of open heath datasets, and our methodology for searching for open healthcare datasets, along with associated challenges, a carefullydefined scope and classification in this section. The result is a useful overview of healthcare data sources, with a curated list of publicly accessible healthcare datasets. The entire collection of data sources is accessible via our interactive EHR STAR Browser, available at https://ehr.wangqiru.com. We hope this section provides a helpful jump-start for potential researchers to develop visual healthcare data systems and form collaborations.

Healthcare data challenges
In this section, we discuss some major challenges faced in EHR data.
The accessibility of EHR data is one of the main barriers to researchers in general [MIT16]. We face several challenges searching for related data, which requires a considerable amount of time to search for. User registration and verification required by some data providers increases the manual labour. EHR data is more special due to its sensitive nature, and also comes in unstructured forms, e.g. clinic letters and hospital discharge letters. Converting the data into a structured form may lose valuable insight. Futhermore, an anonymization process is usually applied to EHR data by the respective data governance group.
Data quality is critical to EHR research, as much data is entered and computed manually, it is likely to contain incomplete and erroneous values. A special case is identified by Shneiderman and Plaisant [SP19] where a patient record was reported as being admitted 14 times but discharged only twice by a hospital. Verifying data quality requires a significant amount of time and effort. EHRs were not originally created with supporting research in mind [MIT16]. Overtime, the secondary use of EHR data in supporting healthcare research is emerging and widely accepted worldwide, this in turn improves the quality control measures for collecting them [KRN*19].
Data interoperability is challenging, given there is no standard definition of an EHR, healthcare providers often develop their own format to support the clinical work flow [CBC*17]. The lack of a standardized terminology, such as the UMLS, also contributes to this challenge.
These challenges remain unsolved. We see recent efforts in addressing these challenges, such as building a freely accessibility EHR database [GAG*00, JPS*16] and improving data validation and interoperability [FJV*09].

Healthcare Data search methodology
We focus on healthcare datasets that are openly accessible from a reputable data provider such as a non-profit organization, scientific  research or an initiative that provides trustworthy health related sources. We start by examining data sources mentioned in the related literature we found. Our search results are shown in Table 17 [Har19] for relevant data. We also use keyword combinations listed in Table 16 with data search engines [Goob, Gooa] and well-known government data portals [Eur, Thef, Theb, Thed] to expand our results. We present 34 related healthcare datasets found in Table 17.

Healthcare data scope
The EHR data survey scope includes datasets that (1) offer free and open access to external researchers, (2) have greater than 500 records and 5 attributes in each record, (3) are published by credible providers, (4) have derived publications in peer-reviewed journals and (5) are archived in English for accessibility. To verify the eligibility, we examine each dataset, or the most popular datasets if multiple datasets are provided as a collection or catalogue. We refer to these as focus data sources.

Context Data and Out of Scope Healthcare Data Sources
During our search, we found some candidates that fulfil some but not all criteria. We still include them as context data sources in our data source overview Table 17.
We generally exclude datasets that require an access fee, with the exception of some candidates as context data sources. We generally exclude datasets that are accessible solely via project collaborations. We generally exclude datasets that are not archived in English. However, we do include some as context data sources (if they are high quality) in Table 17 for interested readers, and describe them in Section 6.5.4.
We exclude datasets that are not directly related to EHR. Here are some noteworthy examples. [Thec] provides datasets on the adoption, utilization and performance of information technology in healthcare facilities sponsored by the US government, these datasets are excluded. The VAST Challenge 2010 Mini Challenge 3 [VAS10b] provides a dataset on genetic sequences for tracing the mutations of the Drafa virus. Each sequence of single molecules is coded as a single alphabet, therefore the dataset does not contain any actual EHR information and is excluded. The VAST Challenge 2011 Mini Challenge 1 [VAS11a] provides data containing posts collected from social media platforms for the identification of an epidemic outbreak, these datasets are excluded due to the lack of an EHR dimension.

Healthcare data sources classification
We present a description of data sources in this section. Table 17 displays an overview of data sources we found.
Based on the focus data source and context data source introduced above, we classify data sources into three categories: A specialized source refers to datasets focusing on a single specialty or area of specialization. The Human Mortality Database [SBW] provides multiple datasets specifically on all-cause mortality from over 50 countries or regions, therefore we classify it as a specialized focus data source.
A collection source provides access to multiple datasets from different specialties, such as the UCI Machine Learning Repository [DG17], which provides data on breast cancer, diabetes, hepatitis and other diseases.
A catalogue source does not host data on its own website but provides links to other webpages, The Registry of Research Data Repositories (r3data) [Re3] is a catalogue source that hosts over 2,000 scientific datasets, each comes with a comprehensive description and a link pointing to its homepage.

Open access healthcare data sources
Based on the classification, we briefly describe each open access healthcare data source in their corresponding section. We describe each data source using the Five Ws [ZBA*13]: • Who the data provider is • When the data was collected and published • Where the data was collected • Why the data was collected • What the data contains

Specialized healthcare data sources
This section describes focus data sources that focus on a single health related specialty.
Human Mortality Database began as a collaborative project in 2000 [SBW], involving research teams in the Department of Demography at the University of California, Berkeley, USA and the Max Planck Institute for Demographic Research in Rostock, Germany. The database provides open access to detailed mortality and population data for over 50 countries and regions to promote relevant research. Depending on the geographical location, data archives may span over a century.  [vPCB18], incorporating a collection of death rate data from infectious diseases and their historical spread between 1888 -2014. The initial archive focused on the history of diseases throughout the US. It has now expanded to include over 360 datasets on 92 infectious diseases at a global level in a standard format.
The COVID-19 Dashboard is an online interactive dashboard developed by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [Joh20b,Joh20a,DDG20], as a real time visualization for the number of COVID-19 cases, deaths and recovery rates around the world. The raw data is available for open access.

The Scottish COVID-19 Response Consortium (SCRC)
is founded by the University of Glasgow, the consortium includes a group of epidemiologists, mathematicians, computer scientists for developing new models to help inform the control of COVID-19 in Scotland. It offers open access to COVID-19 related data provided by 15 healthboard areas of NHS Scotland [CARA*20].

Collection healthcare data sources
This section describes focus data sources that provide access to multiple datasets from different specialties.
UCI Machine Learning Repository was created by David Aha and fellow graduate students at University of California Irvine in 1987, as a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The repository contains over 110 health related datasets, including subjects such as breast cancer, diabetes, epilepsy and more.
The National Health Service (NHS) of the United Kingdom provides open access to various healthcare data collected through its operation, the data is made accessible via different portals including Public Health Wales (established in 1999) [Pubb], NHS Scotland Open Data (2009) [NHSc], The Government Digital Service (2011) [Theb], OpenDataNI (2012) [Thed], Public Health England (2017) [Puba] and NHS England (2017) [NHSa]. Example datasets hosted on these portals including mortality rate from cancer, liver, cardiovascular diseases and more. [Big] is a forum founded in 2014, serves as a platform for the leaders of 14 largest metropolitan health departments in the US, to exchange strategies and jointly address challenges related to promoting and protecting the health and safety of the people they serve. The forum provides open access to data including mortality from various causes, maternal and child health, HIV etc., covering over 62 million people from 2010-2016.

Big Cities Health Coalition
Global Health Data Exchange [IHM15] operated by the Institute for Health Metrics and Evaluation, provides a catalog of global health and demographic data. It currently hosts over 12 billion population health records collected from 195 countries. The mission of the exchange is to serve as a critical resource for informed policymaking. The exchange supports searching and filtering data by over 350 diseases, injuries and risk factors.

Catalogue healthcare data sources
This section describes catalogue data sources that do not host data on their website but provide links to other data sources.
FAIRsharing [FAI] started in 2007 as a community-driven registry providing descriptions of standards, databases and data policies. Datasets can be published on FAIRsharing to increase the visibility and foster collaboration. The registry not only hosts a catalogue of health related databases, but also provides access to proven standards and data policies to reduce the potential for unnecessary reinventions.
The U.S. Government's Open Data [Thef] and HealthData.gov [Hea] started offering links to datasets in 2011, to ensure compliance with relevant Open Data Policy and promote research and innovation. Public entities ranging from federal agencies to local government departments collected over 200,000 datasets, including popular healthcare data on cancer, diabetes and hypertension.
The European Data Portal [Eur] was established in 2012 aiming to serve as a point of access to public data published by institutions, agencies and other bodies across European countries. Over 10,000 health related datasets including HIV-related, norovirus and cancer are available. [Mae] is a catalogue of epidemiological research founded by McGill University in 2012. The catalogue later expanded to include population health studies, to promote collaborative research. It currently hosts links to over 200 well-known research projects.

Maelstrom Catalogue
re3data [Re3] is funded by the German Research Foundation in 2012, as a global registry of over 2,000 research data repositories from multiple academic disciplines. It aims to provide permanent storage and access to healthcare data for the scientific community.
The COVID-19 Open Research Dataset Challenge [The20a] is a challenge launched in 2020 by the Allen Institute for Artificial Intelligence on Kaggle, an online community of data scientists. The challenge offers over 59,000 academic journals for free, in order to attract researchers and develop novel solutions to study the ongoing evolution of COVID-19. Some 1,300 novel solutions have been submitted and many are accompanied by open access anonymized patient data, as a part of the submission requirements.

Context healthcare data sources
A context healthcare data source refers to a data source that does not fulfil all criteria listed in Section 6.3, but we include and describe some high quality sources here for interested readers. [UK] recruited 500,000 participants aged between 40-69 years in the U.K. from 2006 -2010, with extensive physical measurements and blood, urine and saliva samples collected in conjunction with wearable monitors and online assessments of personal well-being. Researchers are obliged to return their results and findings to benefit the research community. We include the UK Biobank as a context data only as it charges a one time access fee of £2,100 (reduced to £600 for researchers from developing countries or students).

UK Biobank
LifeLines Biobank [SRP*08, Lif] archives 167,000 participants including all age groups in the Netherlands. The research collects physical and physiological measurements such as blood pressure, skin autofluorescence, and biomaterials such as blood and urine, from participants, along with regular online questionnaires on stress and quality of life. We include LifeLines Biobank as a context data source only as it charges a one time access fee of approximately € 7,800.

Tracking Adolescents' Individual Lives Survey (TRAILS)
[Tra] is an ongoing research project that studies the psychological, social and physical development of over 2,500 adolescents in the Netherlands since the year 2000. The research is conducted in the form of questionnaires and interviews on topics such as cognitive functioning, academic performance, tests on fitness condition, and physical measurements such as baroreflex sensitivity. We include TRAILS as a context data source only as it charges a one time access fee of over € 3,000, however, the fee is waived if a collaboration is formed with the TRAILS research group. [Dep] is another well-known populationbased study ongoing in Ommoord, Rotterdam since 1990, with a focus on the risk factors of cardiovascular, neurological, ophthalmological and endocrine diseases in the elderly aged 55 years and over. Three cohorts (1990,2000,2006) included 14,926 participants and has resulted in over 2,000 scientific articles. We include the study as a context data source only as it charges an access fee and the access is only granted to collaborations formed with the study's principal investigators.

Secure Anonymised Information Linkage (SAIL) Databank
[FJV*09] was established in the UK in 2006. It allows external researchers to access billions of EHRs on datasets such as outpatient, critical care and primary GP care in the UK. Access to additional restricted datasets such as bowel screening, breast test and cervical screening in Wales is granted with additional approval from data providers. We include this noteworthy SAIL Databank as a context data source as the access is granted via project collaboration only.

Study of Health in Pomerania (SHIP)
[JHL*01, VAS*11b] started after the German reunification in the 1990s, as a populationbased epidemiological study. The study includes 7,008 women and men aged 20 -79 years, with a wide range of medical data being collected. We include SHIP as a context data source due to its lack of accessibility since the study is primarily archived in German.
Groningen Initiative to Analyse Type 2 Diabetes Treatment (GIANTT) [GIA] is a project aimed at the quality of care for people with type 2 diabetes in Groningen, the Netherlands since 2004. The primary data source is from local general practices. We include GIANTT as a context data source only due to its restricted accessibility since the study is in Dutch. GIANTT also charges an access fee.es an access fee.es an access fee.es an access fee.

Future Research Challenges and Discussion
In this section, potential future research directions are derived from the discussion of the challenges reported in the literature. Future work and challenges are often discussed at the end of each research paper. Table 18 summarizes a list of the top 10 most popular future challenges we extract from the reviewed literature, ordered by their popularity. We observe that the top future challenges are to tackle scalability as data size grows, conduct additional in-depth and effective evaluations and improve the efficiency in screen space utilization. Another popular challenge is the interoperability between different EHR Vis systems, which can be potentially addressed by adopting a common terminology standard such as the UMLS. Finally, the ability to increase system usability while simultaneously introducing advanced interactive user options, is a popular future research direction. A summary of future challenges identified in the literature, ordered by the publication year on the x-axis and the frequency on the y-axis. We use 1-2 words to represent these challenges in the table header, and describe them in detail in Section 7.  Keim [Kei02], and categorize bar chart, line chart and pie chart as standard 2d display.
indicates the technique is applied in the literature. indicates a customized variant of the technique is applied in the literature. Scalability (22 papers, 45%) and data dimensionality (7 papers, 14%) are reported as a future challenge 28 times in total. As the result of data growth exceeds the capacity of existing EHR Vis systems [LK06]. Apart from the handing of high-dimensional and multivariate EHR data, maintaining the system availability in a real world scenario where multiple users are accessing the system concurrently, is a trending future research direction [SNK*12]. From the table, we can see this has been a persistent theme.
While scalability a challenge for all visualization systems, we make note of how the following challenges are inherent to EHR visualization.
In-depth evaluation (14 papers, 29%) and validation including quantitative studies, qualitative studies and validation is reported 14 times as the second most popular future research direction. An indepth evaluation and validation helps to reveal the weakness and potential improvements for the system. We examine and describe the evaluation techniques adopted by the literature in Section 5. Some 14 papers report the lack of evaluation or an insufficient number of participants in their studies. The recruitment of qualified participants is challenging, these participants often do not have the time to complete lengthy and thorough evaluations. The table of challenges indicates this as a prominent theme in recent years.
Limited screen space (12 papers, 24%) constrains the content visualized and reduces the effectiveness of an EHR system [GOT*19]. As the probability of using multiple views increases in EHR Vis systems, we categorize this challenge as a domain-specific one. Features with less significance are often hidden to make space for others [BSKR19]. This may result in over-simplification and missing potential insights [MLL*13]. This is highly related to the challenge of visual aggregation and clustering (4 papers, 8%) of multiple patients and requires more advanced interaction (9 papers, 18%) techniques to explore and navigate the data, especially the temporal dimension. Table 7 indicates that interaction is a popular future challenge in earlier years.
Data interoperability (10 papers, 20%) between EHR Vis systems and institutions continues to lag [MIT16] and is reported 10 times as a future challenge. This increases the difficulty for researchers to incorporate data from heterogeneous sources in varying formats [OS16]. Although Table 3 indicates that some papers focus on same UMLS terms, these EHR Vis systems are built specifically for their given datasets and do not offer interoperability. This is a very EHR-specific challenge that can be potentially addressed by promoting collaboration between different research groups on same topics, and adopting a common terminology standard such as the UMLS. Table 7 indicates limited screen space and data interoperability as re-occurring challenges over the last 10 years.
System usability (12 papers, 24%) and human factors are reported by 11 papers as a future challenge direction. Low usability often results in a longer learning curve that requires more training time for users [WPS*09,KCJM16]. This in turn may increase the occurrence of human errors. Due to the domain expertise required, it is difficult to conduct a full usability test on EHR Vis systems.
Data quality and uncertainty (7 papers, 14%) is another challenge reported in 7 papers. Data often contains missing or incorrect values, this requires further investigation during data collection and pre-processing [OPS16].
Open data access (4 papers, 8%) is reported 3 times, as the authors of most papers we review are collaborating with domain experts or institutions. However, access to high quality data still remains a big challenge for many researchers [MIT16]. We attempt to address this challenge here in Section 6. Even though the sensitive nature of EHR data requires special permission, open data access and accessibility are not mentioned more often in the literature. This is likely due to the collaborations formed between visualization and medical experts: in Section 5, we find 59% of the papers choose to collaborate with medical experts, who also provide EHR data for visualization researchers. Table 19 shows an overview of visualization techniques applied in all papers included in this STAR. We observe that standard 2D displays and glyph are the most popular techniques among 21 techniques found across all EHR Vis systems. This implies that using advanced visual techniques to mitigate scalability challenge brought by EHR data dimensionality, remains understudied.

Conclusions
In this STAR, we present an up-to-date overview of research papers, with an in-depth investigation of 99 in the field of EHR and PopHR Visualization and Visual Analytics. We investigate some of the most commonly used terminology in the field and categorize the literature based on six re-occurring research themes. Our STAR differs from the eight related surveys, by including 29 more recent publications, as well as a novel classification that utilizes UMLS, as a means to improve the understanding of recent development in research and foster potential interdisciplinary collaborations. We then investigate the evaluation techniques adopted by the literature. Furthermore, we invest over two months in investigating a collection of 34 high quality open access datasets, aims to serve as a starting point for potential researchers. Lastly, our interactive EHR STAR Browser enables the reader to easily navigate through all literature and data sources collected in this STAR.