Towards a Universal Clinical Genomics Database: The 2012 International Standards for Cytogenomic Arrays Consortium Meeting
Contract grant sponsor: NIH (HD064525).
Communicated by Richard G. H. Cotton
Correspondence to: Erin Rooney Riggs, MS, CGC, 2165 N. Decatur Rd., Decatur, GA 30033. E-mail: email@example.com
The 2012 International Standards for Cytogenomic Arrays (ISCA) Consortium Meeting, “Towards a Universal Clinical Genomic Database,” was held in Bethesda, Maryland, May 21–22, 2012, and was attended by over 200 individuals from around the world representing clinical genetic testing laboratories, clinicians, academia, industry, research, and regulatory agencies. The scientific program centered on expanding the current focus of the ISCA Consortium to include the collection and curation of both structural and sequence-level variation into a unified clinical genomics database, available to the public through resources such as the National Center for Biotechnology Information's ClinVar database. Here, we provide an overview of the conference, with summaries of the topics presented for discussion by over 25 different speakers. Presentations are available online at www.iscaconsortium.org.
The International Standards for Cytogenomic Arrays Consortium Conference
The International Standards for Cytogenomic Arrays (ISCA) Consortium is a group of laboratories, clinicians, and researchers united in their efforts to raise the standard of patient care by improving the quality of cytogenomic microarray (CMA) testing. Improving the quality of CMA testing has many facets, but one of the most important remains the delivery of consistent, evidence-based interpretations of copy number variants (CNVs). Case reports and series are invaluable in providing the phenotypic information that informs clinical interpretations, but large case–control studies provide the statistical information often needed to strengthen an argument for or against the pathogenicity of a particular CNV. To this end, one of the primary initiatives of the ISCA Consortium has been the creation of a publicly available database that intends to capture both the volume of cases necessary to make statistical inferences, as well as the phenotypic information required to make these inferences clinically informative. This database relies on a commitment to data sharing and consists of cases from laboratories around the world. The information therein is available for both expert-level curation and use in other research applications, resulting in an ongoing community-wide effort to generate knowledge on the effects of structural variation.
The needs originally identified by the ISCA Consortium are not unique to the structural variation community. Within the sequencing community, there is also a need to augment patient care through improved testing, starting with the consistent, evidence-based interpretation of sequence-level variation. With the infrastructure already largely in place for the collection and curation of these variants through the efforts of the ISCA Consortium, it is a natural move to align the efforts of the sequencing community to those of the cytogenetic community. The focus of the 2012 ISCA Consortium Meeting was to foster discussion regarding these merged interests and to forge plans regarding the implementation of this initiative, including the following: the collection of clinical genomic data, data curation and evidence-based applications for quality assurance, the importance of phenotypic data, translating data into clinical actionability, and the integration of this initiative with complementary efforts.
Clinical Genomic Data Collection: Expanding the ISCA Experience and Learning from Related Efforts
The creation of any database is naturally reliant upon the willingness of the community to submit data. For the proposed unified database of clinical genomic variation, data generated through the course of both clinical testing and research testing (for both structural and sequence-level variants) will be accepted. However, it has been the experience of the ISCA Consortium that there are often barriers to collecting these data. Erin Kaminsky (Emory University) discussed the ISCA Consortium's experience with collecting genomic variant data including the current status of the ISCA database, barriers to data submission, and benefits of genomic data collection. Survey results from 92 member institutions of the ISCA Consortium were presented and showed that although 80% of these institutions performed microarray testing, the majority of those institutions did not submit data to the ISCA database. The most common barriers to data submission were found to be “lack of time” and “lack of resources.” Potential solutions for these issues were discussed, including various data submission options. These submission options range from manual submission to “one-click” data submission through software vendors and/or third-party data-brokering relationships such as the one the ISCA Consortium has developed with Cartagenia, a Belgian company producing a Web-based software and database platform for the interpretation of genomic variation. The continued development of resources to reduce the time and effort necessary will facilitate an increased rate of data submission.
The acquisition of large amounts of genomic data from different applications will be essential as whole-genome sequencing techniques move into the clinical arena. Tina Hambuch (Illumina) discussed her group's experience with the clinical interpretation of whole-genome sequencing in five healthy individuals. On average, 30–40 variants detected per genome were found to be categorized as “disease causing” in the Human Gene Mutation Database. After careful manual assessment of literature, variant frequencies, and other evidence, her group determined that a significant percentage of these variants reported in peer-reviewed publications and public databases are not likely to be pathogenic. She asserted that the time required to complete such a review for each case was significant, but that the effort required per genome declines rapidly over the first 50–100 genomes sequenced, emphasizing the need for large volumes of such data to be publicly available. Her conclusion was that patients would benefit from a community effort to improve clinical interpretation and patient care.
Sherri Bale (GeneDx) also illustrated the need for the collected data to be publicly available and easily accessible. She discussed that publicly available variant data will provide clinical laboratories assistance in interpreting variants that are not yet reported in the literature and will save clinical laboratories time and resources in interpretation. In addition, Bale emphasized the need for expert curation by presenting anecdotal case examples in which information crucial to the interpretation of particular variants was either unpublished, in conflict with current published information, or fragmented among multiple, difficult to access sources. This lack of publicly available, easily accessible information directly impacted the care of these patients, and will continue to affect the care of subsequent patients whose clinicians and/or laboratories may also not be able to easily access the information.
Donna Maglott of the National Center for Biotechnology Information (NCBI) discussed the burden of gathering and evaluating the information needed to interpret novel variants from multiple systems such as individual databases, public databases, propriety databases, publications, and analysis tools. Maglott echoed Bale on the need for sharing variant data and introduced ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/), a new registry of variants and related data including reported phenotype, sample size, frequency at which each variant was observed, and pathogenicity based on inheritance and functional data, which will hopefully remove barriers to data accessibility.
Experience with the ISCA Database
Though the ISCA database has traditionally focused on collecting results from postnatal CMA testing, information regarding the prenatal applications of CMA testing will become increasingly valuable as CMA becomes more frequently used in this setting. Ronald Wapner (Columbia University) presented and discussed data from the National Institute of Child Health and Human Development (NICHD) Prenatal Microarray Study Group's comparison of cytogenetic results obtained by karyotype with results from CMA. This study showed an increased detection of clinically relevant CNVs in pregnancies with and without structural abnormalities, suggesting that CMA should become the first-tier test for invasive prenatal cytogenetic diagnosis; data from the study have been submitted to database of Genotypes and Phenotypes (dbGaP), and the full results of the study have been published [Wapner et al., 2012]. Wapner et al. (2012) also discussed some future steps for the NICHD study, including updates to guidelines for the clinical application of CMA in prenatal diagnosis, long-term prospective studies of fetuses identified with pathogenic CNVs and variants of uncertain clinical significance, and assessment of patient preference for result disclosure.
Another potential prenatal application of CMA is the testing of products of conception (POC). Amy Fuller (GeneDx) reported on 127 POC specimens tested by a custom-designed 180K oligonucleotide + single-nucleotide polymorphism (SNP) platform, 72 of which also had karyotype testing. Although some shortcomings, such as missing low-level mosaicism and balanced rearrangements, were identified in POC microarray results showed that 19/39 (48.7%) aberrations identified as positive or as a variant were detectable using this type of microarray but not by chromosome analysis. She also presented several cases to illustrate the advantages of using an oligonucleotide + SNP array over karyotype on POC, including increased detection of CNVs and long stretches of homozygosity (including those indicative of uniparental disomy), and the improved success rate for both obtaining results and being able to provide accurate recurrence risk counseling; these advantages will greatly improve the field of POC diagnostics.
Several other speakers focused on examples of the benefits of having access to large pools of clinical testing results for knowledge acquisition. David Miller (Boston Children's Hospital) discussed his hospital's experience with parental testing requested for the interpretation of a child's uncertain CMA result. Using a relatively large dataset acquired through clinical genetic testing and subsequent follow-up at Boston Children's Hospital, parental array results tended not to influence clinicians’ medical management considerations or recurrence risk counseling when the CNV itself was of uncertain significance (e.g., no dosage-sensitive genes). Parental testing was perceived as more helpful when a CNV is interpreted as likely pathogenic or pathogenic.
Santhosh Girirajan (University of Washington) provided an example of using databases to study the link between CNVs and neurodevelopmental disorders and identified the challenges in studying and interpreting newly identified CNVs. This group studied cases from Signature Genomics Laboratories (PerkinElmer, Inc.) compared to population controls and identified a higher CNV burden in cases and also proposed some novel pathogenic CNV locations [Cooper et al., 2011]. Subsequent review following the publication indicated that their data were comparable to what has been reported in the ISCA database. They also noted that an increased CNV burden in one individual correlates with clinical severity and can account for phenotypic variability; this led to a suggestion that individuals with two or more large, rare CNVs are more likely to have an abnormal phenotype. Challenges identified through this study included the need for larger datasets of cases and controls to determine statistical significance between populations and the difficulty associated with interpreting the broad range of phenotypic variability associated with many CNVs due to the lack of comprehensive phenotype information—both initiatives of the ISCA Consortium.
A Sequence Database Example
Robert Nussbaum (University of California, San Francisco) illustrated the need for the collection of large datasets for individual genes through a discussion of the lack of publicly available knowledge regarding variation in the BRCA1 and BRCA2 genes. He pointed out that variants from particular genes are not available in a public database and proposed methods for collecting data on them. He discussed his plans to gather data on BRCA1 and BRCA2 variants by contacting specialty genetics providers through the National Cancer Institute. At the time of the meeting, 770 providers had been contacted and asked to send a copy of the BRCA1/2 report with protected health information removed in accordance with Health Insurance Portability and Accountability Act. Nussbaum reported that he planned to submit any information that he received to ClinVar in an effort to build a catalog of variation within these genes, as well as to capture clinically relevant interpretations that are used for medical decision-making.
Other Related Efforts
The ISCA Consortium model of data sharing and collaboration has been successfully adopted by other organizations as well, including the Cancer Cytogenomic Microarray Consortium (CCMC) (www.cancergenomics.org). Marilyn Li (Baylor College of Medicine), speaking on behalf of the CCMC, reported that the consortium was focusing on the clinical validation of cancer CMA platforms through a multicenter clinical trial. Preliminary results indicate that CMA is able to detect clinically relevant aberrations at a level comparable to current standard technologies, such as fluorescence in situ hybridization and histological techniques, but offered the additional benefit of being able to further refine the aberration, yielding additional relevant information such as origin, size, and genomic content. The study also showed that CMA results were consistent among different laboratories and among different array platforms, indicating that CMA is a sensitive and reliable technique for genomic analysis in the oncology setting.
Laboratories should not be the only partners in data sharing and collaborative efforts. Pat Furlong of Parent Project Muscular Dystrophy discussed the power of engaging the patient advocacy community in such efforts, as these groups can be powerful allies in attempts to collect genotype and phenotype information to increase research opportunities through enhanced datasets. TreatNMD (http://www.treat-nmd.eu/), a global initiative focused upon bringing novel therapeutic approaches to patients with neuromuscular disorders, has responded to queries from academic and commercial researchers since 2007 and helped researchers in (1) locating participants, (2) planning clinical trials, (3) promoting clinical treatment recommendations and standards of care, and (4) characterizing the patient population. DuchenneConnect (www.duchenneconnect.org) is a contributing member of TreatNMD, which includes registries in 44 countries and over 10,000 patients with muscular dystrophy. The data collected in DuchenneConnect have been shown to compare with published norms for (1) mutation spectrum, (2) age of diagnosis, and (3) age of use of a wheelchair, demonstrating the validity of this type of data collection. Currently TreatNMD and DuchenneConnect are assisting 21 research trials wherein genotype and phenotype data are critical.
Data Curation and Evidence-Based Applications for Quality Assurance
Presentations regarding data curation and how an evidence-based process is being applied within the ISCA Consortium to facilitate clinical interpretations were provided. Christa Lese Martin (Emory University) discussed how data curation is needed to ensure consistency and confidence and to facilitate the acquisition of knowledge for variants of uncertain significance. The ISCA database provides an online mechanism for submitting laboratories to identify interpretation discrepancies both within their own laboratory and between their laboratory and others. At the time of the meeting, 4.6% of submitted cases were in conflict with the ISCA Consortium's curated pathogenic regions (http://www.ncbi.nlm.nih.gov/dbvar/studies/nstd45/). Martin then described the ISCA database curation process by which such conflicts are identified and presented to the submitting laboratory for resolution. This process promotes optimal patient care overall and prevents erroneous clinical interpretations based on the use of uncurated data or the limited experience of a given laboratory.
Erik Thorland (Mayo Clinic) presented an expert-level curation process based on an evidence-based protocol developed by the ISCA Consortium [Riggs et al., 2012] that is currently being applied to create a dosage-sensitivity map of the genome for use in clinical interpretation, and an update of this process was provided. Of the 385 genes that are currently targeted with increased probe coverage on the ISCA 180k array, 264 had been completely reviewed at the time of the meeting, and haploinsufficiency and triplosensitivity ratings have been assigned to indicate the level of evidence for pathogenicity that is available in the scientific literature. This information is publicly available at www.ncbi.nlm.nih.gov/projects/dbvar/ISCA, and efforts are ongoing to engage more of the genetics community in this large-scale project.
Finally, Heidi Rehm (Harvard Medical School) proposed curation considerations for data derived from molecular-based techniques that are intended for developing a unified clinical genomics database, including standard terminology and interpretation guidelines, maximizing phenotypic data quality, and optimal data capture methods. There is support for this effort from American College of Medical Genetics and Genomics and College of American Pathologists, and many clinical laboratories have indicated commitment to data sharing. There are also several disease-specific subgroups that have been formed to facilitate the collection of data within specialty groups, which will act as models for the larger, unified variant database.
The Importance of Phenotypic Data
Though there has been great focus on obtaining large volumes of cases for statistical analyses, P values alone do not necessarily provide the correlation to a specific phenotype, which is essential information for the ordering clinician and families in particular. The ISCA Consortium has previously devoted efforts to enhance the amount of phenotypic data that it collects. To make this information useful, it is necessary to use a standardized vocabulary to ensure that terms are consistent (i.e., a particular term means the same thing each time it is used), the data are generalizable and nonidentifying, and the data are easily indexable and searchable. For its database, the ISCA Consortium has opted to use Human Phenotype Ontology (HPO) codes. Peter Robinson (Charité-Universitätsmedizin Berlin) discussed the benefits of using an ontology-based system, including the idea that the relationships between specific phenotypes and their hierarchical associations within general body systems can, when associated with specific diseases and/or genes, help us understand the relationships between those phenotypes, diseases, genes, and gene families, adding to our knowledge of how genetic variation contributes to human disease.
Steven Van Vooren (Cartagenia) illustrated the ISCA Consortium's various processes for collecting phenotype information, including a one-page phenotype form and an algorithm to transform free text submitted by laboratories into HPO terms. He also discussed some proof-of-principle data showing that this type of information available within the ISCA Consortium database could be used to elucidate genotype–phenotype correlations.
Marc Williams (Geisinger Health Systems) presented a strategy for collecting phenotype information from electronic health records (EHRs), and discussed the need to develop standard, structured data elements to facilitate this collection. A dedicated effort by bioinformatics professionals would be necessary to build and pilot the structures needed to accomplish this, but, if successful, it could greatly increase the pace of knowledge acquisition.
An example of how phenotypes could be used to further analyze variants was provided by Chad Shaw (Baylor College of Medicine). He discussed how a phenotypically contextualized pathogenicity score, combining factors such as high-resolution phenotype information, gene expression, Gene Ontology, protein interaction data, and so on, could be applied to potentially enhance the process of CNV interpretation. He described this in practice using the phenotype of “epilepsy” as a specific example, and determined an epilepsy-specific pathogenicity score for each detected CNV using this method. CNVs harbored by patients with epilepsy had significantly higher epilepsy-specific pathogenicity scores than those of patients referred for nonneurologic indications. His conclusion was that an integrative approach to confirm pathogenicity of rare and nonrecurrent CNVs using medical records and bioinformatics may enhance interpretation.
Translating Data into Clinical Actionability
Of all the benefits the unified clinical genomics database hopes to provide, one of the most important will be cataloging variants and evidence in a useful way, allowing clinicians and other experts to turn genomic testing information into clinically actionable directives for patient care. Deanna Church (NCBI) discussed the shift that is occurring from understanding the structure of the genome to understanding the biology and science behind disease and implementing this knowledge to improve healthcare [Church et al., 2011]. Centralized databases such as dbVar and dbSNP aid in this process by compiling and curating information from multiple submissions for researchers to use. Other databases at NCBI such as the Assembly database (http://www.ncbi.nlm.nih.gov/assembly) allow for standardization of genome representation. Robust data management is critical for assessing the analytical validity of individual variant calls. Data from the Genome Reference Consortium (http://genomreference.org) provided several examples of false-positive variant calls that were a consequence of genome misassembly.
Erin Ramos of the National Human Genome Research Institute (NHGRI) discussed the need for systematic collection and evaluation of variants of possible clinical significance identified by genome-wide association study and sequencing studies. The importance of determining a consensus on which variants are actionable and how to make this information available to clinicians was emphasized and illustrated the need for a centralized resource for clinical variants that is compatible with the EHR. Workshops have been held to discuss this issue and have supported publicly available tools to integrate genomic information and actionable variants into the EHR. Funding opportunities have been developed to assist in the development of this resource, referred to as the Clinically Relevant Genetic Variants Resource (CRVR), as well as to further explore the incorporation of genomic findings into clinical care (http://grants.nih.gov/grants/guide/rfa-files/rfa-hg-12-016.html).
Christine Micheel (Vanderbilt), speaking on behalf of the My Cancer Genome project (www.MyCancerGenome.org), provided an example of a related effort to connect genotype information with clinical actionability already defined and in development within the cancer community. Her group has developed a database to assist clinicians and researchers with the move from a traditional view of cancer based on location and histology to a system based on molecular features such as somatic mutations. Through the Vanderbilt Personalized Cancer Medicine Initiative, they found that 66% of melanoma tumors and 46% of lung tumors possessed actionable mutations. The project is integrated into the EHR, where genomic test results are displayed and linked to supporting information. The system currently covers 16 cancers, 24 genes, and 272 disease–gene–variant relationships.
Demonstration of clinical actionability is important not only to patient medical management and outcomes, but also for the reimbursement of CMA testing. Jay Ellison (Signature Genomics Laboratories, PerkinElmer, Inc.) reviewed microarray reimbursement issues and found that one reason cited by many third-party payers reluctant to cover microarray testing was lack of demonstrated clinical utility. He discussed the clinical utility project from his laboratory, which monitored follow-up actions of diagnoses made by microarray, queried physicians regarding follow-up actions, and included tallying of all actionable diagnoses in their database. The detection rate for actionable diagnoses was calculated on 23,156 postnatal microarray cases tested on oligonucleotide platforms. The total detection rate of pathogenic CNVs was 15.4%, with 35% of these cases (5.4% overall) having actionable features that require specific clinical follow-up. His recommendations for what is needed to increase insurance reimbursement rates included more published data on patient outcomes. Efforts surrounding clinical actionability can be enhanced by the direct patient participation that patient advocacy organizations can facilitate.
Clinically actionable results are not necessarily limited to those related to the initial reason for referral; as CMA is a genome-wide test, incidental or secondary findings do arise, and can also contribute to a patient's medical management. Surabhi Mulchandani (Children's Hospital of Philadelphia) presented examples of such incidental findings, including unexpected familial relationships, later onset conditions, pathogenic alterations unrelated to the reason for testing, pathogenic findings in parents tested for unrelated reasons, and carrier status in minors, from a laboratory perspective. She used case presentations to discuss these incidental findings (identified in 78/7,200 cases) and concluded that guidelines and educational resources are needed.
Integration of the Unified Clinical Genomic Database with Complementary Efforts
To make a unified clinical genomics database truly useful, a number of issues must be considered, particularly its ability to display data in an intuitive, user-friendly format. One consideration that has been brought up in relation to large-scale database efforts in the past has been the ability of individuals to submit the same data to multiple, separate efforts. Paul Law (Kennedy Krieger Institute) discussed the use of GUIDs (globally unique identifiers) to facilitate the entry of information to multiple databases. This process would help centralize and share data for individuals across and within sites in a safe, controlled, and private manner, as well as facilitate requests for additional information from the site of origin. There are some limitations to using GUIDs, however, including determining the best way to choose IDs that are secure while still allowing for matching between sites. There are limitations to this system, however, for example, GUIDs are typically generated using a combination of pieces of identifying information such as part of a social security number, a year of birth, and so on. People who know a subjects’ identifying information (e.g., friends, family, and colleagues) may be able to generate and match the GUID. Thus, the process of creating GUIDs must be carefully considered as the database system is built and may require updates along the way.
Having a unified clinical genomics database that is supported by an organization of laboratories, clinicians, and researchers united in their goals to develop best-practice standards for various aspects of the genetic testing process will ultimately help government agencies in their quest to more formally evaluate these emerging technologies. Wendy Rubinstein of the National Institutes of Health Genetic Testing Registry (www.ncbi.nlm.nih.gov/gtr) reported that the goals of the testing registry were to (1) aggregate and provide information about genetic tests for public access, (2) enable clinicians to find tests and gauge their value, and (3) enable experts to formally evaluate tests and develop evidence-based standards. She demonstrated the various features of the Website and requested feedback from the clinical, laboratory, and research communities.
Elizabeth Mansfield from the US Food and Drug Administration (FDA) discussed specific challenges in evaluating CMA testing platforms that include (1) how to know what was missed, (2) how to account for variability in gains and losses, and (3) how to provide size validation. The FDA is moving toward platform validation but believes that clinical interpretation by certified professionals will be necessary for CMA testing. The FDA is working with other groups to ensure the availability of samples for validation to allow platforms to show precision and reproducibility. She ended her presentation by stressing the importance of databases for the test validation process.
Planning the Future: Next Steps for the Consortium
Overall, the 2012 ISCA Consortium Meeting laid the foundation for the expansion of the traditionally cytogenetics-focused group to include the DNA sequencing community. Presentations highlighted both the work that the ISCA Consortium has done to date as well as the plans for similar efforts within the sequencing community. The success of the ISCA Consortium and locus-specific databases will serve as models for this expanded effort. Efforts to secure funding for this endeavor are underway. To promote this effort as a truly unified one, the ISCA Consortium announced plans for a name change to reflect all aspects of genomic variation it now represents; following the meeting, a decision was made to call this expanded consortium the International Collaboration for Clinical Genomics (ICCG) to represent laboratory and clinical genomics efforts. The ICCG intends to maintain active collaborations with other groups and initiatives working toward the shared goal of improved patient care such as the aforementioned CRVR, the Human Variome Project (www.humanvariomeproject.org), and others and will continue to encourage community support and input. All meeting attendees were urged to continue to participate in the development of communal resources by sharing data, voicing concerns, offering potential solutions, and providing valuable feedback.
We would like to cordially thank the official sponsors of the 2012 ISCA Consortium Meeting: Agilent Technologies (Santa Clara, CA; http://www.agilent.com), Affymetrix (Santa Clara, CA; http://www.affymetrix.com), Cartagenia (Leuven, Belgium; http://www.cartagenia.com), GenomeQuest, Inc. (Westborough, MA; http://www.genomequest.com), Roche NimbleGen, Inc. (Madison, WI; http://www.nimblegen.com), Illumina (San Diego, CA; http://www.illumina.com), Life Technologies (Carlsbad, CA; http://www.lifetechnologies.com), McKesson (San Francisco, CA; http://www.mckesson.com), NextBio (Santa Clara, CA; http://www.nextbio.com), and SciGene (Sunnyvale, CA; http://www.scigene.com). We also thank all speakers for their feedback during the preparation of this manuscript, and for their overall contributions to the 2012 ISCA Consortium meeting.
Disclosure statement: The authors declare no conflicts of interest.