Researchers, ethicists, and bio-informatics experts have long had to balance the need to protect the privacy of genomic research participants with the need for information that could lead to life-saving treatments. Recently, a team of researchers at the Whitehead Institute of Biomedical Research in Cambridge, Massachusetts reignited interest in this issue after they determined the surnames of people whose de-identified personal genomes are publicly available from the National Institutes of Health (NIH).
By profiling short tandem repeats on the Y chromosomes and querying free, publicly accessible online databases, researchers established the identities of almost 50 people [Gymrek et al., 2013]. While use of Y chromosome tandem repeats to identify individuals is not new, the researchers did so without DNA samples as references and with relatively quick online searches [Gymrek et al., 2013; Jobling MA, 2001].
Led by Yaniv Erlich, PhD, a geneticist and fellow at the Whitehead Institute, the researchers first tried to identify the surnames of 10 males whose genomic data were included in the 1000 Genomes Project. Using the sequence data for Y-chromosome short tandem repeats (Y-STRs), the researchers cross-referenced them with free, anonymous, and publicly available genetic genealogy data sets, which often contain genetic information voluntarily posted by members of families. That information led to possible last names, ages, and states of residence. Using this information, the researchers tapped into websites that aggregate public data such as addresses, phone numbers, and social networking profiles to reveal donors' identities.
Dr. Erlich's team informed NIH of their findings well before their paper was published. In response, NIH mitigated the breach of privacy risk by removing information about age from public view, says Laura Rodriguez, PhD, Director, Division of Policy, Communications, and Education, and Acting Chief of the Genomic Healthcare Branch at the National Human Genome Research Institute in Bethesda, Maryland.
Researchers first found the Y-STRs in the genome of biologist J. Craig Venter, who published his entire genome in 2007. They put this information into a genetic genealogy website, which returned a top hit for his last name. With a year of birth and a state of residence, an Internet search identified Venter.
The researchers then tried the technique with 10 unidentified men whose DNA sequences had been analyzed and posted online as part of the 1000 Genomes Project. These men and members of their families also participated in a separate NIH endeavor involving genetic samples. Information about the samples and the donors' relationship to one another appeared online and was available from a tissue repository. Using the same basic technique, Dr. Erlich identified some of the men, plus their relatives who had provided genetic samples.
Making a Point
The researchers' intent was not malicious, says Dr. Erlich. Rather, they wanted to provide a “snapshot of privacy challenges with genetic data.” Their research follows a National Cancer Institute meeting in June 2012 at which some participants maintained privacy risks were small because great effort is needed to find someone, according to Dr. Erlich.
After spending the two days needed to find Y-STRs, the search for subjects' identities was quick. “About after 30 minutes of work you know whether you are on the right track,” says Dr. Erlich, whose team revealed the identities of members of two extended families in less than seven hours. Identifying individuals in a third family was harder because they had smaller digital traces, he adds.
Dr. Erlich predicts there will be more opportunities to breach individuals' privacy because people are increasingly using the Internet and leaving longer digital trails. For example, the Department of Veterans Affairs' (VA) Million Veteran Program (MVP) is building a mega-database to hold genomic and clinical information for future studies about veterans who receive their care from VA. However, the database is not set up for open access.
Thinking About Risk
In a commentary, Dr. Rodriguez and her colleagues question whether “complete de-identification of many types of human data is realistic in today's information-rich society,” and she suggests framing risk “along a continuum.”
That's a good idea, says Bradley A. Malin, PhD, Associate Professor of Biomedical Informatics and Computer Science at the Vanderbilt School of Medicine in Nashville, Tennessee, and Director of the Health Information Privacy Lab, part of NIH's Electronic Medical Records and Genomics (eMERGE) Consortium. “No system is impregnable to risk,” Dr. Malin says, but explaining to potential research subjects that risk exists on a continuum is tricky. “Behavioral research shows that people aren't good at reasoning about risk. You can make people overly concerned. Explanations should balance possible negative consequences with benefits, and compare degree of risk of identification to a situation people understand, like being in a car accident or winning the lottery,” Dr. Malin advises.
Sources of free, open access genomic data in the U.S. are few, notes Dr. Rodriguez. They include the NIH's International HapMap, the 1000 Genome Project, and the Personal Genome Project, which has a 24-page consent form and requires subjects to pass an enrollment test that ensures they understand and consent to the risks involved in making data and samples public.
Dr. Rodriguez and Dr. Erlich advise genetic counselors to have an honest discussion about the benefits and risks with research participants. “We need to be aware of the information-rich world in which we live,” says Dr. Rodriguez. “Connections can be made between sets of information more than we predicted in the past, or know of for the future. We need to be cognizant of, and transparent about, this reality.”
“Respect participants. Tell the truth. Say there's a privacy risk but also a big benefit for kids,” Dr. Erlich adds. “We can help them by doing research.”