Sharing genetic risk scores can unknowingly reveal secrets

Genetic data can be analyzed to estimate the risk of certain conditions

Science Photo Library / Alamy

Genetic risk scores that summarize a person’s likelihood of developing certain health conditions can be exploited through mathematical tricks to reveal hidden details about their DNA.

The method could theoretically be used by health insurance companies to reconstruct genetic data from a summary genomic report, revealing health risks not disclosed by the patient. Alternatively, people who share their points anonymously can be identified by extracting the genetic data and querying public genealogy databases.

Polygenic risk scores measure the impact of tens of thousands to thousands of individual letter variations in the genome, known as single-nucleotide polymorphisms (SNPs). Used by researchers and DNA testing companies like 23andMe to summarize potential health risks, the scores are sometimes shared publicly, for example by people asking for advice on interpreting their scores.

Unpacking a polygenic risk score is like trying to figure out a phone number and only knowing that the digits add up to 52. It’s an example of the knapsack problem in mathematics, known for being computationally difficult. Because of this, the score is considered a low privacy risk.

However, each SNP value used in a risk score is multiplied by an extremely precise weight – up to 16 digits long – that reflects its contribution to overall disease risk. This makes small risk models vulnerable to attack.

“Because the final polygenic risk score is limited by a limited number of ways you can arrive at this number, and a statistically probable arrangement of the underlying SNPs, it can be derived with a high degree of accuracy,” says Gamze Gürsoy of Columbia University in New York.

Gürsoy and Kirill Nikitin, also at Columbia, ran 298 polygenic risk models using 50 SNPs or fewer on genetic data from 2,353 individuals. Working backwards, they calculated all the possible genomes that could have produced each given score, filtering out those with many unusual mutations.

As one SNP can be used by multiple polygenic risk models, Gürsoy and Nikitin were able to chain their attack, using SNPs revealed by smaller models to help solve larger ones.

They were able to reconstruct the donor genotype with 94.6 percent accuracy, correctly predicting 2,450 SNPs per individual. Tests showed that 27 SNPs were enough to identify an individual in a pool of half a million samples, and family members could be predicted with up to 90 percent precision. Individuals of African and East Asian descent were more easily identified as they are less well represented in genetic databases.

According to Gürsoy, 447 small, high-precision models in a public database of polygene scores are vulnerable to this attack.

“We wanted to point out that the risk is low, but under (some conditions) there may still be some leakage,” says Gürsoy. “We should consider this when designing research studies, especially if we involve vulnerable populations.”

Ying Wang of Massachusetts General Hospital says that existing data protection and computational bottlenecks limit the risk of polygenic risk scores being exploited in this way. “The results may serve as a warning that small models should be treated as potentially sensitive data in clinical reporting and informed consent discussions,” she says.

Topics:

Click Here to Get More