Significant SNPs have limited prediction ability for thyroid cancer

Recently, five thyroid cancer significantly associated genetic variants (rs965513, rs944289, rs116909374, rs966423, and rs2439302) have been discovered and validated in two independent GWAS and numerous case–control studies, which were conducted in different populations. We genotyped the above five single nucleotide polymorphisms (SNPs) in Han Chinese populations and performed thyroid cancer-risk predictions with nine machine learning methods. We found that four SNPs were significantly associated with thyroid cancer in Han Chinese population, while no polymorphism was observed for rs116909374. Small familial relative risks (1.02–1.05) and limited power to predict thyroid cancer (AUCs: 0.54–0.60) indicate limited clinical potential. Four significant SNPs have limited prediction ability for thyroid cancer.


Introduction
Thyroid cancer is the fifth most common type of female cancer and its incidence is increasing. It has been considered as one of highest familial risk carcinomas among all kinds of cancers [1,2]. Most common diseases are caused by multiple genetic rather than few loci. In the last 2 years, two independent genome-wide association studies (GWAS) have been conducted to identify single nucleotide polymorphisms (SNPs) associated with thyroid cancer risk. Five SNPs (rs965513, rs944289, rs116909374, rs966423, and rs2439302) which were highly significantly associated with

Cancer Medicine
Open Access papillary thyroid carcinoma (PTC) were discovered by genome-wide association study. In addition, these five SNP were validated by continued case-control studies in more than three different populations (Han Chinese, Ohio, Poland, etc. Table 1).
To examine the prediction ability based on variants with highly significant associations, we use all five SNPs to predict thyroid cancer by nine classification methods (K-nearest neighbors, logistic regression, na€ ıve Bayes, random forest, support vector machine, Bayesian additive regression trees (BART), recursive partitioning, fuzzy rule-based system, boosting). Contradictory to our intuitiveness, we found that although all these five SNPs were significantly associated with thyroid cancer, the precision of their prediction for thyroid cancer was very low.

Methods
The five SNPs were genotyped in 845 PTC and 1005 controls in Han Chinese population using the SNaPshot multiplex single-nucleotide extension system. PTC patients who were treated in the Department of Head and Neck Surgery, Fudan University Shanghai Cancer Center, Shanghai, China from January to December 2010 were enrolled in this study. All patients were ethnically Chinese Han and came from Eastern China. A total of 1005 can-cer-free unrelated individuals were recruited from the Taizhou Longitudinal Study (TZL). The SNPs were genotyped with the SNaPshot multiplex single-nucleotide extension system. Details of SNPs (Table S1) and primers were listed in our previous article [3].
The relative risk to daughters of an affected thyroid cancer individual attributable to a given SNP is calculated by the formula: k Ã ¼ pðpr2þqr1Þ 2 þqðpr1þqÞ 2 ðp 2 r2þ2pqr1þq 2 Þ 2 , where p is the frequency of the risk allele, q = 1 À p, r 1 and r 2 are the relative risks (estimated by odds ratio [ORs]) for heterozygotes relative to common homozygotes and rare homozygotes relative to common homozygotes in the population, respectively [4,5]. Assuming a multiplicative interaction, the proportion of the familial risk attributable to the SNP is calculated by log(k*)/log(k o ), where k o is the overall familial relative risk (FRR), estimated to be 8.48 for thyroid cancer [1]. Gender-and age-matched cases and controls were constructed by 1000 times resampling technology.
Nine machine learning methods were used to make prediction for PTC from health individuals, including K-nearest neighbors [6], logistic regression, na€ ıve Bayes [7], random forest [8], support vector machine [7], BART [9], boosting, recursive partitioning, and fuzzy rule-based system [10]. The parameters in the models were optimally GWAS, genome-wide association studies; OR, odds ratio. 1 ORs were calculated based on the multiplicative model. For the combined study populations, the OR value were estimated using the Mantel-Haenszel model. 2 ORs were calculated for the risk allele with using multiple logistic regression analyses. selected. Classification accuracy, sensitivity, specificity, and AUC were used to evaluate the performance of the methods. They were calculated by 10-fold cross-validation.

Marginal FRR of the significant SNPs
As the previous studies showed that the five SNPs with large OR were significantly associated with thyroid cancer in various populations (Table 1). Our previous data also showed that SNPs were significantly associated with thyroid cancer in Chinese population (the seventh study of Table 1). In present study, we estimated the FRR for five significantly associated SNPs in Chinese population. We found that the FRRs were low, ranging from 1.02 to 1.05. These five SNPs counted only 5.98% of the overall familial risk ( Table 2) which was very closed to that of polish population (about 6%) [11]. Our finding suggested that majority of the heritability was undiscovered.

Genetic risk prediction for thyroid cancer based on five SNPs
The five significant SNPs were used to predict risk of thyroid cancer by nine classification methods. The results were summarized in Table 3. The prediction accuracies ranged from 0.52 to 0.57 in the nine prediction methods, while receiver operating characteristics (ROCs) ranged from 0.54 to 0.60. The sensitivity of the prediction (0.28-0.48) was much less than specificity (0.56-0.76), which suggested the clinical application value might be limited ( Table 3). In addition, the AUC of classification based on five SNPs and gender, and based on five SNPs, gender, and age ranged from 0.49 to 0.58, and from 0.50 to 0.59, respectively. This indicated that including gender and age information will not improve prediction (Tables S2 and  S3, Fig. S1).

Conclusion
In the present study, we estimated the FRR and evaluated thyroid cancer prediction accuracy of the five SNPs that showed significant association with thyroid cancer in several association studies. The results showed that although the OR of each SNPs was large, the FRR of each SNPs was very marginal. By 10-fold cross-validation, we found that the prediction accuracy of five SNPs was low across all nine classification methods. Particularly, the sensitivity of five SNPs was very low. It suggested that the clinical application of five SNPs might be limited. Our results strongly demonstrate that complex diseases are caused by a large number of SNPs, environments, and their interactions. GWAS addressing common variants have come to its limit and missing heritability for most complex disorders is very high. Only about 5-10% heritability  was found based on common disease common variant (CDCV) model. To improve prediction of genetic variation for complex diseases, we need to incorporate more common and rare SNPs, copy number variations (CNVs), and nongenetic susceptibility factors, such as iodine intake, exposure to radiation in the classification analysis. Novel statistical methods for variable screening should be developed to optimally select SNPs and CNVs across the genome for disease risk prediction.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Figure S1. ROC comparison among all the machine learning prediction methods. Nine machine learning method were used to make prediction for PTC from health individuals, including K-nearest neighbors (KNN), logistic regression (LR), na€ ıve Bayes, random forest, support vector machine, Bayesian additive regression trees (BART), boosting, recursive partitioning, fuzzy rule-based system. The parameters in the models were optimally selected. Classification accuracy, sensitivity, specificity and AUC were used to evaluate the performance of the methods. They were calculated by 10-fold cross-validation. Table S1. Genomic information for five Acknowledged SNPs from GWAS. Table S2. Model performance with methods based on five SNPs and gender. Table S3. Model performance with methods based on five SNPs, gender, and age.