Fifty-eighth annual meeting of the american association of physicists in medicine
SU-D-204-06: Integration of Machine Learning and Bioinformatics Methods to Analyze Genome-Wide Association Study Data for Rectal Bleeding and Erectile Dysfunction Following Radiotherapy in Prostate Cancer
We investigated whether integration of machine learning and bioinformatics techniques on genome-wide association study (GWAS) data can improve the performance of predictive models in predicting the risk of developing radiation-induced late rectal bleeding and erectile dysfunction in prostate cancer patients.
We analyzed a GWAS dataset generated from 385 prostate cancer patients treated with radiotherapy. Using genotype information from these patients, we designed a machine learning-based predictive model of late radiation-induced toxicities: rectal bleeding and erectile dysfunction. The model building process was performed using 2/3 of samples (training) and the predictive model was tested with 1/3 of samples (validation). To identify important single nucleotide polymorphisms (SNPs), we computed the SNP importance score, resulting from our random forest regression model. We performed gene ontology (GO) enrichment analysis for nearby genes of the important SNPs.
After univariate analysis on the training dataset, we filtered out many SNPs with p>0.001, resulting in 749 and 367 SNPs that were used in the model building process for rectal bleeding and erectile dysfunction, respectively. On the validation dataset, our random forest regression model achieved the area under the curve (AUC)=0.70 and 0.62 for rectal bleeding and erectile dysfunction, respectively. We performed GO enrichment analysis for the top 25%, 50%, 75%, and 100% SNPs out of the select SNPs in the univariate analysis. When we used the top 50% SNPs, more plausible biological processes were obtained for both toxicities. An additional test with the top 50% SNPs improved predictive power with AUC=0.71 and 0.65 for rectal bleeding and erectile dysfunction. A better performance was achieved with AUC=0.67 when age and androgen deprivation therapy were added to the model for erectile dysfunction.
Our approach that combines machine learning and bioinformatics techniques enabled designing better models and identifying more plausible biological processes associated with the outcomes.