Smooth‐Threshold Multivariate Genetic Prediction with Unbiased Model Selection
ABSTRACT
We develop a new genetic prediction method, smooth‐threshold multivariate genetic prediction, using single nucleotide polymorphisms (SNPs) data in genome‐wide association studies (GWASs). Our method consists of two stages. At the first stage, unlike the usual discontinuous SNP screening as used in the gene score method, our method continuously screens SNPs based on the output from standard univariate analysis for marginal association of each SNP. At the second stage, the predictive model is built by a generalized ridge regression simultaneously using the screened SNPs with SNP weight determined by the strength of marginal association. Continuous SNP screening by the smooth thresholding not only makes prediction stable but also leads to a closed form expression of generalized degrees of freedom (GDF). The GDF leads to the Stein's unbiased risk estimation (SURE), which enables data‐dependent choice of optimal SNP screening cutoff without using cross‐validation. Our method is very rapid because computationally expensive genome‐wide scan is required only once in contrast to the penalized regression methods including lasso and elastic net. Simulation studies that mimic real GWAS data with quantitative and binary traits demonstrate that the proposed method outperforms the gene score method and genomic best linear unbiased prediction (GBLUP), and also shows comparable or sometimes improved performance with the lasso and elastic net being known to have good predictive ability but with heavy computational cost. Application to whole‐genome sequencing (WGS) data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) exhibits that the proposed method shows higher predictive power than the gene score and GBLUP methods.
Citing Literature
Number of times cited according to CrossRef: 7
- Yuta Takahashi, Masao Ueki, Gen Tamiya, Soichi Ogishima, Kengo Kinoshita, Atsushi Hozawa, Naoko Minegishi, Fuji Nagami, Kentaro Fukumoto, Kotaro Otsuka, Kozo Tanno, Kiyomi Sakata, Atsushi Shimizu, Makoto Sasaki, Kenji Sobue, Shigeo Kure, Masayuki Yamamoto, Hiroaki Tomita, Machine learning for effectively avoiding overfitting is a crucial strategy for the genetic prediction of polygenic psychiatric phenotypes, Translational Psychiatry, 10.1038/s41398-020-00957-5, 10, 1, (2020).
- Aleksander Hagen Erga, , Impulse Control Disorders in Parkinson’s disease, 10.31265/usps.39, (2020).
- Akira Narita, Masao Ueki, Gen Tamiya, Artificial intelligence powered statistical genetics in biobanks, Journal of Human Genetics, 10.1038/s10038-020-0822-y, (2020).
- Svetlana Cherlin, Darren Plant, John C. Taylor, Marco Colombo, Athina Spiliopoulou, Evan Tzanis, Ann W. Morgan, Michael R. Barnes, Paul McKeigue, Jennifer H. Barrett, Costantino Pitzalis, Anne Barton, MATURA Consortium, Heather J. Cordell, Prediction of treatment response in rheumatoid arthritis patients using genome‐wide SNP data, Genetic Epidemiology, 10.1002/gepi.22159, 42, 8, (754-771), (2018).
- Jun Yasuda, Kengo Kinoshita, Fumiki Katsuoka, Inaho Danjoh, Mika Sakurai-Yageta, Ikuko N Motoike, Yoko Kuroki, Sakae Saito, Kaname Kojima, Matsuyuki Shirota, Daisuke Saigusa, Akihito Otsuki, Junko Kawashima, Yumi Yamaguchi-Kabata, Shu Tadaka, Yuichi Aoki, Takahiro Mimori, Kazuki Kumada, Jin Inoue, Satoshi Makino, Miho Kuriki, Nobuo Fuse, Seizo Koshiba, Osamu Tanabe, Masao Nagasaki, Gen Tamiya, Ritsuko Shimizu, Takako Takai-Igarashi, Soichi Ogishima, Atsushi Hozawa, Shinichi Kuriyama, Junichi Sugawara, Akito Tsuboi, Hideyasu Kiyomoto, Tadashi Ishii, Hiroaki Tomita, Naoko Minegishi, Yoichi Suzuki, Kichiya Suzuki, Hiroshi Kawame, Hiroshi Tanaka, Yasuyuki Taki, Nobuo Yaegashi, Shigeo Kure, Fuji Nagami, Kenjiro Kosaki, Yoichi Sutoh, Tsuyoshi Hachiya, Atsushi Shimizu, Makoto Sasaki, Masayuki Yamamoto, Genome analyses for the Tohoku Medical Megabank Project towards establishment of personalized healthcare, The Journal of Biochemistry, 10.1093/jb/mvy096, (2018).
- Melania Pintilie, Jenna Sykes, Evaluating Common Strategies for the Efficiency of Feature Selection in the Context of Microarray Analysis, Journal of Data Analysis and Information Processing, 10.4236/jdaip.2017.51002, 05, 01, (11-32), (2017).
- Guifang Fu, Gang Wang, Xiaotian Dai, An adaptive threshold determination method of feature screening for genomic selection, BMC Bioinformatics, 10.1186/s12859-017-1617-9, 18, 1, (2017).



; (a) indicator function
with
(green dotted) and its approximation by the adaptive lasso smooth‐thresholding function
with
(black solid); (b) plots of
, which is the solution to the equation
with respect to β given
and D. Indicator function,
(Green dotted). (Note: any ρ gives an identical solution.) Smooth‐thresholding
for
(black solid), 10 (red solid) and 20 (blue solid), respectively. When
, the smooth‐threshold estimator reduces to the adaptive lasso estimator.
‐scale (x‐axis). Black dotted line (ST.P), average of prediction squared error for test data (PSEte) for predictive model from smooth‐threshold multivariate genetic prediction trained on the train‐ ing data. Red solid line (CpST), average of the proposed
‐type criterion (an unbiased estimator of the black dashed line).
loglikelihood averaged over 20 simulation repli‐ cates for binary traits in polygenic scenarios (P19,...,P24). Black dashed line (ST), average of mean
loglikelihood for training data (PSEtr) for pre‐ dictive models from smooth‐threshold multivariate genetic prediction at each p‐value threshold in
‐scale (x‐axis). Black dotted line (ST.P), average of prediction
loglikelihood for test data (PSEte) for predictive model from smooth‐threshold multivariate genetic prediction trained on the train‐ ing data. Red solid line (CpST), average of the proposed
‐type criterion (an approximate unbiased estimator of the black dashed line).
