Asymptotic Conditional Singular Value Decomposition for High-Dimensional Genomic Data
Article first published online: 16 JUN 2010
© 2010, The International Biometric Society
Volume 67, Issue 2, pages 344–352, June 2011
How to Cite
Leek, J. T. (2011), Asymptotic Conditional Singular Value Decomposition for High-Dimensional Genomic Data. Biometrics, 67: 344–352. doi: 10.1111/j.1541-0420.2010.01455.x
- Issue published online: 20 JUN 2011
- Article first published online: 16 JUN 2010
- Received October 2009. Revised February 2010. Accepted April 2010.
- False discovery rate;
- Gene expression;
- Singular value decomposition;
- Surrogate variables
Summary High-dimensional data, such as those obtained from a gene expression microarray or second generation sequencing experiment, consist of a large number of dependent features measured on a small number of samples. One of the key problems in genomics is the identification and estimation of factors that associate with many features simultaneously. Identifying the number of factors is also important for unsupervised statistical analyses such as hierarchical clustering. A conditional factor model is the most common model for many types of genomic data, ranging from gene expression, to single nucleotide polymorphisms, to methylation. Here we show that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges. We also propose a consistent estimator of the dimension of the underlying conditional factor model for a finite fixed sample size and an infinite number of features based on a scaled eigen-decomposition. We propose a practical approach for selection of the number of factors in real data sets, and we illustrate the utility of these results for capturing batch and other unmodeled effects in a microarray experiment using the dependence kernel approach of Leek and Storey (2008, Proceedings of the National Academy of Sciences of the United States of America 105, 18718–18723).