Cell population‐based framework of genetic epidemiology in the single‐cell omics era

Genetic epidemiology is a rapidly advancing field due to the recent availability of large amounts of omics data. In recent years, it has become possible to obtain omics information at the single‐cell level, so genetic epidemiological models need to be updated to integrate with single‐cell expression data. In this perspective paper, we propose a cell population‐based framework for genetic epidemiology in the single‐cell era. In this framework, genetic diversity influences phenotypic diversity through the diversity of cell population profiles, which are defined as high‐dimensional probability distributions of the state spaces of biomolecules of each omics layer. We discuss how biomolecular experimental measurement data can capture the different properties of this distribution. In particular, single‐cell data constitute a sample from this population distribution where only some coordinate values are observable. From a data analysis standpoint, we introduce methodology for feature extraction from cell population profiles. Finally, we discuss how this framework can be applied not only to genetic epidemiology but also to systems biology.


INTRODUCTION
Understanding the phenotypic diversity among human populations is important in medicine and other life sciences. Genetic epidemiology evaluates phenotypic diversity by statistical models that combine genetic effects and environmental effects to identify the causal variants or genes of diseases. This has greatly contributed to the understanding of the genetic causes and mechanisms of disease.
In recent years, the field of genetic epidemiology has grown significantly due to the availability of genomics data. In particular, genomewide association studies (GWAS) have identified many genetic variants that affect complex traits including diseases. [1] In addition to genomic information, information from other omics such as transcriptomics can also be used to analyze phenotypic diversity. Furthermore, in the past

Omics model
Genetic factors influence phenotypic diversity of biomolecules such as RNA or proteins. Comprehensive biomolecular information is known as omics information, which is classified into genome, transcriptome, proteome, epigenome, or metabolome information. [2,3] Genetic epidemiologists have actively studied phenotypic diversity through such omics information, which is not limited to genomic information.
The Omics Model shown in Figure 1B is a framework for combining genetic epidemiology with omics data. In this model, genetic and environmental effects contribute to phenotypic diversity via biomolecular information. To identify the genetic effects on pools of biomolecular information such as the transcriptome, proteome, metabolome, or epigenome (blue arrow in Figure 1B), the identification of single nucleotide polymorphisms (SNPs) associated with these biomolecules (expression quantitative trait loci (eQTL), protein QTL, methylation QTL, metabolite QTL) are being actively investigated. [4,5] For example, eQTL analysis identifies genetic variants that are associated with gene expression levels obtained from transcriptome data. The eQTLs identi-fied in various tissues have been published in databases such as. [6,7] In addition, studies that examine the relationship between omics diversity and phenotypic diversity (red arrow in Figure 1B) constitute disease omics analysis. Studies to identify differentially expressed genes in diseased and healthy individuals are included in this category.
Both types of study designs have been widely implemented in omics research projects. While association studies between cell population profiles and phenotype are often performed to identify a cellular subset related with disease using cytometry data or single-cell RNA-seq data (red arrow in Figure 1C), [8][9][10] genetic epidemiological analyses based on such models have not been performed to date (blue arrow in Figure 1C). Previously, we performed the first GWAS study on the diversity of lymphocyte populations in peripheral blood using a large-scale cytometry dataset based on this framework. [11] As a result, although the analysis was performed with a relatively small sample size, the SNPs associated with individual differences of the lymphocyte profile were successfully identified. In recent years, research to acquire cytometry data on a large scale has also become common. [12] Genetic epidemiological research under this model can be expected to bring new findings.

Multi-tissue model
The Cell Population Model can be extended to multiple tissues in the Multi-Tissue Model ( Figure 1D). Under this framework, phenotypic diversity is understood as being generated by a combination of effects from cell population profiles of multiple related tissues. For systemic diseases involving multiple tissues, such models are a natural expression of the mechanism. Although genetic epidemiological studies using the Multi-Tissue Model have not been conducted, it is considered meaningful as a future genetic epidemiological model in the single-cell era. layer are information about mutations or DNA damage that accumulate in the somatic genome and are distinct from the germline genome information inherited from parents. For example, cancer is a disease caused by an increase in the number of cells with abnormal somatic genomic information, and cancer genome analysis has been used to identify genes involved in the pathogenesis of the disease. [13,14] In addition, considering mitochondria genome is beneficial to understand the differences among cells. For example, recent in vivo study using mouse observed the mitochondrial transfer between different types of cells, which is related to biological or pathological phenomena. [15,16] Because each cell has individual omics information, one cell can be

EXPERIMENTAL DATA MEASURING BIOMOLECULES TO CAPTURE THE PROPERTIES OF THE CELL POPULATION PROFILE
Experimental data measuring biomolecules can be interpreted as capturing different parts of the distribution of the cell population profile in the Omics State Space. Because the distribution of cell population profiles is very high-dimensional and complex, there is no experiment technique to get a complete picture. Existing biomolecular experimental data can be classified according to three perspectives with respect to the desired information: the target omics layer, bulk/single-cell, and candidate-based/comprehensive. For example, bulk and candidatebased approaches in the proteome layer include western blotting or ELISA. Immunocytochemistry is a single-cell level and candidate-based method primarily in the proteome layer, where the number of cells that can be measured is small but protein localization can be distinguished.
Single-cell and comprehensive approaches in the transcriptome layer include RNA-seq or DNA microarray. Methods for comprehensive measurements at the single-cell level in each layer have made rapid progress in the past few years. [15][16][17][18][19][20][21][22] Recent genomics assay can detect even mtDNA mutations at single cell level. [23] In particular, single-cell data and bulk data differ in their data structure. The bulk measurement is an estimate of the mean value for a par-  The ability to acquire more biomolecule information simultaneously at the single-cell level will allow us to understand the shape of the cell population profile at higher resolution. In recent years, the ability to measure omics information in multiple layers simultaneously has been actively researched, and measurement techniques at the singlecell level have been developed. [24,25]

REQUIREMENT OF A CELL POPULATION
In this section, we will discuss important issues when considering cell population profiles as distributions, and the requirements that must be met for a cell population.
When the cell population profile is viewed as a probability distribution, each data point is considered independent and the cell location information disappears. Then, cells need to be able to come and go from each other within a cell population. This assumption holds well for peripheral blood cell populations. When blood cells are sampled from peripheral blood, each cell can be regarded as independent and randomly collected, and the single-cell data can be regarded as a statistical sample from the population distribution. However, in many anatomically defined tissues, it is not only necessary for cells to maintain their proper biomolecular expression state, but also for each cell to occupy its proper position in the tissue to maintain the tissue function. For example, tissue stem cells are maintained in a microenvironment called a niche. [26] Considering such cell populations as distributions would result in a loss of biological information.
In recent years, spatial omics technologies that simultaneously acquire positional and omics information have received much attention. For example, the spatial transcriptome can reveal transcriptome data while retaining spatial information in the tissue. [27] Such spatial information may be useful in determining the range of cell populations that can be treated as distributions and in compensating for the loss of positional information.
To extend the Cell Population Model to the Multi-Tissue Model, it is necessary to consider the interactions between the cell populations.
Cell populations exchange information through physical interactions or cellular signaling. In reality, the diversity of some complex phenotypes is generated by many cell populations that make up an individual and their interactions.

FEATURE EXTRACTION OF CELL POPULATION PROFILES
In order to design genetic epidemiology studies based on a cell population-based framework, such as the Cell Population Model or Multi-Tissue Model, it is necessary to perform association analysis between the cell population profile and individual labels such as genotype or phenotype. Since the cell population profile is represented as a probability distribution on the Omics State Space, conventional methods of genetic epidemiology and omics data analysis cannot be directly used in this situation. The solution to apply these data analysis methods and conduct association analysis is to extract feature values from cell population profiles. In this section, we introduce three conventional ideas on feature extraction of cell population profiles, methods using bulk data, methods based on cellular subsets, and non-parametric methods.
The mean value of distribution obtained by bulk data is one of the most commonly used features of cell population profiles. For example, bulk transcriptome data have contributed greatly to the identification of tissue-specific genes [28] . The identification of tissue-specific genes is way to compare cell populations from multiple tissues to find transcriptome axes whose mean values differ significantly among the multiple tissue cell populations on the Omics State Space. In the medical science field, many searches for biomolecular markers using bulk data have been conducted. [29,30] Feature extraction based on cellular subsets is frequently done with single-cell data. Each cell in a cell population is a little different from the others, so no two cells are exactly the same. However, since cell populations are formed as cells proliferate and differentiate, there are cellular subsets with the same properties and functions in the cell population.
Therefore, we can understand cellular function by classifying cells into subsets and annotating their functions. Since a cell population profile is a mixed distribution of cellular subsets, a quantitative value of the percentage of each subset is also a valid feature of a cell population profile. Computational methods for clustering cells using single-cell data to identify cellular subsets are actively being studied by computational biologists. [31,32] Cellular subset-based feature extraction also loses information.
One reason is that the results of feature extraction are affected by prior biological knowledge and assumptions about the pre-identified cellular subsets. However, it is not known exactly how many cellular subsets there are in our body or how we should classify them. Novel subsets are being newly identified. Even data-driven classification using information science methods cannot eliminate such biases due to the assumptions made in the algorithms and statistical models. In addition, In cytometry data analysis, a method using information theory-based dissimilarity quantification and multi-dimensional scaling (MDS) has been proposed. [33,34] Here, the dissimilarity matrix among probability distributions is calculated by nonparametric density estimation, and MDS is applied to this dissimilarity matrix to obtain coordinates that reflect the dissimilarity relationship. Decomposition into Extended Exponential Family expresses the cell population profile distribution as an exponential family-like formula in a nonparametric manner, giving coordinates based on the inner-product matrix among the distributions. [35] The coordinates obtained by these procedures can be treated as data-driven feature values of the cell population profiles.
The development of feature extraction methods that satisfy these requirements is a future challenge in data analysis for implementing genetic epidemiology models in the single-cell era. The advantage of cellular subset-based feature extraction is that the biological meaning of the obtained features is clear and easy to interpret. The advantage of nonparametric methods is that they can model cell population profiles without using prior assumptions about cellular subsets. However, nonparametric methods generally require larger sample sizes to perform robust analysis. Due to cost issues, it is often difficult to acquire singlecell data with very large sample sizes. While there are many methods to compare multiple samples in cytometry data, such methods are lack-ing in single-cell RNA-seq data in particular. [36] That is a future task in single cell data analysis.

DYNAMICS OF CELL POPULATION PROFILES
Research designs that combine genetic epidemiology and systems biology are useful in medical research. [37,38] In genetic epidemiology, genetic factors are identified by analyzing the diversity of complex phenotypes such as diseases among individuals. However, understanding the dynamics of complex phenotypes within the same individual is also important for the control of diseases with a systems biology approach. The dynamics of such cell population profiles can also be analyzed by applying ordinary data analysis methods through feature extraction. The changes in the cell population profiles for each sample can be visualized and analyzed as the time series data of its features. Various data analysis methods are available to analyze time series data. [39] A cell population-based framework such as the Cell Population Model or Multi-Tissue Model provides an integrated approach to both genetic epidemiology and systems biology (Figure 3). This framework will be useful to investigate the genetic effect that is condition-specific or related with dynamics. For example, the recent research suggested that the RNASEH2B variant has relation to hemophagocytic lymphohistiocytosis (HLH) depending on the biological condition. [40] CONCLUSION In this perspective paper, we proposed a cell population-based framework for genetic epidemiology. In this framework, genetic diversity