Bulked sample analysis in genetics, genomics and crop improvement

Summary Biological assay has been based on analysis of all individuals collected from sample populations. Bulked sample analysis (BSA), which works with selected and pooled individuals, has been extensively used in gene mapping through bulked segregant analysis with biparental populations, mapping by sequencing with major gene mutants and pooled genomewide association study using extreme variants. Compared to conventional entire population analysis, BSA significantly reduces the scale and cost by simplifying the procedure. The bulks can be built by selection of extremes or representative samples from any populations and all types of segregants and variants that represent wide ranges of phenotypic variation for the target trait. Methods and procedures for sampling, bulking and multiplexing are described. The samples can be analysed using individual markers, microarrays and high‐throughput sequencing at all levels of DNA, RNA and protein. The power of BSA is affected by population size, selection of extreme individuals, sequencing strategies, genetic architecture of the trait and marker density. BSA will facilitate plant breeding through development of diagnostic and constitutive markers, agronomic genomics, marker‐assisted selection and selective phenotyping. Applications of BSA in genetics, genomics and crop improvement are discussed with their future perspectives.


Introduction
Biological assays in genetics, genomics and crop improvement, such as genetic mapping, usually involve using all the samples or individuals collected from a population, followed by analysis using an array of genetic factors such as molecular markers at DNA, RNA or protein level. To ensure enough power in statistical analysis, a large number of samples should be combined with sequencing or a high density of genetic markers. As almost all the traits with agronomic values are genetically complex, which are affected by many genes, environments and their interactions (Cramer et al., 2011;El-Soda et al., 2014;Grishkevich and Yanai, 2013), identification of involved genetic factors such as quantitative trait loci (QTL) has been playing a vital role in manipulating the traits of interest and understanding the genetic architecture (Holland, 2007;Xu, 2010). However, conventional analysis requires assaying all the individuals for the target traits collected from a sample population. As a result, it is usually expensive and time-consuming.
To maintain the statistical power by reducing cost and simplifying analytical process, selective assay, such as selective genotyping, by which only individuals with extreme phenotypes (usually the two tails selected from a sample population) are analysed, has been proposed (Darvasi and Soller, 1994;Sun et al., 2010). A further significant cost reduction is to bulk all the individuals selected from each tail of the population and analyse as a pool. For example, pooled DNA analysis for marker identification was developed by two groups independently but named differently as bulked segregant analysis (Michelmore et al., 1991) and DNA pooling (Giovannoni et al., 1991). More recently, bulked segregant analysis has been modified to locate the target genes, by using large populations, increased tail sizes and high-density markers so that there is no need to validate the putative markers by genotyping the entire populations using the positive markers (Sun et al., 2010;. As a consequence, it has dramatically reduced genotyping cost by using selective samples, while the statistical power in QTL mapping is comparable to the entire population analysis (Macgregor et al. 2008;Sun et al., 2010;Vikram et al., 2012). Considering a population with 500 individuals and 25 extreme ones selected to form each bulk, bulked segregant analysis will only cost 0.4% (=2/500) of the total cost required for entire population analysis.
With the development of molecular breeding technologies in recent years, bulked segregant analysis has also witnessed many improvements. The pooled DNA analysis can be used for two contrasting groups of individuals from any population as suggested by  and Sun et al., (2010), not just for those from biparental segregating populations. First, the same principle has been used in mapping by sequencing using two contrasting groups, such as major gene mutants and their corresponding wild types (Austin et al., 2011;James et al., 2013;Schneeberger et al., 2009), which is strategically different from MutMap using bulked segregants from the mutant-derived population (Abe et al., 2012;Takagi et al., 2013bTakagi et al., , 2015. Second, individuals with extreme phenotypes from natural populations have been bulked for sequencing and genomewide association study (GWAS) (Bastide et al., 2013;Turner et al., 2010;Yang et al., 2015). In this article, the term bulked sample analysis (BSA) is used to include all analyses using selected and pooled samples from genetics and breeding populations. We define BSA as a sampling-bulking method to achieve the best representativeness by selecting only a part of individuals from the entire sample set and pooling as bulks. To generalize the concept, we define two important components involved in BSA: samples that represent individuals collected from populations and markers that represent all types of biomarkers at DNA, RNA and protein levels.
In this article, the concept BSA will be described and examined for innovative researches in genetics, genomics and crop improvement. We will first extend the concept of BSA to include segregants from segregating populations and variants from all types of populations. The sample handling strategies, including sampling, bulking and multiplexing, and sample analysis strategies at DNA, RNA and protein levels will be then developed. Finally, applications of BSA in genetics, genomics and crop improvement will be discussed with future prospects.

Bulks: segregants and variants
Bulked sample analysis can be used for any populations with significant phenotypic difference for the target trait among individuals, with nontarget traits varied randomly, between the two contrasting samples. The samples can be collected from many populations with two types of genetic background: (i) segregants from segregating populations derived from bi-or multiparents and (ii) variants from any populations of a species including those with diverse genetic background.
Biparental populations have been most frequently used in BSA with any segregants or phenotypic contrasting extremes. This type of populations includes F 2 , F 2:3 , BC 1 , RILs (recombinant inbred lines), and DHs (doubled haploids), among which RIL and DH populations consist of individual homozygous lines so that they can be maintained by selfing and evaluated in multiple environments across years and locations.
As one of the two major types of multiparental populations, NAM is developed by crossing a common line with a diverse panel of lines, followed by generating a set of RIL populations. Different from NAM, a MAGIC population starts with multiple biparental crosses, and by the end, a composite hybrid is derived to include all the parental lines, from which a set of RILs or DHs is developed.
Several types of testcross populations can be derived from biparental (F 2 , F 2:3 , RIL, DH) or multiparental populations, by testcrossing each individual within a population with their parental lines, F 1 , or testers. In this case, genotyping is performed for the individuals that are used for testcrossing, while phenotyping is conducted on the testcross progeny. Several mating designs, including diallel, NCD I, II and III, and triple test cross, can be explored for generating testcross populations. The testcross populations can be used for understanding genetic mechanisms of important phenomena such as hybrid performance, combining ability and heterosis.

Variants
Fixed or homozygous individuals are defined in this article by the term 'variants', which together represent a full spectrum of variation for the target trait, but vary randomly for nontarget traits within a population. Different from a set of segregants that are derived directly from two or more parental lines, variants come from naturally existing populations, mutant libraries each containing a set of mutants or a panel of variants from different sources ( Figure 1). The use of prevailing phenotypic differences of such populations can bypass the requirement for developing large segregating populations.
A natural population consists of a panel of individuals such as varieties, inbreds, accessions, ecotypes, races, from a specific species, representing a full spectrum of variation for the target traits. This type of population usually involves a wide range of genetic backgrounds with several traits that can be targeted. Because of diverse variation for nontarget traits, phenotyping under managed or controlled environments is preferred (Araus and Cairns, 2014). Natural populations have been widely used in plants, including 75 wild, landrace and improved maize lines (Hufford et al., 2012), 278 maize inbreds (Jiao et al., 2012), 285 maize inbreds and their 570 testcrosses produced by crossing with two testers (Riedelsheimer et al., 2012), 517 rice landraces (Huang et al., 2010), 950 rice varieties (Huang et al., 2012) and 971 sorghum accessions (Morris et al., 2013). A large number of mutants have been generated using targetinduced local lesions (TILLING) strategy. To cover all the genetic loci across the genome, a large number of mutants need to be developed, which is time-consuming and also very expensive. Mutant libraries have been used to discover rare mutations in extensively pooled DNA in rice (Chi et al., 2014), identify and functionally analyse miRNAs in developing kernels of a viviparous mutant in maize (Ding et al., 2013), detect and catalogue genomewide ethyl methanesulfonate (EMS) induced mutations in rice and wheat (Henry et al., 2014) and screen for mutants of enhancing leaf yield and associated metabolic traits in tobacco (Reddy et al., 2012).
A panel of variants from different sources but with variation in the same target trait can be mixed and selected for BSA. This kind of panel may include samples from populations of multiple sources and mutants from multiple donor parents. It may also contain samples from segregants mixed with different sources of variants.

Samples: sampling, bulking and multiplexing
Target traits and phenotyping for sampling Bulked segregant analysis was originally designed to target the traits controlled by major genes with large effect and less confounded by environments. Recent developments in BSA have increased the power of bulked segregant analysis in identifying minor causal alleles (Bernier et al., 2007;Sun et al., 2010;Tuberosa et al., 2002;Venuprasad et al., 2009;Vikram et al., 2011;Figures 2 and 3). A simulation study indicated that BSA can be used for mapping QTL with relatively small effects, as well as linked and interacting QTL. With the original population size of 3000, selection of 10% of the extreme individuals from each tail and marker density of 5 cM, we could have the power of 95% to detect a QTL that explains only 1% of the phenotypic variation (Sun et al., 2010). This has been supported by a study in yeast with identification of several genes with minor effects for chemical resistance traits and mitochondrial function (Ehrenreich et al., 2010). The power of BSA largely depends on the feasibility of classifying individuals into groups with extreme phenotypes, which in turn depends largely on precision phenotyping under well-managed environments, particularly for the traits with low heritability and largely affected by environments. To improve phenotyping precision in field conditions, it is important to reduce 'signal-to-noise' ratio, by selection of research plots with low spatial variability in soil properties, uniform application of inputs with good weed, pest and disease control, use of adequate plot borders, use of experimental designs to control within replicate variability and data analysis to reduce or remove spatial trends (Xu, 2016;Xu et al., 2012Xu et al., , 2013. Precision phenotyping also depends on the utilization of new field-based techniques (precision fertilization, water management and weed control; remote sensing techniques for accurate evaluation of secondary traits) and correct selection, calibration and application of phenotyping instruments (such as neutron probes, radiation sensors and chlorophyll and photosynthesis meters).
For biotic and abiotic stresses, phenotyping needs to be performed simultaneously in two contrasting environments, or near iso-environments (NIEs) (Xu, 2002(Xu, , 2010, with one imposing much less stress on plants than the other. The effect of the stress environment can be measured using the much-less-stress or normal environment as a control. A relative trait value is then derived from two direct trait values to ascertain the sensitivity of plants to the stress. Traits suitable for measurement under NIEs include all abiotic/biotic stresses (e.g. disease resistance and drought tolerance) and agronomic practices (e.g. weed control). A relative trait value can be also derived by measuring of the same trait under the NIEs with one neutral factor significantly different such as plant responses to photoperiod or day-length.

Sampling
Two contrasting sampling methods, trait-based sampling and marker-based sampling, have been used in BSA ( Figure 3). The former is based on the phenotypic extreme plants for a trait of interest, and the plants are selected from the high and low tails of the phenotypic distribution (Lander and Botstein, 1989;Lebowitz et al., 1987). The second approach is based on molecular markers evenly covering the genome of the entire germplasm collection or segregating populations (Edwards et al., 1987;Soller and Beckmann, 1990), and individuals are selected by genotyping in the target region. In genetics, the former sampling method tends to be used for rough mapping (Vikram et al., 2011), while the latter mainly applies to fine mapping (Boopathi et al., 2013;Frouin et al., 2014;Yang et al., 2014). However, BSA is mainly based on trait-based sampling.
The power of BSA largely depends on sampling-related factors, particularly sample sizes including entire population size and tail size (the number of individuals selected for bulking) (Figure 3). In segregant-based BSA, the population size required mainly depends on population type, distance between markers, recombination frequency in the target region and genetic architecture of the target trait . As recombination frequency and relative information of genotypes usually vary across population types, the population size required in constructing the linkage map might also vary.
For the complex trait controlled by minor genes, other factors associated with the target genes, such as gene number, gene effect, gene interactions and the relative positions on chromosomes, should be taken into account to determine the required population size . To effectively identify marker- Genotyping factors Figure 3 The pipeline and affecting factors of bulked sample analysis. Two major types of populations, artificial population and nature variants, are taken as an example, with two bulks (Bulk A and Bulk B) formed by selection of individuals with extreme phenotypes. Artificial populations: a segregating population derived from biparents and multiparents. Natural variants: a population including various natural variants, which are selected from either single or multiple sources. The pipeline starts from original populations and ending-ups with applications.
trait association, the population size should also consider marker availability and genotyping cost . At the same time, with the reduction in genotyping cost, the increase in population size becomes more feasible. Taking genetic mapping as an example, how many individuals should be sampled from phenotypic extremes usually matters with the entire population size and gene effect. For small-to moderate-sized populations (each with 200-500 individuals), optimum tail size would be 20%-30% of the entire population (Gallais et al., 2007;Navabi et al., 2009). With the increase in the population size, selected proportion (SP) required for a given power of QTL detection will decrease. For a QTL of large effect (with phenotypic variation explained (PVE) = 10%-15% or larger), each tail should contain at least 20 individuals (or SP > 10%) selected from an entire population of around 200 (Sun et al., 2010). For a QTL of medium effect (PVE = 3%-10%), each tail should contain 50 individuals (SP = 5%-10%) from an entire population of 500-1000. For QTL of small effect (PVE = 0.2%-3%), each tail should contain 100 individuals (or SP < 5%) from an entire population of 3000-5000 (Sun et al., 2010). In terms of the optimum SP, it should consider the cost balance between genotyping and phenotyping for the selected samples (Darvasi and Soller, 1994;Gallais et al., 2007).

Bulking
Selected samples of phenotypic extremes may come from single or bidirectional selection, which results in uni-, bi-and multibulks for the target traits, providing comparative analyses with different options (Figure 2). There are four types of BSA. For qualitative traits such as disease resistance with two distinct phenotypes (R, resistance; S, susceptible), two bulked samples with qualitative difference can be generated ( Figure 2a). For most quantitative traits with normal distribution, two bulked samples can be selected from two tails with extremely low and high phenotypic values, respectively ( Figure 2b). To increase statistical power and reduce the false positives, multiple bulks can be selected independently from each of the two tails ( Figure 2c). In many cases, where only one bulk is available for the target trait from one tail while the other tail was killed by lethal genes or due to severe stresses, BSA can be performed by comparing the bulk with a group of individuals randomly selected from a control population under no stress with normal allele frequencies for the target trait ( Figure 2d).
There are two ways to bulk sampled individuals. Tissues sampled from the phenotypic extremes are pooled first, and then, a single DNA/RNA/protein isolation is performed; or DNA/RNA/ protein is isolated first from each extreme individual, and then, an equal amount of the extraction from each individual is bulked. As the two bulking methods for DNA analysis do not give significantly different results (Liu et al., 2010), bulking before extraction is more cost-effective.
When only one extreme (most resistant individuals under a severe abiotic and biotic stress condition) is available or reliable estimation of allele frequencies is not possible, BSA using a single bulk can be performed by comparing the available bulk with a phenotypic control that is randomly selected from the individuals under a normal environment , or by comparing with the theoretical expectation. A similar situation is that the target trait is associated with a lethal gene so that only survivors from one tail are available for being used as single bulk (Figure 2d).
To increase the power of BSA, multiple parallel bulks have been proposed to form from the same population (Ghazvini et al., 2013;Xu et al., 2012). Only the positive genetic signal will show up consistently between parallel bulks (Figure 2), which provides confirmation with each other to reveal the true genetic difference because the probability for false positives showing up simultaneously in different bulks becomes much lower as the number of bulks increases. For the traits controlled by lethal genes or severely selected under stress conditions, we may only get the extreme phenotypic data for one tail. In this case, we may just do single bulk analysis to see whether observed genetic signal (e.g. allele frequencies in DNA analysis) in the bulk deviates significantly from the expected, from the individuals under normal condition or from nonlethal case (Figure 2d).

Multiplexing
Multiplexing can be performed for samples and markers, both of which perform multiple assays in one reaction. Sample multiplexing is usually used along with individual-based selective genotyping, which makes it possible to achieve the same low cost as BSA by multiplexing many samples, while marker multiplexing is used along with BSA. Multiplexing will increase the total number of samples or markers without drastic increase in cost and time. Bulked samples can be also multiplexed as individual samples, resulting in a further cost reduction and throughput increase.
As an example for sample multiplexing, a unique sequence (barcode) can be attached to each sample so that multiple samples can be pooled in one sequencing run but can be distinguished and sorted during data analysis (http://www.illu mina.com/technology/next-generation-sequencing/multiplexingsequencing-assay.html). The barcode sequences can be designed follow the instructions (http://comailab.genomecenter.ucdavis. edu/index.php/Barcodes). The total number of available barcodes is determined by barcode length and the number of indices. Dual indexing, namely two indices used in multiplexing, further increases the total number of samples that can be pooled. With multiplex sequencing, a large number of samples can be simultaneously sequenced during a single experiment, while multisample pooling improves productivity by reducing time and reagent use. Illumina now provides a 384-sample kit which allows as many as 96 samples to be analysed in one run. For single nucleotide polymorphism (SNP) genotyping, up to tens or even hundreds of samples can be labelled individually but mixed and analysed as one sample (Livaja et al., 2013;Takagi et al., 2013a). As RNA can be analysed by its cDNA form in BSA procedure, the protocols for DNA multiplexing can be generally used for multiplexing RNA samples.
Early marker multiplexing efforts started with mixing several pairs of primers in PCR analysis (Henegariu et al., 1997). Marker multiplexing has been used for assays at DNA, RNA and protein levels. At DNA or cDNA level, SNP data can be obtained using one of the numerous multiplex SNP genotyping platforms that combine a variety of chemistries, detection methods and reaction formats. Putting thousands of markers onto a single chip is one of the best ways to multiplex markers. In maize, several SNP chips have been developed (Ganal et al., 2011;Unterseer et al., 2014;Yan et al., 2010), which can genotype 1536-600 K SNPs per run.
Multiplex sequencing has been accomplished by random DNA shearing followed by barcode tagging with short DNA sequences (barcodes) and pooling samples into a single sequencing channel (Craig et al., 2008), or using an inexpensive barcoding system to sequence restriction site-associated genomic DNA (i.e. RAD tags) (Baird et al., 2008). The former has been used to rapidly determine the complete organellar and microbial genome sequences (Cronn et al., 2008) and also for discovery and mapping of genomic SNPs (Huang et al., 2009(Huang et al., , 2010. The latter has been used for high-density SNP discovery and genotyping. To multiplex proteins or parallel protein interaction profiling, a single-molecular interaction sequencing (SMI-seq) technology has been developed. First, DNA barcodes are attached to proteins collectively via ribosome display or individually via enzymatic conjugation. To construct a random single-molecule array, the barcoded proteins are then assayed en masse in aqueous solution and subsequently immobilized in a polyacrylamide thin film for amplification and sequencing (Gu et al., 2014).
Protein multiplexing can be also performed by mass spectrometry. To eliminate the intrinsic bias towards detection of highabundance proteins, significant progress has been made in a large-scale study to detect a limit of~2 lg/mL (Addona et al., 2009), and a biomarker validation pipelines established to detect proteins in the ng/mL range in plasma (Addona et al., 2011;Whiteaker et al., 2011).
Antibody colocalization microarray (ACM) as a novel concept for protein multiplexing without mixing has been used to quantify proteins in the serum of patients with breast cancer and healthy controls, with six candidate biomarkers identified (Pla-Roca et al., 2012). ACM involves a physical colocalization of both capture and detection antibodies, spotting of the capture antibodies, and sample incubation, followed by spotting of the detection antibodies. Up to 50 targets and their binding curves can be produced. By comparing with enzyme-linked immunosorbent assay or conventional multiplex sandwich assay, the ACM can be validated.

Analyses: DNA, RNA and protein
With the advent of new biotechnologies such as new sequencing technologies and other bio-assay methods, genetic polymorphisms between two contrasting samples can be revealed at the level of DNA, RNA or protein through individual markers, microarrays and sequencing ( Figure S1; Table 1).

DNA analysis
At DNA level, genetic differences can be identified by different types of DNA markers, microarrays and genotyping by sequencing (GBS; Elshire et al., 2011;Poland and Rife, 2012) or wholegenome sequencing (Goff et al., 2002;Pizza et al., 2000).
Traditional BSA is usually performed with individual DNA markers, especially with PCR-based markers (Giovannoni et al., 1991;Michelmore et al., 1991). Array-or chip-based genotyping has become popular recently so that a large number of markers can be genotyped. Although microarray-based BSA is conducted just like the traditional genotyping methods, it is high-throughput and the number of markers used is far more than traditional methods, significantly improving output and efficiency.
With the significant reduction in sequencing cost in recent years, bulked samples can be genotyped by DNA sequencing at multiple depths. The number of markers that can be generated by sequencing will increase with sequencing depths, which thus increase mapping outputs for complex traits controlled by multigene effects (Magwene et al., 2011;Takagi et al., 2013a).
Whole-genome resequencing has been used for pooled DNA samples from the segregating individuals. Such a population can be developed by selfing the cross between a mutant plant and its wide-type, while the mutant plant can be generated from natural mutation or chemical mutagenesis. Such a strategy involving a mutant has been called MutMap, where the SNPs incorporated by Table 1 Analytical methods based on individual markers, microarrays and sequencing at DNA, RNA and protein levels mutagenesis should be used as markers to search for the region harbouring the mutation corresponding to a given phenotype (Abe et al., 2012). Such method has been used to identify the unique genomic positions most probable to harbour mutations in rice causing pale green leaves and semi-dwarfism (Abe et al., 2012), blast disease resistance (Takagi et al., 2013b) and salt tolerance (Takagi et al., 2015).

RNA analysis
There are many advantages to genotype by RNA-seq. First of all, compared to array-based technology, genotyping by RNA-seq can detect much more variances. It usually covers 70%-90% of the total genes based on the tissue and development stage of the sample. For example, with about 70M reads (100 bp), 71.6% of the genes (filtered-gene set) can be covered using 15-day-old seeding of maize (Fu et al., 2013). RNA-seq technology allows us to discover and profile the transcriptome in the species with or without a reference genome. Compared to other technologies such as microarrays, RNA-seq technology offers the following benefits (Ozsolak and Milos, 2011;Wang et al., 2009). First, it does not require species-or transcript-specific probes so it enables unbiased detection of novel transcripts, gene fusions, single-nucleotide variants, indels (small insertions and deletions) and other previously unknown changes that arrays cannot detect. Second, unlike the array hybridization technology, where gene expression measurement is limited by background at the low end and signal saturation at the high end, RNA-seq technology quantifies discrete, digital sequencing read counts, offering a broader dynamic range. Third, it offers increased specificity and sensitivity, for enhanced detection of genes, transcripts and differential expression. Fourth, sequencing depth can easily be increased to detect rare transcripts, single transcripts per cell or weakly expressed genes. In addition, RNA-seq offers the potential to refine existing gene annotation through discovery of novel exons and junction sites (Steijger et al., 2013). RNA-seq BSA reduces the cost remarkably when repetitive sequences or other 'junk' DNA is enriched in the genome. For example, maize and wheat genomes are 2.4 and 17 Gbp, with the former containing more than 80% of transposable elements (Liu et al., 2012;Ramirez-Gonzalez et al., 2014).
One of the disadvantages for RNA-seq is that, when a causal mutation lies in the nonexpressed region or not linked with a SNP that can be genotyped, it is impossible to be mapped. Another disadvantage of RNA-seq BSA is that RNA-seq cannot detect the changes in copy number, and therefore, a causal mutation caused by copy number variation cannot be mapped either.

Protein analysis
To identify the type and amount of proteins, methods based on individual markers, protein arrays and sequencing have been used (Table 1). In two-dimensional electrophoresis (2DE), proteins are separated according to their charges and molecular weights, and this technique has been significantly improved with the development of immobilized pH gradient strips for isoelectric focusing (G€ org et al., 2009). Although 2DE has been around since 1970s, which even predates the naming of proteomics, it still has its place in many laboratories and is certainly very useful for the analysis of post-translational modifications in particular (Rabilloud et al., 2010). To identify specific proteins from a complex protein mixture by Western blotting, proteins have to be separated by size, and then transferred to a solid support, followed by marking a target protein using a proper primary and secondary antibody (Mahmood and Yang, 2012).
Making protein assays needs a solid surface such as microscope slides, membranes, beads or microtitre plates to hold the protein, a coating with multiple functions to immobilize the protein and prevent its denaturation, and a hydrophilic environment for the binding reaction to occur. To apply the coating to the support surface, thin-film technologies, such as physical vapour deposition and chemical vapour deposition, are used (Gates, 1996).
There are three types of protein microarrays currently available, that is analytical microarrays, functional protein microarrays and reverse-phase protein microarrays (Zhu and Snyder, 2003). Protein array detection with fluorescence labelling, as the most common, highly sensitive and widely used method, is compatible with readily available microarray laser scanners. Label-free detection methods, such as surface plasmon resonance (Kodoyianni, 2011), carbon nanotubes, carbon nanowire sensors and microelectromechanical system cantilevers , offer much promise, with further development, for high-throughput protein interaction detection.
Due to proteins' high sensitivity to changes in microenvironments, maintaining protein arrays in a stable condition over extended periods of time is a great challenge. In situ techniques involve on-chip synthesis of proteins, directly from the DNA using cell-free protein expression systems. DNA array to protein array (He et al., 2008a), protein in situ array (He et al., 2008b) and nucleic acid programmable protein array (Miersch and LaBaer, 2011) are three examples of in situ methods. As tandem affinity purification tag fusions, 17 400 ORFs were generated in Arabidopsis to develop a platform for large-scale protein analysis and production of recombinant Arabidopsis proteins. By printing the purified recombinant proteins, a high-density Arabidopsis protein microarray was then produced, and used for protein-protein interaction analysis (Lee et al., 2011;Popescu et al., 2007) and identification of the target proteins (Popescu et al., 2009). Using parallel analysis of translated ORFs, the interaction of proteins can be discovered (Larman et al., 2014).
Unlike DNA sequencing which can be conducted by diverse platforms, there are only two major direct methods of determining the amino acid sequence of a protein (protein-seq), namely mass spectrometry (MS) and Edman degradation. Because of its sensitivity and efficiency, MS is not only becoming the main tool to study the primary structure of proteins, but also as a central technology for proteomics. Protein quantity can be determined by adding stable isotopes or mass tags into different samples, allowing equivalent peptides (or peptide fragments) to be identified by a specific increase in mass. When proteins and peptides are labelled by selective tags, such as isotope-coded affinity tags (Gygi et al., 1999) and isobaric tags for absolute and relative quantification (iTRAQ; Ross et al., 2004), a limited number of proteins can be measured. With nonselective labelling, such as isotope-coded protein label (Kellermann, 2008) or mass-coded abundance tagging (MCAT; Cagney and Emili, 2002), large-scale peptide sequencing and quantitation can be achieved. If samples are labelled when they are still metabolically active (stable isotope labelling with amino acids in cell culture, SILAC), a dynamic in vivo profile of thousands of proteins can be quantified (De Godoy et al., 2008). By combining SILAC and iTRAQ, 18-plex isotope labelling can be achieved, and a two-stage stable isotope labelling strategy which allows test of six different protein samples was developed .

Applications: genetics, genomics and crop improvement
Genetics Traditional applications of BSA in genetics are mainly in gene mapping and candidate gene discovery. Using molecular markers and bulked segregant analysis, early applications are almost exclusively used for mapping genes with relatively large effects for traits of agronomic importance, such as grain yield (Bernier et al., 2007), drought tolerance (Kanagaraj et al., 2010;Venuprasad et al., 2009;Vikram et al., 2011) and heat tolerance  in rice, waterstress tolerance in wheat (Altinkut and Gozukirmizi, 2003) and salt tolerance in Egyptian cotton (El-Kadi et al., 2006).
To identify metabolite QTL (mQTL), the genetic and metabolic basis of glucosinolate accumulation was dissected in oilseed rape/ canola (Brassica napus) through analysis of total glucosinolate concentration and its individual components in both leaves and seeds of a DH mapping population (Feng et al., 2012). QTL that had effect on glucosinolate concentration in either or both of the organs were integrated, resulting in 105 mQTL. In rice, gas chromatography-mass spectrometry analysis revealed significant differences between parental lines in fatty acid composition of brown rice oil, and 29 associated mQTL were identified in F 2 and/ or F 2:3 populations (Ying et al., 2012). To dissect the genetic architecture underlying the differences between quantitative and qualitative changes, mQTL mapping was performed in Arabidopsis using two segregating populations with 22 flavonoid QTL identified (Routaboul et al., 2012).
High-throughput genotyping platforms help move BSA recently to chip-based analysis. The examples of BSA based on DNA/RNA/ chip are summarized in Table 2. The method has been used to study a dozen of recessive mutants in maize (Liu et al., 2010), leaf rust in wheat (Forrest et al., 2014), sulphur and selenium contents in Arabidopsis thaliana (Becker et al., 2011), salt resistance in cotton (Rodriguez-Uribe et al., 2011), bean common mosaic virus in common bean (Bello et al., 2014), Phytophthora root rot in pepper  and photosynthetic traits in poplar (Wang et al., 2014a).
With the development of next-generation sequencing (NGS) technologies, BSA has been used for quick discovery of associated markers and candidate genes by sequencing the parents and bulks of phenotypic extreme individuals from the segregating populations, through BSA based on DNA-and RNA-seq. As a typical example for NGS-assisted BSA method, 430 extreme sensitive and 385 extreme tolerant rice seedlings to low temperature were selected from a very large F 3 population with 10 800 individuals, and genotyped with about 450 000 SNPs, with six QTL for low temperature, four QTL for partial resistance to the fungal rice blast disease and two QTL for seedling vigour identified . Using 946 F 2 plants, rice male sterility gene ms-IR36 was mapped to a 33-kb region on the short arm of chromosome 2 that includes 10 candidate genes (Frouin et al., 2014). Meanwhile, QTL-seq has been used to identify an early flowering QTL located near Flowering Locus T in cucumber . DNA-seq BSA has also been used to study nulliplex-branch and short-fibre mutant in cotton (Chen et al., 2015;Thyssen et al., 2014). In few cases, DNA-seq BSA has been used to identify several genes simultaneously from single population (Takagi et al., 2013a;Yang et al., 2013) and using multiple parallel bulks to identify the same gene (Ghazvini et al., 2013;Hiebert et al., 2014). With NGS and de novo transcriptome assembly, a similar pipeline as DNA-seq BSA can be developed to enrich molecular markers and identify eQTL and candidate genes through RNA-seq BSA. In maize, this method was used to rapidly and efficiently map genes for mutant phenotypes, in which 32 mutants (gl3-ref/ gl3-ref) and 31 nonmutant siblings (gl3-ref/Gl3-B73 or Gl3-B73/ Gl3-B73) were selected with the gl3 locus mapped (Liu et al., 2012). Other examples include genetic mapping of grain protein content (Trick et al., 2012) and major disease resistance for wheat yellow rust (Ramirez-Gonzalez et al., 2014) in wheat, downy mildew disease resistance in sunflower (Livaja et al., 2013), pericarp russet pigmentation in sand pear (Wang et al., 2014b) and cytoplasmic male sterility in radish (Lee et al., 2014b).

Functional analysis
Whole-genome information and high-throughput tools have contributed to the development of functional genomics, including transcriptomics (Zimmerli and Somerville, 2005) and proteomics (Roberts, 2002).
Protein arrays have five major applications in human and animals, including diagnostics, proteomics, protein functional analysis, antibody characterization and treatment development. For example, BSA-based functional genomics analysis identified 193 genes showing greater mRNA abundance in adult oocytes and 223 genes showing greater mRNA abundance in prepubertal oocytes (Patel et al., 2007).
In Lactobacillus rhamnosus, 100 strains isolated from diverse sources were used to understand the genetic complexity and ecological versatility of the species through genomic and phenotypic analysis (Douillard et al., 2013). By mapping their genomes onto the L. rhamnosus GG reference genome, a wide range of metabolic, antagonistic, signalling and functional properties were characterized.
With the rapid development in liquid chromatography coupling with tandem MS (LC-MS/MS), large-scale proteome identification and quantification can be achieved (Cooper et al., 2011;Nilsson et al., 2010;Ning et al., 2011). In plants, proteins with potential agronomic values have been detected by high-throughput protein sequencing. Around 200 gliadins and glutenins have been identified in wheat flour using MS, and their abundance was also examined (Dupont et al., 2011). Approximately 4975 nuclear proteins were detected in soybean leaves, and protein differently expressed in between soybean rust susceptible and resistant plants might be involved in disease resistance (Cooper et al., 2011). Using two-dimensional difference gel electrophoresis coupled with LC-MS/MS, 42 brassinosteroid (BR)-regulated proteins were identified in Arabidopsis. These proteins are predicted to play potential roles in specific cellular processes, including signalling, cytoskeleton rearrangement, vesicle trafficking and biosynthesis of hormones and vitamins (Deng et al., 2007).

Genotype by environment interactions
Phenotypic expression is environment-dependent. The environments can be defined as the sum total of circumstances surrounding or affecting an organism or a group of organisms. Cultivars as pure-breeding genotypes, when grown under a wide range of environments, are exposed to different soil types, fertility levels, moisture contents, temperatures, photoperiods, biotic and abiotic stresses and agronomic practices. As gene expression may be modified, enhanced, silenced or timed by regulatory mechanisms in the cell to respond to internal and external factors, the genotypes (cultivars) may specify a range of phenotypic expressions that are called the norm of reaction, or plasticity, which is simply the expression of variability (Bradshaw, 1965).
Genotype by environment interactions have been investigated with focus upon individual phenotypic traits. Generalized principles behind GEI could be revealed through high-throughput techniques that have greatly expanded the depth so that traits of agronomic importance can be analysed in terms of genotypic and environmental effects (Xu, 2016). Three genetic mapping approaches, linkage mapping using bi-parental populations, GWAS with natural populations and integrated linkage-GWAS, for example, can be used along with DNA-, RNA-and proteinseq. Combining a GWAS with transcriptional networks and metabolite or protein composition phenotypes will facilitate rapid identification and validation of many genes that are potential causal genetic candidates (Chan et al., 2011;Fu et al., 2009;Joosen et al., 2013). Therefore, combining ecophysiological modelling with genetic mapping to create QTL-based ecophysiological models could help narrow down genotype-phenotype or gene-phenotype gaps. For example, RNA-seq has been used to simultaneously examine tens of thousands of measurements in the form of gene expression levels. Gene expression can be monitored through whole-genome sequencing across individuals, developmental stages and environments to identify responsive genotypes. By incorporation of additional genes and regulatory links, gene regulatory network may have enriched more ancient and, consequently, more connected gene components for GEI (Grishkevich and Yanai, 2013).
The whole-genome strategies and associated methods hold great power in GEI analysis. Identification of GEI across four dimensions (genome 9 environment 9 space 9 time) will be able to query how gene expression at different developmental stages across different spaces and environments respond to GEI (Xu, 2016). To dissect GEI into its individual genetic components, the genetic complexity of the phenotypic responses to the environment should be examined with underlying genes and their allelic composition and combinations (haplotypes). As a consequence, GEI will be identified to correspond to (often several) QTL with environment-specific effects (El-Soda et al., 2014).
Understanding GEI can be facilitated by experiments under well-managed environments (Xu 2015(Xu , 2016. However, the results obtained in a managed environment such as in climate  , 14, 1941-1955 chamber or greenhouse should not be generalized to the field environments because of the expected large GEI (Araus and Cairns, 2014;Tuberosa, 2012). Such warning is supported by several reports on Arabidopsis, where a poor correlation has been observed between variation for flowing time (FT) scored in field experiments and FT variation observed under greenhouse conditions (Brachi et al., 2010;Hancock et al., 2011;M endez-Vigo et al., 2013).

Development of diagnostic and constitutive markers for breeding
The combination of advanced sequencing technology with BSA provides a powerful tool for rapid identification of genes or causal mutations, which can be used to develop markers for breeding, as using traditional bulked segregant analysis discussed by Xu and Bai (2015). Compared with entire population analysis, BSA provides a short cut to identifying and developing markers for important agronomic traits, which basically follows the approaches currently available. In addition to most frequently used DNA-based markers, results from RNA and protein analyses can be also used to develop markers. For more effective markerassisted selection (MAS), markers should be developed from genic or functional regions and associated with gene functions. To reveal allelic variation within a gene, multiple markers shall be developed to construct haplotypes that represent different combinations of marker alleles. The higher the density of markers can be established in a specific region, the more meaningful haplotypes can be constructed.

Agronomic genomics
With the development of RNA-seq, effects of agronomic practices on plant growth and development can be determined by examining the change in gene expression under different agronomic practices such as fertilization, irrigation, weeding and pest control. The term agronomic genomics can be used to represent the genomic study of the effects of agronomic practice on gene expression for the target traits, by which high-efficient, cost-effective and environment-friendly agronomic practices can be developed to optimize the gene expression and thus the crop production (Xu, 2016). The drivers of gene expression patterns can be determined more straightforward in the complex fluctuating environments where organisms typically live. However, one of the major difficulties in analysing data associated with agronomic practice is the complex, noisy and multiple environmental factors that affect the transcriptome change simultaneously in the field. Agronomic genomics can facilitate breeding crops that respond better to agronomic practices.
As an example in agronomic genomics, transcriptome data collected from the leaves of rice plants in a paddy field, along with the corresponding meteorological data, were used to characterize the changes in transcriptome under natural field conditions and to develop statistical models for the endogenous and external influences on gene expression (Nagano et al., 2012). The model was built using 461 microarray data with distinct sampling time points and the corresponding meteorological data including wind speed, air temperature, precipitation, global solar radiation, relative humidity and atmospheric pressure. The predictive performance of the model was evaluated using 108 and 16 microarray data collected from plants grown in the field in two crop seasons, compared with the microarray data collected from plants grown in light-and temperature-controlled growth chambers. Endogenous diurnal rhythms, ambient temperature, plant age and solar radiation predominantly control the transcriptome dynamics, allowing prediction of the influence of changing environments and also the relevant biological changes.

Marker-assisted selection and selective phenotyping
For more effective MAS, a large population can be classified into several subpopulations based on marker-trait associations or population properties. BSA can be then used to improve the efficiency of selection for each subpopulation. Furthermore, selective genotyping can be applied to each subpopulation to identify desirable individuals.
Selective phenotyping, that is phenotyping only a part of individuals from the target population, is most effective when prior knowledge of genetic architecture allows focus on specific genetic regions (Jannink, 2005;Jin et al., 2004) and specific allele combinations or haplotypes, particularly for the traits that are difficult or expensive to evaluate. As genotyping or sequencing becomes much cheaper, it may be more efficient to first genotype the whole population to identify the most informative subset of individuals, and then phenotype this subset precisely. Compared with unselective phenotyping, selective analysis should have dramatically improved power for the same number of individuals phenotyped, as a much wide genetic variation would have been sampled for genotyping. Selective phenotyping becomes much simplified when phenotypic extremes can be easily identified by a simple phenotypic screening, for example abiotic stress tolerance, where a large number of plants/families can be eliminated easily under severe stress .
High-density planting and selection at early stages of plant development, which allows one to work with more plants/families at the same cost, should also be investigated as a viable option for some traits . Where the target trait is largely influenced by planting density or strong selection pressure, this type of selective phenotyping will clearly confound the capacity to make genetic gain. However, many major genecontrolled traits can be selected in this way without much disturbance.

Perspectives
Bulked sample analysis has both the advantages, as discussed in previous sections, and some limitations. Although BSA has been widely used with many examples available in genetics and genomics, it has been largely focused on relatively simple traits controlled by major genes. For quantitative traits that are controlled by genes with different effects, selection of bulked samples can be improved through precision phenotyping assisted by envirotyping and performed with controlled environments or well-managed experimental trials (Xu, 2016), by which genes with relatively large effects could be targeted. For complex traits that involve many genes each with minor effect and affected significantly by environments, BSA may not be effective as the entire population analysis, as selection of the extremes containing all favourable alleles from a large number of loci would be impossible (Farkhari et al., 2013). Compared with traditional methods in genetic analysis, BSA requires to identify only individuals showing contrasting extreme phenotypes in the target populations, while not taking into account the accurate trait values for the rest . As a consequence, genetic analysis by BSA might upward or downward the real value such  [1941][1942][1943][1944][1945][1946][1947][1948][1949][1950][1951][1952][1953][1954][1955] as phenotypic variance, LOD score and additive effects (Vikram et al., 2012). Furthermore, BSA consistently fails to identify epistatic interactions (Schneeberger, 2014), and it is more insensitive to occasional phenotyping mistakes (Schneeberger et al., 2009). However, BSA can be improved by increased population and tail sizes and marker density (Sun et al., 2010), by using multiple bulked samples, and through precision phenotyping and envirotyping. In addition, GWAS using individual genotypes may be also used for validation of BSA results and elucidation of complicated scenarios more reliably.
With the significant reduction in cost in genome sequencing, especially the development of RNA-seq, genotyping has been more simplified than before. As a result, BSA can be performed by genotyping multiple bulks from a large-size population, by which the power of detection can be improved (Ghazvini et al., 2013;Hiebert et al., 2014;Sun et al., 2010) as rare events or alleles of interest can be verified among multiple bulks or by individual genotyping after appropriate sub-bulks are identified. When it comes to optimal designs of NGS-assisted BSA, it should take into account the number of bulks, number of individuals in each bulk and sequencing depth (Kim et al., 2010).
Compared with DNA-microarrays, developing protein microarrays requires a lot more steps in its creation and faces many challenges (Cretich et al., 2013;Gahoi et al., 2015;Hall et al., 2007;Liotta et al., 2003;Lueking et al., 2005;Zhu and Snyder, 2003). Technical challenge is to find a surface and a method of attachment that allows the proteins to maintain their secondary or tertiary structure, and to produce an array with a long shelf life so that the proteins on the chip do not denature over a short time. Experimental challenges are to acquire antibodies or other capture molecules against every protein in the target genome, quantify the levels of bound protein, extract the detected protein from the chip for further analysis and reduce nonspecific binding by the capture agents (https:// en.wikipedia.org/wiki/Protein_microarray). By the end, the capacity of the chip should be significantly increased to allow the whole proteome to be represented and less abundant proteins to be detected.
Technically, microarray-based high-throughput analysis of protein-protein and other biomolecular interactions holds immense potential for multiplex interactome mapping and also great opportunity for an inclusive representation of the signal transduction pathways and networks (Gahoi et al., 2015). Equipped with multiplexing, quantitative proteomics comes to the era of 'ultra' high-throughput, making it possible to comprehensively compare all major tissues/organs of humans (Paulo et al., 2015) and plants as well in a day.
Although BSA has been widely used at DNA and RNA levels, its use at protein level has not been reported. Since many MS-based protein quantification tools are designed for comparison between two samples, we would expect that many of them are suitable for protein-based BSA. A major challenge in proteinbased BSA would be that enrichment in cellulose and many plants' secondary metabolites such as polyphenols, lipids, organic acids, terpenes or pigments makes it hard to extract pure proteins from tissues.
Traditional genotyping methods, such as individual markerbased or marker chip-based, should be complemented by GBS, including both regular whole-genome sequencing and the simplified genome sequencing with target enrichment or reduction in genome complexity (De Donato et al., 2013;Elshire et al., 2011;He et al., 2014;Poland and Rife, 2012). GBS technology is flexible and efficient, providing acceptable marker density for genomic selection or GWAS at roughly one-third of the genotyping cost of currently available technologies (De Donato et al., 2013), and the cost becomes less when samples are multiplexed (Zhang et al., 2015). GBS can be used in GWAS, genomic diversity study, genetic linkage analysis, molecular marker discovery and genomic selection under a large scale of plant breeding programs . When the number of individuals contained in the two bulks is large enough, for example, more than 500, BSA can be combined with GBS technologies and used for GWAS (Duncan et al., 2011;Schl€ otterer et al., 2014).
It can be expected that BSA, which have been widely used with mixed success in genetic mapping and gene identification, will become increasingly important in genetics, genomics and crop improvement and will replace the analysis of all individuals (entire population) in many cases. BSA by sequencing (BSA-seq) will become more attractive with the development of dedicated software for BSA-seq data analysis, novel techniques for analysis of low-frequent or rare variants, new approaches to accurate estimates of haplotypes, and new sequencing technologies allowing generation of longer sequencing reads to facilitate the reconstruction of haplotype information (Schl€ otterer et al., 2014). As genomewide selective genotyping and BSA become possible, an effective information management and data analysis system will be required to make full use of BSA in genetics, genomics and plant breeding.