Indel‐seq: a fast‐forward genetics approach for identification of trait‐associated putative candidate genomic regions and its application in pigeonpea (Cajanus cajan)

Abstract Identification of candidate genomic regions associated with target traits using conventional mapping methods is challenging and time‐consuming. In recent years, a number of single nucleotide polymorphism (SNP)‐based mapping approaches have been developed and used for identification of candidate/putative genomic regions. However, in the majority of these studies, insertion–deletion (Indel) were largely ignored. For efficient use of Indels in mapping target traits, we propose Indel‐seq approach, which is a combination of whole‐genome resequencing (WGRS) and bulked segregant analysis (BSA) and relies on the Indel frequencies in extreme bulks. Deployment of Indel‐seq approach for identification of candidate genomic regions associated with fusarium wilt (FW) and sterility mosaic disease (SMD) resistance in pigeonpea has identified 16 Indels affecting 26 putative candidate genes. Of these 26 affected putative candidate genes, 24 genes showed effect in the upstream/downstream of the genic region and two genes showed effect in the genes. Validation of these 16 candidate Indels in other FW‐ and SMD‐resistant and FW‐ and SMD‐susceptible genotypes revealed a significant association of five Indels (three for FW and two for SMD resistance). Comparative analysis of Indel‐seq with other genetic mapping approaches highlighted the importance of the approach in identification of significant genomic regions associated with target traits. Therefore, the Indel‐seq approach can be used for quick and precise identification of candidate genomic regions for any target traits in any crop species.


Introduction
Conventional trait mapping methods are generally expensive and take much time in generating and analysing genotyping data on segregating populations. Trait mapping becomes more time-consuming if genotyping is performed using low-throughput marker systems such as simple sequence repeat (SSR) markers. Visual scoring in such marker systems also adds to the possibility of discovering spurious marker-trait associations (MTAs). High-throughput marker systems such as single nucleotide polymorphism (SNP) in combination with automated genotyping platforms (SNP arrays, KASpar assays, GoldenGate assays, etc.) have provided better options in generation of genotyping data. However, downstream analysis of such large volume data (quality assessment, identification of parental polymorphism and subsequently assessment of informative SNPs in population) takes time to provide meaningful information, which can be used for MTAs. This limits the rapid deployment of high probability MTAs in genomics-assisted breeding (GAB) and, subsequently, delays development of new breeding lines (Varshney et al., 2007). Additionally, meeting the increasing demand of nutritious food under anticipated climate change scenario along with ever-decreasing agricultural lands and limited water resources is a challenging task (Khoury et al., 2014). It requires sophisticated rapid genome mapping and targeted GAB approaches to produce better and high-yielding crop varieties in faster manner (Godfray, 2010;Varshney et al., 2005).
The rapid development of next-generation sequencing (NGS) technologies has enabled generation of genomic resources at large scale with faster pace during the last decade (Pazhamala et al., 2015;Pandey et al., 2016). NGS-based approaches have also provided rapid ways to establish relationship between genotype and phenotype at higher resolution . Nevertheless, despite the decreasing sequencing cost, development of individual reference-based assembly for each accession in a given species or progeny of mapping populations is still a challenging task. To overcome this bottleneck and to identify genomic segments responsible for phenotypic traits using NGS, many alternative approaches such as SHOREmap (Schneeberger et al., 2009), Next-generation mapping (NGM) (Austin et al., 2011), MutMap (Abe et al., 2012), Isogenic mapping-by-sequencing (Hartwig et al., 2012), SNP-ratio mapping (SRM) (Lindner et al., 2012), MutMap+ (Fekih et al., 2013), MutMap-Gap (Takagi et al., 2013a) have been used. Abovementioned studies rely on a number of different principles, which can handle mainly qualitative traits (traits governed by 1-2 genes). In contrast, QTL-seq approach was proposed primarily to deal with quantitative traits, based on D SNP index to map the target genomic region(s) for blast resistance and seedling vigour in rice (Takagi et al., 2013b). Similarly, whole-genome resequencing (WGRS)-based BSA was applied to calculate G 0 statistics to identify the QTLs for cold tolerance in rice seedling . Recently, genome resequencing of contrasting parents together with identification of nonsynonymous SNP (nsSNP) substitution was utilized for identification of candidate genes in defined QTL regions or new genic regions in many crops (Silva et al., 2012;Singh et al., 2016a;Xu et al., 2014). To list a few, nsSNP substitution approach has been successfully utilized in mapping the candidate genes for sheath blight resistance in rice (Silva et al., 2012), drought tolerance in maize (Xu et al., 2014) fusarium wilt (FW) and sterility mosaic disease (SMD) resistance in pigeonpea (Singh et al., 2016a). In all these studies, SNP genotyping data were used for establishing MTAs. However, Indels in the genomic regions based on bulked segregant sequencing have not yet been targeted for trait mapping. Evidence of involvement of Indels in altering the gene functions has been reported in different crops (see Kage et al., 2015). Further, in comparison with other markers, Indels have a number of inherent advantages such as abundance in the genome, multiallelic and codominant, ease in marker conversion and amenable to low-cost genotyping.
In view of above, this study reports a novel approach called 'Indel-seq', which is a combination of WGRS and BSA, for the identification of Indels associated with target traits. An example of application of Indel-seq has been provided in pigeonpea with FW and SMD resistance as target traits. In this context, the extreme bulks (resistant and susceptible) along with the resistant parents of recombinant inbred lines (RILs) segregating for FW and SMD resistance were sequenced. Candidate genomic regions/genes were identified for FW and SMD resistance in pigeonpea using Indel-seq approach. Further, the identified Indels were validated on a set of FW-and SMDresistant and FW-and SMD-susceptible genotypes. In summary, Indel-seq seems to be a suitable approach for coarse as well as fine mapping of quantitative traits in a rapid and precise manner.

Principle of Indel-seq
Indel-seq combines WGRS and BSA to identify the genomic regions associated with the target traits. To initiate Indel-seq approach, any segregating population (F 2 /RILs/back-cross) for the target traits could be utilized. Based on the phenotypic data of segregating population, 15-20 lines of extreme classes can be selected to constitute DNA pools in high trait bulk (HTB) and low trait bulk (LTB). Subsequently, two bulks (HTB and LTB) along with the high trait parent (HTP) are subjected to WGRS with high genome coverage (~109) (Figure 1). WGRS data, subsequently, can be analysed in a proposed manner to detect trait(s)-associated Indels. As the first step in analysis, high-quality WGRS data from HTP, HTB and LTB are mapped to the reference genome (RG). Mapped/aligned data are used for the identification of genomewide Indels. Identified Indels are then subjected to high-quality filtering parameters such as Q value >30, homozygous and no 'N' (missing call) in any tested sample. Further homozygous Indels supported by a minimum of seven sequencing reads in both the bulks (HTB and LTB) can be selected for establishing MTAs. In this direction, each Indel could be passed through the either (i) or (ii) of following criteria: 1. RG = HTP = HTB 6 ¼ LTB; here RG is similar to HTP. Indel should be selected if call is similar in RG, HTP and HTB and alternative call in LTB. 2. HTP = HTB 6 ¼ LTB = RG; here RG is similar to LTP. Indel should be selected if similar call is present in HTP and HTB and contrasting call in LTB and RG.
Further selected Indels based on above principles are subjected to chi-square (v 2 ) analysis to check their goodness of fit ratio, that is 1:1 in HTB and LTB. A significant deviation from the normally expected ratio of any Indel would indicate the possible association with the target trait. Effect of significantly associated Indels on genes and genomic regions can be predicted through SnpEff (http://snpeff.sourceforge.net/) software.

Application of Indel-seq approach in pigeonpea
Extreme pools for Indel-seq To deploy Indel-seq in pigeonpea for detecting the candidate genomic regions/genes for FW and SMD resistance, available sequencing and phenotypic data were utilized in this study (Singh et al., 2016a). In brief, phenotyping data generated for resistance to FW and SMD on the RIL population, that is ICPL 20096 (resistant to FW and SMD, HTP) 9 ICPL 332 (susceptible to FW and SMD, LTP), were used for defining resistant bulk (HTB) and susceptible bulk (LTB) of 16 individual RILs in each group (Figures 2 and S1-S2). Using WGRS, a total of 9.27, 8.99 and 8.43 Gb data were generated for the resistant parent or HTP and HTB and LTB, respectively (Table S1). Cleaned data were aligned to the pigeonpea reference genome resulting in mapping of total 90.6% (HTP), 81.8% (HTB) and 82.5% (LTB) of the total high-quality reads. Genome coverage was found to be 89.21% in HTP, 87.72% in HTB and 87.37% in LTB with an average depth of 13.4 X in HTP, 11.4 X in HTB and 10.8 X in LTB (Table S1).

Candidate Indels
Resequencing data sets for HTP, HTB and LTB were aligned with reference genome (RG) for identification of Indels (Varshney et al., 2012a). As a result, 211 603 genomewide Indels were identified. Of 211 603 Indels, 89 261 were identified on the pseudomolecules and the remaining were present on CcLG0 or floating scaffolds.
A total of 88,867 Indels with Q value >30 were selected for downstream analysis (Table S2 and Figure S3-S5). The lengths of these Indels were ranged from 1 bp to 99 bp ( Figure S6). Indels with heterozygous and 'N' (missing) calls in the HTP, HTB and LTB were also discarded, and a set of 33 577 Indels was subjected to further filtration. After applying final filtering criteria, that is Indels with read depth ≥7 were selected, the number of Indels reduced to 14 408 across HTP, HTB and LTB. On pairwise analysis, a total of 1290 Indels were identified between HTB and LTB. These Indels were checked for the concept, that is RG = HTP = HTB 6 ¼ LTB. As  Figure 3). The chi-square values in HTB ranged from 7 (P-value: 0.008151) to 12 (P-value: 0.000532) and in LTB ranged from 7 (P value: 0.008151) to 14.22 (P-value: 0.000163) ( Table 1). These 16 Indels were found affecting 26 genes (Table 1). Of 26 affected genes, 24 genes showed the effect in the upstream/downstream of the genic region and two genes have effect at genic level (Table S3). Few of these candidate genes have been reported to play significant role in the defence mechanisms in other plant species (Table S3).

Validation of candidate Indels
To validate and classify the identified 16 candidate Indels associated with the target genes for FW and SMD resistance, a comparative analysis based on allele frequencies in available sequence data was performed among four additional FW/SMDresistant and FW/SMD-susceptible genotypes along with HTP, RG, HTB and LTB ( Table 2). As a result, of 16 candidate Indels, five with an effect on eight candidate genes were validated (Table 2).

Indels for FW resistance
Three Indels, one each on CcLG02, CcLG07 and CcLG08, were found to be associated with FW resistance. For instance, one-bp deletion identified on CcLG02 (at position 1 253 647 bp) showed 'C' allele in FW-resistant genotypes and HTB, whereas 'CA' allele was present in LTB and FW-susceptible genotype (ICPB 2049) with a P-value <0.001. The identified one-bp insertion in susceptible genotypes ('C' to 'CA') was found to be affecting AP-1 complex subunit sigma-2 (C.cajan_05665) and L-ascorbate oxidase (C.cajan_05665) at upstream and downstream regions, respectively. At 405 527bp position on CcLG07, 'AT' allele was identified in HTB-and FW-resistant genotypes and 'A' allele identified in LTB and susceptible genotype (ICPB 2049) with a Pvalue of <0.001. This single-bp deletion ('AT' to 'A') in susceptible genotypes showed an effect at intronic region and targeting receptor-like protein kinase (C.cajan_17341). On CcLG08 (at position 7 106 619 bp), one-bp deletion was observed in HTB-and FW-resistant genotypes ('T' allele) in comparison with LTB and FW-susceptible genotypes ('TG' allele) (with P-value <0.001). The insertion of one bp ('T' to 'TG') in susceptible genotypes has shown the effect on two genes (C.cajan_16014; Transcriptional corepressor SEUSS and C.ca-jan_16015; Uncharacterized protein).

Indels for SMD resistance
For SMD resistance, Indel-seq analysis has provided two associated Indels, one each on CcLG02 and CcLG10. On CcLG02 at 14 020 849 bp position, one-bp insertion in HTB and SMD resistance genotypes ('CA' allele) was detected. In the case of LTB and susceptible genotype (ICP 8863), 'C' allele was present (with P-value <0.001). The identified one-bp deletion ('CA' to 'C') in susceptible genotypes targeting two genes (C.ca-jan_05815 at upstream and C.cajan_05816 at downstream region), and both the genes were annotated as conserved oligomeric Golgi complex subunit 5. Similarly, On CcLG10 (at position 18 889 276 bp) one-bp insertion was observed in HTBand SMD-resistant genotypes ('AT' allele) in comparison with LTB and SMD-susceptible genotype ('A' allele) (with P-value <0.001). This single-bp deletion in susceptible genotypes ('AT' to 'A') showed frame-shift effect in an uncharacterized protein (C.cajan_15032). (e) Raw reads of HTP, HTB and LTB are aligned to the reference genome (RG, which is similar to high trait parent in term of target phenotype) for the identification of Indels. (f) Bulked segregant analysis (BSA) approach is applied to identify the associated Indels with the target traits using several Indel filtering criteria to identify putatively associated Indels between resistance and susceptibility. (g) List of putatively linked Indels identified. Due to excess vegetative growth, without growing into reproductive phase, this disease is known as the green plague of pigeonpea. Discussion NGS-based genome mapping enables identification of candidate genomic regions/genes in a rapid way, which is often difficult using traditional methods in terms of time and resources required (Varshney et al., 2012b). Recently, a number of SNP-based approaches combining BSA and WGRS have been successfully developed and implemented to identify the target candidate genes (see Zou et al. 2016). In the present study, an Indel-seq approach has been proposed for the identification of candidate genes/Indels associated with target traits. This approach has been tested in pigeonpea for rapid identification of candidate genes associated with the FW and SMD resistance.
To enable WGRS-based identification of candidate genes using mapping-by-sequencing approach, several methods have been developed and discussed in different crops (Abe et al., 2012;Austin et al., 2011;Hartwig et al., 2012;Schneeberger et al., 2009;Takagi et al., 2013a;Trick et al., 2012). Based on the published literature and through large-scale simulation studies, James et al. (2013) developed user guide for mapping-by-sequencing. Among different NGS-based approaches, QTL-seq approach provided the first successful example of mapping candidate genomic regions through NGSbased approach in crop species like rice (Takagi et al., 2013b). QTL-seq approach was found successful for identification of candidate genomic regions (SNPs) for FW and SMD resistance in pigeonpea (Singh et al., 2016a) and 100-seed weight and root trait ratio (RTR %) in chickpea (Singh et al., 2016b). However, in the majority of above-mentioned studies, Indels have been ignored. For effective applications of Indels in trait mapping, we propose here Indel-seq approach that is a combination of WGRS and BSA. Deployment of Indel-seq approach has been used for identification of candidate genomic regions associated with FW and SMD resistance in the present study.

Application of Indel-seq approach for identification of trait-associated Indels
Two types of genetic variations, namely SNPs and Indels, are the most promising variations and used in the trait mapping studies in a number of crops (Huang et al., 2012;Li et al., 2013;Thudi et al., 2014). In the recent past, NGS-based trait mapping approaches utilizing a large number of SNPs generated through   resequencing/genotyping have been used for trait mapping . SNP-based mapping approaches identified candidate genes for the target traits in many reports but identification of a large number of cloned genes with the presence of functional Indels through map-based cloning experiments for different traits in different crops revealed the importance of Indels for trait mapping and development of functional markers (Kage et al., 2015). Comparative analysis of Indel-seq approach with other NGSbased QTL mapping approaches combining WGRS and BSA revealed some pros and cons over other methods of trait mapping ( Table S4). The additional advantage of Indel-seq mapping approach is to map the candidate genes in the population developed by crossing gamma-induced mutants with the wild types due to the presence of genomewide Indels in the genome. Another important feature of Indel-seq is the high probability of development of PCR-based markers for trait mapping. The rapid fall in the cost of sequencing will facilitate application of Indel-seq for trait mapping in diploid crops with relatively smaller genomes such as rice (389 Mb), chickpea (738 Mb), sorghum (818 Mb), pigeonpea (833 Mb). However, analysis of data sets for complex and large genome species requires some additional modification in the selection criteria of Indel for marker-trait association analysis.
Indel-seq analysis in pigeonpea for mapping FW and SMD resistance has been very effective as it overcomes many constraints like identification of polymorphic markers between parents, the time required for genotyping of the mapping population, preparation of the (low density) genetic maps, and identification of QTLs (with large intervals). WGRS data of parental line and bulks revealed a higher number of genomewide Indels; however, comparatively low mapping percentage and genome coverage was obtained after aligning the raw sequences to the reference genome. This lower mapping and coverage percentage might be due to sequencing library used, sequencing errors, structural rearrangements or insertions in the query genome or deletions in the reference, a high percentage of repetitive elements (Sims et al., 2014) and quality of the reference genome. WGRS analysis of resistant parent and both the bulks revealed 89 261 genomewide Indels and 33 577 Indels between the bulks (HTB vs LTB), which further narrowed down to 1290 Indels, based on stringent selection criteria (read depth and homozygosity of calls in the bulks). The number of Indels was further reduced to 464 based on Indel-seq principle. However, this number is comparatively higher than the previous SMD resistance mapping experiments in which only 120 and 78 SSRs were found polymorphic in two mapping populations after screening of 3000 SSR markers (Gnanesh et al., 2011). Finally, based on chi-square analysis 16 candidate Indels were identified targeting 26 different candidate genes. The Indel-seq pipeline discussed in this report is very simple and after mapping raw reads to the reference genome, analysis can be done using simple Perl Script or in Microsoft Excel program (2010 and above).

Identification of significant genomic regions for FW-and SMD-resistant breeding
To check the efficiency of Indel-seq in identifying possible candidates (markers/genes) for the target traits, we have also used identified Indels (Table S5) in a recently proposed method known as EXPLoRA-web BSA (Duitama et al., 2014). We have significant results from each of the models proposed in EXPLoRAweb BSA (Tables S6). As expected, the lowest number of QTLs was reported in the high sensitive model (a = 5, b = 1) and highest number of QTLs in the high specific model (a = 30, b = 1). Interestingly, 12 of 16 candidate Indels identified through Indel-seq approach were found common in EXPLoRAweb BSA analysis (Table S7). Moreover, from the five validated Indels in the present study, four were also found in EXPLoRA-web BSA analysis. This has enhanced our confidence in proposing Indel-seq as a possible approach for fast trait mapping experiments. However, it is important to mention that EXPLoRA-web BSA has provided a large number of possible Indels' associations, which directly cannot be applied for genomics-assisted breeding (GAB) programmes, whereas Indel-seq has provided reasonable numbers of high confidence MTAs (three for FW and two for SMD) which can be converted into KASP markers. After validation of KASP markers, it can be utilized in GAB for development of FW-and SMD-resistant pigeonpea genotypes.

Conclusions
It is evident from the present study that identification of candidate genes for targeted traits based on NGS will not only increase the precision and power but also generate results in less time than the conventional methods of genome mapping. In near future due to rapid declining of sequencing cost and availability of high-quality draft genome sequences in several crops, we envisage application of Indel-seq for trait mapping and GAB for crop improvement. Identified target genes and associated Indels in the present study were validated on defined sets of genotypes for which sequence data were available. These results after validation on larger sets of genotypes will be useful in guiding diseases resistance breeding efforts in pigeonpea.

Plant materials and construction of pools
Six pigeonpea genotypes were selected based on their FW and SMD responses identified from our previous experiments (Saxena et al., 2010a;Singh et al., 2016a;Varshney et al., 2012a These RILs were phenotyped for FW and SMD resistance using standard procedures as mentioned in Nene and Reddy (1976) and Singh et al. (2003). The detailed descriptions on sick plot nursery, filed design and construction of bulks have been presented in Singh et al. (2016a).

Sequencing libraries and alignment of short reads of bulks
Raw sequencing data of ICPL 20096 (resistant parent or HTP) and resistant bulk or HTB and susceptible bulk or LTB were used for Indel-seq analysis (Table S2). The generated paired end reads of 251 bp lengths were cleaned using the tool Sickle (Joshi and Fass, 2012) with minimum phred quality score of 30 and minimum read length of 70 bp. The reads containing 'Ns' were also removed. The clean data of samples were used to align to the  (Varshney et al., 2012a) using BWA: Burrows-Wheeler Aligner (Li and Durbin, 2009) to get the Sequence Alignment/Map (SAM)/BAM (Binary Alignment/Map) alignment files, which results in alignment files in BAM format. The BAM files were further processed for Indel realignment using IndelRealigner component of Genome Analysis Toolkit (GATK; McKenna et al., 2010), and Picard utility was used for adding read group information. These processed BAM files were then subjected for the variants calling through GATK (DePristo et al., 2011) using standard parameters for the parent and both the bulks. The identified genomewide variants were further used for Indel-seq analysis for the identification of MTAs.

Mining of resequencing data sets for validation
To validate the candidate SNPs, resequencing data sets of four genotypes, namely ICPL 20097 (R-FW and R-SMD) and ICP 8863 (R-FW and S-SMD), ICPB 2049 (S-FW and R-SMD) and ICPL 99050 (R-FW and R-SMD), were used to find out the genes/markers unique to FW and SMD . To test the association, p-value was calculated between identified Indels with the target traits using single factor ANOVA in Microsoft Excel 2013.

EXPLoRA-web BSA
EXPLoRA-web BSA works upon the principle of LD to detect QTLs using Hidden Markov Model (HMM) (Duitama et al., 2014). Genomewide mapping reads of susceptible bulk (LTB) onto the reference genome (RG) was utilized to develop input files for EXPLoRA-web BSA analysis. Only those positions were selected for analyses, which were supported by a minimum of 10 reads. LTB was chosen for BSA in the present analysis because RG was resistant to both the diseases (FW and SMD). To control the EXPLoRA-web models, three different parameters were utilized for identification of QTLs (i) a = 5; b = 1 (high sensitivity) (ii) a = 10; b = 1 (the middle ground between sensitivity and specificity) and (iii) a=30; b=1 (high specificity). The a/b ratio determines the shape of the b distribution in the models, which reflects the probability for the phenotype-linked states (Pulido-Tamayo et al., 2016).

Supporting information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1 Classification of RILs and parents based on FW percent disease score (PDI). Figure S2 Classification of RILs and parents based on SMD percent disease score (PDI). Figure S3 Genome-wide identified insertion plots for all linkage groups. Figure S4 Genome-wide identified deletion plots for all linkage groups. Figure S5 Genome-wide identified Indels plots for all linkage groups. Figure S6 Number and length of insertions and deletions identified after mapping.