Phasing and imputation of single nucleotide polymorphism data of missing parents of bi-parental plant populations

This paper presents an extension to a heuristic method for phasing and imputation of genotypes of descendants in bi-parental populations so that it can phase and impute genotypes of parents of bi-parental populations that are fully ungenotyped or partially genotyped. The imputed genotypes of the parent are then used to impute low-density genotyped descendants of the bi-parental population to high-density. The extension works in three steps. First, it identifies whether a parent has no or low-density genotypes available and it identifies all of its relatives that have high-density genotypes. Second, using the high-density information of relatives, it determines whether the parent is homozygous or heterozygous for a given locus. Third, it phases heterozygous positions of the parent by matching haplotypes to its relatives. We implemented the new algorithm in an extension of the AlphaPlantImptue software and tested its accuracy of imputing missing parent genotypes in simulated bi-parental populations from different scenarios. We also tested the accuracy of imputation of the missing parent’s descendants using the true genotype of the parent and compared this to using the imputed genotypes of the parent. Our results show that across all scenarios, the accuracy of imputation of a parent, measured as the correlation between true and imputed genotypes, was > 0.98 and did not drop below ∼ 0.96. The imputation accuracy of a parent was always higher when it was inbred than when it was outbred and when it had low-density genotypes. Including ancestors of the parent at HD, increasing the number of crosses and the number of high-density descendants all increased the accuracy of imputation. The high imputation accuracy achieved for the parent across all scenarios translated to little or no impact on the accuracy of imputation of its descendants at low-density. Key Message New fast and accurate method for phasing and imputation of SNP chip genotypes within diploid bi-parental plant populations.


Introduction 56
This paper presents an extension to a heuristic method for phasing and 57 imputation of genotypes of descendants in bi-parental populations so that it can phase 58 and impute genotypes of parents of bi-parental populations that are fully ungenotyped 59 or partially genotyped. The imputed genotypes of the parent are then used to impute 60 low-density genotyped descendants of the bi-parental population to high-density. breeding programs is that the number of selection candidates that would need to be 66 genotyped at high-density in each cycle can be very large (Heffner et al., 2010). 67 In livestock and human populations, an effective strategy to overcome this 68 cost barrier has been to genotype a subset of the population at high-density and to use 69 this data for imputation of the rest of the population genotyped at low-density. The The drawback of our previous algorithm is that it requires that both parents of 94 each bi-parental population are known and have phased genotypes available at high-95 density. Although this is normally the case when parents are inbred, pedigree errors, 96 sample loss or mislabelling or poor DNA quality can mean that one or both parents 97 may have fully or partially missing genotype data. Additionally, if genotyping 98 resources are limiting, breeders may choose not to genotype a parent that has only 99 been used to in one or two crosses. Furthermore, even if parents have high-density 100 genotypes available, unless they are fully inbred (i.e., homozygous at every locus and 101 therefore all genotypes are phased de facto) it is unlikely that they have phased 102 genotypes available for use in imputation. 103 This paper presents an extension to our previous algorithm in 104 AlphaPlantImpute to enable it phase and impute high-density genotypes of parents of 105 bi-parental populations that are missing or that only have low-density genotypes 106 available. The extension requires that some relatives of the parent (e.g., descendants, 107 ancestors, siblings) have high-density genotypes. The extension has three steps. First, 108 it identifies whether a parent has no or low-density genotypes available and all of its 109 relatives that have high-density genotypes. Second, using the high-density 110 information of relatives, it determines whether the parent is homozygous or 111 heterozygous for a given locus. Third, it phases heterozygous positions of the parent 112 by matching haplotypes to its relatives. 113 We tested the accuracy of imputing missing parent genotypes using the 114 extension to AlphaPlantImpute in simulated bi-parental populations from different 115 scenarios. These scenarios varied in the levels of inbreeding in the missing parent, 116 whether the parent had no genotypes or was genotyped at low-density, the number of 117 crosses that the parent was used in and whether the ancestors of the parent had high-118 density genotypes available. We calculated the accuracy of imputation of the missing 119 parent within each scenario as the correlation between the true and imputed 120 genotypes. We also tested the accuracy of imputation of the missing parent's 121 descendants using the true genotype of the parent compared to using the imputed 122 genotypes of the parent. Our results show that across all scenarios, the accuracy of 123 imputation of a parent was consistently high. The imputation accuracy of a parent was 124 always higher when it was inbred than when it was outbred and when it had low-125 density genotypes. Including ancestors of the parent at HD, increasing the number of 126 crosses and increasing the number of high-density descendants all increased the 127 accuracy of imputation. The high imputation accuracy achieved for the parent across 128 all scenarios had little or no impact on the accuracy of imputation of its descendants at 129 low-density, which remained high. 130 131

Materials and methods 132
Definitions 133 A focal individual is a descendant individual that is to be imputed. Parent A is 134 the missing parent that is the target of imputation. The high-density (HD) array is the 135 target array for imputation. In our test datasets, the HD array consisted of 25,000 SNP 136 markers. The low-density (LD) array is the array at which focal individuals have 137 genotypes and where Parent A may have genotypes. The LD array consisted of 50 138 SNP markers. 139

Description of the method 140
We present an extension to the original imputation method in 141 AlphaPlantImpute to phase and impute parents of bi-parental populations that are 142 missing or that have LD genotypes available. First, AlphaPlantImpute identifies 143 parents with missing genotypes or unphased genotypes (hereafter described for a 144 single parent referred to as Parent A). Second, AlphaPlantImpute gathers HD 145 genotype information of all known relatives for Parent A. Relatives include ancestors, 146 siblings, descendants and mates. AlphaPlantImpute then uses any genotype 147 information available on Parent A and its relatives to first impute missing genotypes 148 and then phase heterozygous genotypes of Parent A. 149

Parent A not genotyped 150
In livestock, the next generation are produced by a single cross of two 151 ancestors. This means that loci where both ancestors are homozygous for the same 152 genotype (i.e., both are genotype 0 or genotype 2) and where ancestors are opposing 153 homozygotes (i.e., one is genotype 0 and the other is 2) can be confidently imputed in 154 their offspring. In plant breeding populations, individuals are often the product of a 155 single cross to produce F1 individuals, followed by many rounds of selfing. This 156 means that if an offspring (in this case Parent A) has no genotypes but has ancestors 157 genotyped at HD, the only loci that can be confidently imputed are where both of its 158 ancestors are homozygous for the same. These loci are phased de-facto. 159 If Parent A has HD descendants and mates, use this information to phase and 160 impute genotypes for Parent A in the following three steps: (1) Infer positions where 161 Parent A is likely to be homozygous based on allele frequencies in descendants. For 162 example, if all HD descendants are fixed for the 0 allele, then Parent A is likely to be 163 genotype 0. If the allele frequencies are almost equal and the mate of Parent A is 164 known to be genotype 0, then Parent A is likely to be genotype 2; (2) Infer positions 165 where Parent A is likely to be heterozygous based on genotype frequency distortion in 166 descendants. This is calculated using a chi-square test of observed genotype counts to 167 expected genotype counts given observed allele frequencies. If there is significant 168 distortion and the mate is homozygous then Parent A is likely to be heterozygous; (3) 169 To phase inferred heterozygous loci of Parent A at HD, collate the genotypes of all 170 HD descendants and mates at these loci. Use these loci as anchor points in the Parents B, C, D and E were assumed genotyped at HD. 262

Results 263
Unless otherwise stated, all results presented below had 10 HD descendants 264 per cross. 265

Effect of whether Parent A is inbred or outbred 266
The imputation accuracy of Parent A was always higher when it was inbred 267 than when it was outbred but the differences were small. Figure 1 plots  no genotypes (opaque) or had LD genotypes (transparent). Figure 1 shows that when 271 Parent A had no genotypes, the accuracy of imputation was 1.01 times higher when it 272 was inbred than when it was outbred (0.980 vs. 0.970). When Parent A had LD 273 genotypes, the accuracy of imputation was 1.02 times higher when it was inbred than 274 when it was outbred (0.999 vs. 0.983). For all cases, the yield of imputation was 275

Effect of whether Parent A has LD genotypes or not 277
The imputation accuracy of Parent A was always higher when it had LD 278 genotypes than when it had no genotypes but the differences were small. Figure 1  279 shows that when Parent A was inbred, the accuracy of imputation was 1.02 times 280 higher when it had LD genotypes than when it had no genotypes (0. 999 vs. 0. 980). 281 When Parent A was outbred, the accuracy of imputation was 1.01 times higher when 282 it had LD genotypes than when it had no genotypes but the differences were small (0. 283 983 vs. 0.970). 284

Effect of including Grandparent 1 and Grandparent 2 at HD 285
Including Grandparent 1 and Grandparent 2 increased the accuracy of 286 imputation when Parent A has some LD genotypes but the differences were small. 287 When Parent A had no genotypes, the accuracy of imputation was the same regardless 288 of whether Grandparent 1 and Grandparent 2 were included or excluded. Figure 2 is 289 similar to Figure 1 and plots the genotype accuracy ( Figure 2a) and genotype yield 290 ( Figure 2b) for Parent A in Scenarios 1 and 5. Figure 2a shows that the main benefit 291 of including Grandparent 1 and Grandparent 2 for increasing the imputation accuracy 292 was when Parent A was outbred and had LD genotypes. In this case, the accuracy of 293 imputation of Parent A was 1.02 times higher when Grandparent 1 and Grandparent 2 294 were included than when they were excluded (0.983 vs. 0.997). However, this 295 increase in accuracy was at the expense of yield. Figure 2b shows that when Parent A 296 was outbred and had LD genotypes, the yield was 100% when Grandparent 1 and 297 Grandparent 2 were excluded and was 97.4% when Grandparent 1 and Grandparent 2 298 were included. 299

Effect of the number of crosses with Parent A 300
Increasing the number of crosses that Parent A was used in increased the 301 accuracy of imputation but the differences were small. Figure 3a is similar to Figure 1  302 and plots the genotype accuracy for Parent A in Scenarios 1, 2, 3 and 4. Figure 3a  303 shows that increasing the number of crosses from one in Scenario 1 to two in Scenario 304 2 increased the imputation accuracy regardless of whether Parent A was inbred or 305 outbred, or had no genotypes or had LD genotypes. When Parent A was inbred, the 306 accuracy of imputation was 1.02 times higher in Scenario 2 than in Scenario 1 when it 307 had no genotypes (0.980 vs. 0.999) and was just slightly higher when it had LD 308 genotypes (0.999 vs. 1.0). When Parent A was outbred, the accuracy of imputation 309 was 1.01 times higher in Scenario 2 than in Scenario 1 when it had no genotypes 310 (0.970 vs. 0.975) and was 1.01 times higher when it had LD genotypes (0.983 vs. 311 0.992). For all cases, the yield of imputation was 100%. 312 Increasing the number of crosses that Parent A was used in increased the 313 accuracy of imputation most when Parent A was outbred and had LD genotypes but 314 the differences were small. Figure 3a shows that when the number of crosses 315 increased from one in Scenario 1 to four in Scenario 4, the accuracy of imputation 316 was 1.02 times higher in Scenario 4 than in Scenario 1 when Parent A was outbred 317 and had LD genotypes (0.983 vs. 0.999). 318 Figure 3a also shows that increasing the number of crosses that Parent A was 319 used in decreased the accuracy of imputation when Parent A was outbred and had no 320 genotypes but the differences were small. When the number of crosses increased from 321 one in Scenario 1 to four in Scenario 4, the accuracy of imputation was 1.01 times 322 higher in Scenario 1 than in Scenario 4 (0.970 vs. 0.959). 323

Effect of number of descendants with HD genotypes 324
Increasing the number of descendants with HD genotypes increased the 325 accuracy of imputation of Parent A but the differences were small. Figure 3b is 326 similar to Figure 3a and plots the genotype accuracy for Parent A in Scenarios 1, 2, 3 327 and 4 when the number of descendants with HD genotypes was 50. For example for 328 Scenario 1, when the number of descendants increased from 10 to 50 the accuracy of 329 imputation was 1.01 times higher when Parent A was inbred and had no genotypes 330 genotypes. In general across all scenarios, the average accuracy was > 0.98 and the 368 average accuracy did not drop below ~ 0.96. The yield was 100% for all scenarios 369 apart from when Grandparents 1 and 2 (i.e., the ancestors of Parent A) were included 370 with HD genotypes. The only scenario where this was not the case was when 371 Grandparents 1 and 2 were included and Parent A was outbred and had LD genotypes. 372 In this case, the yield dropped to 97%. The reason for this is that this scenario had HD 373 genotypes available for both Grandparents 1 and 2 and for 10 offspring of Parent A. 374 The heuristic algorithm uses the two sources of information independently to impute 375 Parent A. Where they disagree, the genotype is set as missing. 376 As expected, adding more information from relatives genotyped at HD 377 increased the accuracy of imputation for Parent A. When Parent A was used in a 378 single cross, including its parents at HD increased the accuracy of imputation for 379 Parent A, particularly when Parent A was outbred and had LD genotypes. However, 380 the increase in accuracy when Parent A had LD genotypes was at the expense of 381 yield. The reason for this decrease in yield is likely caused by disagreement between 382 Parent A genotypes imputed using its descendants genotyped at HD and genotypes 383 imputed using its parents genotyped at HD. When Parent A had no genotypes, 384 including its parents at HD had no effect. This is because the only loci that could be 385 filled with confidence were loci where its parents were fixed for the same allele. 386 Increasing the number of crosses that Parent A was used in increased the 387 accuracy of imputation for Parent A when it was inbred or outbred and had LD 388 genotypes. This was likely due to two reasons. First, the extra HD information from 389 other crosses increased the ability to call heterozygous loci. For example, by chance 390 within a single cross one of the haplotypes of Parent A may have been 391 underrepresented or not represented in the descendants selected for HD genotyping 392 but may have been represented in HD descendants in the second cross. Second, the 393 LD genotypes of Parent A were used to assign parent-of-origin to the haplotypes of 394 HD descendants. Loci that were not informative of parent-of-origin within one cross 395 may have been informative in another cross, providing extra information on the 396 haplotypes of Parent A. Increasing the number of crosses that Parent A was used in 397 had only a small benefit when Parent A was inbred and had no genotypes. In this 398 case, the accuracy of imputation for Parent A was already ~ 0.98 with a single cross 399 and increasing to number of crosses increased the accuracy of imputation for Parent A 400 to > 0.999. The only exception to the benefit of increasing the number of crosses was 401 when Parent A was outbred and had LD genotypes. This could have been caused by 402 incorrect assignment or the inability to assign parent-of-origin to the haplotypes of 403 HD descendants, which would result in incorrect or uncalled genotypes for Parent A. 404 Increasing the number of descendants at HD within a cross increased the 405 accuracy of imputation across all scenarios. This is expected, since more HD relatives 406 provides more information for confidently calling the genotypes of Parent A. 407 Overall, the results suggest that high imputation accuracy of >0.98 and an 408 imputation yield of 100% in almost all cases can be achieved for Parent A by 409 collating HD genotypes of as many relatives as possible. This is critical for ensuring 410 accurate imputation of descendants genotyped at LD. 411

Effect of using imputed genotypes or true genotypes of Parent A to impute F2 focal 412 individuals 413
Using true or imputed genotypes of Parent A had only a small effect on the 414 accuracy of imputation of impute F2 focal individuals. The largest increase in 415 imputation accuracy when using true genotypes rather than imputed genotypes for 416 Parent A was observed when Parent A was outbred and not genotyped, but even in 417 this case the increase was 0.028. The likely reason for the small increase was that the 418 accuracy of imputation of Parent A was in general > 0.96 across all scenarios. 419 Therefore, our results suggest that some error in the imputation of Parent A is likely 420 to have minimal, if any effect on the imputation of focal individuals that are its 421

descendants. 422
Relevance for breeding programs 423