Genotyping complex structural variation at the malaria‐associated human glycophorin locus using a PCR‐based strategy

Structural variation in the human genome can affect risk of disease. An example is a complex structural variant of the human glycophorin gene cluster, called DUP4, which is associated with a clinically significant level of protection against severe malaria. The human glycophorin gene cluster harbours at least 23 distinct structural variants, and accurate genotyping of this complex structural variation remains a challenge. Here, we use a polymerase chain reaction‐based strategy to genotype structural variation at the human glycophorin gene cluster, including the alleles responsible for the U– blood group. We validate our approach, based on a triplex paralogue ratio test, on publically available samples from the 1000 Genomes project. We then genotype 574 individuals from a longitudinal birth cohort (Tori‐Bossito cohort) using small amounts of DNA at low cost. Our approach readily identifies known deletions and duplications, and can potentially identify novel variants for further analysis. It will allow exploration of genetic variation at the glycophorin locus, and investigation of its relationship with malaria, in large sample sets at minimal cost, using standard molecular biology equipment.

The human glycophorin genes GYPE, GYPB and GYPA are located on three paralogous repeat units approximately 120 kb in size sharing 97% identity at chromosome 4. GYPA and GYPB encode glycophorin A and glycophorin B, respectively, both of which are major components of the erythrocyte membrane and are receptors for ligands on the surface of the P. falciparum merozoite (Baldwin, Li, Hanada, Liu, & Chishti, 2015;Mayer et al., 2009;Orlandi, Klotz, & Haynes, 1992;Pasvol et al., 1982;Sim, Chitnis, Wasniowska, Hadley, & Miller, 1994). No protein product of GYPE has yet been detected, and it is thought therefore to be a pseudogene. There is extensive structural variation at this locus resulting in at least eight distinct deletions of one or two repeat units, 13 distinct duplications of one or two repeat units, and two more complex variants (Leffler et al., 2017;Louzada et al., 2020). Some of these variants have been shown to be responsible for particular blood group phenotypes -for example the DUP4 variant codes for the Dantu blood group (Leffler et al., 2017), and homozygosity for either DEL1 and DEL2 variants results in the U-blood group as part of the S-s-U-phenotype observed in Africans (Gassner et al., 2020). These structural variants can be identified and genotyped using sequence read depth of mapped high-throughput sequencing reads (Algady et al., 2018;Leffler et al., 2017). However, the ability to rapidly genotype the different structural variants at this locus using simple polymerase chain reaction (PCR) approaches would have a number of benefits. Firstly, it would allow access to large cohorts with limited DNA, where whole genome amplification is both prohibitively costly and known to introduce bias in copy number measurement (Pugh et al., 2008;Veal et al., 2012). Secondly, it would allow laboratories in resource-limited situations to genotype glycophorin structural variation, as only standard molecular genetic laboratory equipment and an instrument suitable for DNA fragment analysis are needed. Previously we published a PCR-based approach to genotype the malaria-protective DUP4 variant using a paralogue-specific PCR targeting a junction unique to that variant. We used this PCR approach to genotype a Tanzanian cohort and demonstrate association of the DUP4 variant with haemoglobin levels in peripheral blood (Algady et al., 2018). In this report, we describe a simple triplex paralogue ratio test (PRT) system based on a single PCR of 10-ng DNA that can distinguish other structural variants of the glycophorin gene cluster.
PRT is a well-established form of quantitative PCR that is particularly robust in copy number detection in genomic DNA. It relies on co-amplification of a test and reference locus using the same PCR primer pair, and distinguishing the two products either by single nucleotide difference (Saldanha et al., 2011) or, more commonly, by a small size difference between the products (Armour et al., 2007).
Because the same primer pair is used to amplify test and reference regions that are similar in sequence, segmental duplications or diverged, dispersed and repeats, are usually targeted. The similarity in sequence means that the amplification of both the test and reference loci follows similar kinetics, and quantification of endpoint products reflects the relative amounts of the starting molecules. PRT has been shown empirically to be more robust than more common forms of quantitative PCR that use different primer pairs to amplify test and reference (Aldhous et al., 2010;Cantsilieris, Western, Baird, & White, 2014;Fode et al., 2011).
We also present a strategy to confirm the most common variants by junction fragment PCRs. Junction fragment PCR relies on knowledge of the copy number breakpoint to design paralogue-specific allele-specific primers flanking the breakpoint, so that if the copy number variant is present, the two paralogue-specific primers are brought into close proximity, enabling a short PCR product to be successfully amplified. Using these approaches, we present a cost-effective strategy to genotype structural variation in the human glycophorin region to facilitate large-scale genotyping studies of valuable DNA cohorts.

DNA samples
Control DNA samples from the 1000 Genomes project were chosen because DNA samples are publically available, and high throughput Illumina sequencing has been used to genotype structural variation at this locus (Leffler et al., 2017;Sudmant et al., 2015). A subset of genotypes of this sample set have been validated by fibre-FISH and breakpoints identified (Louzada et al., 2020). DNA samples for CEPH Europeans from Utah (CEU), Chinese from Beijing, Japanese from Tokyo and Yoruba from Ibadan Nigeria (YRI) were previously purchased from Coriell Cell Repositories. DNA from the Beninese infants came from a longitudinal birth cohort established in the district of Tori Bossito (Le Port et al., 2012). The protocol was approved by both the Ethical Committee of the Faculté des Sciences de la Santé (FSS) in Benin and the IRD Consultative Committee on Professional Conduct and Ethics (CCDE). All genomic DNA samples were diluted to a concentration of 10 ng/μl in water.

Paralogue ratio test (PRT)
Triplex PRT was conducted essentially as previously published (Armour et al., 2007;Hollox, 2017;Walker, Janyakhantikul, & Armour, 2009). Briefly, three primer pairs targeting different regions of the glycophorin repeat were combined in a single PCR reaction as follows: In a final volume of 10 μl, 10 ng of genomic DNA was amplified using 0.5 μl each of 10 μM forward and reverse primers for the three PRT assays (Table 1), 1 μl of 10× low dNTP buffer (Hollox, 2017), 1 μl of 10× Taq DNA polymerase buffer (KAPA Biosystems, with 15 mM MgCl 2 ) and 0.5 U Taq DNA polymerase. The forward primer of each PRT assay was 5′ labelled with a fluorescent tag, either 6-FAM or HEX, to allow visualisation of products by capillary electrophoresis. Cycling conditions were an initial denaturation at 94 • C for 2 min, followed by 25 cycles of 94 • C for 30 s, 65 • C for 30 s, 70 • C for 30 s and a final extension of 70 • C for 5 min. PCR products were denatured and run on an ABI3130xl capillary electrophoresis machine using POP-4 polymer with an injection time of 30 s under standard fragment analysis parameters and ROX-labelled Mapmarker 1000XL (Eurogentec) and areas under the peak of each PCR product subsequently analysed using the GeneMapper software. Samples with PCR product peak areas between 300 and 60,000 were accepted. For each sample, PRT area ratios were calculated using test and reference peaks as shown in Table 1. These values were then normalised using DNA samples of known copy number (Handsaker et al., 2015;Leffler et al., 2017;Louzada et al., 2020) that were included in every experiment: NA19190 (CN = 1; DEL2), NA19818 (CN = 1; DEL2), NA19085 (CN = 2; reference), NA19777 (CN = 2; reference) and NA19084 (CN = 3; DUP2). Following normalisation, data were further normalised to account for small batch effects by dividing against the median value for that experiment.

Junction fragment duplex PCR
Junction fragment duplex PCR was conducted using 10-ng genomic DNA, 1 μl each of 10 μM breakpointspecific primers with a 3′ terminal-linked nucleic acid to enhance specificity (Table 2), 0.1 μl each of 10 μM control primers (5′GAGTACGTCCGCTTCACC-3′ and 5′CTTCCACACTTTTGGCATGA-3′), 0.5 U Taq DNA polymerase and 2 μmol each of dATP, dCTP, dGTP and dTTP in a final volume of 10 μl 1× Kapa Buffer A (ammonium sulphate-based buffer, final concentration MgCl 2 1.5 mM). PCR cycling was 95 • C for 2 min, followed by x cycles of 95 • C for 30 s, y • C for 30 s, 72 • C for 30 s and a final extension of 72 • C for 5 min. The number of cycles (x) and annealing temperature (y) were assay specific and shown in Table 2. PCR products were analysed using standard 2% agarose gel electrophoresis, staining by ethidium bromide The PRT1E amplicon has a size of 418 bp but was not included in subsequent analysis as it was not present in all reference controls.
The 3′ nucleotide of each primer is bracketed, indicating that this nucleotide is a linked nucleic acid (LNA). and visualisation by UV light. The control primers amplify a 216 bp product, and the breakpoint-specific primers amplify product sizes shown in Table 2.

Development of triplex PRT system
A triplex PRT consists of three distinct PRTs that are amplified together in a single PCR reaction, which generates three independent measures of copy number in one single tube and electrophoresis capillary (Walker et al., 2009). If the PRTs have a test and reference on the same chromosome, they are called cis-PRTs, in contrast to trans-PRTs where the reference is located on a different chromosome from the test amplicon (Hollox, 2017). In this study, we designed three cis-PRTs, each one targeting a different region of the glycophorin repeat, with test and reference distinguished on the basis of product size by electrophoresis ( Figure 1). Because the most frequent deletion affects the GYPB gene, the test amplicon was the GYPB amplicon with the reference amplicons being either GYPA or GYPE amplicons. Because the extent of the most frequent structural variants are known, and the three different PRTs measure copy number at different regions of the glycophorin repeat, it is possible to predict the relative test/reference ratios of these different structural variants ( Figure 2). We validated our triplex PCR by analysing a cohort of 177 1000 Genomes samples with matching short read sequence data, where structural variants of the glycophorin region had previously been identified (Supporting Information). We could then directly compare our PRT results with estimates of copy number from previous sequence read depth analysis and fibre-FISH (Louzada et al., 2020). Analysis of PRT1 shows complete concordance with copy number measured by short-read sequencing depth, detecting the deletions and duplications identified previously (Figure 3a). Analysis of PRT2 shows a more complex situation; although the duplications are identified correctly, some of the deletion samples are identified by a gain of signal by PRT2 rather than a loss of signal (Figure 3b; Table 3). This discrepancy is expected, as PRT2 measures DEL2 as an increase in signal, because the amplicon corresponding to GYPA (reference) is lost not GYPB (test) (Figure 2). Three outliers were not previously identified as known structural variants and are likely to represent a variant with a gene conversion event affecting one of the PRT amplicons (Figure 3b). PRT4 yields similar results to PRT1 for this cohort, with a few individuals with no structural variant identified showing a higher than expected value, again probably due to gene conversion events.  By comparing across all three PRTs, the four different structural variants can be distinguished from individuals not carrying a structural variant. The exceptions are DUP2 and DUP27, which cannot be distinguished and are either the same variant or share very close breakpoints (Louzada et al., 2020), the deletions DEL6 and DEL13 and the duplications DUP26, DUP14 and DUP7 (Table 3). These variants are rare and would need to be distinguished either by developing a particular junction fragment PCR as we have done for three deletions described below or by genome sequencing. This analysis shows that by using different loci as test and reference loci, and by combining PRT assays together, different structural variants can be distinguished on the basis of copy number and breakpoint (Table 3). It should also be noted that, for PRT4, the clusters for the deletions are higher than expected, with a median value for heterozygotes around 1.25 compared to a theoretical value of 1 (Table 3), and a median value for homozygotes of 0.5 compared to a theoretical value of 0 (Table 3). Of the two DEL1 control samples, NA19818 shows the expected level of test amplicon amplification, whereas NA19190 shows a much lower level. The reason for this is unclear -it might be due to an undetected variant in the PRT4 test locus primer binding site on the reference allele in NA19190 -but the consequence is that the calibration curve for the positive controls for PRT4 is slightly distorted at lower copy number values. For future studies, we recommend using only NA19818 as the DEL1 positive control.

Analysis of Benin population
We used our triplex PRT approach on a cohort of 574 individuals from Benin in west Africa. Individuals from west African populations have been shown to have appreciable frequencies of different glycophorin structural variants (Leffler et al., 2017). A plot of PRT1 copy number estimates against PRT2 copy number estimates shows clear clustering of individuals (Figure 4a), with genotype inferred using the relative dosages predicted from the structures of the different variants ( Figure 2). A plot of PRT1 copy number estimates against PRT4 copy number shows dosages consistent with the assigned genotypes, but with several WT/WT samples as outliers for PRT4. The molecular basis for these outliers is not yet clear; they have not previously been identified as novel structural variants affecting copy number, so it seems likely that they are due to gene conversion events altering the ratio of test/reference amplicon dosage without altering the overall copy number of the locus.

Confirmation using junction fragment PCR
Junction fragment PCR uses the fact that, across a duplication or deletion breakpoint, two different sequences come into juxtaposition. By designing one PCR primer to the sequence one side of the breakpoint, and the other PCR primer to the other side of the breakpoint, a PCR amplification that is specific to genomic DNA containing this juxtaposition of sequences -that is the deletion or duplication -can be designed ( Figure 5). This approach can generate a positive test of the presence of a particular deletion or duplication in a particular genomic DNA. There are two main challenges in applying this approach to the glycophorin gene cluster deletions and duplications. Firstly, the location of the breakpoint must be known precisely to enable PCR primers to be designed to flank the breakpoint and still generate a small amplicon that is still likely to be amplified even if the genomic DNA is degraded. Secondly, because the sequences either side of the breakpoint are from glycophorin paralogous repeats that show >97% identity, each PCR primer must be carefully designed to be specific to that particular paralogue.
To confirm genotypes, we designed junction fragment PCRs to span the known breakpoints of DEL1, DEL2 and DEL7, as determined previously (Louzada et al., 2020). These were designed to generate short amplicons with a co-amplified non-variable control amplicon from an unrelated locus in the genome, to act as a positive control for DNA amplification. To ensure paralogue specificity, the 3′ end of each primer was designed over a nucleotide that distinguished the paralogue to be amplified from the other two paralogues, and was synthesised using a linked nucleic acid nucleotide analogue to enhance specificity in base-pairing (Latorra, Campbell, Wolter, & Hurley, 2003). These assays were developed using the samples for each variant as shown in Table 2. The assays for DEL1 and DEL2 were confirmed using the samples HGDP00449 and HGDP00474, whose genotypes had previous been inferred from short read genome sequencing (Louzada et al., 2020).

DISCUSSION
In this study, we present a genotyping strategy for structural variants of the glycophorin gene cluster in humans. We use a PCR-based approach called the PRT to estimate copy number gain and loss across the region. By combining information from three separate PRT assays, the genotype of particular structural variants can be called. These can then be confirmed by paralogue-specific junction fragment PCR assays. There are several strengths in the PRT approach. It calls copy number and does not rely on homogeneity of variant breakpoints, unlike junction fragment PCR. This allows a variant to be called that may not share the exact breakpoint with other examples of the same apparent variant, and this is known to be a complication in DUP2, encoding the Sta antigen. Because PRT is based on PCR of small fragments, it can be applied to large sample cohorts containing small quantities of degraded DNA. The two-stage approach of PRT followed by junction fragment PCR allows initial triage of samples homozygous for the reference variant and selection of samples carrying rarer variants for confirmatory junction fragment PCR, or other analyses, such F I G U R E 5 Detecting deletions using junction fragment PCR. Design of paralogue-specific PCR primers that span a deletion breakpoint allows detection of particular deletions by junction fragment PCR. In this example, paralogous sequences with a high level of identity from three glycophorin repeats are shown as purple (glycophorin A), orange (glycophorin B) and purple (glycophorin E). At the top is an example of a deletion formed between the glycophorin A and glycophorin B repeats. PCR primers (shown as arrows) are designed with the 3′ nucleotide at a position that distinguishes the repeat involved in the deletion from the other two repeats, and the combination of primers amplifies only across the deletion breakpoint. In this example, the primers are designed to distinguish B and E from A (left hand primer) and A from B and E (right hand primer). In combination, these two primers only amplify from the repeat unit at the top, spanning the deletion breakpoint [Colour figure can be viewed at wileyonlinelibrary.com] as targeted high-throughput sequencing. Because in most cases the breakpoints have previously been defined for these variants (Louzada et al., 2020), junction fragment PCR could also be designed for the very rare variants that currently cannot be distinguished by PRT, for example DUP7 and DUP14.
Our approach uses widely available low-cost molecular biology equipment, with the exception of capillary electrophoresis equipment. Our chosen positive controls are form the 1000 Genomes project, which have been validated extensively by short-read sequence read depth analysis, fibre-FISH and breakpoint Sanger sequencing (Louzada et al., 2020). It is also adaptable, for example using PRT1 alone would be the most straightforward to interpret for common deletions and duplications, and could be used to select for particular genotypes of interest for sequencing.
We demonstrated the usefulness of our approach by genotyping 574 individuals from a Beninese birth cohort. We found two deletions also observed in other west African populations (DEL1 allele frequency 7% and DEL2 allele frequency 5%) that, when homozygous, are responsible for the U-blood group (Gassner et al., 2020). Genotype frequencies are in Hardy-Weinberg proportions, and there is no previous epidemiological evidence that these affect the risk of malarial clinical phenotypes (Leffler et al., 2017). We found two examples of the DEL7 variant and two DUP3 variants, confirming that these are present, but rare, in west African populations. We found no examples of the DUP4 variant that is protective against malaria.
As with any approach, there are weaknesses. As mentioned above, the junction fragment PCR can be challenging to design and assumes homogeneity of breakpoints across all copies of the same variant. Although PRT can determine copy number robustly (Adewoye et al., 2018;Cantsilieris et al., 2014), care must be taken to include internal positive controls to allow for normalization within an experiment and across experiments (Hollox, 2017).
In conclusion, we hope our genotyping approach will be useful to other investigators in tackling this fascinating, complex region where variation has profound consequences on the susceptibility of individuals to malarial disease.

A C K N O W L E D G E M E N T S
We would like to thank Rachael Madison for technical support and Mark Jobling for access to the ABI 3100xl capillary electrophoresis machine. We would also like to thank the children and families from Tori Bossito in Benin for their contribution to this study. WA was funded by a PhD studentship from the Saudi Arabia Cultural Bureau in London.

AUTHOR CONTRIBUTIONS
WA and EJH conceived and designed the work. AG and DC contributed samples. WA, EW, DM and EJH collected and analysed data. EJH wrote the paper with contributions from all authors.

D ATA AVA I L A B I L I T Y S TAT E M E N T
Data are available in the Supporting Information.