Human perforin gene variation is geographically distributed

Abstract Background Deleterious mutations in PRF1 result in lethal, childhood disease, familial hemophagocytic lymphohistiocytosis type 2 (FHL 2). However, not all mutations in PRF1 are deleterious and result in FHL 2. Currently, these nondeleterious mutations are being investigated in the onset of numerous disorders, such as lymphomas and diabetes. Yet, there is still an overwhelmingly large amount of PRF1 mutations that are not associated with disease. Methods We conducted a post hoc analysis of the PRF1 mutations in the coding region using the recently published Exome Aggregation Consortium genomes, Leiden Open Variation Database, NCBI SNP database, and primary literature to better understand PRF1 variation in the human population. Results This study catalogs 460 PRF1 mutations in the coding region, and demonstrates PRF1 is more variant then previously predicted. We identify key PRF1 mutations with high allelic frequency and are only found in certain populations. Additionally, we define PRF1 SNVs are geographically distributed. Conclusions This study concludes with a novel hypothesis that nondeleterious mutation in PRF1, which decreases perforin expression and/or activity, may be an example of selective advantage in the context of environmental stressors prevalent near the equator. Our studies illustrate how perforin deficiency can be protective from injuries resulting in blood–brain barrier (BBB) disruption.

Another possibility is that perforin variants exist in the population as a consequence of selective advantage during certain kinds of inflammation. Perforin is critical for virus and intracellular bacteria clearance. Perforin is also required for cytotoxic activity against tumors and may play a role in immune surveillance (Pipkin & Lieberman, 2007). However, perforin has also been demonstrated to be a key modulator of pathologic blood-brain barrier (BBB) disruption during CNS viral infection and experimental cerebral malaria (Nitcheu et al., 2003;Suidan, McDole, Chen, Pirko, & Johnson, 2008). Severity of BBB disruption occurs in a perforin gene dosage-dependent manner with higher perforin expression resulting in more disruption (Willenbring et al., 2016). Therefore, if perforin diversity is considered in regard to balancing the need to defend against infection with the attenuation of immune-mediated pathology, for specific kinds of infection, perforin single-nucleotide variants (SNVs) could be associated with a selection advantage.
Verifying the potential selective advantage associated with various perforin alleles has been difficult due to lack of analysis on perforin SNVs in healthy individuals. Recently, it has become possible to investigate genetic variation of perforin in a large number of characterized individuals, from geographically distinct populations. The Exome Aggregation Consortium has compiled 60,706 genomes from individuals of African, Latino, South Asian, East Asian, and European descent (Lek et al., 2016). Using these data, we investigated the genetic variation of PRF1 modeling their impact on the perforin protein, and evaluate perforin diversity as compared to other related genes. From this work, we put forward a hypothesis and model of how perforin genetic diversity may provide of selective advantage in certain environments.

| Proteomic structure of perforin
Perforin protein structure was illustrated using PyMOL (pymol.org, Schrodinger, LLC). The 3NSJ file was uploaded into PyMOL and used to demonstrate perforin structure and diversity among subdomains. Structural data were from the NCBI Protein Database, originally published in 2010 (Law et al., 2010).

| Compiling perforin SNVs
All perforin SNVs were compiled using the Exome Aggregation Consortium Database, the Leiden Open Variation Database, and primary literature reporting human perforin mutations. Only SNVs in the coding region of perforin were used in this study. When comparing genomic diversity to other genes, only Exome Aggregation Consortium Database data were used for perforin and all other genes investigated.

| Identifying perforin mutation classes associated with geographical regions
To define which perforin mutations should be described in reference to the frequency in distinct ethnic groups, we first identified the missense mutations within the pooled ExAC database that had an allelic frequency >0.001. Additionally, mutations of high interest based on their prevalence in the literature and association with disease were also included (Lee et al., 2006;Voskoboinik et al., 2005). This resulted in a list of 33 mutations. From this list, we then identified mutations that are predominant in one particular ethnic group, and/or have a previously reported defined effect on perforin activity. It should be noted that many perforin mutations have not been tested to determine their effect on function and still remain undefined.

| Calculating genomic diversity
For the purposes of this manuscript, all mutations in the coding region of human perforin gene (PRF1) are taken into consideration to better understand genetic diversity or genetic variation. To determine the genomic diversity of PRF1 compared to other human genes, we calculated the single-nucleotide variant rate (SNVR), missense mutation rate (MR), and variance using the following equations:

| Defining equatorial distance
Equatorial distance was determined through use of latitude degrees. Genomic data were divided into six ancestral regions. These regions are Africa, South Asia, East Asia, Latin America, Europe non-Finland, and Finland. Ancestral ethnicity was associated with each region as defined by the ExAC browser database. For each of these regions, a representative latitude degree was assigned for basis of discerning equatorial distance. The representative latitude was determined by choosing the median latitude for the defined region.

| Statistical analysis
All graphs were made using GraphPad Prism (Graphpad Software, San Diego, CA). Statistical tests were run in GraphPad Prism Statistical Software (GraphPad Software).

| Ratio of missense mutations to amino acids is highest in amphipathic a-helix
Human perforin is 555 amino acids, has a molecular weight of 67 kDa, and is organized into three major domains: the membrane attack complex perforin-like/cholesterol-dependent cytolysin (MACPF/CDC) domain, the epidermal growth factor (EGF) domain, and the C terminus domain. These three domains are further organized into eight subdomains (Figure 1a,c). We have determined using the ExAC Browser database and primary literature sources that there are 460 SNVs present in the coding region of human PRF1 (Figure 1b) Clementi et al., 2006;Feldmann et al., 2002;Goransdotter Ericson et al., 2001;Horne et al., 2008;Husami, 2011;Lee et al., 2006;Lek et al., 2016;Stepp et al., 1999;Urrea Moreno et al., 2009;Voskoboinik et al., 2005;Zhang et al., 2011;Zur Stadt et al., 2006). Of the 555 amino acids in perforin, over half are associated with a mutation. Missense mutations account for 350 of the 460 SNVs (Table 1).
We next established which subdomain of perforin contained the highest ratio of missense mutations to amino acids ( Table 1). The amphipathic a-helix has the highest ratio of missense mutations to amino acids compared to other perforin subdomains with a ratio of 0.966 ( Figure 1, Table 1). Not all amino acids within the amphipathic ahelix are associated with missense mutations. However, there are residues within this subdomain that are associated with multiple missense mutations ( Figure 1, Table 1). The lytic peptide subdomain has the next highest ratio of missense mutations to amino acids, with a ratio of 0.853 (Table 1). In contrast, the leader peptide, C2 domain, and cleavable C terminus have the lowest missense mutations to amino acid ratios of 0.477 and 0.467, respectively (Figure 1, Table 1). These data demonstrate the difference in the ratio of missense mutations to amino acids among different perforin subdomains, with the amphipathic a-helix subdomain having the highest ratio.

| Specific perforin mutations are associated with certain populations
Due to large amount of perforin mutations, we wanted to determine if certain classes of perforin mutations were associated with geographical regions. Using the ExAC browser database we were able to investigate the allelic frequency of certain perforin mutations associated with different populations (see Section 2). From our identification criteria, we identified 33 perforin mutations of interest (data not shown). From this list of 33, we investigated the allelic frequency of 14 perforin mutations (Figure 2). Of these 14 perforin mutations, six did not have a defined perforin activity associated with them. However, these mutations nonetheless had a higher frequency in certain populations ( Figure 2a). The missense mutations Arg4His and Val135-Met are most frequent in the African population (Figure 2a), although the missense mutations causing Val145Ala, Ala211Val, and Ala437Val substitutions are most frequent in the South Asian population (Figure 2a), the amino acid substitution Ile266Val is most frequent in the Latino population (Figure 2a).
A missense mutation that does not affect perforin activity, Asn252Ser, is present at comparable allelic frequency in four of the six populations studied (Figure 2b). The perforin amino acid substitution Ala91Val is well documented and results in reduced perforin activity (Voskoboinik et al., 2005(Voskoboinik et al., , 2007. This mutation had the highest allelic frequency in European (non-Finnish) and Finnish populations ( Figure 2c). Other missense mutations that reduce perforin activity are Gly149Ser and Arg232His. These substitutions have the highest allelic frequency in the Latino and South Asian populations, respectively ( Figure 2c) (Risma, Frayer, Filipovich, & Sumegi, 2006). Perforin mutations that result in null perforin activity have the highest allelic frequency in the African, South Asian, and East Asian population (Figure 2d). In the list of perforin mutations investigated, deleterious mutations were not detected in European (non-Finnish) and Finnish populations (data not shown, Figure 2d).

| Perforin allelic diversity is comparable to HLA molecules
The variant nature of perforin prompted us to compare perforin allelic diversity to other, relevant molecules. First, we determined the SNV rate for the genes encoding b-actin (actin), tumor necrosis factor alpha (TNFa), perforin-2 (MPEG1), complement 9 (C9), granzyme B (GzmB), human leukocyte antigens A, B, and C (HLA-A,B,C), and perforin ( Figure 3a). SNV rate is defined as the number of mutations in the coding region of the protein divided by the total number of nucleotides in the coding region of the protein (see Section 2). As expected, the HLA genes had the highest SNV rate. Also as expected, the highly conserved gene encoding actin had the lowest SNV rate among genes we analyzed. Similarly, the proinflammatory cytokine, TNFa, also had a low SNV rate. In contrast, perforin had a high SNV rate, comparable to HLA molecules (Figure 3a). Meanwhile, the gene encoding C9, the most homologous protein to perforin, had an intermediate SNV rate, as does the GzmB and MPEG1 gene.
The presence of a SNV in a specific gene does not necessarily result in an amino acid change. Therefore, we next determined the missense mutation rate of each of the genes in our panel. Missense mutation rate (MR) was calculated by taking the number of missense mutations present in the coding region divided by the number of amino acids present in the protein (see Section 2). Consistent with our calculated SNV rate, actin and TNFa MRs were the lowest. The MR of C9, MPEG1, and GzmB was intermediate. Meanwhile, the MR of perforin was high and comparable to HLA-A, B, and C (Figure 3b).
The SNV rate and MR rate inform us on the presence of diversity in the perforin gene, but do not reflect the prevalence of variants within the population. To define prevalence of variants in the human population, we determined the SNV variance for each aforementioned gene using Tajima's D equation (Tajima, 1989). We apply Tajima's D equation to combine the amount of variants present in the gene and the allelic frequency of each variant to evaluate variance (Tajima, 1989). As expected, the genes for the HLA molecules had a very high level of variance. Furthermore, the genes encoding actin and TNFa had very low SNV variance. The other genes studied that encode for C9, GzmB, MPEG1, and perforin all had an intermediate level of variance as determined by Tajima's D equation (Figure 3c). Although not as high as the HLA molecules, perforin SNV variance was considerably higher than actin, TNFa, MPEG1, and C9. A molecule found in the cytotoxic granule with perforin, GzmB, has almost threefold higher variance than C9 and slightly higher variance than PRF1 (Figure 3c). Perforin genetic variance did not surpass GzmB or HLA gene diversity, but is still higher than other molecules investigated (Figure 3).

| Perforin variance is indirectly correlated with distance from equator
Allelic shaping pressures associated with geographically distinct regions have influenced diversity in immunological genes (Jeffery & Bangham, 2000). To determine if geographically distinct pressures could potentially shape perforin gene diversity, we investigated perforin SNV variance in populations closer and farther away from the equator. Since GzmB and perforin had comparable SNV variance, we also determined the equatorial distribution of GzmB SNV variance. Again, we applied Tajima's D to define the SNV variance for the genes encoding perforin and granzyme B for populations of African, Latino, South Asian, East Asian, European non-Finnish, and Finnish descent. Each population had a distinct geographical region which could be assigned a distance from the equator, in latitude degrees (Figure 4a). Then using linear regression analysis, we assessed the relationship between the equatorial distance and SNV variance of perforin and GzmB for each population. As shown in Figure 4, we detected a significant, indirect correlation between PRF1 SNV variance and equatorial distance (Figure 4c). Furthermore, we investigated the correlation between perforin SNV rate and equatorial distance using a linear regression analysis. There was a significant correlation detected between perforin SNV rate and equatorial distance (Figure 4d). While a similar relationship between granzyme B SNV variance and equatorial distance was evident, the correlation did not reach significance (Figure 4b).

| DISCUSSION
In this study, we indexed 460 perforin SNVs in the coding region of the perforin gene, which is fourfold greater than previously reported . Furthermore, we categorize each subdomain of perforin as the ratio of missense mutations to amino acids. Subdomains that have a higher ratio may include subdomains that are less rigid in the amino acids necessary to be functional. The amphipathic a-helix had the highest ratio of missense mutations to amino acids (Table 1). This domain is part of the lipid membraneinserting domain (Persechini et al., 1992). If a mutation still allows for the protein to be inserted into a lipid membrane, this subdomain will retain function. In contrast, subdomains with a lower ratio of missense mutations to amino acids tend to be more critical to perforin protein function. This includes the C2 domain. The C2 domain is responsible for binding calcium, which is required to maintain proper perforin function. Additionally, the leader peptide protein also had a low ratio of missense mutations to amino acids. This could be due to the leader peptide serving an important role in directing perforin toward the secretory pathway . Additionally, it has been reported that the first 19 amino acids are necessary for proper pore formation and lytic activity (Persechini et al., 1992). Altering this subdomain could therefore be deleterious to pore formation, rendering perforin nonfunctional. Structure function studies need to be conducted to better understand the extent the ratio of missense mutations to amino acids reflects the mutation tolerance of perforin subdomains.
Since there are 460 PRF1 SNVs, we wanted to identify mutations of interest to further investigate in future studies. These mutations included those that were frequent across all populations or may be associated with a certain population. We identified 33 perforin mutations that had an allelic frequency of >0.001 or had a defined perforin activity associated with them. Many of these mutations were not associated with a defined perforin activity. However, some perforin mutations were more prevalent in certain populations (Figure 2). Mutations that did affect perforin activity, Ala91Val, Gly149Ser, and Arg232His, had very different population distributions. Ala91Val, a mutation that has been reported to cause 50% reduction in perforin activity, has the highest allelic frequency in European (non-Finnish) and Finnish populations (Voskoboinik et al., 2005) (Figure 2c). However, the Latino population has the highest allelic frequency for the Gly149Ser substitution. The South Asian population has the highest allelic frequency for the Arg232H substitution. Both of these missense mutations result in decreased perforin activity. Future studies should investigate if these mutations are associated with any disease or selective advantage.
Deleterious mutations in PRF1 can lead to the lethal childhood disease FHL 2. The ExAC browser database excludes individuals with childhood diseases. Therefore, we did not expect to find many deleterious perforin mutations in the ExAC browser population. However, we did identify deleterious mutations in the population. All of F I G U R E 1 Perforin mutations are present in all domains and subdomains of this pore-forming protein. (a) Genomic organization of PRF1 with indicated domains and subdomain. (b) List of amino acids, using single letter code, with corresponding mutations. Lower case letters indicate that there is a frameshift or deletion that results in an amino acid termination later in the protein. An "X" indicated a termination at that amino acid (c) 3D rendering (cartoon, mesh, surface) of perforin structure T A B L E 1 Summary of perforin SNVs broken down by subdomain. Each subdomain lists the total number of SNVs, mutations, amino acids in that subdomain, and the ratio of missense mutations to amino acids these mutations were heterozygous and were not present in the European (non-Finnish) and Finnish populations (Figure 2d). Consistent with previous studies, Leu18Ter substitution was unique to the African population (Lee et al., 2006). The substitution Arg410Trp was present in the African, Latino, and highest in the East Asian population (Figure 2d). Previously, this mutation has been associated with FHL 2 patients (Feldmann et al., 2002). However, studies should continue to investigate any consequence being heterozygous for these mutations has on long-term health.
Defining perforin as having more mutations than previously reported led us to compare perforin variance to both related and unrelated genes. The SNV and mutation rate of perforin were comparable to HLA-A, B, and C, which are considered some of the most polymorphic molecules known. Furthermore, it has recently been reported the SNV rate average for the genome is higher than previously expected with 1 in 8 nucleotides are associated with a change in the average human population (SNV rate = 0.125) (Lek et al., 2016). Within the perforin gene (PRF1), over 1 in 4 nucleotides are associated with a change in the human population (SNV rate = 0.27) (Figure 3a). When we investigated the prevalence of SNVs using Tajima's D equation, perforin SNV variance was much lower than the variance calculated for each of the genes encoding the HLA molecules ( Figure 3c). However, perforin SNV variance was still higher than the other genes in our panel and is comparable to the GzmB gene. From these data, we conclude that perforin variants F I G U R E 3 Perforin variability is higher than other genes but not as high as class I genes. Perforin (a) SNV rate, (b) mutation rate, and (c) variance as measured by Tajima's D compared to other genes WILLENBRING ET AL. may have significance since they are more prevalent than what we expected based on previous predictions (Petrovski, Wang, Heinzen, Allen, & Goldstein, 2013).
The observation that perforin gene diversity is more extensive than previously thought prompts an analysis of the significance of this finding. Genetic diversity in the human immune response is important to effectively protect against extrinsic and intrinsic health threats. The most notable example is the diversity in the major histocompatibility complex (MHC) class I and class II loci (Jeffery & Bangham, 2000). The diversity of MHC class I has been tracked in parallel with the human migration pattern, starting in Africa, through Asia, across the Bering Strait, down through North America and ending in South America (Fernandez Vina et al., 2012). Furthermore, specific haplotypes of MHC class I and II have been attributed to resistance of certain pathogens (Jeffery & Bangham, 2000). Although many immune-related molecules have been studied in reference to their diversity, perforin, the effector molecule of CD8 T and NK cells, had not been investigated. The (a) SNV rate of perforin (PRF1), (b) variance of PRF1, and (c) variance of GZMB are plotted against equatorial distance. Linear regression analysis was used to test the relationship between these two values. R 2 and p values are indicated on graph. ps < .05 are considered significant One indication of selective pressure for SNVs is geographical distribution. It has previously been shown that pathogen richness, a major selective pressure, is correlated with equatorial distance (Cashdan, 2014). Regions that are closer to the equator have a more diverse pathogen population than regions that are farther away (Cashdan, 2014). Since perforin is crucial for controlling intracellular pathogens, we hypothesized that perforin gene diversity may be due to a selective pressure such as pathogen richness. Using a linear regression analysis, we tested the relationship between equatorial distance and perforin SNV variance (as calculated with Tajima's D). We did detect a correlation between PRF1 SNV variance and equatorial distance ( Figure 4). In contrast, we did not observe a correlation between GZMB variance and equatorial distance ( Figure 4). These data demonstrate that the higher perforin variance we observe in populations closer to the equator may be a result of an equatorial-associated pathogenic pressure.
Lethal pathogenic pressures could take numerous forms. In areas such as Africa, diseases that can result in CNS immune-mediated pathology are prevalent, such as Dengue Virus Hemorrhagic Fever (DVHF), Crimean-Congo Hemorrhagic Fever (CCHF), and cerebral malaria (CM) (Kebede, Duales, Yokouide, & Alemu, 2010). Perforin has been shown to be a key regulator of neuropathologies, such as BBB disruption in experimental models (Suidan et al., 2008). Perforin null mice are resistant to both viralassociated BBB disruption and hemorrhage associated with cerebral malaria (Nitcheu et al., 2003;Suidan et al., 2008). However, perforin null humans suffer from lethal childhood disease, FHL2 (Stepp et al., 1999). Yet, reduction in perforin, but not ablation, reduces BBB disruption (Willenbring et al., 2016). We put forward a hypothesis that variance in perforin may contribute to maintaining the balance between immune-mediated and pathogen-mediated pathology ( Figure 5). Perforin may be increasing the relative fitness of an individual. An individual that has a perforin mutation resulting in decreased activity and/or expression may have advantage over others in that group during a neuropathologic event due to the ability to better balance inflammation-induced pathology. We contend that future studies should focus on the relationship between perforin mutations and protection from a perforin-mediated pathology.

ACKNOWLEDGMENTS
The authors would like to thank the Exome Aggregation Consortium and the groups that provided exome variant data for comparison. A full list of contributing groups can be found at http://exac.broadinstitute.org/about. We would also like to thank Courtney Malo for her greatly valued contribution to figure optimization.

DISCLOSURE
The authors have nothing to disclose.

AUTHOR CONTRIBUTION
RCW, YI, LRP, and AJJ conceived and designed experiments. RCW performed the experiments. RCW, YI, LRP, and AJJ analyzed data. RCW and AJJ wrote the manuscript.

Robin C. Willenbring
http://orcid.org/0000-0002-2138-2912 F I G U R E 5 Current working model of perforin as an example of selective advantage. Based on data presented in these studies, we put forward the novel hypothesis that perforin may be a selective advantage, in regard to mediating BBB disruption and proper viral control. As perforin expression and/or activity are increased, the ability to control a viral infection is also increased. However, an individual also increases the susceptibility of having pathologic BBB disruption