Bioinformatic screening and detection of allergen cross‐reactive IgE‐binding epitopes

Protein allergens can be related by cross‐reactivity. Allergens that share relevant sequence can cross‐react, those lacking sufficient similarity in their IgE antibody‐binding epitopes do not cross‐react. Cross‐reactivity is based on shared epitopes that is based on shared sequence and higher level structure (charge and shape). Epitopes are important in predicting cross‐reactivity potential and may provide the potential to establish criteria that identify homology among allergens. Selected allergen's IgE‐binding epitope sequences were used to determine how the FASTA algorithm could be used to identify a threshold of significance. A statistical measure (expectation value, E‐value) was used to identify a threshold specific to identifying cross‐reactivity potential. Peanut Ara h 1 and Ara h 2, shrimp tropomyosin Pen a 1, and birch tree pollen allergen, Bet v 1 were sources of known epitopes. Each epitope or set of epitopes was inserted into random amino acid sequence to create hypothetical proteins used as queries to an allergen database. Alignments with allergens were noted for the ability to match the epitope's source allergen as well as any cross‐reactive or other homologous allergens. A FASTA expectation value range (1 × 10−5–1 × 10−6) was identified that could act as a threshold to help identify cross‐reactivity potential.


Introduction
Cross-reactivity has generally been studied in the context of food safety. The goal has been to understand how sensitization to one particular food may allow for cross-reactivity to homologous allergens in other foods. Cross-reactivity begins with sensitization to one allergen. Exposure to another highly similar protein that shares enough of the IgE-binding epitope structure/sequence can also sensitize and/or elicit an allergic response. This shared sequence and shared reactivity is termed, cross-reactivity. In some cases, a sensitizer may not cause allergy symptoms and the cross-reactive allergen is the eliciting antigen, but not sensitizing to the patient. This example illustrates the case where the eliciting, cross-reactive allergen is an incomplete allergen. In contrast, the allergen that can both sensitize and elicit a reaction is considered a Correspondence: Scott McClain E-mail: scott.mcclain@syngenta.com Abbreviations: E-value, expectation value; NCBI, National Center for Biotechnology Information "true" or complete allergen. Nevertheless, it is the shared sequence homology that is of interest in determining the potential to elicit a clinically relevant allergy response, that is, an allergy event associated with symptomology such as wheezing, urticaria, and oral allergy syndrome, as examples.
Protein allergens have the potential to cross-react if two or more allergens share amino acid sequence to a degree that there is shared IgE antibody binding. IgE binding is ascribed to the IgE-binding regions, or epitopes, of the allergen. Shared IgE binding across multiple proteins is based on the premise that there is high degree of homology within the epitope(s) to maintain IgE binding. Thus, cross-reactivity results from the Fc epsilon-RI (FcRI) receptor binding by the IgE-allergen complex from either allergen [1]. Together, this complex is the basis for stimulating mast cells and basophils to release mediators, such as histamine.
IgE-binding epitopes can be of a sequential, uninterrupted amino acid string or as a discontinuous distribution throughout the larger allergen sequence. A sequential epitope tends to result in a minimum length of sequence that can bind IgE [2,3], and in some food allergen based cases retain IgEbinding capacity after surviving gastric enzyme reduction of the intact protein [4]. Sequential epitopes may be part of larger epitopes, but likely have the same physical constraints in binding IgE whether or not they are isolated as peptides or within the intact protein [5]. Still, sequential epitopes acting to bind IgE as part of the larger allergen structure would of course be impacted by the constraints of structure and charge imparted by the rest of the sequence. Discontinuous epitopes (e.g., conformational or nonsequential) are defined by an allergen remaining intact and the maintenance of proper folding (secondary and tertiary sequence conformation) to bring distributed residues within range of one another to allow IgE binding.
The structural family of allergens with the most numerous sequences is the birch tree pollen allergens (Bet v 1). This allergen group is well recognized for birch tree respiratory sensitivity, which can be due to sensitivity to one or more of the many isoforms produced by Betula pendula (verrucosa) species [6]. However, the allergen belongs to a broader group of structurally related proteins in several species, some of which can induce allergy based on shared homology. This allergen group is a good example of epitope homology where patients with birch pollen hay fever can also experience clinical symptoms not from the original sensitizing allergen, Bet v 1, but are instead reacting to a Bet v 1 homologue in a food. One allergen that shares homology with Bet v 1 is Mal d 1; the pathogen resistance associated protein in apples that can cause oral allergy syndrome [7]. The Bet v 1 sequence appears to be the "parental" source of the shared epitopes, as all of the Mal d 1 epitopes (both B and T-cell) are contained within Bet v 1. Clinically, the focus is on the elicitation response and this is shown by Bet v 1 being able to inhibit the B-cell epitopes through IgE binding by Mal d 1, but with Mal d 1 being unable to fully inhibit Bet v 1 IgE binding [7]. It should be noted that the foods themselves are not exclusively dependent on the Bet v 1 as a sensitizer. The foods themselves can also prompt naïve or original reactivity to the allergens in those foods directly [8]. It should be recognized that other proteins in birch pollen and foods with the Bet v 1 homologues may also sensitize and elicit allergy of their own accord.
Bioinformatics has the capacity to statistically determine the probability of taxonomic relatedness at the protein level [9,10]. As Pearson (2000) notes, " . . . , with biological sequences (as opposed to fair coins), the assumptions underlying the statistical model may not be met. When the assumptions fail, the highest scoring unrelated sequence may have an expectation value (E-value) that is much too low (e.g., E < 10 −3 ] or much too high [E > 100]" [11]. This sets the context for using FASTA as a tool, which needs to be vetted for its use in specific cases with appropriate context for the groups of proteins being evaluated. Bioinformatics has been extended herein in its application for assessing whether similarity can describe the possibility of cross-reactivity between protein allergens. The shared percent identity in amino acids remains a traditional way to describe how alike two proteins are in their sequence. Although noted for the imperfect nature of using identity (i.e., a percentage of shared, exact amino acid matches across a total amino acid length) to "find" potential cross-reactivity among sequences [5], an identity threshold has found its way into regulatory guidance [12,13]. Thus, the metric of a minimum 35% shared identity, plus a minimum of 80 amino acid overlap length, has become criteria to establish significant shared sequence between an unknown or novel protein allergen and a known allergen. In the regulatory framework from 2001 (FAO/WHO; evaluation of allergenicity of genetically modified foods), the intent was to set a tiered approach whereby the first step would be that if an alignment between an allergen and a novel protein exceeded 35% and 80 amino acid overlap, then a second step, serum screening, would be employed to confirm the existence or absence of cross reactivity. However, as was recognized at the time, there was no qualified, complete list of known allergens [14], which could be systematically explored for similarity thresholds. Together with the fact that very few epitopes for allergens were known at the time, the 35% over 80 amino acids represented a conservative approach to setting a tiered assessment that hinged on serology-based confirmation of allergenicity, but lacked a detailed exploration of allergens and the way in which cross-reactivity can be assessed bioinformatically.
It is well understood that bioinformatics and alignment algorithms base their probability assessments of homology on extrapolating to higher order protein structure from sequential sequence similarity (i.e., identical and similar residues). In the case of the FASTA algorithm, the intent is to identify local alignments between sequences [11] to find the portions of two proteins that may describe their core areas of shared sequence. This local alignment feature of FASTA is consistent with how epitopes tend to be localized as small portions within the larger, intact protein sequence. In the current study, verified IgE-binding epitopes of key allergens were used to determine a minimum alignment threshold to detect homologous sequences. In using only epitope sequence information, the focus was on testing the capacity of minimum, but immunologically relevant sequence to act as a bioinformatic screen for other homologous or cross-reactive allergens. Known, sequential and discontinuous epitopes were used to model the localized positioning of the epitopes within a larger protein sequence. Hypothetical query proteins were constructed of random amino acid sequence that was modified to include known allergen epitopes; in effect, doping random sequence with known, biologically relevant allergen epitopes. Each hypothetical protein was then compared to a database of known allergens to determine whether FASTA alignments could discern homology based on only epitopes. In using epitopes, the goal was to model the use of bioinformatics for establishing threshold criteria based on biological similarity among distinct allergen groups.

Methods
Random amino acid sequence was used to construct hypothetical protein sequence(s) that were of the same length as known allergens. Random sequence was used to fill in between the portions of known allergen epitopes. This random "filler" sequence was derived from random, alternative open reading frame sequence, as derived and translated from an original gene, human alpha-amylase; the filler sequence otherwise had no similarity to the parental gene or any known gene, including allergens. The actual primary reading frame was ignored, and a reverse reading frame was translated and prepared using the BLAST program, GETORF3 routine [15]; this amino acid sequence was then randomized (Supporting Information Fig. 1). Portions of each filler sequence were repetitively used to construct hypothetical proteins of the proper length for each allergen from which epitopes were derived (discussed below).

Hypothetical sequence construction based on allergen epitopes
Each hypothetical sequence was prepared by spacing the known epitopes and placing them into their same locations, relative to the N-terminus of the native allergen. Allergen sequences were based on their identification, as listed in the Food Allergy Research and Resource www.allergenonline.org database [16,17].
To create the hypothetical sequence for Ara h 1, the IgEbinding 10-mer peptides that were identified by the Bannon laboratory [18,19] were placed into random sequence to create a 626 amino acid sequence to be used for allergen database comparisons (Fig. 1A). The length was based on the Ara h 1 protein (GI: 1168391) with epitope locations mapped into the hypothetical sequence according to their native location in Ara h 1 [18].
A second peanut-based allergen sequence was constructed based on the recent discovery that nonhomologous proteins may have cross-reactivity [20]. A hypothetical sequence was constructed using synthetic epitopes known to cross-react with peanut allergens. The epitopes AH2-1, AH2-3a, and AH2-3c from Bublin et al. [20] were loaded into random sequence to make a 172 amino acid length sequence (Fig. 1B) based on the Ara h 2 protein (GI: 26245447). A supplementary 172 amino acid hypothetical sequence was also prepared, but a contiguous 69 amino acid section covering the AH2-1, AH2-2, and AH2-3a AH2-3b and AH2-3c epitopes [20] was inserted into random sequence.
The epitopes of the Pen a 1 tropomyosin allergen from the brown shrimp species Penaeus aztecus [21] were used to prepare hypothetical sequence as with the other allergens; note, the genus is now listed as Farfantepenaeus. Epitope regions were inserted based on concatenating the individual, overlapping epitopes listed in the work by Reese, et al., Fig. 1A-E [21]. The length is based on the Pen a 1 protein (GI: 73532979) and was used to create a hypothetical protein sequence of 284 amino acids (Fig. 1C).
The epitope [22] of the European White birch (B. pendula) pollen allergen, Bet v 1, was prepared similar to the others.
The single epitope is discontinuous along the length of the allergen. The length is based on the average length of several listed isoforms of the Bet v 1 protein (example given by GI: 1542865) with random sequence used to create a 160 amino acid hypothetical protein (Fig. 1D). In addition to only using the epitope residues, a separate hypothetical sequence was constructed using the entire region of the Bet v 1 protein over which the epitopes are dispersed. This "epitope region" is 56 amino acids in length and two separate sequences were constructed for comparing with the allergen database. It should be noted that in using only one epitope this is not necessarily a representation of the multiple, nonoverlapping eptitopes required by an allergen for cross-linking of the FcRI receptor [1].

FASTA comparisons
Each hypothetical sequence was compared to a protein allergen sequence database (FARRP, 2015). The database consisted of 1,897 sequences representing clinically confirmed, as well as putative allergens. The comparison was performed using the FASTA algorithm, version 3.4t11 [10]. Parameter settings for FASTA were as follows: BLOSUM 50 matrix, gap penalty = 12, gap extension penalty = 2, Z = 2,000, and z = 1. A minimum 30% shared identity and a sequence overlap length of 40 amino acids were used as display limits for the output shown in tables; an upper threshold of E = value of 10 was also used for display of alignments. This combination of alignment display limiting was set below the Codex Alimentarius (2009) guideline values of 35% and 80 amino acids to insure alignments would be displayed above and below the Codex threshold values.

Results
The alignments produced between each epitope-containing sequence and allergens were evaluated to identify homologous and possible cross-reactive allergens. Four sequences were used; peanut Ara h 1 and Ara h 2, shrimp tropomyosin Pen a 1 and the birch tree pollen allergen, Bet v 1. They have varying levels of taxonomic conservation at the epitope level and across the entire sequence that were expected to affect alignment metrics [23,24]. Tropomyosin, for example, has well-recognized conservation across many species that was expected to allow identification of all known tropomyosin allergens. In order to perform the analysis in this study, each allergen had representative, cross-reactive epitopes inserted into random, hypothetical sequence. FASTA was used to compare the hypothetical "full-length" protein to the allergen database in order to observe the primary alignment to any database sequences below the expectation cut-off of 10 (E-value = 10). Relevant alignments were judged by whether or not aligned database sequences matched the allergen (or group of allergens) from which the epitopes were derived, and  A random selection of amino acids was loaded with the epitopes of peanut Ara h 2 (AH2-1, AH2-3a, and AH2-3c); total length = 172 aa. A contiguous epitope region covering 66 amino acids is identified by bold and highlighted lettering; underlined epitopes are, in order from N-to C-terminal end, AH2-1, AH2-3a, and AH2-3c. (C) A random selection of amino acids was loaded with shrimp Pen a 1 epitopes: total length = 284. Epitopes are identified by bold and highlighted lettering. (D) A random selection of amino acids was loaded with the region of sequence from Bet v 1 containing the discontinuous epitope residues; total length = 160 aa. Epitope region is identified by bold and highlighted lettering; underlined letters are the Bet v 1 epitope residues.
by the E-value at which non-homologous alignments were observed. It should be noted that two or more significantly aligning sequences does not imply known cross-reactivity in every case. Although highly cross-reactive groups of allergens have been examined, serology work to support cross-reactivity to confirm cross-reactivity has not been performed for many members of the respective allergen homologues discussed herein.
In the first exercise, the 10-mer peptides that were identified by the Bannon laboratory [18,19] were examined for their impact on alignments between allergens and a hypothetical 626 amino acid sequence. FASTA analysis showed that the overlap with the parental Ara h1 protein was between 305 and 328 amino acids in length, with the best alignment producing an E-value of 5.3 × 10 −39 (Table 1). Only Ara h 1 and two beta-conglycinin proteins from soybean showed alignments, both vicilin-like 7S globulins [25,26]; the least significant alignment was 2.1 × 10 −11 . Alignments indicated little variance among these homologous proteins, as there were no unexpected allergens from other species identified. The epitopes supported very specific identification of homology even though each epitope was only ten amino acids (except for the case of the overlapping region). The epitopes in total represented 19% of the overall hypothetical sequence.
The Pen a 1 allergen is the tropomyosin protein from shrimp with a length of 284 amino acids (FARRP 2015), and it represents a highly conserved protein with less variability across species than Ara h 1. The percentage of epitope residues relative to the overall sequence length was 32%, the most of any of the hypothetical proteins. FASTA results showed a total of 72 alignments, with all of these far below an E-value of 0.1 ( Table 1). The most significant alignment was nearly identical in its E-value to several other tropomyosins from related species (35 alignments had E-values between 1.3 × 10 −26 and 8.1 × 10 −22 ), indicating close homology within this structurally related group. The source organism of the Pen a 1 sequence is the shrimp species, Farfantepenaeus aztecus, and it displayed the second most significant alignment (E-value 1.5 × 10 −26 ).
The analysis of Bet v 1 was focused on exploring whether a discontinuous epitope [22] would retain the capacity to represent the protein in general; that is, could the epitope alone flag Bet v 1 (and homologues) when inserted into random sequence. Sixteen epitope residues (Fig. 1D) were considered as a challenge in the use of FASTA due to the lower concentration of contiguous residues as well as the lower total number of residues relative to the random sequence length (10%). The hypothesis was that algorithms such as FASTA may be limited in identifying either the parental sequence or homologues with so few epitope residues scattered throughout the sequence. As shown in Table 1, the Bet v 1 allergen was clearly observed (as the most significant alignment) as were numerous other members of this large, crossreactive allergen family (Tables 2 and 3). In all, there were 106 alignments (alignment data not shown) with only two alignments producing an E-value greater than the cut-off of 0.1 used for Table 1. An E-value range of 0.001-0.01 is recommended as an upper threshold below which alignments likely start to have significance [11]. Most of the other species with Bet v 1 homologues were represented, including genus' Castanea, Corylus, Malus, Quercus, and Carpinus and the range in E-values was between 6.8 × 10 −3 and 8.0 × 10 −1 ; a very narrow range.
In an attempt to determine how Bet v 1 epitope impacted the FASTA E-value range, the entire native protein sequence encompassing the Bet v1 epitope residues (56 amino acids; Fig. 1D) was also loaded into random sequence, similar in construction as the other hypothetical proteins, and compared against the allergen database. The expectation was that homologues of Bet v 1 would be more easily distinguished due to the much greater proportion of the allergen represented in the hypothetical sequence. The sequence consisted of 35% Bet v1 sequence and produced a total alignment count that was greater (169) than using just the epitope residues. As before, there were only two alignments above an E-value of 0.1. The main difference across all the alignments was that most of the isoforms of Bet v 1 represented in allergen database are observed (63) and many more representative isoforms from Carpinus and Caucus, for example, populate the alignment list ( Table 2). The main impact on the alignment metrics was a much more significant E-value maximum (10 −22 vs. 10 −3 ) when the larger 56 amino acid region was used. This is an expectation with FASTA when a large localized portion of the sequence is an exact match [10]. The cut-off between homologues and nonhomologous sequences was also more in line with an established threshold for evaluating allergens with an E-value of 3.9 × 10 −7 [27] when the larger 56 amino acid Bet v 1 section was used ( Table 2).
One last example of epitope detection was examined based on the recent discovery that nonhomologous proteins may have cross-reactivity [20]. The synthetic peptides AH2-1, AH2-3a, and AH2-3c aligned with Ara h 2 and these peptides showed cross-reactivity by IgE-binding inhibition [20]. These same three peptides were loaded into random sequence to make a 172 amino acid length sequence for comparison to the allergen database. The bioinformatic results show that Ara h2 was clearly identified (Table 1); similarity with Ara h1 and Ara h3 was not identified as there were only four total proteins aligned and all of them were Ara h 2 isoforms. The additional hypothetical sequence containing a contiguous 66 amino acids section of the Ara h 2 protein (Fig. 1B) was more effective in identifying Ara h 1 and Ara h 3 (Supporting Information Table 2). The only other peanut allergen identified by this additional comparison and studied by Bublin et al. was Ara h 6. A primary reason a larger portion of the Ara h 2 protein was required to identify the nonhomologous, but crossreactive proteins, is the likely presence of undiscovered epitopes that are shared among these proteins. In contrast, the disparate epitope locations across the three proteins appear to have limited the ability to clearly identify shared similarity that may exist for this unique example of nonhomologous cross-reactivity. For example, the epitope regions in Ara h1 and Ara h3 are located far away from the N-terminal location [20]. The Ara h1 region begins at amino acid position 584, and Ara h 3 begins at amino acid 328, whereas Ara h 2 begins very near the N-terminal end at position 26 for the peptide AH2-1.

Discussion
The premise behind comparing novel protein sequences to allergens has always been based on either identifying a known allergen or identifying a risk of allergic cross-reactivity. Presumably, unexpected cross-reactivity with a novel protein, or newly discovered protein, could happen in one of three ways. First, a portion of a known allergen could be inadvertently used to construct a synthetic novel protein. Second, a novel protein is derived from a species not yet identified as an allergen source and the novel protein is unexpectedly similar to its homologous counterpart, a known allergen. The third, and most unlikely scenario, involves the unintentional modification of a protein that results in enough similarity to share IgE binding with an allergen. For all practical purposes, even heavily modified novel proteins have to retain their function and structure to the extent that they are highly similar to the native protein expressed in the source organism. This level of retained, native similarity to a known structural class of protein makes it unlikely that random modification would somehow create an unexpected, but immunologically relevant level of similarity with an allergen. Otherwise, it is straightforward to identify the taxonomic class of the source organism and the structural family and function of the modified novel protein.
Well-characterized allergen epitopes were used to examine the sensitivity of bioinformatics as a screening tool. The goal was to examine the level of sensitivity for detecting known cross-reactivity potential by focusing on epitope sequence isolated from the rest of the parental core sequence. The B-cell epitope was considered the minimum level of biologically relevant sequence that could identify the parental allergen or homologues for which cross-reactivity risk could be observed. The results demonstrated that FASTA retained its usefulness in identifying localized areas of sequence similarity even when the localized areas are small. This is an extension of the concept discussed by Bannon and Ogawa [4] and experimentally tested using maize allergen sequences [27], as well as with motif-centric experiments where shared sequence across proteins is predicted first, then tested for immune reactivity [28]. In this article, the Bet v 1 exercise was the best example of detecting homology at a very low proportion of native allergen sequence, with only 16 total residues representing a single discontinuous epitope (Fig. 1D). The 16 residues, although not contiguous within the hypothetical sequence, created enough of an exact match within the larger sequence to identify Bet v 1 (Table 1). This is because spacing along the length of the hypothetical sequence was the same as in endogenous Bet v 1 and thus, still allowed FASTA to identify this unique pattern of similarity; any other order would lower the similarity score. In this regard, the results hint at the importance of the underlying unique structure of the whole protein and the inherent spacing along the sequential sequence for identifying similarity between sequences with FASTA.
The summary statistic, the E-value, performed well in identifying both the parental sequence from which the epitope was derived as well as homologous proteins with known cross-reactivity. This was improved upon (lower E-value and more homologous proteins from related species) when the whole Bet v 1 region was loaded into the random sequence, or when there were many more residues as there are for Ara h 1 and tropomyosin. In the case of tropomyosin, the epitopes are well conserved and unique to a degree that the E-value range remained virtually unchanged even when the epitopes were inserted into a much longer (700 amino acids) hypothetical sequence (data not shown). In contrast to only lengthening the random amino acid content, a reduction from six to four total epitopes (two replaced with random sequence) was used as a separate comparison. This shifted the E-value range in an increasing direction; the lowest value moving from 3.4 × 10 −27 to 2.1 × 10 −15 , and the highest moving from 1.5 × 10 −09 to 1 × 10 −06 , plus one additional alignment at E-value = 7.4. The reduction in epitopes was impactful, but with such a highly conserved protein there was no reduction in the ability to detect homologues.
The sensitivity in detecting similarity was much better using an E-value rather than other metrics such as percent identity, or a combination percent identity and overlap length ( Table 2). For example, E-value was consistent in grouping known homologous Bet v 1 sequences. This is in contrast to the inconsistent designation by percent identity and overlap. Those alignments with E-values below 10 −1 and also having <35% identity and an overlap length <80 are shaded in gray in Table 2. An E-value of 10 −1 was considered as a minimum to survey for the accuracy of all alignments with regard to the percentage and overlap metrics. Clearly, members of the Bet v 1 family were identified by >35% identity and 80 or more amino acid overlap (Codex metrics), but these metrics were comparatively poor at identifying all of the homologous proteins. In the most extreme example, where only the Bet v 1 epitope residues were loaded into the hypothetical sequence, no alignments exceeding Codex metrics were observed below an E-value of 10. Interestingly, the threshold in E-value observed with proteins from carrot (Daucus carotus) is in line with the threshold of 3.9 × 10 −7 calculated by Silvanovich et al. [27], but ranges just past this cutoff; E-values 5.3 × 10 −5 to 5.7 × 10 −6 . These values are very close to one another considering that the E-value is typically judged on a log scale and both would be considered statistically significant [29]. However, the hypothetical Bet v 1 protein and Daucus alignment displays percent identity and overlap length values just below thresholds of 35 and 80, respectively; an example in which those metrics [12] do not identify an important  alignment from a screening perspective (Fig. 2). This would be important to recognize with regard to cross-reactivity given the recent confirmation of IgE reactivity in Bet v 1 sensitized patients to the Dau c 1 protein [30]. A lack of vicilin (vicilin-like) homogues to the Ara h 1 hypothetical sequence was noted. A further evaluation noted the fact that some expected homologues, such as those from Lupinus and Pisum, were not present until the display limit was lowered to 25% identity. Yet, these same Lupinus and Pisum allergens had E-values ranging from ß1 × 10 −10 to 1 × 10 −16 , further indicating that percent identity can be disconnected from the more relevant overall similarity identified by E-value.
The more simplistic Codex metrics also fail to discriminate the tropomyosin and Ara h 2 region of epitopes (Supporting  Information Tables 1 and 2). Alignments with the Pen a 1 hypothetical protein, for example, all align with significant Evalues, but the percent identity values decline to point where percent identity and overlap length do not coincide with obvious homology among the tropomyosins. This is also highlighted by the Bet v 1 analysis where the recently confirmed cross-reactivity with Vig r 6 and Vig r 1 [31] from Vigna radiata was noted by identification of these using an E-value, but would not have been noted using Codex metrics ( Table 2). The impact of lower percentages of epitope residues, relative to the rest of the allergen, is highlighted even more clearly by the inability to identify homologues for the Bet v 1 epitope. Compared with using E-values, there were no alignments in the first 14 best results identified where either 35% identity or 80 amino acid lengths were observed (Table 3).
An E-value of 1 × 10 −5 -1 × 10 −6 has been identified that delineates significant allergen similarity, as modeled for the Ara h 1, Pen a 1, and Bet v 1 (region) epitopes. This E-value range was selected based on the observation that E-values lower than 1 × 10 −5 were exclusively observed for the Pen a 1 and Ara h 1 hypothetical sequences (Table 1). In addition, the Bet v 1 epitope region-containing hypothetical sequence displayed a breakpoint between cross-reactive and other, nonhomologous sequences near this E-value (Table 2). When modeling shorter proteins (the Bet v 1 protein is only 160-161 amino acids) with very limited epitope sequence in a discontinuous structure, a higher threshold may be appropriate (E-value = 1 × 10 −3 ) to identity similarity. There is some minimum level of sequence information that is below the point at which FASTA can consistently produce the same threshold E-value compared with alignments based on the analysis of their full length sequence. The combination of a short protein and a single, dispersed epitope accounts for this impact.
Modeling of nonhomologous Ara h 2 cross-reactive proteins [20] requires further clinical confirmation and further bioinformatic modeling in order to identify a meaningful E-value threshold. In addition, E-values based solely on Bet v 1 and a single discontinuous cross-reactive epitope seem too indistinct to be considered predictive at this time. It is likely that the epitopes for both Bet v1 and the Ara h 1/Ara h 2/Ara h 3 complex are so uniquely distributed they may produce bioinformatic outcomes distinct from modeling observations of other allergens. Thus, identifying a single bioinformatic threshold for similarity using FASTA E-values would be unlikely to hold true for various other discontinuous epitopes or nonhomologues proteins, respectively. This points to limitations in trying to fit a single threshold value across the many disparate groups of allergens. In the case of nonhomologues proteins, the epitopes may have arisen from cross-reactivity due to very subtly distinct secondary and tertiary protein structures that simply cannot be resolved in the experiment herein where the epitopes have been taken away from contextual core sequence that would otherwise help identify homologous regions of proteins (i.e., more numerous local aligning sequences).
In combination with a well-curated allergen database, allergen sequence screening benefits from an observed reduction in false-positive alignments when using an E-value of 1 × 10 −5 -1×10 −6 . More important from a safety perspective, false negatives based on percent identity were avoided by using an E-value. A range, rather than a single value, is appropriate due to the unique nature of individual proteins and cross-reactive allergen groups, which promotes thresholds that are expected to vary to some degree. Certainly, for full length proteins, an E-value of 1 × 10 −5 would be an effective E-value threshold. Nevertheless, the overall conclusion is consistent with the basic tenant of allergen cross-reactivity that there is no known risk of cross-reactivity without homology to a known allergen. FASTA has the capability to serve this screening purpose given the appropriate application of the algorithm. The use of alignment parameters based on modeling of known allergens (both epitopes and core structure) that incorporate a relevant level of statistical significance criteria (i.e., E-value) is the key.
Taken together, epitopes do appear to weight a random amino acid string enough to identify significant similarity and potential cross-reactivity. In reality, allergens do not consist of just epitopes; they have core sequence structure that gives them their unique secondary and tertiary structure. Core sequence homology among related organisms is a key to the FASTA calculation of probable similarity because it was designed to help identify conserved domains [10]. In terms of using a dedicated allergen database, shared core sequence is the basis for relatedness among allergens [32,33]. When at least some of the core sequence is present (e.g. Bet v 1 epitope region) or when epitopes are relatively numerous, FASTA easily identified homologues and produced an E-value threshold that was indicative of statistically significant alignments according to FASTA expectations. The bioinformatic screening of sequences based on FASTA E-values would be consistent with the intent of the FASTA algorithm statistical metrics for homology and an improvement over shared identity and overlap length. Examinations into the details of all of the alignments within an allergen class are always advised once screening for homology identification is performed.
The study herein complements previous work [27,[34][35][36] intended to identify biologically relevant metrics for screening novel proteins. The first attempts at creating regulatory thresholds for novel protein screening were based on percent identity and sequence overlap length, and they have persisted until present day [12,13]. These values were primarily based on Bet v 1 homology structure, but it was unclear how these would perform using curated allergen databases such as FARRP [16], which were not available at the time. There are more sophisticated informatics methods that have been adapted for specific antigens [37]. And, for deep analyses it is unlikely that one method is able to completely capture the variability across all known homologues of a given structural class [38], much less all of those different classes in an allergen database. Yet, a local-alignment approach based on shared similarity and the use of an accepted general standard is a way forward since it underpins the very use of algorithms such as FASTA and the similar BLAST. In effect, by using the summary E-value statistic, the base percent identity and length of alignment is incorporated into an analysis of similarity between proteins. As presented, the concept of building an analysis of cross-reactivity among known allergens offers a "from the ground up" approach that can be extended. It has the potential to support modeling of the different allergen structural groups to identify meaningful thresholds for shared similarity, as it remains to be identified whether there is a single shared feature of proteins that make them allergens. The goal is to support the growing knowledge base of understanding the nature of how allergens are similar in structure, function, and their propensity to cross-react in order to work toward a predictive approach that is both conservative and accurate.
The author would like to thank Andre Silvanovich for conceptual and technical review.
No conflict of interest. Employed solely by agricultural biotechnology company, Syngenta Crop Protection, LLC.