AlphaFold predicts novel human proteins with knots

Abstract The fact that proteins can have their chain formed in a knot is known for almost 30 years. However, as they are not common, only a fraction of such proteins is available in the Protein Data Bank. It was not possible to assess their importance and versatility up until now because we did not have access to the whole proteome of an organism, let alone a human one. The arrival of efficient machine learning methods for protein structure prediction, such as AlphaFold and RoseTTaFold, changed that. We analyzed all proteins from the human proteome (over 20,000) determined with AlphaFold in search for knots and found them in less than 2% of the structures. Using a variety of methods, including homolog search, clustering, quality assessment, and visual inspection, we determined the nature of each of the knotted structures and classified it as either knotted, potentially knotted, or an artifact, and deposited all of them in a database available at: https://knotprot.cent.uw.edu.pl/alphafold. Overall, we found 51 credible knotted proteins (0.2% of human proteome). The set of potentially knotted structures includes a new complex type of a knot not reported in proteins yet. That knot type, denoted 63 in mathematical notation, would necessitate a more complex folding path than any knotted protein characterized to date.


| INTRODUCTION
The AlphaFold method (Jumper et al., 2021) has already led to high-quality predictions for thousands of protein structures from different genomes. AlphaFold's approach bypasses the need to understand the complex rules that determine the physical protein-folding process. An in silico approach can lead to the prediction of new folds and structures which are hard to determine using approaches such as crystallization NMR or cryo-EM. AlphaFold uses a per-residue confidence score (pLDDT) to estimate the quality of prediction. However, is the pLDDT score sufficient for quality assessment, especially in the case of long multi-domain proteins? Since it is known that the probability of a linear chain being knotted increases with its length, can folding rules be simply bypassed? Are these longer protein structures more complicated than we can currently model accurately?
The fact that proteins can have their chain formed into a knot has been known for almost 30 years (Mansfield, n.d.;Mansfield, 1997;Takusagawa & Kamitori, 1996). Topology, and more specifically knot theory, provides tools to recognize different types of knots (e.g., Alexander and HOMFLY-PT polynomials [Alexander, 1928;Freyd et al., 1985;Przytycki & Traczyk, 2016]). Although knot theory defines a knot as a "closed curve", we can still study knots in proteins by connecting their N and C ends to use knot theory methods (Millett et al., 2013). So far, five types of knots have been found in experimentally solved protein structures (besides unknot these are 3 1 , 4 1 , 5 2 , and 6 1 prime knots [Jamroz et al., 2015;Dabrowski-Tumanski et al., 2019) and 3 1 #3 1 complex knot (Bruno da Silva et al., 2023]). All of these prime knots are twist knots which means that they can be easily tied by twisting a loop of an open chain, pulling one end of the chain through the loop, and connecting the ends. Does Alpha-Fold predict protein structures with other types of knots?
Herein, we show the advantages of considering the topology of a model generated by a protein structure prediction method. In most cases, the presence of knots in proteins is conserved within a family, thus the topology of the predicted model should be the same as its homologs. Otherwise, there is a strong probability that the model was incorrectly predicted.
Furthermore, the topological analysis of proteins, especially human ones, can lead to the understanding of how evolution coded such high levels of protein organization (Sulkowska, 2020). Recent studies have shown that some proteins form open knots in their native folded structure (Mallam & Jackson, 2005;Mansfield, 1994;Sulkowska, 2020). These protein knots resemble wellknown rope knots (Sułkowska et al., 2012). In general, the percentage of known knotted proteins is much lower than would be expected in random polymers with a similar length, compactness, and flexibility (Virnau et al., 2006). Among known protein structures only the simplest of knot types have been observed (Jamroz et al., 2015) and, furthermore, only ones which can be constructed by single threading (Taylor, 2000). Numerical simulation has shown that many proteins can self-tie (a Beccara et al., 2013;Li et al., 2012;Noel et al., 2013;Sułkowska et al., 2009) (even a protein with the 6 1 knot (Bölinger et al., 2010)), and knotting can be assisted by a chaperone (Soler et al., 2016) or ribosome (Baiesi et al., 2019;Chwastyk & Cieplak, 2015;Dabrowski-Tumanski et al., 2018), especially in the case of deeply knotted proteins (Jamroz et al., 2015). An in vitro investigation additionally has shown that some proteins can self-tie (Wang, Chen, & Hsu, 2015), although chaperones facilitate their folding (Andrews et al., 2013;Jackson et al., 2017;King et al., 2010;Mallam & Jackson, 2012;Wang, Liu, et al., 2015;Ziegler et al., 2016). Interestingly, although its biological role is not clear (Jackson et al., 2017;Sulkowska, 2020), knotting has been found in proteins from all branches of the tree of life and is conserved even among the sequences with low similarity (Sułkowska et al., 2012). The presence of knots in proteins raises many fundamental questions, from which we list here just a few: (1) Is the small percentage of knotted proteins deposited in the PDB observed due to the specificity of the data? (2) Are there more complex knots which have yet to appear in the PDB? (3) What is the origin of knotted proteins? (4) Is topology strictly conserved? (5) What role do knots play in proteins?
Based on human protein structures predicted by AlphaFold, we have found answers to some of these questions. We have conducted a comprehensive review of all 23,391 structures predicted by AlphaFold for the human proteome. For each protein structure, we determined the dominant knot type (Jamroz et al., 2015) and found the location of the knot cores (i.e., minimal portions of the protein backbone that form a given knot type). As a result, we found 340 knotted structures that we then further analyzed. All these proteins are deposited in an online database to enable further in vivo and in silico investigation, available at: https://knotprot.cent. uw.edu.pl/alphafold. They are also available in Alpha-Knot database alongside with knotted proteins of 20 other proteomes at https://alphaknot.cent.uw.edu.pl. (Niemyska et al., 2022) After careful evaluation, we concluded that over 75% of the models containing knots are artifacts. However, most of the previously known knotted human proteins were correctly predicted, which shows that AlphaFold is capable of modeling structures with complicated topology (see Table S1 for details about all proteins we classified as knotted). Compelling evidence for this is shown by a bacterial protein from Aquifex aeolicus, which AlphaFold models with correct knotted topology even though the single template available (TrmD protein, PDB ID: 1oy5, UniProtKB ID: O67463) is unknotted. On the other hand, it is not always the case since there are examples of errors in AlphaFold models that are due to problems in the templates from the PDB (Tata et al., 2022).
All the structures that were not classified as artifacts were put through a more detailed evaluation that included a topology analysis of homologous proteins found in the AlphaFold database. Using sequence clustering we grouped the proteins by similarity and checked if the topology is conserved both within each cluster and within the whole family of homologs. As a result, we obtained two additional pieces of information: (1) further verification of a given knot's presence in the human protein; and (2) whether the knot is robustly present (conserved) in each family. Interestingly, these results also emphasized the models that were of good quality but probably of the wrong topology. Since we obtained the models from an early version of AlphaFold Protein Structure Database we recalculated problematic structures by our locally installed AlphaFold 2 version. For example, the model of Testis anion transporter 1 protein (UniProtKB ID: Q96RN1) available in the Alpha-Fold Protein Structure Database v.1 has a 3 1 knot (with high average pLDDT for the knot core being 86.1), whereas 96% of its homologs are unknotted. After recalculation with AlphaFold 2, we obtained a more credible unknotted model of this protein. Finally, for additional verification of the knotted models, we used RoseTTaFold to predict their structures and compare topologies.
Here, we established how many knotted proteins are predicted to be in a single proteome and found new families of knotted proteins. Recently, an article searching for the most complex knot types was published (Brems et al., 2022). It also analyzes AlphaFold predicted structures (including human proteins) and the topology results agree with ours. However, this extensive survey focuses on the most complex knot types and does not report any new knotted proteins with less than five crossings nor with chains longer than 600 aa nor with shallow knots (with tails shorter than 5 aa) nor with low quality (pLDDT lower than 80) (Brems et al., 2022). In our study, we find that all newly identified knotted human proteins are 3 1 knotted.

| TYPES OF PROTEIN KNOTS FOUND
Thus far four main types of knots have been observed in experimentally solved proteins-3 1 , 4 1 , 5 2 , and 6 1 ( Jamroz et al., 2015). Here, we have found three types-3 1 , 5 2 , and 6 3 ( Table 1). The 6 3 knot has not been seen before in a protein. Note that the 4 1 and 6 1 knots that have not been detected in our study were previously observed mostly in bacterial and plant proteomes (16 proteins with 4 1 and two with 6 1 ).

| New 3 1 knotted proteins
Overall, we found 51 credible (robustly) knotted structures. Within this group, there are proteins already known as knotted because either their structure was resolved experimentally, or they are part of an established knotted family (e.g., SPOUT clan of knotted methyltransferases). Interestingly, in some cases, we were able to find knots in proteins considered unknotted before due to their incomplete crystal structure (as in the Ion channel regulator).
The remaining proteins make a group of newly identified knotted proteins ( Table 2). All of them contain a subchain forming the 3 1 knot which is overall the most common knot type found in proteins.

| Ecdysoneless protein
We found the human Ecd (ecdysoneless) protein (UniProtKB: O95905) to have been modeled by Alpha-Fold with a deep 3 1 knot. (The same topology is formed in the models from RoseTTaFold). This 644 amino acid long structure has the knot positioned between 208 and 314 amino acid with average pLDDT for this region being 93/100. The knot is conserved in the homologs of the protein, including the original Dropsophila melanogaster Ecdysoneless protein (UniProtKB ID: Q9W032) from which the name was derived. None of the homologous proteins have structures resolved experimentally, including the whole SGT1 family (Pfam ID: PF07093) that they are part of. This makes the knot identification T A B L E 1 Number of knotted structures with different knot types found within the human proteome based on AlphaFold prediction.

Knot types
No in this protein even more important, since SGT1 is a large family that contains over 2000 sequences. In human cells, Ecd was found to function as a regulator of the tumor suppressor p53 protein (Zhang et al., 2006). Also, the deficiency of Ecd lowers the fidelity of mRNA splicing (Erkelenz et al., 2021).

| Ion channel regulator
Calcium-activated chloride channel regulator 1 (CLCA1; UniProtKB ID: A8K7I4) is a 914 amino acid long protein with a 3 1 knot in its structure modeled by AlphaFold. Interestingly, this protein was resolved experimentally, although only one domain was crystallized (domain VWA, between residues 303-459). Also, this domain by itself is unknotted-the knot is present only in a fulllength protein since it is located at residues 76-311 ( Figure 1b). (Note, that when all subchains are considered the internal knot [slipknot] can be detected. This motif will be called K3 1 3 1 ). The knotted core is marked by AlphaFold as a high-confidence region (average pLDDT 92.4/100). Moreover, we found that paralogs of this protein (also predicted by AlphaFold: CLCA2 and CLCA4) and other homologs all have the same knot type. Similarly, the models calculated by RoseTTaFold also have a 3 1 knot. This shows that the prediction of a knot in this group of proteins is credible and the knot itself is a conserved feature that is likely to be advantageous for the protein.
CLCA proteins undergo self-cleavage (CLCA1 at position 695) to form the N-and C-terminal parts of the proteins. The N-terminal portion is responsible for the function of the protein, which is activating ion channels (Yurtsever et al., 2012). Based on our analysis we now know that this part is knotted, which was not previously observed. This information will shed new light onto the studies of CLCA proteins. Moreover, this is a great example that emphasizes the importance of having full structure information available for any protein studies. F I G U R E 2 Von Willebrand factor A domain-containing protein 5A (UniProtKB ID: O00534) with a potential new type of protein knot-6 3 shown with essential strands colored in red, blue, and purple. Upper left panel shows a simplified knotted core of the protein.

| Knot or not?
We placed many constraints on when a model should be described as knotted (see Methods section). The structures that only met some were labeled as potentially knotted-usually due to either a low level of confidence in crucial parts of the knotted core or the topology not being conserved within the protein family. Those structures require further validation. In fact, this group forms a ready-to-use list of the most important protein targets for finding new knotted motifs, and thus can be used to significantly broaden the knowledge of entanglement in proteins (Table 2).

| Potentially knotted-6 3 knot in BCSC-1
We found a new complex type of knot in the von Willebrand factor A domain-containing protein 5A (BCSC-1, breast cancer suppressor candidate 1, Figure 2). The 6 3 knot is located in a high confidence region (average perresidue confidence score [pLDDT] is 88.6) between amino acids 45 and 625. It covers most of the protein, which is 786 residues long. However, because it has quite long tails (which consist of 45 and 161 amino acids, respectively), it is considered to be located rather deep within the model. This suggests that the knot will be stable in this structure and will not be untied spontaneously by thermal fluctuations of the protein.
Even though the model presents itself as accurate based on pLDDT, its 6 3 knot type is not conserved in other species. AlphaFold predicts both 6 3 and 3 1 knots within the BCSC-1 group of homologs (Table 2). Moreover, the model generated with RoseTTaFold also has a 3 1 knot. Finally, the models generated by ESMfold also have both 6 3 and 3 1 knots. Therefore, further experimental examination is needed to verify the existence of this complex knot, especially since this is not a twist knot (i.e., is not the result of twisting of a loop followed by a single threading [Taylor, 2000]). To tie such a knot, the protein chain must cross the energy barrier at least twice during folding as it is pulled through twisted loops. None of the experimentally resolved structures in PDB possess this type of knot. Right now, they are only found in predicted structures-here we report a 6 3 and recently 5 1 and 7 1 were found (Brems et al., 2022).
The BCSC-1 protein is interesting not only due to its complex topology, but also because of its crucial function-there are studies showing that it is involved in cancer development and can act as its suppressor (Di et al., 2018).

| Potentially knotted-integrins alpha
Often experimentally resolved structures have missing fragments that are not structurally organized. Such regions can be crucial for forming a knot since the way they will be modeled into the structure can either make a knot or not. The integrins alpha are a great example that shows the importance of this aspect, since within their models predicted by AlphaFold we found both knotted and unknotted structures.
We detected the 3 1 knot in seven out of 18 human integrins alpha (Table 2). In each of them, the knot is located in the C-terminal part of the protein (in the domain called Calf2; Figure 1C). Based on the AlphaFold predicted structures, it appears that whether an integrin will be knotted is determined by a single loop. Only integrins with longer loops are knotted (green region in Figure 1D). Unfortunately, this loop in most of the structures has a low confidence (pLDDT) score, making its location uncertain. Similarly, the topology of integrins' homologs is also mixed, making topology difficult to assess. Therefore, we cannot verify whether the knot is present in these proteins-experimental evidence is needed. However, if only some of the integrins are indeed knotted, integrins would provide a second example of a family containing topologically distinct proteins-the first being aspartate/ornithine carbamoyltransferases (Sułkowska et al., 2008). Table 3 shows proteins for which an entangled structure of a distinct homolog is known, and thus they are expected to be knotted. Most of them belong to four families (UCH, SPOUT, sodium/calcium exchanger, carbonic anhydrase-related). First, let us discuss transmembrane proteins, for which the knot conservation we found most interesting. Herein, all eight transmembrane proteins belong to only two families SLC24 and SLC8 (sodium/ calcium exchanger). They all share a 9-helix inverted repeat motif, which forms a transmembrane channel, its active part 6-helix responsible for ion transport forms a 3 1 knot. SLC24 and SLC8 represent distant homologs of only one known family of knotted membrane proteins (represented by CAX, NCX) (Jarmolinska et al., 2019). These proteins belong to a huge membrane family PF01699, represented by more than 47,000 sequences. In this context, one is tempted to speculate whether all members possess nontrivial topology. Nevertheless, the 3 1 knot is conserved between SLC24, SLC8, and CAX, NCX proteins which share less than 30% similarity. Evolutionary conservation of a knotted topology, in this case, could imply its beneficial role, as it has been suggested for slipknotted membrane proteins (which also belong to Alpha-helical polytopic Transmembrane classes (based on OPM classification; Zayats et al., 2021). Therein, it was suggested that the channel may be held together more tightly when its α-helices are strapped together by a slipknot loop embracing several of the helices (Sułkowska et al., 2012).

| CONSERVED KNOTTED PROTEINS
Second, ubiquitin carboxyl-terminal hydrolase BAP1 (UCH-L2) with a 5 2 knot is a distant homolog of other UCH proteins. All homologs (UCHL1, UCHL3, UCHL5, UBL1-YEAST) with known structures possess the same type of topology (including internal 3 1 knots), even though their sequence similarity is very low, around 23% (Sułkowska et al., 2012). BAP1 is known to play a role in cancer, functioning both as a tumor suppressor and as a metastasis suppressor.
Third, there are plenty of carbonic anhydrase-related proteins already deposited in the PDB, they all are found to possess a 3 1 knot when a full sequence is determined. Finally, proteins with the UniProtKB ID: Q6PF06, Q6IN84, Q9UJK0 are members of SPOUT family (Tkaczuk et al., 2007), which is known to contain proteins with a conserved 3 1 knot embedded in the active site (Christian et al., 2016).

| Artifacts
More than 75% of knotted human structures from Alpha-Fold we found to be artifacts. In such structures, the knot is often predicted in low confidence regions (pLDDT < 50), which means its position and presence is not reliable. Moreover, about 1/3 of the structures with artifact knots represent large proteins (>2700 aa) that were predicted using multiple overlapping models. In fact, more than 50% of the artifact structures, versus only 17% of the credibly knotted structures, have more than 1000 residues. This behavior is expected, since the difficulty of predicting a structure (including its topology) increases with the length of the protein chain. Thus, topology can be used as a tool to identify the quality of experimental or simulated data. Some examples of interesting artifacts are described on the website.
T A B L E 3 Proteins expected to be entangled. Here are the proteins that do not have a resolved structure but are part of a known family of knotted proteins, thus have a high probability of being correctly predicted as knotted. Close homologs are represented by proteins with at least 60% sequence identity.

| CONCLUSION
While knots are relatively rare in proteins, their conservation may suggest a functional utility. Many of them are enzymes with a knot located in the active site. However, as they are not common, only a fraction of such proteins is available in the PDB. It was not possible to assess their importance and versatility up until now because we did not have access to the whole proteome of an organism, let alone a human one. The arrival of efficient machine learning methods for protein structure prediction, such as AlphaFold, RoseTTaFold, and ESM-Fold changed that.
Herein, we analyzed all proteins from human proteome (over 20,000) determined with AlphaFold in search for knots and found them in less than 2% of the structures. Information about their topology, knot position, and quality (pLDDT score) is available in a database at: https://knotprot.cent.uw.edu.pl/alphafold. Importantly, after detailed analysis we found that the majority of the knotted proteins are artifacts and only 51 structures (0.2% of the proteome) are credibly knotted.
The set of credibly knotted proteins includes new knotted families and new types of knots ( Table 2). The set of potentially knotted structures includes a new complex knot type in proteins. That knot type, denoted 6 3 in mathematical notation, would necessitate a more complex folding path than any knotted protein characterized to date.
To sum up, we select proteins with the best chance of being entangled, and which can be further analyzed without immediate structure determination. However, it would be beneficial to verify experimentally the structures of several proteins, including those of uncertain topology (Table 2), like integrins alpha. Also, our work emphasizes the importance of validation of the predicted models, as pLDDT in some cases is not sufficient to distinguish between correct and wrong models, in particular in terms of their topology.

| MATERIALS AND METHODS
The human proteome was downloaded from the Alpha-Fold Protein Structure Database (Jumper et al., 2021). It consisted of 23,391 structures (20,504 proteins). All the models were analyzed by computing the HOMFLY-PT polynomial for 200 random closures. The details of the method are explained in (Jamroz et al., 2015). We determined the position of the knot core in the structure based on the so-called matrix model (Jamroz et al., 2015). Homologs for each protein were obtained with GGSearch from EMBL-EBI (Madeira et al., 2022) using E-value cutoff 10 À10 and searching within the AlphaFold Protein Structure Database. For the analysis of the conservation of topology in protein families, we clustered the proteins using CD-hit with a 60% sequence identity cutoff (Huang et al., 2010). Proteins with very low pLDDT were remodeled by our group using AlphaFold (version 2.1.1) (Jumper et al., 2021), with the program installed locally. For the RoseTTaFold prediction, we used software available on-line (Baek et al., 2021).
A structure with a knot is classified as credible when random closures form a nontrivial knot type more frequently than a trivial knot, the pLDDT score for the knot core is above 50, there are no clashes in the knot core region (checked with MolProbity tool [Williams et al., 2018]), the same knot type is found in more than 80% of the protein's homologs and in a model generated by RoseTTaFold, and no suspicious geometry was found after visual inspection. The structures which did not pass the visual inspection were classified as potentially knotted. The structures with obvious problems are called artifacts.