Using machine learning to study protein–protein interactions: From the uromodulin polymer to egg zona pellucida filaments

Neural network‐based models for protein structure prediction have recently reached near‐experimental accuracy and are fast becoming a powerful tool in the arsenal of biologists. As suggested by initial studies using RoseTTAFold or the ColabFold implementation of AlphaFold2, a particularly interesting future development will be the optimization of these computational methods to also routinely yield high‐confidence predictions of protein–protein interactions. Here I use AlphaFold2 and ColabFold to investigate the activation and polymerization of uromodulin (UMOD)/Tamm‐Horsfall protein, a zona pellucida (ZP) module‐containing protein whose precursor and filamentous structures have been previously determined experimentally by X‐ray crystallography and cryo‐EM, respectively. Despite having no knowledge of the UMOD polymer structure (coordinates for which were neither used for model training nor as template), AlphaFold2/ColabFold are able to recapitulate a crucial conformational change underlying UMOD polymerization, as well as the general organization of protein subunits within the resulting filament. This surprising result is achieved by simply deleting from the input sequence a stretch of residues that correspond to a polymerization‐inhibiting C‐terminal propeptide. By mimicking in silico the activating effect of propeptide dissociation triggered by site‐specific proteolysis of the protein precursor, this example has implications for the assembly of egg coat proteins and the many other molecules that also contain a ZP module. Most importantly, it shows the potential of exploiting machine learning not only to accurately predict the structures of individual proteins or complexes, but also to carry out computational experiments replicating specific molecular events.


| INTRODUCTION
From mollusk to human, the egg coat (called zona pellucida [ZP] in mammals and vitelline envelope [VE] in non-mammals) is a specialized extracellular matrix that plays key biological roles during oogenesis, fertilization and-in the case of mammals-preimplantation development (Killingbeck & Swanson, 2018;Litscher & Wassarman, 2020).
These functions are intrinsically linked to the architecture of the coat, which in turn depends on the assembly of filaments mediated by the "ZP domain," a bipartite polymerization module conserved in all egg coat subunits as well as many other extracellular proteins with highly different biological functions (Bork & Sander, 1992;Jovine et al., 2002Jovine et al., , 2005. Structural studies of individual ZP/VE subunits revealed that the ZP module consists of two structurally related immunoglobulin-like domains, ZP-N and ZP-C, that are separated by an interdomain linker (Bokhove & Jovine, 2018). ZP module polymerization is activated upon cleavage-dependent dissociation of a C-terminal propeptide (CTP) that includes a polymerization-blocking external hydrophobic patch (EHP) constituting the last β-strand of ZP-C (Han et al., 2010;Jovine et al., 2004;Schaeffer et al., 2009). Recent X-ray and cryoelectron microscopy (EM) structures of ZP module-containing urinary protein uromodulin (UMOD; also known as Tamm-Horsfall protein) revealed the dramatic conformational changes that are triggered by the dissociation of its CTP, as a result of site-specific cleavage by transmembrane protease hepsin (Bokhove et al., 2016;Stanisich et al., 2020;Stsiapanava et al., 2020). However, the magnitude of the observed rearrangements raises the question of whether ZP module filament assembly may also involve polymerization chaperone(s), and it is unclear if equivalent conformational changes take place in the case of heteromeric egg coat filaments.
Here, I use AlphaFold2 (Jumper et al., 2021) and ColabFold (Mirdita et al., 2021) to investigate whether the recent advances in the application of machine-learning to protein structure prediction (AlQuraishi, 2021) can provide insights into these kinds of questions, using ZP module activation as an example of the complex conformational changes that can take place during the assembly of polymeric proteins.
2 | RESULTS 2.1 | Modeling of the UMOD ZP module in its polymerization-inhibited state As a necessary control, I first used AlphaFold2 to model the C-terminal half of UMOD, whose structure was previously determined by X-ray crystallography (Bokhove et al., 2016). This part of the protein, which encompasses its elastase/pronase-resistant fragment and is sufficient for polymerization (Bokhove et al., 2016;Jovine et al., 2002), consists of an epidermal growth factor domain (EGF IV) that is not involved in subunit/subunit interactions within the UMOD filament (Stsiapanava et al., 2020), the ZP module and, in the nonpolymeric precursor form of UMOD, the C-terminal EHP-containing propeptide (Figure 1a).
Despite the fact that no experimental structures of UMOD were used as templates during modeling (see Section 4.1) and consistent with the corresponding prediction deposited at EMBL-EBI , the top five models produced by AlphaFold2 are in very good agreement with the crystallographic information on the protein; this is both at the level of ZP-N (average root mean square deviation [RMSD] over 106 Cα: 2.2 Å) and the region that includes the interdomain linker, ZP-C and the CTP (average RMSD over 180 Cα: 2.0 Å). Moreover, although the relative orientations of the ZP-N and ZP-C domains of the monomeric Al-phaFold2 models are only approximately similar to those observed in the homodimeric crystal structure (average RMSD over 286 Cα for models 1-4: 6.8 Å), the interdomain linkers of all models closely adopt the experimentally observed pre-polymerization conformation, consisting of an α-helix (α1) and a β-strand (β1) (average RMSD over 24 Cα: 1.1 Å) (Figure 1b,c).
As also indicated by Global Distance Test (GDT_TS) scores of 95.3 (ZP_N) and 88.9 (linker + ZP-C + EHP), these results indicate that AlphaFold2 can accurately model the two moieties of the UMOD ZP module. On the other hand, in agreement with the interchain variability observed in the crystal structure (Bokhove et al., 2016), the relative orientation of the ZP-N and ZP-C domains is less defined.

| Modeling of the polymerization-activated state of the UMOD ZP module
To investigate the state of the protein activated for polymerization, I then used AlphaFold2 to model a variant of the same region of UMOD that was C-terminally truncated at the hepsin cleavage site and thus lacked the EHP. This resulted in a significantly different set of relative ZP-N/ZP-C orientations, with the top three ranked models having an interdomain linker whose α1 region converted to a β-strand (α1β') that faces the internal hydrophobic patch of ZP-C (IHP; an EHP-like element that corresponds to β-strand A and is also involved in polymerization) and pairs with its β-strand F to replace the missing EHP ( Figure 2a). Strikingly, this conformational change and intermolecular interaction closely mimics one of the key intermolecular interactions observed in the cryo-EM structure of the UMOD filament (Stsiapanava et al., 2020), so that the ZP-N and ZP-C domains of the activated ZP module can be readily superimposed onto the ZP-N domain of a UMOD subunit and the ZP-C domain of the preceding subunit within the filament (RMSD 1.2 Å over 229 Cα; Figure 2b). A reminiscent but different conformation is instead found in the fourth and fifth ranked models, where the ZP-N/ZP-C linker adopts an extended conformation that also contacts the same region of ZP-C; however, in this case the interaction is mediated by the C-terminal part of the linker, which pairs-in reverse orientation compared to models 1-3-with both βA" (another ZP-C strand involved in UMOD polymerization) and the beginning of βA/IHP (data not shown).

| Extension to egg coat protein filaments
Since ColabFold is able to produce models that recapitulate part of the main protein-protein interactions stabilizing the UMOD homopolymer, what kind of interactions does it suggest between the different subunits that make up egg coat filaments? To answer this question, I modeled complexes of ZP2 and ZP3, the two major subunits of the mouse ZP that are thought to form heterodimers repeating along the filaments (Litscher & Wassarman, 2020). As shown in Figure 4, the outcome of this prediction essentially mirrored what was observed in the case of UMOD by suggesting that the interdomain linkers of ZP2 and ZP3, which are largely disordered in crystal structures of the individual proteins (Bokhove et al., 2016;Han et al., 2010), also adopt a β-strand conformation that allows them to pair with the ZP-C and ZP-N domains of the adjacent subunits within the ZP filament.

| DISCUSSION
The recent developments that culminated with the release of open-source code for both AlphaFold2 (Jumper et al., 2021) and RoseTTAFold (Baek et al., 2021) brought protein structure prediction to a level where it can both rival and facilitate experimental structure determination. This is highlighted by a growing number of reports that models produced by both systems can be successfully used to phase by molecular replacement native X-ray diffraction data for the corresponding proteins (Baek et al., 2021;Flower & Hurley, 2021;Millán et al., 2021;Pereira et al., 2021), as well as to fit maps obtained by cryo-EM (Baek et al., 2021;Gupta et al., 2021). The availability of a database of high-quality structure predictions for the proteomes of several major experimental systems, including human , is bound to significantly expand these and other applications in the near future. At the same time, it will no doubt inform functional studies of a plethora of biological problems for which experimental structural data is either limited or missing.
Against this background, a future development of major interest will be the extension of these computational approaches to the reliable prediction of protein-protein interactions. Indeed, even though both the AlphaFold2 and RoseTTAFold networks were originally developed to predict individual protein structures (and, thus, trained on monomeric proteins rather than complexes), RoseTTAFold was already shown to be able to successfully predict a number of known protein complexes (Baek et al., 2021). Moreover, a comparable functionality was added to the ColabFold implementation of AlphaFold2 (Mirdita et al., 2021).
In this study, I explored the possibility of using AlphaFold2 and ColabFold to gain insights into the polymerization mechanism of ZP module-containing proteins, a large family of extracellular molecules with highly variable architecture and biological functions (Bork & Sander, 1992;Jovine et al., 2005). In particular, I focused on UMOD, In the UMOD filament, the N-terminal half of the ZP-N/ ZP-C linker of a subunit forms a long β-strand (α1β) which can also be described as two distinct strands, α1β' and α1β", that respectively pair with βF and βA" of the ZP-C domain of the previous subunit. Only the former interaction is recapitulated in the monomeric AlphaFold2 models, due to the fact that their interdomain linker must fold back to connect with the beginning of ZP-C JOVINE | 689 because it is the only member of the family for which experimental structural information is available for both precursor and polymeric states of the molecule (Bokhove et al., 2016;Stanisich et al., 2020;Stsiapanava et al., 2020).  (Baek et al., 2021)-AlphaFold2 has potentialities that go well beyond the prediction of the structure of individual proteins (Jumper et al., 2021;Mirdita et al., 2021).
The results reported in Figures 1c and 2 Stsiapanava et al., 2020). This is striking, considering that-unlike in the case of the precursor state of the protein (whose experimental structure, PDB ID 4WRN, was excluded as a template during mod-  Jumper et al., 2021)) scale that ranges from 0 (blue; maximum confidence) to 100 (red; minimum confidence)), respectively. The members of both sets of models are highly similar to each other and show interdomain linker interactions equivalent to those of uromodulin (UMOD). Each set of models 1-4 is also clearly separated from the respective fifth model (rank_5; shown in the top panel as a semi-transparent gray cartoon), which in both cases adopts a significantly different ZP-N/ZP-C conformation that is clearly scored as inferior by all the metrics reported in the bottom panel (Jumper et al., 2021;Mirdita et al., 2021). Note that a low inter predicted aligned error (PAE) between chains indicates a confident prediction; also, only approximately half of each interdomain linker interacts with the adjacent ZP-N/ZPC domain from the other subunit, thus explaining the high structural variability/low pLDDT of the remaining half (which, in the simplified two-domain systems modeled by this prediction, lacks a binding partner) JOVINE | 691 (Stsiapanava et al., 2020). Notably, such a state would be very difficult to detect and study experimentally, because UMOD expression constructs truncated at the hepsin cleavage site are retained in the endoplasmic reticulum and not secreted (Schaeffer et al., 2009).
Regardless of whether this particular form of UMOD truly exists, it is clear that machine-based structure modeling has the potential to unveil protein conformations that may be hard or even impossible to capture experimentally.
When analyzing the models of complexes consisting of multiple copies of the activated ZP module of UMOD, it is important to consider that the current implementation of AlphaFold2/ColabFold does not allow to specify any kind of spatial restraint that may account for the aforementioned membrane-anchoring requirement for filament assembly. Despite this, ColabFold was able to generate models that not only include all the main interactions made by the ZP-N/ZP-C linker of polymeric UMOD (Figure 3a), but also recapitulate the basic subunit organization observed in the filament ( Figure 3b). Clearly, these filament fragment models remain quite far from the corresponding experimental structure by missing additional contacts between subunits (in particular the one involving ZP-N βF' and ZPC αEFβ; Figure 3b) that also contribute to the unique overall conformation of the UMOD polymer (Stsiapanava et al., 2020).
However, they do provide a valuable structural framework that-in the absence of an experimental structure-could either complement pre-existing biochemical and/or functional data, or be used to derive new hypotheses that could then be experimentally tested. Along these lines, the ZP2/ZP3 complex models shown in Figure 4 are, for example, entirely consistent with mass spectrometric and negative stain EM data suggesting that the basic architecture of VE/ZP filaments resembles that of UMOD (Stsiapanava et al., 2020). Together with further structural studies, these very different but com-

| Structure analysis and comparison
Structures were visualized, inspected, and superimposed using PyMOL (Schrödinger, LLC), which was also used to make all figures. GDT_TS scores were calculated using the AS2TS server (Zemla, 2003).