Enabling cryo‐EM density interpretation from yeast native cell extracts by proteomics data and AlphaFold structures

In the cellular context, proteins participate in communities to perform their function. The detection and identification of these communities as well as in‐community interactions has long been the subject of investigation, mainly through proteomics analysis with mass spectrometry. With the advent of cryogenic electron microscopy and the “resolution revolution,” their visualization has recently been made possible, even in complex, native samples. The advances in both fields have resulted in the generation of large amounts of data, whose analysis requires advanced computation, often employing machine learning approaches to reach the desired outcome. In this work, we first performed a robust proteomics analysis of mass spectrometry (MS) data derived from a yeast native cell extract and used this information to identify protein communities and inter‐protein interactions. Cryo‐EM analysis of the cell extract provided a reconstruction of a biomolecule at medium resolution (∼8 Å (FSC = 0.143)). Utilizing MS‐derived proteomics data and systematic fitting of AlphaFold‐predicted atomic models, this density was assigned to the 2.6 MDa complex of yeast fatty acid synthase. Our proposed workflow identifies protein complexes in native cell extracts from Saccharomyces cerevisiae by combining proteomics, cryo‐EM, and AI‐guided protein structure prediction.


INTRODUCTION
The cell displays an astounding heterogeneity, harboring diverse biomolecules at a wide range of concentrations. Proteins represent the largest group of cellular components, at 20%-30% (w/v), or 200-300 g/L [1]. Employing state-of-the art technology, protein structure analysis from cells is only feasible at low resolution by cross-linking mass-spectrometry (XL-MS) [2][3][4] or at higher resolution by in situ cryo-electron tomography (cryo-ET) [5]. However, these methods mainly analyze very abundant complexes such as ribosomes [6] or the nuclear pore complex [7]. To reduce complexity while enriching for less abundant biomolecules, the cell must be lysed and fractionated, using, for example, centrifugation to separate membranes and aggregates from soluble cell content. Such coarse fractionation may then be followed by chromatography, for example, size exclusion chromatography (SEC), to separate biomolecules by size while retaining their native assemblies. Efficient structural determination of such retrieved extracts has been demonstrated by the structural analysis of the MDasized pyruvate dehydrogenase complex from the thermophilic fungus Chaetomium thermophilum [8].
The coupling of SEC fractionation with mass spectrometry (MS), termed co-fractionation mass spectrometry (CF/MS) [9], provides detailed insights into higher-order protein complexes. This is because co-elution profiles of proteins might contain information about stable as well as transient interactions. Recently, CF/MS data analysis was empowered by artificial intelligence (AI) [10] and can lead to the identification of known and unknown protein complexes from red blood cells [11]. These protein complexes often assemble in larger functional units, termed protein communities [12,13].
Functional units of metabolic complexes in particular, referred to as metabolons [14], show interaction with various binders, including scaffold proteins, membrane patches, nucleic acids, and others [15,16].
Currently, only cryo-EM of fractionated cell extracts is able to visualize such complexity at relatively high resolution [8,15,17]. Usually, classical cryo-EM single-particle analysis (SPA) is employed to the acquired images from heterogeneous fractions; SPA is comprised of the following steps: (a) particle picking from acquired micrographs, (b) 2D classification of extracted single-particles, and (c) 3D reconstruction of a Coulomb potential map that can often reach atomic resolution in the case of purified and stable specimen [18]. The majority of these steps already employ AI to learn, identify and reconstruct these structures and are implemented in specialized user-friendly toolkits, that is, RELION [19], Xmipp/Scipion [20], or cryoSPARC [21] which are either freely available (RELION, Scipion) or accessible to academic users (cryoSPARC).
SPA of complex native cell extracts harbors several technical limitations compared to SPA of purified proteins. For near-atomic resolution, a smaller pixel size (e.g., 1.6 Å/px for ∼3 Å at Nyquist frequency) is required during data collection [17] but for cell extracts a larger pixel size is preferable. This choice limits resolution but increases the number of total particles per micrograph, which is indeed favorable for low-abundant macromolecular complexes. One, if not the major, prerequisite for the reconstruction of a high-resolution map from cryo-EM data is a sufficiently large number of single particles [22]. If the particle shape is recognizable, either directly in the micrograph to be picked manually or during the 2D classification of, that is, blob-picked particles (unbiased circular picking based on contrast), the analysis can be streamlined accordingly [23]. Statistical occurrence of a particle might be correlated with the MS-derived protein abundance, guiding the identification as well as the 3D reconstruction of the selected protein complex [15]. Additionally, technological advances in cryo-EM [18] ensure that a low-abundant protein will be present in high enough copy numbers to allow for atomic resolution reconstruction-if the target particle signature can be identified. However, megadalton cryo-EM maps from flexible macromolecules derived from native cell extracts are often of medium resolution, and therefore, hard to interpret in the context of molecular models [24].
In this work, we present an automated workflow that incorporates MS-based protein identification and database knowledge integration via computational analysis to visualize protein communities. These proteomics data enable structural identification and model building of a cryo-EM map derived from native cell extracts utilizing AlphaFold2predicted monomeric protein structures. The derived low-resolution structure of the yeast fatty-acid-synthase (FAS) is correctly identified, and the derived model is in agreement with previously published data ( Figure 1A).

Network analysis
The proteinGroups.txt file was obtained from the MS data and contains all proteins identified together with the respective label-free quantification (LFQ). Only proteins identified in at least 50% out of 12 experiments were considered, and their mean LFQ values were calculated. For every protein identified, the STRING-identifier [27] and the respective protein interaction network, containing both physical and functional interactions, were fetched using the API interface.
Only binary interactions with stoichiometries less than 1:10, based on LFQ intensities, were considered. This is because LFQ intensities infer relative abundance of the protein species, and this is translated in the relative abundance in the cryo-EM micrographs. Complexes that have members with such difference in relative abundance are challenging to capture within micrographs from native cell extracts, for example, the E3 of the PDHc [8]. The protein interaction network plot was generated using the python module NetworkX [28]. Edges were colored from green to red according to their "exp. score" value, a parameter indicating the confidence of a physical interaction between two nodes, as described in Szklarczyk et al. [27]
Particles were picked using the "blob picker" module with a minimum particle diameter of 150 Å, and a maximum particle diameter of 300 Å. Single-particle images were extracted with a box size of 180 px. Retrieved single-particles were iteratively 2D classified, each with 400 classes during 2D classification. Asymmetric (C1) ab initio reconstructions were done for clear 2D classes, and classes containing "junk" single-particles (e.g., ice contaminants or broken/damaged single-particles) were discarded after each iteration of 2D classification. For the dome-shaped map, later identified as the fatty acid synthase complex (FAS), D3 symmetry was applied during homogenous refinement after prediction of symmetry utilizing ChimeraX [29]. From the initial symmetrized reconstruction, 20 2D projections (using the "Create Template" module of cryoSPARC) were generated.
Template-based particle picking then followed, with a particle diameter of 300 Å, a low-pass filter of templates and micrographs set to 25 Å, and a minimum particle separation set to a distance of 1.25 diameters. Picked particles were extracted with a box size of 210 px. After 2D classification and selection of clear classes, an asymmetric (C1) ab initio reconstruction was calculated, followed by a symmetrized (D3) homogenous refinement.

Unambiguous fitting of AlphaFold2 models
The proteins identified by MS were sorted by their LFQ intensities, and the most abundant 150 proteins were selected. Proteins were annotated according to the UniProt database [30], and protein names containing the keyword "ribosome" were removed. Fitting of the AlphaFold2 models in the density was performed with a modified local installation of ChimeraX (version 1.4.dev202202240543; Table   Statement Significance Progress in the analysis of heterogeneous biochemical samples, specifically in cryo-EM of native cell extracts, allows the identification and characterization of protein interactions at the structural level. Here, we propose a robust workflow that incorporates information from proteomics experiments to guide the identification of protein complexes and leverages AlphaFold to predict their structures. Our workflow forgoes the protein backbone tracing step and is able to characterize large protein complexes at medium resolution. S1A). [29] Each model was globally fitted 10,000 times with ChimeraX by random placement within the map density and local placement optimization (Table S1B). Not each random placement results in an accepted solution. All fits were saved and ranked by atom engulfment, which is defined as the fraction of protein atoms overlapping with the electron density at a certain contour level. Taking into account the best fitting solution, an atom engulfment threshold of 0.8 was applied. Additionally, a coarse analysis was carried out to identify steric clashes, using a threshold of Cα-clashing >25% of the total number of Cα atoms in the monomer.

Resolving of Cα-clashing and real-space refinement
For calculating Cα-clashes between two given protein structures (in PDB format), the (x,y,z) coordinates of the Cα-atoms were isolated in a NumPy array [31], the distance between each atom was calculated with SciPy [32], and a clash was identified when atoms resided at a calculated distance of less than 3.65 Å to the nearest atom (Table S1C).
Acceptance threshold for homomultimers was set to a mean Cα-clash value of ≤10.

Local resolution estimation
Local resolution maps for the FAS final reconstruction were generated with the "Local Resolution Estimation" module in cryoSPARC v3.1.1 [21] and visualized with ChimeraX v1.4 [29].

Code availability
The software employed in this study was written in python and is available upon request. . Edges are colored based on the direct interaction confidence (refer to B). All proteins identified in the sample are plotted as a boxplot in terms of relative abundance as measured by the label-free quantification (LFQ) score [35]. Members of the Metabolism group are highlighted as dots. (D) Fatty acid synthesis group-The nodes are colored based on their reactions: fatty acid synthase (red), acetyl-CoA synthesis (light green), long-fatty-acid ligases (dark green), and enoyl-ACP-reductase (blue). Edges are colored based on the exp. score (refer to B). The relative abundance, based on label-free quantification (LFQ) values reported by MaxQuant, of all proteins identified in the sample are plotted as a boxplot and members of the FAS group are highlighted as dots. The boxplot minima represent the 25th percentile, the maxima represent the 75th percentile, the notch indicates the data's median, and the whiskers extend to the minimum and maximum value within a 1.5 interquartile range.

High molecular weight protein complexes from native cell extracts-Identifying protein communities by combining MS and database knowledge
Two high-molecular weight fractions of yeast native cell extracts were previously analyzed by Schmidt et al. [25] to retrieve the endogenous L-A helper virus and various states of translating ribosomes [25].
Here, we re-analyzed the reported MS results as well as 12,795 cryo-EM micrographs that were acquired at a pixel size of 3.177 Å/pix to capture the cell extract content beyond these abundant biomolecules ( Figure S1A). To further classify identified proteins, KEGG pathway analysis was performed [34]. In total, eight principal classes were recognized: (1) ribosomes, (2) proteasome, (3) carbon metabolism (including glycolysis, ethanol fermentation, and Krebs cycle among others), (4) fatty acid biosynthesis, (5) RNA polymerases, (6) RNA degradation, (7) phago-some, and (8) mRNA surveillance (Table S2). These classes cover 40% of all identified proteins highlighting the high complexity of the sample.
The protein-protein interaction analysis, utilizing the standard STRING aggregated score, revealed a densely packed network with highly interconnected proteins. Nevertheless, an apparent grouping of proteins is visible ( Figure 1B). A large and heterogenous group, referred to as Metabolism, includes proteins involved in cytosolic glycolysis, the malate shuttling mechanism, the mitochondrial pyruvate dehydrogenase complex, and the Krebs cycle ( Figure 1C). Relative abundance values were estimated using the LFQ intensity score [35] (see Section 2), also previously used for deriving stoichiometric data for human protein-protein interactions (PPIs) [36]. Diverse abundance values (LFQ score) and low interaction values (exp. score [27]) indicate that the entire group probably does not form a stable metabolon but is composed of different subcomplexes. This is comparable with previously identified structures of co-eluting pyruvate and α-ketoglutarate dehydrogenase complexes [37,38]. Judged by the exp. score, proteinprotein interactions are visible for mitochondrial proteins but cytosolic proteins have lower exp. score, that is, a reduced probability of interaction. A notable exception is phosphofructokinase (PFK): Both its subunits (α, β) are present with very high abundance and direct interaction scores ( Figure 1C).
Another interesting group includes proteins related to fatty acid synthase (FAS; Figure 1D). Notably, apart from the canonical α and β subunits of the FAS (FAS1, FAS2), the acetyl-CoA carboxylase (ACC1) is also identified, which is involved in the production of malonyl-CoA, a substrate for FAS, along with two ligases and one reductase. Correlating with the MS-derived protein abundances, the α (FAS2) and β (FAS1) subunits of FAS have the same copy number, indicating a 1:1 stoichiometry in the complex, perfectly corresponding to the known structure of fungal FAS [39], where they form an A 6 B 6 complex. The carboxylase is present at a similar abundance but due to its megadalton size [40] might co-elute, while the ligases and reductase are of much lower abundance, indicating a transient interaction. This is also underpinned by pre-existing experimental evidence, that is, that only the α and β subunits of FAS form a stable complex, while the interaction of ACC1 and other carboxylases is of a transient nature [12,41].

Template-free cryo-EM reconstruction
For an unbiased picking of particles for cryo-EM, a blob picker module with a target particle size of 150-300 Å (Figure 2A classifications were required to effectively identify clear signatures ( Figure 2C). Very clear 2D classes appeared after multiple rounds of 2D classification ( Figure 2D) but the total number of particles contained in each final class was low. To increase final particle numbers, ab initio 3D reconstructions utilizing the particles contained in these classes were generated and, based on these, 2D templates for template-based picking were created. With this approach, particle picks were considerably increased ( Figure 2E). After the template-based picking, 6,923,214 particles could be identified, of which 7895 were eventually selected, again through iterative 2D classifications. A final reconstruction (D3 symmetry) resulted in a cryo-EM map of 8.0 Å resolution (FSC = 0.143) ( Figures 2F and S1B). The cryo-EM density map shows an overall uniform resolution despite the missing views ( Figure S1C), with additional localized lower resolution features ( Figure S1D). An additional challenge posed by the current reconstruction is the limited coverage of particle views in the 2D class averages ( Figure S1A). This effect of FAS particles has been previously observed [15] but due to the particles' shape and symmetry this does not pose a limitation to achieve sufficient resolution for further map analysis.

AI-guided protein structure modeling -Identification and refinement of FAS
The dome-shaped reconstruction that was retrieved ( Figure 2F) could not be built with current refinement tools; due to the resolution of ∼8 Å, AI tools like FindMySequence [42], DeepTracer [43], or Mode-lAngelo [44] are not applicable. To address this issue, the 150 most abundant proteins were selected, ribosomal proteins were excluded, and modeling was focused on less abundant protein signatures. Ribosomes display distinct structural signatures in the raw micrograph data, 2D classes, and 3D reconstruction and can be easily distinguished from other, less abundant, protein signatures in native cell extract [25,37]. The remaining 61 proteins were systematically fitted in the derived reconstruction ( Figure 2F). To this end, all 61 protein structures were retrieved from the AlphaFold2 database and fitted 10,000 times each in the density to capture a variety of solutions ( Figure 3A).
The best-fitting solutions were isolated and grouped into predicted homomers. To assess the quality of these homomeric complexes, the number of Cα-clashes between subunits was calculated and ranked according to both average and total clash number. A high clash number indicates an ambiguous placement of a protein ( Figure 3B). Only 18 homomers fulfill the criteria. To further statistically analyze these assemblies, the map coverage at various contour levels was calculated ( Figure 3C). The majority of homomers explained only a minor part of the map (<10% map coverage at a contour level of 2.0), but two proteins stood out: FAS1 and FAS2, each explaining nearly half of the map.
Even though these two could be the correct hits, all heteromeric assemblies were generated and analyzed analogously to the predicted homomers: Only seven heteromeric assemblies fulfilled the low clash values criteria (maximum 2.5 % of Cα clashing; data not shown), and their map coverage was calculated ( Figure 3D). Interestingly, the fitted heteromeric complex consisting of FAS1 and FAS2 was one of those hits. The complex was fully covered by the density at high contour levels unlike the other predicted heteromeric complexes ( Figure 3D).
The unambiguously fitted and AI-generated monomeric protein structures resulted in a dodecameric assembly of FAS1 and FAS2 (each in six copies) (Figure 3). This was further cross-validated by the initial network analysis, where a direct interaction between these two proteins was visible ( Figure 1D). To reduce clashes or close contacts between the fitted monomers, a simple real-space-refinement with default values was performed ( Figure 3E

DISCUSSION
Modern approaches in structural biology are able to identify and structurally characterize protein communities in a near-native state [15,24].
Due to the nature of the native environment, the samples are highly heterogenous, and multiple disciplines must be combined to eventually decipher the structural information. These include, among various others, different EM techniques, including tomography [45,46], SPA [8,17,37,47], single cell lysis during sample preparation for cryo-EM [48], and MS [49].
While MS is precise and can identify proteins even at very low abundances, these data must be set in the context of protein communities.
Here, we demonstrate how network visualization and grouping-based on relative protein abundances, KEGG and STRING information-can identify these communities in a single chromatographic fraction, even without the inclusion of elution profile information. In the future, our protocol will be able to incorporate more data, for example, MS results from multiple consecutive fractions. Such data could act as input to identify more protein communities and differentiate these from random co-elution using advanced techniques like deep learning algorithms for more precise data interpretation [11,50]. Using cryo-EM to analyze these highly heterogenous samples adds another layer of knowledge, but also increases complexity. To tackle this, we used MS data, AlphaFold2-predicted protein structures, and were successfully able to unambiguously recapitulate the FAS complex. This demonstrates that, even without reaching near-atomic resolution (<3.8 Å) during map reconstruction, it is feasible to retrieve models of protein complexes in a robust manner. This result complements our recent work [37] where we showed unambiguous identification of a protein improving interface energetics, for example, computed using tools such as HADDOCK [51].
The proposed workflow is generally applicable to any given cryo-EM map if a set of potential target proteins is available, as medium resolution (∼8 Å) maps hold enough information to accurately fit AlphaFold models. Other approaches require the knowledge of the proteins that form the target assembly, for example, HADDOCK [52,53], EMBuild maps (<6 Å) with clear secondary structure separation, for example, Pathwalker/ROSETTA [57,58]. Additionally, they require information about the sequence and structure. Additionally, such algorithms require flexible refinement procedures-therefore, docking hundreds of proteins and their pairs within the map is computationally challenging. Our workflow does not have these limitations, as the sequence information is directly derived from experimental MS data and the rigid fitting is computationally less demanding. A lower limit in protein size should be set, as for example, a 25 kDa protein can most likely be fitted in various rotations in a given low-resolution map. Additionally, an overall domain shape must be recognizable, setting an upper resolution limit to approximately 15 Å; however, these estimates can be quantitatively approached in the future by further benchmarking the proposed workflow. Moreover, a derived, highly heterogeneous, macromolecular assembly, which would include not just protein density (for example, ribosomes which consist of rRNA for a major part), might also negatively influence fitting results. Identifying and modeling of ribonucleoproteins in cryo-EM maps is nevertheless possible with other tools, such as DRRAFTER [59]. It should be noted that FAS, even though in the MDa molecular weight range, is highly symmetric allowing an identification based on only two polypeptide chains. More complex cryo-EM density maps might require adaptations in the fitting and scoring procedures.
Another major limitation of our workflow is how the parameters are set during particle picking. Blob-picked particles must fulfill a certain diameter threshold to be actually selected. Variability in diameter, which is actually feasible to perform during particle picking, can greatly influence chemical and structural heterogeneity of retrieved particles.
Also, a large majority of picked particles can either be noise or damaged [60], and must be separated in lengthy, manually curated 2D classifications. Some 2D class averages might never be retrieved due to particle heterogeneity: phosphofructokinase was never observed in the data but the molecule should have been seen as its diameter is in the range of blob sizes set during picking.
Future developments in cryo-EM should include "on-the-fly" analysis combined with advanced AI-assisted algorithms that cover all steps in a cryo-EM pipeline (from image analysis to model building).
Single solutions for individual steps exist (i.e., DeepCryoPicker [61], DeepPicker [62], phenix autobuild [63], or others [64,65]) but all these require specialized knowledge which limits usability for the majority of scientists. Future developments should aim at making these systems available and easy-to-use (simple point-and-click). These systems can include generalizable neural networks that eventually discriminate noise, contamination and signal, and ultimately provide an output of a