Yeast diversity in open agave fermentations across Mexico

Yeasts are a diverse group of fungal microorganisms that are widely used to produce fermented foods and beverages. In Mexico, open fermentations are used to obtain spirits from agave plants. Despite the prevalence of this traditional practice throughout the country, yeasts have only been isolated and studied from a limited number of distilleries. To systematically describe the diversity of yeast species from open agave fermentations, here we generate the YMX‐1.0 culture collection by isolating 4524 strains from 68 sites with diverse climatic, geographical, and biological contexts. We used MALDI‐TOF mass spectrometry for taxonomic classification and validated a subset of the strains by ITS and D1/D2 sequencing, which also revealed two potential novel species of Saccharomycetales. Overall, the composition of yeast communities was weakly associated with local variables and types of climate, yet a core set of six species was consistently isolated from most producing regions. To explore the intraspecific variation of the yeasts from agave fermentations, we sequenced the genomes of four isolates of the nonconventional yeast Kazachstania humilis. The genomes of these four strains were substantially distinct from a European isolate of the same species, suggesting that they may belong to different populations. Our work contributes to the understanding and conservation of an open fermentation system of great cultural and economic importance, providing a valuable resource to study the biology and genetic diversity of microorganisms living at the interface of natural and human‐associated environments.


| INTRODUCTION
Yeasts are single-celled fungi that have played a significant role in human history.These microorganisms have a wide geographical distribution, in part as a result of their use by humans in the production of fermented foods and beverages; they occur naturally in a variety of climates, ecosystems, and raw fermentation materials such as grapes and cereals (Gallone et al., 2016;Giannakou et al., 2020;Marsit et al., 2017;Molinet & Cubillos, 2020).The oldest archeological records of fermented beverage production are dated to 8000-3000 BC in China, Sumeria, Iran, and Egypt (De Chiara et al., 2022;Marsit et al., 2017;Turker, 2015).Long before humans knew about their existence, yeasts had already been handled in fermentations to produce wine, cider, distilled spirits, bread, cheese, and other products.In addition, Saccharomycetales yeasts are used nowadays in a variety of biotechnological applications, such as bioremediation, biocatalysis, biofuel production, protein production, and biomedical research (Cubillos, 2016;Johnson & Echavarri-Erasun, 2011;Turker, 2015).
Yeast diversity influences the properties of fermented foods and beverages.It is known that Saccharomyces cerevisiae is a salient species in the fermentation process, while "nonconventional" yeasts have the potential to metabolize uncommon carbon sources or to produce desirable metabolites of biotechnological interest.For example, in wine production, S. cerevisiae is responsible for alcoholic fermentation of grapes, Pichia yeasts produce metabolites that add desirable organoleptic properties, while Brettanomyces bruxellensis is associated with unwanted off flavors (Burini et al., 2021;Ciani & Comitini, 2011;Gallone et al., 2016;Godoy et al., 2017;Jolly et al., 2014;Molinet & Cubillos, 2020).
Moreover, yeast populations associated with human-related environments show specific adaptations.For instance, there are S. cerevisiae wine lineages that are tolerant to the presence of copper, sulfates, or high ethanol concentrations, while others used for beer production have the capacity to metabolize more complex sugars such as maltotriose (Cubillos et al., 2019;De Chiara et al., 2022;Gallone et al., 2016;Giannakou et al., 2020Giannakou et al., , 2021;;Liti et al., 2009;Peter et al., 2018;Warringer et al., 2011).Yeast domestication and its mechanisms have been most thoroughly documented in extreme scenarios of human handling, such as the use of starter cultures of S. cerevisiae in the production of beer (Gallone et al., 2016;Molinet & Cubillos, 2020).
In Mexico, agave spirits are produced through the fermentation of juice and bagasse obtained from cooked hearts of agave plants (Agave spp., Asparagaceae).This refers to all distilled beverages obtained from agave, such as tequila, mezcal, bacanora, raicilla, and other locally named spirits.Traditional production takes place in small or medium-sized distilleries across the country, where open fermentation occurs "spontaneously" driven by microorganisms from the surrounding environment, which is characterized by a diverse range of local climates.A variety of production techniques are employed (Arellano-Plaza et al., 2022;CONABIO, 2006) and over 50 different agave taxa are used (Colunga-GarcíaMarín et al., 2017).In brief, agave spirit production involves plant harvesting, cooking, grinding, fermentation, and distillation (Kirchmayr et al., 2017;Nolasco-Cancino et al., 2018).The cooking step transforms complex carbohydrates such as fructans into simple sugars, predominantly fructose, and generates inhibitory compounds for microorganisms, such as furans, aldehydes, and organic acids (Mancilla-Margalli & López, 2002).Through grinding, pulp, fibers, and juice are obtained,

Take-away
• We isolated and identified 4524 yeast strains from open agave fermentations in Mexico.
• A core set of six yeast species was consistently found across diverse regions.
• Kazachstania humilis genomes differed significantly from those of isolates in other regions of the world.
• We report two candidate new species related to the Pichia clade.
Despite the rising demand for agave spirits in international markets, the diversity of yeasts involved in agave fermentation remains largely unexplored.Pioneering studies by Marc-André Lachance in a traditional tequila distillery showed that local insects and objects within the distillery can serve as yeast reservoirs (Lachance, 1995).In following efforts, yeast strains have been isolated from fermentation tanks in some traditional distilleries located in the states of Oaxaca, Durango, and Tamaulipas (Kirchmayr et al., 2017;Lappe-Oliveras et al., 2008;Nolasco-Cancino et al., 2018;Páez-Lerma et al., 2013).From these studies, at least 43 yeast species and over 50 bacterial species have been found in agave fermentations.Among these, S. cerevisiae, Kluyveromyces marxianus, and Pichia species emerged as the most prevalent, irrespective of the sampling location sampled or the year of study.However, all these studies have focused on specific locations within a few producing regions, thus not fully encompassing the entire range of biogeographical and cultural diversity associated with the production of agave spirits.
The Yeast Genomes Mx consortium aims to provide a comprehensive understanding of the genomic and functional diversity of yeasts in Mexico.Considering the pivotal role microorganisms play in agave fermentations and the limited information available about their natural history in such environments, these communities are a good starting point for investigating regional yeast diversity.For this purpose, we assembled the YMX-1.0culture collection, comprising thousands of yeast strains isolated from fermentation tanks in 68 traditional distilleries throughout the country.These strains were initially identified using matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectrometry.Further confirmation was obtained through ITS and D1/D2 sequencing of a subset of the isolates.This approach not only confirmed the regular occurrence of S. cerevisiae but also revealed the widespread presence of nonconventional yeasts, including two candidate new species.The YMX-1.0 culture collection provides a unique resource for launching genomic and population studies of yeasts from agave fermentations in the context of a megadiverse region.As a working example, we generated and analyzed draft genome sequences of four isolates of the nonconventional yeast Kazachstania humilis, showing that these strains are markedly different from those found in other regions.

| Agave distilleries and site information
Sampling sites were chosen to include representative traditional distilleries from each of the seven agave spirit-producing regions in Mexico, as described by Aguirre et al. (2006).The West area was subdivided into West I and West II due to the contrasting environmental conditions and agave species employed.Fieldwork took place between March 2018 and December 2020.Samples were collected from 68 distilleries distributed across 13 states of Mexico, including Colima, Durango, Estado de México, Guanajuato, Guerrero, Jalisco, Michoacán, Oaxaca, Puebla, San Luis Potosí, Sonora, Tamaulipas, and Zacatecas (Figure 1a).The sampling effort focused solely on mezcal, raicilla, bacanora, and other traditional agave spirits, excluding tequila which is predominantly produced in industrialized distilleries (see Data set S1).  (García, 1998).For each distillery, information such as geographic coordinates, oven type, grinding type, and other structural building characteristics were recorded.Metadata associated with each isolate is provided in Data set S1.

| Collection and handling of agave fermentation samples
Samples were collected from distilleries that had at least one active fermentation tank at the time of the visit.If multiple tanks were available, we used information on agave species and fermentation phase, as provided by the producers, to select which tanks to sample-namely, one tank from each species or stage was sampled.The fermentation phase for each sample was determined by dividing the ongoing fermentation time at sampling by the estimated total duration of fermentation, as reported by the producer.The total fermentation time was divided into three phases defined as terciles.A total of 201 tanks were sampled, ranging from one to five per distillery; all but 10 tanks were sampled only once.
Sampling involved collecting agave must from a depth of 25-30 cm below the surface at arm's length from the tank edge, using a sterile 25 mL serological pipette.pH for each sample was determined with colorimetric bands (Civeq CVQ2051) and the temperature of the tank was measured with a needle thermometer placed 10 to 15 cm below the surface.Collected agave must was used to fill two pre-labeled 4 mL cryovials with glycerol to achieve a final concentration of 12.5% and immediately frozen in liquid nitrogen.All samples were kept in liquid nitrogen until they were transferred to a −80°C freezer in the laboratory.
Additional field data, such as the material of the fermentation tank, and the agave species used, were also collected (see Data set S1). bacto peptone, 10 g/L glucose, 1 ml/L HCL, and 200 mg/L chloramphenicol), as described by Liti et al. (2017).Ethanol concentration in the isolation medium was reduced to 6% v/v, mimicking what is usually found in agave fermentations and also promoting the isolation of nonconventional yeast species with lower ethanol tolerance.Out of 201 samples, 158 (86.7%) from 147 different tanks showed signs of fermentation within 2-7 days.These samples were diluted 1:100 and 1:1000 and spread on Wallerstein Laboratory (WL) Nutrient Agar (Sigma 17222) which enables screening for morphological and physiological variations (Verdugo Valdez et al., 2011).Samples were incubated for 5 days at 25°C, followed by an additional 2 days at 4°C to enhance morphological differentiation.Colonies with differing morphologies were randomly selected from the plate exhibiting the greatest number of isolated colonies; 12-60 colonies were transferred to 150 μL of YPD medium with 12.5% glycerol in 96 microtiter-well plates (Corning CLS3367) and stored at −80°C.For most samples (144 out of 158), between 24 and 36 strains were isolated.Note that only strains subjected to DNA extraction and ITS or D1/D2 sequencing were restreaked twice to achieve isogenic colonies.The YMX-1.0 culture collection consists of 4524 strains organized in 53 microtiter plates.

| Strain identification by MALDI-TOF mass spectrometry
Strains were analyzed as in García-Gamboa et al. (2021).In brief, microtiter plates were thawed, strain arrays were printed onto YPD agar plates and then incubated for 48 h at 30°C.Biomass was directly spotted onto MSP 96 polished steel MALDI target plates.Each spot was treated with 1 μL of 70% formic acid until dry, followed by a 1 μL covering of 10 mg/mL HCCA (α-cyano-4-hydroxycinnamic acid) matrix solution, containing 50% acetonitrile and 2.5% trifluoroacetic acid as the organic solvent.The plates were processed using a MICROFLEX LT instrument (Bruker Daltonics).Biomass fingerprints for each spot were acquired with flexControl 3.4 software with the MBT_FC.par method.A total of 40 laser shots were collected at six random positions within each spot, automatically generating mass spectra ranging from 2000 to 20,000 Da.The output files were compared to the BDAL v.8 or v.10 databases (see Data set S1).No mass-spectrometry peaks were obtained for 462 (10.2%) of the 4524 strains analyzed.For samples with usable mass-spectrometry data, 3063 (67.7%) passed a stringent Score Value of 1.7 used for species assignment.Using a more lenient Score Value of 1.5 increased the number of identified strains to 3503 (77.4%).

| Statistical testing of associations between community composition and sample metadata
A non-parametric analysis of similarities test (ANOSIM) was employed (Clarke, 1993; vegan library in R, Oksanen et al., 2022).
Samples or distilleries served as sampling units and communities were encoded at either the species or genera taxonomic level.Tests were performed with the dissimilarity of presence or absence among sampling units encoded in Jaccard's indexes as the distance and grouping factors were the metadata related to each sample or distillery (Table S1).Region, climate group, and grinding type were considered as grouping factors for distilleries (three simultaneous tests, α = 0.05/3).For individual samples, seven additional factors were considered, namely locality, fermentation phase, tank material, expected total duration of the fermentation, agave species, pH, and tank temperature (10 simultaneous tests, α = 0.05/10).For fermentation phase and climate group, taxa that differed among groups were identified with a multilevel pattern analysis using 9999 permutations to assess the significance (p < 0.05, point biserial correlation index r.g.) of each species to each group (indicspecies library in R; De Cáceres et al., 2010).Whether yeast diversity was similar across fermentation phases was evaluated with a mixed-effects model with Simpson's diversity per sample as the dependent variable, fermentation phase as fixed effect, localities within regions as nested random effects, and tank temperature and pH as crossing random effects (lme4 library in R; Bates et al., 2015).This model was compared with a likelihood ratio test to a null model in which the fermentation phase effect was discarded.

| Strain identification by ITS and D1/D2 sequencing
A panel of 474 strains including four of the most abundant putative species identified by MALDI-TOF was selected.Isolates were The YMX-1.0 culture collection was generated with yeasts isolated from diverse environmental and production conditions.(a) Sampling sites were 68 traditional agave distilleries across Mexico (black dots); shaded colored areas indicate the seven regions where agave spirits are produced in the country.(b) Scatter plots showing the diversity of climatic parameters of sampling sites grouped by region: Average annual temperature (top left), isothermality (top right), average annual rainfall (bottom right), and altitude above sea level (bottom left).(c) Samples were collected from actively fermenting tanks; pictures show examples of containers including (i) cement tanks, (ii) wood barrels, (iii) cowhide containers, (iv) plastic barrels, (v) clay pots, (vi) and steel barrels.(d) Flowchart depicting the generation of the collection and the initial analyses performed in this work.Fermented-agave samples were enriched in liquid growth medium and plated onto WL Nutrient Agar.Isolated colonies were grown and stored in microtiter 96-well plates and assigned a putative species by MALDI-TOF biotyping.Species assignment was validated by ITS sequencing for a subset of isolates.D1/D2 and ITS sequencing were used to identify putative new species and genome sequencing to assess intraspecific diversity (see Section 2).restreaked two times to obtain axenic cultures, from which genomic DNA was extracted with sodium hydroxide (Sylvester et al., 2015).
The ITS/5.8 rDNA ITS region was amplified using universal fungal primers ITS-1 and ITS-4, as previously described (Sylvester et al., 2015), and sequenced using the same primers.Sequences were filtered and processed using a 0.01 trimming threshold, using a modified Mott (M1) cut algorithm implemented in the sangeranalyseR package for R (Chao et al., 2021).For a subset of isolates that yielded no reliable identification by MALDI-TOF, taxonomic identity was assigned by analyzing sequences of both the ITS-5.8Sregion as mentioned above, and the D1/D2 variable domains that were amplified and sequenced using the NL1 and NL4 primers.The sequences were assembled, edited, and aligned with MEGA11 (Tamura et al., 2021) and compared with annotated yeast sequences in the GenBank database using BLAST, as previously described (Sayers et al., 2020).A low-depth sequencing approach was used for the taxonomic assignment of the isolates YMX003144, YMX000711, and YMX000142, which were assigned to two candidate new species.Briefly, genomic DNA was purified using the MasterPure DNA purification kit as recommended by the manufacturer and sequenced using IlluminaNextSeq platform (2 × 150 bp).Poor quality reads were removed with fastp v0.20.0 (Chen et al., 2018).The remaining reads were assembled with SPAdes v3.12.0 (Bankevich et al., 2012).Nucleotide sequences of the ITS and D1/D2 regions were recovered from the assemblies with biostrings R package (Pagès et al., 2023) using the sequence of the corresponding primers.
These alignments were concatenated and positions with gaps were removed.The resulting alignment was used to infer a Maximum Likelihood tree with RaxML (Stamatakis, 2014) employing the GTR+GAMMA model for nucleotide substitution and 1000 bootstrap replicates.Structural variation between the CLIB1323 and YMX004033 genomes was identified through comparisons of whole-genome assemblies of both genomes using the SyRI software (Goel et al., 2019).Both assemblies were scaffolded with each other using RagTag, which organized the 16 contigs of CLIB 1323 into 13 scaffolds and the 21 contigs of YMX004033 into 15 scaffolds without loss of information.The assembled genomes were aligned using MUMmer to identify pairs of homologous scaffolds (Delcher et al., 2003), resulting in a set of 13 comparable scaffolds for each strain.This set of scaffolds maintained 100% of the original CLIB 1323 sequence and 98.63% of the YMX004033 sequence.To identify genes encoded within misaligned regions exceeding 1 kb between the two genomes, we overlaid these regions with gene annotations using the findOverlaps function of the GenomicRanges R package (Lawrence et al., 2013).Orthologs among the Kz.humilis strains were determined using OrthoFinder (Emms & Kelly, 2019).

| A comprehensive collection of yeast strains isolated from agave fermentations
The YMX-1.0 culture collection comprises 4524 yeast strains isolated from 68 artisanal agave distilleries in Mexico.Each isolate is associated with metadata regarding species-level identification, biogeographical information, production practices, and source tank pH and temperature at the time of sampling (Data set S1). Fieldwork was conducted between 2018 and 2020, and a detailed description is provided in Section 2.
Sampling sites were traditional distilleries located in 13 states across the seven regions of Mexico where agave spirits are produced: the Northwest, Northeast, West I, West II, Balsas-basin, Centralhighlands, and South-central regions (Figure 1a).Tequila distilleries were excluded because, nowadays, axenic monocultures are commonly used for controlled fermentation in these industrialized settings.The maximum distance between sampled distilleries was 2033 km (1263 mi) from Baviácora, Sonora to San Luis Amatlán, Oaxaca.The sampling sites represented at least four general climate types: warm subhumid, semi-warm, semiarid, and temperate.
Environmental annual temperature at the sampling sites ranged from 14°C to 24°C, with isothermality ranging from 49 to 78.Mean annual precipitation varied from 389 L/m 2 in San Luis de la Paz, Guanajuato (Central highlands) to 1510 L/m 2 in Cabo Corrientes, Jalisco (West II).
Altitudes ranged from 395 meters above the sea level (MASL) in San Pedro Ures, Sonora (Northwest) to 2508 MASL in San Felipe, Guanajuato (Central highlands) (Figure 1b).The diversity of production practices was reflected in the types of fermentation tanks, which were built from wood, cement, plastic, stone, clay, steel, or cowhide (Figure 1c).Agave fermentation samples were nitrogen-frozen with glycerol in the field and used in the laboratory to isolate yeast strains for further analysis (Figure 1d; see Section 2).

Hanseniaspora valbyensis.
Based on MALDI-TOF classification data, despite the great diversity of environmental conditions, we observed that six species-S.cerevisiae, P. kudriavzevii, Kl. marxianus, P. manshurica, P. kluyveri, and T. delbrueckii-were isolated from at least five of the seven agave-spirit producing regions, with extensive presence across sampling sites (Table 1).These yeasts could be considered the core species of agave fermentation; if only species isolated in all producing regions are taken into account, only S. cerevisiae, P. kudriavzevii, and Kl.marxianus would constitute the core of these yeast communities.
The remaining 12 species were found in less than five producing regions and together accounted for 8.8% of the total 3024 yeast isolates with species assignments in our collection (Figure 2a).
We calculated the Shannon-Weiner (H′) and Simpson (D′) indexes for each region, showing that yeast species diversity varied from H' = 1.15 in the Northwest to H' = 2.11 in the South-central region (Figure 2b).Dominance of the most abundant yeast species was marked in the Northwest (D = 0.51) and less evident in other regions.It is of note that isolation of the otherwise widespread P. kudriavzevii yeast was uncommon in the Central highlands and P. manshurica was rare in the Northwest.In contrast, T. delbrueckii and H. lachancei isolates were common in the South-central region, although T. delbruecki was present in six out of the seven producing regions, while H. lachancei was only present in the South-central and West I regions.Other species were isolated at lower frequencies (0.5%-1%) but were nonetheless distributed across different sampling sites and regions (Figure 2c; Table 1).
Although a core set of species was common throughout agave fermentations, based on ANOSIM, the composition of yeast communities in the fermentation samples had statistically significant differences among localities, climate groups, fermentation phases, and total duration of the fermentation (Table 2).In contrast, the producing regions, grinding procedures, agave species, tank pH, and tank temperature did not show statistically significant differences.It is noteworthy that in all cases, the magnitudes of the effects were small (r < 0.25), indicating that the variation in the composition of the communities associated with these variables is small.These results were mostly insensitive to the taxonomic level used for the analyses (species or genera); only the tank material showed an association at the genus level not seen when considering species (Table 2).On the contrary, when factoring the communities at the level of distilleries, we did not find statistically significant differences in the composition of yeast communities across any of the groupings tested (producing regions, climate groups, or type of grinding used, Table 2).The small r statistics at the level of samples together with the not statistically significant differences at the level of distilleries, could reflect the limitation of our sampling design to fully disentangle the variables that influence the composition of yeast communities in these fermentations.However, in general, the similarities of the ranks within fermentation phases and localities suggest that short-term and local factors seem to be the most important determinants of the composition of yeast communities in agave fermentation.
To further assess the changes in yeast community composition through time, we first estimated the Simpson's diversity index in the different fermentation phases.The index decreased with fermentation time, but the change was not statistically significant after controlling for geographical variation (localities within regions) and tank conditions (pH and temperature) (p = 0.059, likelihood ratio test).
However, the Bray-Curtis dissimilarity rank did show a statistically significant decrease as a function of fermentation phase, although the magnitude of the change was small (r = 0.06, p < 10 3 , ANOSIM).
Overall, these results suggest there is a common core of yeast species in agave fermentations and that local variables shape the composition of yeasts in the sampled fermentations.

| Validation of MALDI-TOF biotyping by ITS sequencing
To verify species classification based on MALDI-TOF biotyping, we sequenced the rDNA internal transcribed spacers (ITS region) of a subset of 474 strains (Table S1).ITS sequences revealed that 390 (82.3%) of the isolates were correctly identified initially by MALDI-TOF biotyping at the species level (Figure 3).Discrepancies between identifications by these approaches are likely due to a combination of factors; the accuracy of MALDI-TOF identification depends on the species analyzed, sample processing, and completeness of the reference database.The highest frequency of accurate identifications was obtained for strains classified as S. cerevisiae (88.1%) and Kl.marxianus (87.6%), while the lowest accuracy was for P. manshurica (58.3%); 25 out of the 72 putative P. manshurica strains (~35%) corresponded to other Pichia species.Noteworthy, a small fraction (7.6%) of the putative S. cerevisiae isolates were assigned to its sister species Saccharomyces paradoxus by ITS sequencing, increasing the total number of S. paradoxus isolates to 10 in the YMX-1.0collection.

| Candidate new species of yeasts in the agave fermentation environment
Many isolates of the YMX-1.0 with high-quality spectra (996, 22.0%) were not assigned to any species in the database using a high confidence threshold (Score Value ≥ 1.7).This number was still The distribution of yeasts across Mexico reveals a core of six species and local variation of uncommon species.Using MALDI-TOF biotyping, 3063 isolates of the YMX-1.0culture collection were assigned to putative species.(a) Bar plot shows the frequency of core species; other species comprised 8.8% of the identified strains.(b) Stacked bars show the species distributions of isolated yeasts in each of the seven producing regions; the number of isolates from each region is shown in parenthesis.Only the set of six core species are shown separately and the rest of the species are grouped in others (Otrs).The Shannon-Weiner (H′) and Simpson (D′) diversity indexes are indicated for each region (top).(c) Stacked bars show the frequencies of producing regions where the five prevalent, but less common species were isolated.These species were isolated at 0.5%-1.0%frequencies but were present in at least 5% of the sampling sites.The number of isolates of each species is shown in parenthesis and Saccharomyces cerevisiae data is shown for reference.(d) Plot shows the frequencies of species that are significantly associated with different fermentation phases (multilevel pattern analysis, p < 0.05); all other yeasts are grouped and plotted in a single data series (gray line).The three phases of fermentation were defined by the tercile of the time at which each sample was taken with respect to the total fermentation time.
considerable when using a lenient cutoff of ≥1.5 resulting in 439 isolates with unreliable identification.To explore the diversity of yeasts in this set of non-identified strains, we sequenced the D1/D2 domains and ITS region of 22 isolates with no MALDI-TOF classification from Oaxaca (South-central region) and Durango (West I region).These distant sites were chosen because they were among the most thoroughly sampled and are considerably different in terms of climate and production practices.Analysis of the sequences showed that the strains isolated belong to three genera, Saccharomyces, Torulaspora, and Pichia.Six species were identified, four of which were previously known, while two were candidates for novel species (Table 3).
Three strains from this set of 22 isolates which had no species assignment by MALDI-TOF, showed high ITS identity to a strain isolated from fruit fly formerly classified as S. cariocanus; a MALDI-TOF data were obtained for 4062 (89.8%) samples; species were assigned for 3063 isolates (67.7%) using a stringent cutoff (Score Value > 1.7).b A minor fraction of the isolates was assigned to a different species when analyzed by ITS sequencing (Table S1).F I G U R E 3 Validation of MALDI-TOF biotyping by ITS sequencing.A subset of 474 isolates assigned to the four most abundant species were confirmed by ITS sequencing; plot shows the frequency of true (dark gray) and false (light gray) classifications.
Correct species based on ITS sequencing are provided in Table S1.
this species has been reclassified as S. paradoxus based on the low sequence divergence between the strains (Liti et al., 2009) to the new isolate.These two candidate new species will be formally described in future work.Together, these results suggest that the sub-sampled agave fermentation environment harbors novel fungal species yet to be formally described.

| Genome analysis reveals substantial intraspecific genetic diversity in Kazachstania humilis
The YMX-1.0 culture collection provides a unique biological resource for performing genomic and population studies on the various yeast species found in agave fermentations.In our study, the yeast Kz.
humilis (syn.Candida humilis, Candida milleri) was isolated with low frequency, with 24 isolates in the complete collection, but was present in seven distilleries (10%), spanning four of the seven producing regions (Table 1).To gain a better understanding of the genomic diversity of Kz. humilis from agave fermentations, we sequenced and analyzed the whole genome of four strains from distant regions.This set included the previously sequenced YMX004033 strain (García-Ortega et al., 2022).The resulting assemblies were compared with the genome sequence of the typestrain CLIB 1323 isolated from sourdough in France (Jacques et al., 2016).
The draft genome sizes of the three new Kz.humilis isolates sequenced in this study (YMX00387, YMX000554, and YMX003162) varied from 17.1 to 19 Mbp, a larger size than that of the YMX004033 and CLIB 1323 genome references (13.9 Mbp) T A B L E 3 Identification of 22 Saccharomycetales isolates from agave fermentation based on sequences of the ITS region and D1/D2 domains of the large subunit of the rRNA gene.a All strains were restreaked for single colonies before rDNA sequencing and new IDs were assigned to restreaked strains (see Data set S1).
b Query sequences for all 22 strains had 100% coverage.
c Queries were retrieved from low-depth genome sequences (see Section 2).
(Table S2).Differences in estimated genome sizes may arise from short-read assemblies of diploid heterozygous genomes.In such instances, heterozygous haplotypes might be present as two separate contigs instead of being merged in the haploid assembly.Additionally, short-read assemblies tend to have more scaffolds, in this case, 229 scaffolds on average, compared to those obtained with long-reads such as YMX004033 (21 scaffolds) and CLIB 1323 (16 scaffolds).
These discrepancies deriving from different sequencing technologies have been previously documented (Korlach et al., 2017).The average number of predicted genes for the four strains from agavefermentation was 5333, which is higher than the 4295 predicted for the type strain.However, this value aligns with the gene numbers reported for other species within the genus, including Kz. barnettii (5276), Kz. exigua (5416), Kz. unispora (5287), Kz. saulgeensis (5329), Kz. africana (5375), and Kz.naganishii (5319).It remains possible that these figures overestimate the actual gene content due to the generation of duplicates as a result of assembling haploid genome drafts from non-haploid genomes, particularly in regions of higher heterozygosity (Ko et al., 2022).
To delineate the phylogenetic position of the four Kz.humilis strains from agave fermentation, we retrieved the nucleotide sequences of five frequently used genetic markers (ITS, D1/D2 LSU rRNA, RPB1, RPB2, and EF-1α) (Schoch et al., 2012) from the genome assemblies.After concatenation, a phylogenetic tree was constructed based on the alignment with publicly available molecular marker sequences from Kazachstania species (Figure 4a).The reconstructed phylogenetic tree aligned with phylogenies previously reported for these species (Jacques et al., 2016).As anticipated, strains from agave fermentations clustered within a single clade.Notably, the Kz.humilis type strain CLIB 1323, along with two other isolates-CBS 5658 (isolated from Bantu beer in South Africa) and NRRL Y-7245 (from sourdough in the United States)-formed a neighboring clade.
To quantify the genetic relationship between the Kz.humilis strains from agave and the type strain at the genome-wide level, we determined the average nucleotide identity (ANI) between all strains with available genome sequences.The ANI values ranged from 96.5% to 99.6% (Figure 4a, inset), which aligned within the range of diversity expected for yeast strains of the same species (Gostinčar, 2020).The highest ANI value was observed between YMX000554 and YMX003162 (99.56%), both of which are agave yeast strains from the same producing region.The CLIB 1323 type strain had the lowest median ANI value when compared to all strains from agave fermentations (96.68%), a value that is close to that expected in two different species.Although analysis of more isolates of the species is needed, these differences suggest the existence of distinct population of Kz. humilis in agave and sourdough.
To uncover structural variations within the genomes of these strains, we performed a synteny analysis between the two genome sequences assembled from long-read data: the Kz.humilis type strain CLIB 1323 (Jacques et al., 2016) from sourdough in France and the agave-fermentation isolate YMX004033 from the West II producing region in Mexico (García-Ortega et al., 2022) (Table S3).Synteny Our work considerably expands previous studies of yeast communities performed in specific locations in few of the agavespirit producing regions (Kirchmayr et al., 2017;Lachance, 1995;Nolasco-Cancino et al., 2018;Páez-Lerma et al., 2013), providing an extensive and comparative view of the diversity of yeasts living in this environment.It is important to acknowledge that the description herein presented is based solely on cultured, isolated strains.
Moreover, results are likely biased by the specific enrichment conditions and the MALDI-TOF approach used for species assignment.Because of these limitations, we chose to test association of the community composition only with a handful of metadata variables using the presence and absence of the species.This is a conservative approach as differences in the relative abundance of species within samples are not considered when testing associations between metadata variables and a given community.In addition, we did not include metadata for which sampling was considerably uneven.This is the case of localities (municipality), as only about half of our samples came from localities with more than one distillery sampled.Despite the limitations and conservative approach, the systematic nature and extension of our sampling effort offer an unprecedented opportunity to contrast the diversity of yeast species isolated from agave fermentations from diverse biogeographical and cultural contexts.This suggests that factors such as the distillery or time of fermentation have a moderate but statistically significant effect on the composition of the yeast communities.Future studies using metagenomics approaches will provide a more complete description of the yeasts in these fermentations and may reveal unique features of the microbial communities of specific producing regions.In addition, the culture collection resource and our initial findings will likely fuel hypotheses-driven studies focused, for example, on understanding the specific physicochemical characteristics that result in the local differences of the yeast communities.
The YMX1.0 culture collection also opens new research avenues to understand crucial aspects of yeast biology such as genomic and functional diversity within populations that would not be possible using approaches that are not based on cultured organisms.For instance, the gene duplication of the ZWF1 gene found in a Kz.humilis isolate herein sequenced and missing in strains from other sources could be associated with tolerance to furans in this environment.
Increased expression of genes of the pentose-phosphate pathway such as ZWF1 has been shown to underlie tolerance to hydroxymethylfurfural in S. cerevisiae (Gencturk & Ulgen, 2022;Gorsich et al., 2006;Park et al., 2011).Likewise, the set of genes that are present or missing specifically in agave fermentation isolates could be candidate genetic underpinnings of niche adaptation.The availability of a large collection of strains isolated from agave fermentations will now allow experimental testing of this type of hypothesis.
In this survey, yeast strains were taxonomically classified mostly by MALDI-TOF biotyping using direct biomass transfer (García-Gamboa et al., 2021), which allows rapid identification directly from single colonies.This approach results in higher rates of incorrect species assignment compared to sequencing-based methods.We evaluated the extent of this problem by sequencing the ITS region of a subset of 474 isolates and showed that disparate species were assigned in ~19% of the cases.Notably, García-Gamboa et al. (2021) also used yeast extracts for MALDI-TOF classification, achieving 100% correct identification of P. kluyveri isolates while comparing them to ITS sequencing.In our study, incorrect classification was strongly biased to one of the species tested (P.manshurica), underscoring that the reference databases used for mass-spectrometry biotyping are yet to be improved for less studied species.This is especially relevant for the characterization of agave fermentations in which Picha species have been shown to be common.ITS and D1/D2 sequencing also revealed the occurrence of S. paradoxus in agave fermentations.As opposed to its sister S. cerevisiae, this species is relatively less common in fermentative environments (Dashko et al., 2016).Importantly, our survey revealed two new species candidates related to the Pichia clade, one of which seemed to be broadly present in two of the producing regions surveyed.Further work will be needed to characterize these isolates.These results suggest that the YMX-1.0culture collection may harbor new fungal species associated with this understudied fermentative environment.
The YMX-1.0 culture collection offers a powerful resource for population genomics of the Saccharomycetales yeasts.Analyzing the many strains of model species such as S. cerevisiae or Kl.marxianus will help to understand their evolution and natural history at the interface of natural and human-related environments.In fact, the comprehensive genome survey of S. cerevisiae by Peter et al. (2018) has already revealed widespread introgressions from the sister species S. paradoxus in the agave fermentation lineage, a characteristic only seen at a similar level in one other S. cerevisiae linage.In the case of Kl. marxianus, the genome of a strain isolated from agave juice, has been shown to form a specific clade, far apart from all other sequenced strains of this species (Ortiz-Merino et al., 2018).
The YMX-1.0 culture collection is also a powerful resource to study the diversity of nonconventional yeast species.Here, we report that Kz. humilis thrives in agave fermentations.This is one of the dominant species in type I sourdoughs (Carbonetto et al., 2020;García-Ortega et al., 2022;Wittwer et al., 2022) and one of the most abundant fungi during the preparation of a variety of fermented vegetables and beverages in China (Guan et al., 2020;Gullo, 2003;Wittwer et al., 2022).However, few occurrences of the species have been reported in the Americas.Given their genetic distance to sourdough isolates, the draft genome sequences of four of these Kz.humilis strains from agave fermentation suggested that they may belong to a different population, even when analysis of more isolates is needed to validate this hypothesis.By comparing with isolates from other environments, yeast strains in the YMX-1.0culture collection provide the opportunity to compare the evolutionary trajectories of distinct species that have been exposed to the same environment.
Most of the emphasis in the study and conservation of agave spirit production has been placed on the plants and production practices employed.However, to preserve this system integrally, we still need to consider the microbial communities that are essential for open fermentation associated with small and medium-sized distilleries.Our study provides a unique resource to grant deeper knowledge of the genomic and functional diversity of yeasts associated with this exceptional fermentation system in one of the megadiverse regions of the world.The YMX-1.0 collection directly enables the conservation of the microorganisms that are essential to produce agave spirits, an activity of crucial commercial and cultural importance for rural communities in Mexico.Future research based on our work will contribute to a better understanding of not only how the microbial communities contribute to the production of agave spirits, but also to the biology and natural history of microorganisms living at the interphase of natural environments and traditional distilleries.

AUTHOR CONTRIBUTIONS
Conceptualization: Lucía Morales, Eugenio Mancera, Alexander DeLuna.Methodology: Porfirio Gallegos-Casillas, Adriana Espinosa- Biogeographical parameters were obtained by recording the geographical coordinates of each site over the corresponding raster cartography.Elevation data were gathered at a 1 km scale resolution through the US Geological Survey GTOPO30 data set (earthexplorer.usgs.gov;entities GT30W100N40 and GT30W140N40).Environmental data were obtained from WorldClim climatic layers (worldclim.org/bioclim) at a 1 km (30″) resolution.Analyzed variables included Mean Annual Precipitation (wcbio_12), Mean Annual Temperature (wcbio_01), and Isothermality (wcbio_03); climates were classified as Aw (warm subhumid), (A)C (semiwarm), C(w) (temperate), and BS (arid and semiarid), based on Yeast isolation and establishment of the YMX-1.0culture collection Samples were thawed, and 500 μL were used to inoculate Saccharomyces isolation medium (3 g/L yeast extract, 3 g/L malt extract, 5 g/L U R E 1 (See caption on next page).

T
A B L E 2 ANOSIM test statistics based on environmental and fermentation conditions at two taxonomic and sampling unit levels.a (p value calculated with 9999 permutations, vegan in R) using Jaccard's index from the presence-absence matrix.b Bonferroni corrected significant statistics at α = 0.05/10 = 0.005 for samples and α = 0.05/3 = 0.0167 for distilleries.N.S., nonsignificant.c Not considered, as only 35/68 distilleries were in localities with more than one distillery sampled.
analysis across 13 consistent scaffolds revealed ~11.8 Mb of syntenic regions (~85.9% of the CLIB 1323 genome), 66 translocations, 137 duplications in CLIB 1323 and 71 duplications in YMX004033, 57,063 INDELs and 50,3408 SNVs.Additionally, SyRI identified 0.24 Mb of CLIB 1323 and 0.69 Mb of YMX004033 DNA sequences not aligned with the other respective genome (Figure 4b).These non-aligned regions were associated with 159 genes in YMX004033 and eleven genes in CLIB1323.Of these, 20 genes from YMX004033 and one uncharacterized gene from CLIB 1323 had no orthologous counterparts in their respective sequences.Moreover, 4 out of 21 genes present in two YMX004033 contigs, which were not considered in the synteny analysis (see Section 2), also lacked orthologs in the CLIB1323 genome.The genes unique to the agave strain YMX004033 included homologs of AZR1, CDD1, egt-2, EHT1, FLO9, FLO10, GPM3, MOG1, OPT1, PIN2, PHO3, PDX3, SOA1, SUC4, STB3, Tf2-6, THI5, ygeX, YCT1, YJR142W, YKL068W-A, and three dubious ORFs.These major structural variants further showed that the Kz.humilis strains from agave and sourdough are clearly differentiated at the genome level, with potential reduced genetic compatibility between them.Structural variation also included several segmental and dispersed duplications, with some encompassing complete open reading frames.Together, these findings indicated that the strains isolated from agave fermentations are genetically distinct, highlighting the YMX-1.0culture collection's potential for advancing research on the population genomics of nonconventional yeasts pivotal for food and beverage production worldwide.4| DISCUSSIONThe diversity of yeasts associated with agave fermentation has remained underexplored, despite the widespread biological and industrial interest in the Saccharomycetales yeasts and an increasing demand for agave spirits in the global markets.In this study, we have identified the yeast species present in this human-related environment by systematically isolating thousands of strains from traditional distilleries across Mexico.Each isolate of the YMX-1.0culture collection is linked to metadata for the fermentation sample and isolation site including biogeographical information, pH and temperature at the time of sampling, and production practices.Despite the great diversity of environmental conditions, number of agave species used, and variety of production practices, a set of six species were recurrently isolated throughout sampled distilleries.This suggests that there is a core consortium of yeast species associated with these fermentations whose presence is more strongly influenced by the common characteristics of the fermentation substrate.Nonetheless, the yeast composition of agave fermentations seems rather determined by short-term, local environmental, and production factors of each distillery.It remains to be tested whether intraspecific genetic differentiation exists across producing regions and whether there are phenotypes associated with adaptations to these fermentations.Each of the identified species may represent undescribed populations with unique gene pools, as preliminarily suggested by genomic differences of the nonconventional yeast Kz. humilis here sequenced.
Comparative genomics of Kazachstania humilis isolates.(a) Maximum-likelihood phylogenetic tree for Kazachstania species based on the alignment of the concatenated ITS, D1/D2 LSU rRNA, RPB1, RPB2 and EF-1a genes (2863 positions in total); Zygosaccharomyces rouxii CBS 732 was used as the outgroup.The four strains from agave fermentation are shown in red.The remaining Kz. humilis strains were isolated from various sources; CLIB 1323 and NRLL Y-7245 are from sourdough in France and United States, respectively, while CBS 5658 is from Bantu beer in South Africa.Numbers on the branches are bootstrap support based on 1000 replicates; only numbers higher than 50% are shown.The genetic distance scale represents substitutions per nucleotide.The matrix (inset, bottom right) shows the genome-wide pairwise average nucleotide identity (ANI) values for Kz.humilis strains with available genome sequences.(b) Comparison between the genomes of the type strain Kz. humilis CLIB 1323 (black lines) and strain YMX004033 (red lines).Figure shows syntenic (gray), duplicated (cyan), translocated (green), and inverted (orange) regions larger than 10 kb.SNP density of YMX04033 compared to CLIB 1323 is shown in gray bars.Duplicated regions resulting in gene-copy gains are indicated by magenta triangles, while duplicated regions with simple and low-complexity repeat elements are shown as purple squares.Names of duplicated genes are shown next to the genome where the duplication is observed.