Constructing and Machine Learning Calabi-Yau Five-folds

We construct all possible complete intersection Calabi-Yau five-folds in a product of four or less complex projective spaces, with up to four constraints. We obtain $27068$ spaces, which are not related by permutations of rows and columns of the configuration matrix, and determine the Euler number for all of them. Excluding the $3909$ product manifolds among those, we calculate the cohomological data for $12433$ cases, i.e. $53.7 \%$ of the non-product spaces, obtaining $2375$ different Hodge diamonds. The dataset containing all the above information is available at https://www.dropbox.com/scl/fo/z7ii5idt6qxu36e0b8azq/h?rlkey=0qfhx3tykytduobpld510gsfy&dl=0 . The distributions of the invariants are presented, and a comparison with the lower-dimensional analogues is discussed. Supervised machine learning is performed on the cohomological data, via classifier and regressor (both fully connected and convolutional) neural networks. We find that $h^{1,1}$ can be learnt very efficiently, with very high $R^2$ score and an accuracy of $96\%$, i.e. $96 \%$ of the predictions exactly match the correct values. For $h^{1,4},h^{2,3}, \eta$, we also find very high $R^2$ scores, but the accuracy is lower, due to the large ranges of possible values.


Introduction
Calabi-Yau manifolds are of paramount importance in string theory, since they (together with their orbifolds) are the most promising candidates for internal spaces in the compactification mechanism [1].The simplest example of a Calabi-Yau manifold arises when considering the zero locus of a homogeneous polynomial in the complex projective space.The natural generalisation of this setup consists of having a number of homogeneous polynomials living in a product of projective spaces; under some specific conditions, the zero locus of such polynomials defines a Calabi-Yau manifold.Spaces obtained using the above prescription are usually referred as complete intersection Calabi-Yau's, or CICY's for short.This algebro-geometric construction, first introduced in the seminal work of Candelas [2], was a fundamental result, since it provided a systematic way of producing examples of Calabi-Yau manifolds of complex dimension three.CICY three-folds were classified (up to some equivalences, see [3]) in [4], counting 265 distinct Hodge diamonds among roughly 8000 manifolds.A number of other techniques were also devised to produce examples of Calabi-Yau manifolds [5][6][7], motivating the community to collect large -sometimes extremely large -sets of data pertaining to such spaces, arranged in massive databases of algebro-geometric information (see [8], for instance).These lists focus mainly on the Hodge numbers, since they play the roles of physical parameters in the lower-dimensional theory obtained via compactification [1,9].The importance of such holomorphic invariants, also in the context of pure mathematics, cannot be underestimated: they count the number of classes of Kähler metrics and the possible complex deformations, just to mention the two most quoted results.In the age of big data and increased computing power, the databases mentioned above also provide a fertile land for machine learning to be effective.In fact, employing neural networks to the study of CICY three-folds' Hodge numbers was one of the earliest applications of machine learning to theoretical physics, as was presented in [10].A number of papers also investigating neural network performances on Calabi-Yau three-folds datasets followed in the subsequent years [9,[11][12][13].
While three-folds are suitable candidates for compactifications of theories with ten dimensions, their higher-dimensional analogues need to be considered when dealing with theories having more than ten dimensions.M-theory [14][15][16], F-theory [17][18][19][20], and Stheory [21] are the most famous examples of such theories.Motivated by the study of F-theory vacua (see [22,23]), Gray, Haupt and Lukas classified all CICY four-folds, obtaining more than 900000 inequivalent configurations, with over 4000 different sets of Hodge numbers, in [24,25].The dataset was investigated via different machine learning techniques in [26,27] with very promising outcomes.More recently, five-folds also made an appearance in the physics literature, although studies specifically dedicated to classifying Calabi-Yau five-folds are scarce (we could only find a partial classification in [8], with incomplete Hodge information).M-theory, when compactified on a Calabi-Yau five-fold, results in a N = 2 supersymmetric quantum mechanics [28,29].Moreover, Calabi-Yau five-folds play a role in F-theory, where upon compactification, they provide a way to systematically construct N = (0, 2) CFTs [30,31], which might allow to classify them.Therefore, a good dataset of such Calabi-Yau manifolds is crucial in the search of relevant CFTs.Additionally, in [32], a 3D string vacua has been constructed by compactifying S-theory on a T 2 × T 3 fibered Calabi-Yau five-fold.
Having outlined the context for the present work from the string theory perspective, let us review some recent developments in the application of state-of-the-art data science techniques in this field.As mentioned above, machine learning as a tool for exploring the landscape of string compactifications on Calabi-Yau manifolds has appeared multiple times in the recent literature (see [33][34][35][36][37][38][39][40][41][42][43][44]). 1 These investigations uncovered new features and insights, such as an unobserved clustering behaviour of the Hodge numbers [13], a new approximation method [56], and hints for an unknown formula for h 1,3 [26].Restricting to CICY's only, astonishingly good results with 100% accuracy have been obtained for four-folds in [27], and investigations in [57] show new promising results on extrapolating predictions from low to high Hodge numbers for both three-folds and four-folds.The latter technique is particularly relevant for the case of five-folds, whose exhaustive list is estimated to be astronomical in size; such an approach might allow to extract global properties of the dataset quickly without the need to generate it all (which could be unfeasible with current computational resources).
Our work aims to be the first useful effort in classifying CICY five-folds and applying machine learning techniques to the resulting dataset.From the sizes of the CICY threefolds and four-folds datasets, we expect the five-folds one to be of the order of 10 8 .Given the computational power at our disposal, we limit ourselves to a subset of it for this first investigation.
The paper is structured as follows.In section 2 we review the mathematics behind CICY five-folds.We start by recalling general facts about Calabi-Yau manifolds and outlining the construction of such manifolds as complete intersections (sections 2.1 and 2.2).Then, section 2.3 presents the techniques that were employed for the calculation of the Hodge numbers.We build upon the analogous works in the context of three-folds and four-folds (summarised in [4,25]), which are adapted to the case under investigation in sub-section 2.3.1;but we also introduce a further and new step in the computation, described in sub-section 2.3.2, which is needed to determine the full Hodge diamond.Section 3 is devoted to outlining how the dataset was constructed.We comment on the expected size (3.1),summarise the implementation of the algorithm for the generation of configuration matrices (3.2) and discuss the algorithm for the calculation of the Hodge numbers (3.3).The subset of the dataset that we restricted ourselves to is presented in section 4, together with its properties.Section 5 introduces the architectures that we chose for the neural network analysis of the Hodge numbers.Our findings are described in section 6, where each subsection focuses on the performance of Machine Learning (ML) in the prediction of one of the invariants.Finally, section 7 consists of the conclusions and outlook.The appendix contains the proof of finiteness of CICY five-folds, and detailed examples illustrating the generation of the dataset and the Hodge numbers calculation.

Constructing and Characterising Calabi-Yau Fivefolds
In this section we provide a quick overview on complete intersection Calabi-Yau manifolds in products of projective spaces.For more details on Calabi-Yau spaces in general, the reader is referred to [58][59][60], while some relevant sources for the construction of CICY's are [2,[61][62][63].

The Construction
As mentioned in the introduction, CICY's are defined as smooth submanifolds in the product of projective spaces.We can write such a product generically as and it is usually referred as the ambient space, which we denote as A. In order to consistently define a non-trivial submanifold, we consider homogeneous polynomials in the ambient space.Let us label them as p α , with α = 1, ..., K.Each p α depends on m sets of homogeneous coordinates (one for each projective space in the ambient space).Then, in general, the zero locus of these polynomials defines a smooth submanifold of dimension dim(A) − K.It turns out that the holomorphic numbers of the spaces originating from this construction do not depend on the specific form of the homogeneous polynomials, but only on their degree.We therefore define the following notation: the degree of the polynomial p α in the r th set of coordinates, associated to the r th projective space, is denoted as q r α .All the relevant information concerning the submanifold can thus be arranged in a configuration matrix : which is the object that is commonly used to label CICY's.Any space obtained according to the prescription above is compact and Kähler, by definition.In order to obtain a manifold in complex dimension five, the condition must be satisfied.Moreover, so far we have only achieved the "CI" in "CICY".In order to ensure the Calabi-Yau property (which is defined, together with its implications, in the next section), we also need to impose a constraint on the degrees of the polynomials: We conclude this section by introducing an important point, which is discussed in more details in section 3.2: more than one configuration matrix may correspond to the same manifold.For example, it is evident that permuting columns does not change the nature of the space described, since it is just a relabelling of the polynomials.The same goes for permuting rows, which just amounts to changing the order of the projective spaces in the ambient space.These two observations already show that there is a very large amount of redundancy in the description of CICY via configuration matrices.There are other sources of redundancy as well, which are also described in section 3.2.

Basic Properties, Index Theorem and Euler Number
This section follows closely Appendix B of [28], which is itself very similar to the detailed construction of four-folds that can be found in [24].
Let us consider a complex 5-dimensional Kähler manifold M. If its canonical bundle is trivial, or equivalently if its global holonomy group is contained in SU( 5), then we say that M is Calabi-Yau five-fold.This also implies that its first Chern class vanishes, i.e., c 1 (M) = 0.The Hodge numbers h r,s (M) are the holomorphic invariants used to classify the above manifolds.h r,s is defined as the complex dimension of the Dolbeault cohomology group H r,s (M) of M, namely the r th sheaf cohomology of the sheaf of germs of holomorphic s-forms: Let us summarise some mathematical facts that are useful for calculating the Hodge numbers of Calabi-Yau five-folds.It follows from Bochner techniques that h 0,p (M) = h p,0 (M) = 0 for p = 1, 2, 3, 4. The Calabi-Yau condition, namely the triviality of the canonical bundle, ensures that h 0,0 (M) = h 0,5 (M) = h 5,0 (M) = h 5,5 (M) = 1 (where we also implicitly used the Poincaré duality).Finally, we have the following symmetries between Hodge numbers: h p,q (M) = h q,p (M), by conjugation, and h p,q (M) = h 5−p,5−q (M), by Serre duality.Overall, the Hodge diamond of a Calabi-Yau five-fold is of the form Given the usual decomposition for the Betti numbers, and the Hodge diamond (5), it follows that and b i (M) = b 10−i (M) for i > 5, by Poincaré duality.This, in turn, implies that the Euler number η(M) of M can be written as As it is made clear in [28], there is one more condition that can be derived from c 1 (M) = 0 for the Hodge numbers, which was derived (for the case of four-folds) in [64].First, let us state the general form of the Atiyah-Singer index theorem, for a complex vector bundle V on M and with ch(V) being the Chern character of V. We denote the Todd class of the complexified tangent bundle of M by Td(T M).In our case, V is the bundle of holomorphic q-forms, V = Ω q M where q = 0, 1, 2, 3.The cohomology groups of these bundles can be related to the cohomology group of the complex manifold M as H i (M, Ω q M ) ≃ H i,q (M).The Chern class and character of the tangent bundle are computed as In the above equations, we have implicitly used the splitting principle to write the bundle T M as as a direct sum of line bundles L i , with the x i 's being their first Chern classes (see [62] or [58]).Then, the integrand of the index theorem reads: Expanding this expression and re-writing it in terms of Chern classes using (10), it follows from the index theorem (9): where c i = c i (T M).Now, as a consequence of c 1 = 0 and of the symmetries in (5), we have that ( 12) is automatically satisfied, and ( 14), (15) describe the same equation.Subtracting ( 13) from ( 14), and comparing the resulting equation to (8), yields If we instead eliminate c 5 , we find a constraint which depends solely on Hodge numbers: reducing the number of Hodge numbers needed to characterise five-folds from six to five.Everything that was derived so far applies to any Calabi-Yau manifold.As anticipated, we now specialise to Calabi-Yau manifolds constructed as complete intersections of polynomials living in a product of projective spaces.These manifolds do not exhaust all possible Calabi-Yau's, but allow a systematic construction of many examples.We now make use of all the definitions given in the previous section, with A denoting the ambient space and M being the complete intersection Calabi-Yau.We let J r be the Fubini-Study Kähler form associated with one of the factors in the ambient space, i.e.CP nr for some r.We assume it to be normalised as Then, η(M) can be computed by using (16), together with (again, from [28]): Computing the components of c r 1 ...rn n is just a matter of tedious, but straightforward, summations, which can be easily implemented on a computer.It should be noted that some non-zero components are associated to a combination of Kähler forms which vanishes.For instance, consider a product of CP 1 and CP 5 .Then, in the expansion of c 2 , we have a term of the form c 11 2 J 1 J 1 .In that case, despite c 11 2 might be non-zero, the term vanishes since J 21 is zero on CP 1 .Once c 5 is computed, and the vanishing terms in (19) have been established, we need to integrate the non-zero terms according to (16) in order to obtain the Euler number.The integration is over M, but we can translate it into an integral over the ambient space A using the relation By considering the combinations that vanish and the normalisation (18), the integral can be quickly evaluated. 2 Let us now present the techniques used for the calculation of the cohomological numbers that we focus on in our investigations.

Spectral sequences
The calculation for the Hodge numbers involves the study of two exact sequences.The first one was already considered in the literature for the cases of three-folds and four-folds (see [4,24,25,62,65]).On the other hand, the appearance of the second one is much more rare (an instance can be found in [66]), but it is the key to tackle the more complicated problem of five-folds.We present them in order.

The Adjunction Sequence
This section is based on the lower-dimensional analogues of this work, mentioned above.We refer the reader to those sources for a detailed description, while only sketching the procedure here.The analysis that we are about to present is sufficient to determine the full Hodge diamond for CICY three-folds and four-folds, but the same is not true for five-folds.The extra machinery necessary for the latter manifolds is described in the next subsection.
Let T A and T M be the holomorphic tangent bundles of A and M, respectively.The normal bundle to M is defined as the following quotient: N = T A | M /T M .Since M is embedded in A as the zero locus of some holomorphic section ξ, we obtain the short exact sequence: where E is a holomorphic bundle over A. This is called the adjunction sequence, and it is the key result for studying the cohomology of CICY's.It implies that E| M = N , which we can write as a sum of line bundles as where ) is its restriction to M and h r denotes the hyperplane bundle over CP nr r .The long cohomology sequence associated to (21), which is at the core of our computation of the Hodge numbers, reads: As usual, Serre duality and Dolbeault theorem have been used to relate the cohomologies of the sequence to H •,1 (M).In order to determine the Hodge numbers from h 1,1 to h 1,4 , one first needs to calculate the cohomologies in the right two columns.The key object for the computation is the Koszul resolution: In fact, by twisting the above sequence in precise ways we can determine the cohomologies valued in the tangent bundle (indirectly) and those valued in the normal bundle (directly).
Let us start from the former, for which we need yet another short exact sequence, the Euler sequence: The bundle R is defined as where e i are the unit vectors.The associated long sequence in cohomology reads The first column on the left is easy to compute, since and all the other cohomologies are trivial.The central column can be determined by considering (26).Specifically, we can consider the twisted version of the Koszul resolution (24) for each of the basis vectors and the associated spectral sequence, to calculate the cohomology of each O M (e i ).We note that spectral sequence in this case is trivial, i.e., the filtration converges immediately, for all CICY three-folds (see [4]).The cohomology valued in R is then obtained by summing all such contributions, raised to the appropriate power.This allows to determine the missing column, i.e., the cohomology valued in the ambient space tangent bundle restricted, which is needed in the adjunction long cohomology sequence (23).
Regarding the normal bundle, the situation is simpler, without the need for introducing any other sequences.According to the decomposition (22), we can find the cohomology valued in the co-normal bundle by summing the cohomology of the line bundles twisted by each of the constraints.To calculate all these contributions, we can again consider the twisted version of the Koszul resolution, and its spectral sequence. 3sually, when considering the spectral sequences above, it is the case that they degenerate at the first page.However, this is not guaranteed in general, and the condition that forbids the degeneracy is referred as the obstruction in [4].Before getting to it, let us point out that the procedure outlined above has already been studied in detail for the case of three-folds, and general formulae for the spectral sequences involved have been worked out.Closed expressions for the non-vanishing groups appearing in the spectral sequences, namely E j,k 1 (T A ) and E j,k 1 (E), can be found in the appendix (again, see [65] and [4] for more details).The obstruction is present whenever there is a pair of non-vanishing groups (E) such that j ≥ j ′ .This implies the presence of non-trivial maps, which makes the filtration non-trivial.More examples of obstructions appear in the next section, and the specific approach that we chose to deal with this complication is discussed in section 3.3.Finally, we recall that this machinery is not sufficient for determining the complete Hodge diamond of CICY five-folds.Hence, we need to consider yet another sequence, which brings us to the next subsection.

The Symmetrised Adjunction Sequence
The method just described allows us to find four Hodge numbers out of six (modulo the obstruction issue, which we discuss in detail).A naive counting would tell us that we have found the full diamond, since there are two extra constraints, namely ( 8) and ( 17), coming from the Euler number and the index theorem that we discussed, respectively.However, h 2,2 and h 2,3 cannot be determined independently from those equations, since they appear in the same combination in both.Hence, the need for an additional computation, based on the symmetrisation of the adjunction sequence, which allows us to find the remaining Hodge numbers.An application of such a sequence can be found in [66], where only CICY's defined on a single projective space are studied.The symmetric square of (21) yields: This sequence is not short, but it can be split into the following two short exact sequences: The associated long sequences in cohomology read: and respectively.In the first sequence, we have related the cohomology of the tangent space squared to the Hodge numbers of the Calabi-Yau, again by using Serre duality, Dolbeault theorem and the symmetries of the Hodge diamond.By straightforward generalisations of the techniques used in the previous section, we can compute Let us start from the last one.The cohomology can be found by summing the contributions coming from the line bundle twisted by every possible pair of constraints.Each of those is obtained by considering the associated version of the Koszul resolution.Analogously, ) is determined by summing the cohomology of the line bundles twisted by each pair of unit vectors (see (26)).Again, all these terms are found by considering the appropriate twisted Koszul resolution.Finally, by each of the constraints, and summing their cohomologies.These three results can be plugged in (30) and (31), from which the unknown dimensions of H 2,2 and H 2,3 can be determined, most of the times.We end this section by briefly discussing when this procedure might not be sufficient for determining the desired Hodge numbers.The first immediate observation is that, under the assumption that ) and H • (M, Sym 2 E) are known, there might be too many non-zero entries in (30) and (31) to find the Hodge numbers uniquely.However, we find that this is not the case, while the real obstacle that we find is the assumption above.In fact, the computation of those three cohomologies involve the study of filtrations which are not always trivial, i.e. there might be an obstruction, as mentioned previously.For such cases, the presence of additional maps introduces extra variables, which complicates the problem.In section 3.3 we discuss how we tackled this issue when implementing the above procedure algorithmically.

Building a Dataset
Given a method for computing the Hodge numbers from a given configuration matrix, the next natural step is to construct a list of many such configuration matrices.The only subtle point in this procedure is to take redundancy into account.Specifically, a number of configuration matrices might describe the same space, in which case only one representative should be picked 4 .As described in [24], the first check is whether two matrices just differ by a permutation of rows and columns.There are also other known techniques to identify equivalent configurations (again, see [24]) which need to be implemented in order to build a dataset free of redundancies.
In this section, we comment on the global properties of the dataset that can be inferred a priori, we describe the process of generating configuration matrices and we outline the implementation of the Hodge numbers calculation.

The Size
Finiteness of CICY three-folds has been proven in [67], where the maximum size of a configuration matrices for such spaces is derived.By similar arguments, one can show that four-folds are also finite in number (see [24]), and clearly the same holds for fivefolds.In fact, we find that configuration matrices for five-folds can be at most of size 25 × 30, i.e., 25 projective space factors and 30 constraints over them.The proof of this can be found in the appendix.However, the actual size of the dataset is a whole different story.The three-folds list consists of 7890 spaces, while the four-folds one has 921497 CICY's.Even though not all equivalences have been removed from those datasets, it is expected that the effective number of inequivalent configurations is of the same order of magnitude [24], [3].Now, by considering that there are just a few two-folds, we can estimate the expected size of our dataset to be in the order of 10 8 .Even with current computational power, constructing the whole dataset would be a very challenging task, and thus we will be concerned with a subset of it, as described in section 4.

Generation and Redundancies
As mentioned in section 2.1, there might be more than one configuration matrix describing the same manifold.There are different reasons for this.The most intuitive one is that exchanging rows and columns corresponds to a simple relabelling of the projective spaces and constraints, respectively.Hence, it does not affect the construction, leading to the same Calabi-Yau.This insensitivity to permutations accounts for the largest part of the redundancy in the dataset, and we will shortly outline how it has been removed in our case.However, there are (at least) other two sources of equivalence, which we did not consider in this work.According to [24], they consist of matrices related by ineffective splitting and those related by accidental identities.We refer the reader to the discussion therein for further details, and we just quickly comment on the nature of these three redundancies.
Row/column permutations relate matrices of the same dimensions and, as we mentioned, they constitute the vast majority of the equivalence class.On the other hand, the other two redundancies involve pair of matrices with different dimensions.This feature allows to trade small matrices for larger ones, with more zeroes, whose spectral sequence calculations simplify.We return to this point at the end of next section, and we now focus on our algorithm for removing permutation matrices.
Given a maximum size for the configuration matrix, the algorithm creates the possible choices of n such that the above bound is satisfied.The number of rows is simply given by the number of projective spaces, i.e., the length of n, while the number of columns is immediately found according to (2).The n's found this way are arranged in lexicographic order, which is the zeroth way of avoiding redundancies.Then, for each element of the above list, there might be many configuration matrices satisfying the Calabi-Yau condition (3), hence defining a CICY.Since such condition should be satisfied row by row, the algorithm finds the allowed combinations of q r α for each r.These sets of rows are again arranged in lexicographic order.To obtain all the possible configuration matrices out of these sets, we take the Cartesian product of them.Clearly, when n r = n r ′ (r ̸ = r ′ ), one should avoid repetitions when taking the product of the two sets, and the same is true if more than two n r 's are equal.This is also implemented in the algorithm to avoid redundant calculations.The configuration matrices thus obtained all share the same column n, i.e., they have the same ambient space, by construction.However, because of the way they were constructed, no matrices in the list can be related by permutations of rows and columns.For the next step, we choose one of such matrices, and we generate (possibly) new CICY configurations by doing permutations within each row.The strategy is the same as above: we create a permutation set for each row, and then take the Cartesian products of these sets, ignoring repetitions for identical rows.Now, this list might contain matrices that are related by permutations of rows and columns.To check it, we employ the efficient algorithm described in [24] (and using brute force when the eigenvalues are degenerate).We now repeat the last procedure, where we permute elements within each row and check equivalence, for all the matrices that share the same n.And, finally, we move on to the next n in the list.The above procedure can be summarised schematically as follows.Given some bounds for the dimensions of the configuration matrix, we perform these steps, in order: 1. Find the the n's (ambient spaces) that satisfy the bounds, and order them lexicographically.
2. Pick an n, and build all the possible matrices out of the lexicographically ordered rows satisfying the Calabi-Yau condition.
3. Pick one matrix of the above, and build all the matrices obtained by permuting each row independently.
4. Identify and remove matrices that are related by permutations of rows and columns.
5. Repeat for all the matrices obtained in 2, and for all n's obtained in 1.
This procedure is best illustrated with an example, which can be found in the appendix.
To conclude this section, we note that building the whole dataset (up to matrices with dimensions 25 × 30) with the algorithm just described is hopeless with the computational resources at our disposal.In fact, this was already noted in [24], when constructing all CICY four-folds.They employed a version of the above algorithm to produce only matrices that correspond to spaces without CP 1 factors.The rest of the dataset is obtained by performing effective splittings to produce inequivalent configurations, until it is no longer possible to do so.This is a natural extension of our work, as we discuss in the next section.

Algorithm for the Hodge Numbers
We now outline our implementation of the procedure described in section 2.3.Regarding the (standard) adjunction sequence, we implement the calculation using the formulae presented in [65], which can be found in the appendix.They describe the first page of the spectral sequences associated to the ambient space tangent bundle and the normal bundle, which allow to determine the cohomologies that appear in (23).As we discussed in section 2.3, it is not always the case that the spectral sequence degenerates at the first page (see the obstruction mentioned above).In fact, when there are maps connecting two non-zero entries we need to introduce a new variable for each of those, representing the kernel of the map.We refer to all of the extra variables that might appear in the computation, but are not the Hodge numbers, as supplementary variables.Hence, in general, one obtains more unknowns than the four Hodge numbers appearing in (23), with a number of equations that depend on how many zeroes appear in the sequence and on their positions.For each configuration matrix, our algorithm determines such a system of equations and proceeds to find the unique solution whenever the system is well-posed.When it is not well-posed, we still proceed to find a number of solutions, because it might be the case that the system is underconstrained only in the supplementary variables. 5olutions are found within a region bounded by the inequalities that follow from the cohomology sequence.Specifically, for any exact sequence 0 Applying this to the long cohomology sequence (23) yields bounds on the variables involved, ensuring finiteness of solutions.To solve the system of linear equations, with the above bounds, we employ MiniZinc.Once the software has found a number of solutions, we can identify three possible scenarios: • The solution is unique.
• The solution is not unique, in which case there are two further sub-cases.
-The supplementary variables decouple, and the solution is unique in the Hodge numbers.
-There are solutions with different sets of Hodge numbers.
In the first case and in the first sub-case, the Hodge numbers are successfully determined.On the other hand, for the second sub-case, our approach fails to provide a definite numerical answer.All the configuration matrices for which we were not able to determine the full Hodge diamond fall into this category.Now, clearly, this does not mean that their Hodge numbers cannot be calculated.The way to proceed is to consider the ineffective splitting equivalence.This equivalence allows to trade small matrices for large matrices, which describe the same manifold, but provide significant simplification in calculating the cohomological properties.The ineffective splitting is ubiquitous in the literature of CICY (see [4,24,68], for instance), and it is a very powerful tool.Preliminary investigations show that it is as useful for five-folds as it was for the lower-dimensional analogues.However, we do not apply it to our sub-dataset, in the light of the considerations made at the end of the previous section.It is more natural to implement the ineffective splitting, together with the effective splitting algorithm, in a unified effort to produce and classify a larger set of CICY five-folds.Hence, we leave this for future work.

Properties of the × Dataset
In this section, we present and comment on the properties of the dataset constructed according to the techniques just described.It is available here.Due to the expected astronomical size (see section 3.1), we chose to restrict ourselves to matrices with dimensions up to 4 × 4.This yielded 27068 configuration matrices, out of which 3909 are products.We believe that this number is large enough to exhibit some general properties, and it is definitely large enough to allow an efficient application of machine learning techniques.In fact, very promising results were obtained in [26], when restricting to configuration matrices with size up to 4 × 4, for CICY four-folds.We have been able to compute the full Hodge diamond for 12433 matrices, i.e. 53.7% of the nonproduct matrices.We computed the Euler number for all the configuration matrices in the dataset.
Let us now present some general properties of our results.The distributions of the Hodge numbers and the Euler number are shown in figure 1, where we also plot mean, upper bound and lower bound.
Figure 1: These plots show the distribution of the invariants that we computed.Since h 1,1 has the smallest range, it is the most suitable piece of data in the machine learning context.Conversely, we can anticipate that the wide ranges of h 1,4 , h 2,2 and h 2,3 make it very hard for the neural network to predict those numbers exactly.Finally, it is interesting to note how h 1,2 and h 1,3 vanish in the vast majority of cases, resulting in a very skewed sampling.
The number of values that each invariant can take, with configuration matrices up to 4 × 4 and representing non-product spaces, is given in table 1 below.Just for reference, there are 265 different Hodge diamonds among all CICY three-folds, and 4417 distinct Hodge sets in the CICY four-folds dataset.It is also interesting to compare our work with previous results in lower-dimensional Calabi-Yau spaces.We do so in figure 2, where a lot of information is encoded.
Figure 2: The plots represent Hodge data of Calabi-Yau spaces of different dimensions, obtained via different constructions.CICY three-folds and CICY four-folds data were taken from [69] and [70], respectively.The toric hypersurfaces data can be found at [8].
We start by reproducing the (famous) scattered plot with Calabi-Yau three-folds constructed as toric hypersurfaces, in orange, and as complete intersections, in red.This can be found in [71], for instance.By the notation Hodges, we mean the sum of the non-trivial Hodge numbers.For three-folds, they are h 1,1 and h 1,2 , and the Euler number reads η = 2(h 1,1 − h 1,2 ).For four-folds, the non-trivial Hodges are h 1,1 , h 1,2 , h 1,3 , h 2,2 , with the Euler number given by η = 4 + 2h 1,1 − 4h 1,2 + 2h 1,3 + h 2,2 ; the associated plot is shown in the top-right.A detailed analysis on the properties of the cohomological data of CICY four-folds can be found in [25].Finally, we represent the manifolds from complex dimension three, four and five on a single plot, by employing logarithmic axes.This is shown in the bottom plot, where an additional line is drawn to indicate how the asymptotic behaviour of three-folds and five-folds exactly match.They both cluster along the line6 This analysis shows one of the interesting investigations that naturally follow from our results.The knowledge of the holomorphic invariants for five-folds allows to study Calabi-Yau properties across the first two non-trivial odd complex dimensions.This is subject of ongoing work, aimed at unveiling such a connection.

Neural Network analysis
In this section, we describe the neural network architectures that we employed for the investigations on the dataset.We were inspired by some recent applications of ML techniques to the lower-dimensional CICY datasets [9,11,12,26,27,38].Before doing that, we list some choices concerning the analysis of the data that we present in this section.Regarding the splitting into training, validation, test, we opted for a split into 70%, 20%, 10% (in order).All matrices have been padded to the maximum dimension of 4 × 4. The batch size was set to 128.All the investigations were run on a Macbook Air M1 with 16GB RAM and with 8 cores.We employed PyTorch, which uses the Metal Performance Shaders (MPS) backend for GPU training acceleration.

Classifier
A classifier neural network is most efficient when the number of classes is much smaller than the size of the dataset, and the samples are distributed somehow uniformly among them.For these reasons, this kind of architecture is particularly suitable for learning h 1,1 .The most natural choice, following [26], can be summarised as: where N h is the number of possible h 1,1 values, i.e. the number of classes.In our case we have that N h = 16, which gives a total of 209936 parameters.For whole CICY four-folds dataset, where N h = 24, the authors in [26] report a 96% accuracy, with 20% (15 + 5) of dataset used for the training session.Motivated by their results, we employ this architecture for our investigations on h 1,1 .
Regarding the other Hodge numbers, we modify the above architecture slightly, to the following neural network: where N H is the number of classes for the higher Hodge numbers (which can be found in table 1).

Linear Regressor
Moving to the realm of regressors, we begin with those of the simplest type, i.e. linear regressors.Since such architectures do not rely on a fixed number of classes, they can be more easily adapted to different problems.Again, by following the work presented in [26], we take the network as the starting point for learning h 1,1 .It is composed of 177329 parameters.For the four-fold case, the accuracy reached by this network is roughly 93%.We use exactly this network for investigating h 1,1 in our case.For higher Hodge numbers, we employ the slightly larger neural network N 4 * 4 (1024, s, 1024, s, 512, s, 64, s, 16, s, 1), (37) with 1625697 parameters.

Convolutional Regressor
The last type of architecture that we employed for our investigations is the convolutional regressor.This choice is motivated by the results in [38].Specifically, the neural network described therein can be written as involving 575841 trainable parameters.The results of this network in the prediction of h 1,1 for three-folds by the authors above has an accuracy of 94%.We test this on all the Hodge numbers

Results
In this section, we present the results of our ML investigations on the cohomological and topological properties contained in the dataset with the architectures just described.We mainly focus on two measures: the standard R 2 score and the accuracy, defined as: accuracy = correct predictions / total predictions .(39) It is generally the case that h 1,1 is learnt with higher precision and accuracy, compared to the other Hodge numbers (see [13,26,27,38,56]).Our investigation is no exception; we find that, by simply applying the architectures that were devised for CICY four-folds, very good results are obtained.This is a very promising result, showing that for CICY five-folds, in line with the results for three-folds and four-folds, machine learning is very effective in predicting the lowest non-trivial Hodge number.One of the reasons for this performance is the limited range of h 1,1 , shown in figure 1.

h 1,2
As it can be seen in figure 1, the distribution of this Hodge number is heavily skewed towards zero, because the vast majority of the h 1,2 's vanish.Probably due to these features, we find that the neural network struggles in the training process, which is evident from the negative R 2 measure.This is shown in Table 3.
In fact, h 1,2 is zero 97.7% of the times, which makes the distribution of the samples not suitable for ML purposes, since the network does not have enough non-zero values for training.This percentage is also showing that, as we expect, the classifier is doing no better than guessing h 1,2 = 0 by default.

h 1,3
The distribution of h 1,3 is similar to the one of h 1,2 , with slightly less vanishing numbers.We observe a slight improvement in the R 2 score, but the inadequateness of the samples, discussed in the previous case, still applies.This is illustrated in Table 4.
We find that 95.7% of h 1,3 's vanish, showing again that the classifier is probably always predicting the class h 1,3 = 0.

h 1,4
In this case, the distribution is very different from the previous two discussed.h 1,4 can take a much larger range of values, which are scattered throughout 3 order of magnitudes.We first present the results, in table 5, then comment on their interpretation.
These results hint at a very simple behaviour: the neural network is learning the pattern behind the Hodge number (hence the very high R 2 score), but is not able to predict the exact integer.It is a reasonable outcome, since the range is very large, as we mentioned, and the size of the training set is probably not optimal.However, despite the failure to give exact predictions, the network is providing good approximations.To give a quantitative reference for this, we tested an additional measure, which is reported in the last column of the table.The accuracy with 10 % tolerance counts how many predictions differ from the actual value by less than 10% of it, and divides their number by the total number of predictions.Hence, the results show that the regressor neural networks are indeed able to approximate the Hodge numbers well, despite they do not manage to predict their exact values.

h 2,2
The distribution of h 2,2 somehow interpolates between the one of h 1,2/3 and the one of h 1,4 ; most of the values are close to zero (even though only 5 of them are actually zero) and the upper bound is 510, giving it quite a wide range.We find that the machine learning performances are quite poor with respect to both measures, as shown in table 6 below.
6.6 h 2,3 Our findings show that h 2,3 has the largest number of possible values and it spans the widest range among the Hodge numbers.Its distribution is very similar to the one of h 1,4 , scaled by a factor of 10.Accordingly, we observe similar results, with very high R 2 measure but very low accuracy, shown in table 7.
Again, by employing the accuracy with 10% tolerance measure, introduced for the h 1,4 investigation, we show that the neural network is able to approximate the Hodge numbers well.

η
We end this section by presenting the investigation on the Euler number.Its range is again quite extended, which makes classification efforts not effective, as it can be seen from table 8 below.
In this case, the linear regressor also gives very poor results, with a low R 2 score and zero accuracies.The convolutional regressor, on the other hand, is still performing as a good approximator, with very high R 2 and a promising 83% accuracy with 10% tolerance.

Conclusions
We presented a partial classification of complete intersection Calabi-Yau (CICY) fivefolds and their Hodge diamonds.Due to their importance in compactification of F-theory and M-theory, these spaces have appeared multiple times in the physics literature in the last twenty years.Lower-dimensional CICY's, with analogous applications in the dimensional reduction of string theory, have already been classified in previous works, and their Hodge numbers have been computed.The key mathematical fact that was employed for the computations in those cases is the existence of the short exact adjunction sequence.However, in complex dimension five, this is not sufficient to determine the complete Hodge diamond of CICY's.We showed that, by also considering the symmetrised version of the above sequence, it is possible to determine the full Hodge diamond.On the other hand, the generation of configuration matrices does not differ from the lower-dimensional analogues.Both procedures (the generation of configuration matrices and the computation of Hodge numbers) were implemented on a computer, and applied to a subset of all the possible spaces.This subset was chosen according to the following results and considerations.We derived the maximum size for the configuration matrix, which is 25 × 30, and estimated the expected size of the dataset to be in the order of 10 8 .Since this number, with our current algorithm, makes the construction of the whole dataset unfeasible, it was natural to restrict ourselves to a subset of it.Motivated by the recent machine learning investigations presented in [26], we chose to focus on all configuration matrices with size up to 4 × 4, a subset that yielded promising results in the above work based on four-folds.We obtained 27068 matrices, inequivalent under permutations of rows and columns, for all of which we calculated the corresponding Euler number.Out of these, 3909 were excluded from the rest of our investigations, because they represent product spaces.Of the remaining 23159 non-product matrices, we computed the full Hodge diamond for 12433 of them (53.7%),finding 2375 different sets of Hodge numbers. 7These figures are large enough for machine learning techniques to be effective.Hence, we attempted to learn all the non-trivial Hodge numbers, namely h 1,1 , h 1,2 , h 1,3 , h 1,4 , h 2,2 ,h 2,3 , and the Euler number η, with three types of architectures: classifier, linear regressor and convolutional regressor.In agreement with the machine learning investigations on the lower-dimensional CICY datasets, we find that h 1,1 can be learnt to very high levels of accuracy, with the best result being achieved by the convolutional regressor.For this architecture, we find an R 2 score of 91% and an accuracy, defined as the ratio of the exact predictions to the total predictions, of 96%.This result corroborates what was found by analogous investigations for three-folds and four-folds, and the success of ML is partially due to the fact that the range of h 1,1 is relatively small.The opposite is true for h 1,4 , h 2,3 and η, with each of their distributions spanning many order of magnitudes.This makes any classification efforts almost hopeless, but it is still sensible to apply regressor neural networks.For those, we find very high R 2 scores for the invariants above, but, again due to the size of the range, only a very small fraction of the predicted values exactly match the Hodge numbers.However, we find that around 85% of the predictions differ by less than 10% from the exact result for the convolutional regressor, showing it to be an efficient approximator.Due to their distributions, the remaining Hodge numbers did not yield promising results.A more complete set of samples would be beneficial for a proper training of the neural network.This brings us to the final considerations about this work, and its possible extensions.
The construction of the full CICY five-folds dataset, which is the most natural development of our work, is itself a very challenging task.As we mentioned, because of its estimated size in the order of 10 8 , building it completely is likely be very heavy computationally.Machine learning could help make this process more manageable, and could be useful to quickly extract information on the full dataset, such as bounds for the Hodge numbers, by just constructing part of it and predicting the rest of it.Progresses in this direction, i.e. extrapolating predictions from low to high Hodge numbers, have been made for CICY three-folds and four-folds in [57].This new collection of data would also allow for an investigation of the properties of CICY's cohomological numbers across different complex dimensions, both with and without neural networks.It is something that has never been done before, and it has the potential unveil unknown properties and reveal a new predictive power.Moreover, we believe that substantial improvements on the ML performance could be reached also with the current subset of data.Applying the inception network developed for CICY four-folds in [27] would be a promising starting point in that direction.Finally, another interesting extension consists of a detailed analysis of the topological properties in our dataset.This would include computing the intersection numbers, and other investigations along the lines of [25].Such a study would also have a particular relevance from the phenomenological perspective, in that it would identify the elliptically fibered spaces.

A.1 Finiteness of the CICY Five-folds
In this section, we present the proof of the bounds on the configuration matrices stated in section 3.1.The first step is to isolate the CP 1 factors in the ambient space, and we do so by adopting the following notation: Since we have f projective spaces of dimension 1 and F projective spaces of dimensions Then, according to (2), we need K constraints such that to construct a five-fold.The Calabi-Yau condition (3) translates into: Without loss of generality, we can assume that all constraints obey: From the equations above, we obtain which gives: Let us now find an upper bound for the quantity f +F j=f +1 (n j − 1).By using (41) again, and the above inequality, we get: This allows us to infer the following inequality: and also Hence, we have just derived an upper bound for the number of projective spaces with dimensions greater than one.To determine the same constraint for the number of CP 1 's, we need to make a few considerations on the equivalence between certain configuration matrices.Let us first observe that for generic a, b, X, M .Hence, we can assume that any bilinear constraint that involves a CP 1 factor must involve also a CP n j factor with j ∈ {f + 1, ..., f + F } (if it were to involve two CP 1 factors, we could apply the equivalence above).Let us assume that there are t constraints of this type, which we denote as All these constraints involve one of the CP n j factors with j ∈ {f + 1, ..., f + F }, hence we have that Let us make another remark on the general form of the configuration matrices of the form (49) (left hand side).Suppose that the constraint on the left is again of degree two.This time, assume that it is of degree two over one CP 1 , and therefore trivial everywhere else.This implies that the configuration matrix is block diagonal, having the form of a product space.Hence, when considering a constraint with degree bigger than one over the one-dimensional projective spaces, labelled by some α, we impose This ensures that more than one CP 1 factor is involved.We denote the sum of the degrees over the CP 1 factors coming from all constraints of this type with s.Then, by definition, we have that t + s = 2f .Now, let us focus on the following quantity: In the light of ( 43), if all the constraints in the configuration have degree smaller or equal to three, then this quantity counts the number of constraints with degree three.Let us outline what happens with constraints with higher degree.Any constraint of degree g > 3, together with g −3 constraints of degree two, can be exchanged for g −2 constraints of degree three.In order to find which of the two cases (before or after the exchange) maximises s, we look at both.In the first case, the maximum value of s is g (recall the condition ( 52)).In the second case, s can be up to 3(g − 2), which is greater that g since g > 3. Hence, the maximum value of s is attained when all the constraints have degree 3, which yields This implies that Hence, this gives f + F ≤ 25.Since we had that f +F j=f +1 n j ≤ 20 (see (48)), then an upper bound for the dimensions of the ambient space is given by dim According to (41), the maximum number of constraints is therefore 30.Summarising, in complex dimension five we have that: the maximum size of a configuration matrix describing a CICY is 25 × 30 .

A.2 Generation Algorithm Exemplified
In this section, we present an example that illustrates the steps involved in the algorithm for the generation of inequivalent matrices.
Let us choose 3 × 3 as the maximum dimension of the configuration matrix, by which we mean that q r α should be at most 3 × 3.Then, the algorithm finds the possible ambient spaces, n, which satisfy this condition, which are: m = 3 : .
Let us pick one of these to illustrate the successive step.To fit this example within a page or so, let us choose n =   2 2 3 .
The algorithm now finds the sets of rows satisfying the Calabi-Yau condition (3), but it does so while also taking into account that some projective spaces may be identical.Specifically, the first and second factors are the same, and this is taken into account to avoid redundant elements in the set.Hence, we generate all valid inequivalent polynomials living in the first two ambient spaces, and all valid inequivalent polynomials living in the last ambient space; once arranged in lexicographic order, the list reads: The next step consists of taking the Cartesian product of the above sets, in order to find all the possible configuration matrices out of those.We find: where, again, we ordered the matrices lexicographically.Note that, since we keep the lexicographic order in every step, the matrices generated at this stage cannot be related by permutation of rows and columns, by construction. 9Now, we have to pick one of them in order to proceed to the next step, and (for illustrative purposes) we choose For this matrix, we perform all permutations within the same row, again avoiding repetitions when two rows are identical, and we obtain: The above set can contain matrices that are related by permutations of rows and columns, and this is checked using the algorithm of [24], with brute force comparison when the eigenvalues are degenerate.This yields only three inequivalent matrices: Now, the same steps must be performed for the other matrices listed in (60), before moving to another n from (58).

A.3 Hodge Number Computation Exemplified
We provide the explicit steps of the computation outlined in sections 2.3.1, 2.3.2,3.3 for a specfic example: the CICY five-fold described by 2 1 1 1 6 0 0 7 .
As we mentioned in the relevant section, general formulae for the first page in the spectral sequences associated to the tangent bundle and normal bundle have been presented in [65] (as well as in [4]).We refer the reader to the above article, together with the more pedagogical treatment in [62], for more details.We just report the main definitions and results here.Let us first clarify the notation.Given a holomorphic vector bundle V over A, we denote the i th page with E j,k i (V), and the i th differential is of the form The first page is defined as: with the subsequent ones being E r+1 (V) = H(E r (V), d r ).For our problem, we are interested in the special cases of V being E and T A .As we mentioned, the first pages of both sequences have already been worked out in [65] (or, alternatively, in [4]), and they read: H γr CP nr , T (CP nr ) ⊗ (h r ) − α∈A q r α m s=1 s̸ =r for the tangent bundle of the ambient space, and H γs CP ns , (h s ) q s β − α∈A q s α , for the normal bundle.In both equations A denotes a subset of indices α = 1, ..., K, and |A| its cardinality.To calculate the dimensions of the cohomologies involved, we use Bott's formula: It is evident that all these formulae can be easily implemented on a computer.Calculating the ranks of the quantities above is the first step of the algorithm.From now on, by a slight abuse of notation, we will indicate the groups with their dimensions, since it makes it easier to present our computations.With this in mind, the spectral sequence associated with the normal bundle of the five-fold described by (64) reads: It is evident that it degenerates at the first page.In fact, this is not always the case, as shown by the spectral sequence for the tangent bundle: We see that this sequence has a non-trivial map in it which prevents it from degenerating.By defining x as the kernel of such a map (so that, in this case, x ∈ {0, 1, 2}), we can proceed to the subsequent pages, and obtain the cohomology on the right.We can plug these results into the long cohomology sequence associated to the adjunction sequence, i.e. (23).Then, by indicating the cohomologies with their dimensions, we have the following schematic sequence: Clearly, the associated system of equations is: x + 1 − 2 = 0 , h 1,1 = x , 50 − 1717 + h 4,1 = 0, (71) with the other Hodge numbers being zero.
Let us now move to the computations for the symmetrised adjunction sequence.Once again, we have some spectral sequences to consider, but let us employ a couple of shortcuts to write them more compactly.We avoid drawing the horizontal maps in the first row just for aesthetic reasons, while we explicitly draw arrows to represent all the other maps.We also do not write out all the elements of the subsequent pages, like we did for the tangent bundle, since they follow straightforwardly from the first page.With this in mind, the spectral sequences involved in the calculation of the other Hodge numbers schematically read: In all three cases there are maps preventing the sequence from degenerating at the first page, which are taken into account as shown.This introduces additional unknowns in the problem.Plugging our results in sequences (30) and (31), and focusing on the dimensions as above, we obtain: We find 5 non-trivial equations from (72) and 4 non-trivial equations from (73).The system can be enlarged by adding the Euler equation (see (8)) and the constraint coming from the index theorem discussed (see ( 17)). 10 Moreover, it easy to see that the auxiliary variables x and y drop out of the system, leaving a system for which a unique solution can be found.Ignoring the supplementary variables, the result of this computation yields the Hodge diamond: (74)

Table 1 :
Total number of distinct values for the cohomological quantities of interest.

Table 2 :
They are shown in table 2 below.ML performances on the Hodge number h 1,1 .