3D Generative Model Latent Disentanglement via Local Eigenprojection

Abstract Designing realistic digital humans is extremely complex. Most data‐driven generative models used to simplify the creation of their underlying geometric shape do not offer control over the generation of local shape attributes. In this paper, we overcome this limitation by introducing a novel loss function grounded in spectral geometry and applicable to different neural‐network‐based generative models of 3D head and body meshes. Encouraging the latent variables of mesh variational autoencoders (VAEs) or generative adversarial networks (GANs) to follow the local eigenprojections of identity attributes, we improve latent disentanglement and properly decouple the attribute creation. Experimental results show that our local eigenprojection disentangled (LED) models not only offer improved disentanglement with respect to the state‐of‐the‐art, but also maintain good generation capabilities with training times comparable to the vanilla implementations of the models. Our code and pre‐trained models are available at github.com/simofoti/LocalEigenprojDisentangled.


Introduction
In recent years digital humans have become central elements not only in the movie and video game production, but also in augmented and virtual reality applications.With a growing interest in the metaverse, simplified creation processes of diverse digital humans will become increasingly important.These processes will benefit experienced artists and, more importantly, will democratise the character generation process by allowing users with no artistic skills to easily create their unique avatars.Since digitally sculpting just the geometric shape of the head of a character can easily require a highly skilled digital artist weeks to months of work [GFZ*20], many semi-automated avatar design tools have been developed.Albeit simpler and faster to use, they inherit the intrinsic constraints of their underlying generative models [FKSC22].Usually  ABWB19], these models are either limited in expressivity or they cannot control the creation of local attributes.Considering that deep-learning-based approaches, such as VAEs and GANs, offer superior representation capabilities with a reduced number of parameters and that they can be trained to encourage disentanglement, we focus our study on these models.
By definition [BCV13; HMP*17; KM18], with a disentangled latent representation, changes in one latent variable affect only one factor of variation while being invariant to changes in other factors.This is a desirable property to offer control over the generation of local shape attributes.However, latent disentanglement remains an open problem for generative models of 3D shapes [ATJD19] despite being a widely researched topic in the deep learning community [HMP*17; KM18; KWKT15; EWJ*19; DXX*20; WYH*21; RL21].Most research on latent disentanglement of generative models for the 3D shape of digital humans addresses the problem of disentangling the pose and expression of a subject from its identity [ATJD19; ATDJ21; CNH*20; ABWB19; ZYL*20; LYF*21; ZYHC22], but none of these works is able to provide disentanglement over the latent variables controlling the local attributes characterising the identity.Some control over the generation of local attributes was achieved for generative models of 3D furniture by leveraging complex architectures with multiple encoders and decoders independently operating on different furniture parts [NW17; YML*20; RDC*21].In contrast, [FKSC22] recently proposed a method to train a single VAE while enforcing disentanglement among sets of latent variables controlling the identity of a character.This approach allows their Swap Disentangled VAE (SD-VAE) to learn a more disentangled, interpretable, and structured latent representation for 3D VAEs of bodies and heads.However, although [FKSC22] disentangles subsets of latent variables controlling local identity attributes, variables within each set can be entangled and not orthogonal.In addition, their curated minibatching procedure based on attribute swapping is applicable only to autoencoder-based architectures and it significantly increases the training duration.In this work, we aim at overcoming these limitations by leveraging spectral geometry to achieve disentanglement without curating the mini-batching.In particular, we encourage the latent representation of a mesh to equal the most significant local eigenprojections of signed distances from the mean shape of the training data.Since the eigenprojections are computed using the eigenvectors of combinatorial Laplacian operators, we require meshes to be in dense point correspondence and to share the same topology.This is a standard requirement for most of the traditional [BV99; BRZ*16; DPSD20; GFZ*20; LMR*15; OBB20; PWP*19; PVS*21] and neural-network-based [FKD*20; FKSC22; GCBZ19; RBSB18; ZWL*20; YLY*20] generative models, which not only simplifies the shape generation process, but also the definition of other digital humans' properties that will be automatically shared by all the generated meshes (e.g., UV maps, landmarks, and animation rigs).
To summarise, the key contribution of this work is the introduction of a novel local eigenprojection loss, which is able to improve latent disentanglement among variables controlling the generation of local shape attributes contributing to the characterisation of the identity of digital humans.Our method improves over SD-VAE by enforcing orthogonality between latent variables and avoiding the curated mini-batching procedure, thus significantly reducing the training times.In addition, we demonstrate the flexibility and disentanglement capabilities of our method on both VAEs and GANs.
Blendshapes are still widely adopted for character animation or as consumer-level avatar design tools because, by linearly interpolating between a predefined set of artistically created shapes, the blend-weights can be easily interpreted [LAR*14].However, to compensate for the limited flexibility and diversity of these models, large amounts of shapes are required.This makes the models very large and only a limited number of shapes can be used in most practical applications.An alternative approach capable of offering more flexibility is to build models relying on principal component analysis (PCA) [BV99; EST*20].These datadriven models are able to generate shapes as linear combinations of the training data, but the variables controlling the output shapes are related to statistical properties of the training data and are difficult to interpret.In recent years, PCA-based models have been created from large number of subjects.For example, LSFM [BRZ*16] and LYHM [DPSD20] were built collecting scans from 10, 000 faces and 1, 212 heads respectively.The two models were later combined in UHM [PWP*19], which was subsequently enriched with additional models for ears, eyes, teeth, tongue, and the inner-mouth [PVS*21].Also, [GFZ*20] combined multiple PCA models, but they were controlling different head regions and an anatomically constrained optimisation was used to combine their outputs and thus create an interactive head sculpting tool.PCA-based models of the body were also combined with blendshapes in SMPL [LMR*15] and STAR [OBB20], which were trained with 3, 800 and 14, 000 body scans respectively.PCA-based models generally trade the amount of fine details they can represent with their size.The advent of geometric deep learning techniques brought a new set of operators making possible the creation of neural network architectures capable of processing 3D data such as point-clouds and meshes.[RBSB18] introduced the first VAE for the generation of head meshes.In its comparison against PCA, the VAE model used significantly fewer parameters and exhibited superior performances in generalisation, interpolation, and reconstruction.This pioneering work was followed by many other autoencoders which differed from one another mostly by their application domain and the mesh operators used in their architecture [LBBM18 Other simple methods that leverage statistical properties and do not require supervision over the generative factors are for instance the DIP-VAEs [KSB18] and the FactorVAE [KM18].All methods above were re-implemented to operate on meshes by [FKSC22], but they did not report good levels of disentanglement with respect to the identity attributes.In the 3D realm, there are currently two prominent streams of research: the one disentangling the identity from the pose or expression of digital humans [ATJD19; ATDJ21; CNH*20; ZYL*20; ZBP20; TSL21; JWCZ19; HHS*21; OFD*22], and the stream attempting to disentangle parts of manmade objects [YML*20; NW17; LLW22; RDC*21].In both cases, the proposed solutions require complex architectures.In addition, in the former category, current state-of-the-art methods do not attempt to disentangle identity attributes.The latter category appears better suited for this purpose, but the type of generated shapes is substantially different because the generation of object parts needs to consider intrinsic hierarchical relationship, and surface discontinuities are not a problem.More similar to ours, is the method recently proposed by [FKSC22], where the latent representation of a mesh convolutional VAE is disentangled by curating the minibatching procedure and introducing an additional loss.In particular, swapping shape attributes between the input meshes of every mini-batch, it is possible to know which of them share the same attribute and which share all the others.This knowledge is harnessed by a contrastive-like latent consistency loss that encourages subsets of latent variables from different meshes in the mini-batch to assume the same similarities and differences of the shapes created with the attribute swapping.This disentangles subsets of latent variables which become responsible for the generation of different body and head attributes.We adopt the same network architecture, dataset, and attribute segmentation of SD-VAE.This choice is arbitrary and simplifies comparisons between the two methods, which differ only in their disentanglement technique.
Like VAEs, the research on GANs comes mostly from the imaging domain, where good levels of control over the generation process were recently made possible.Most of these models leverage segmentation maps [HMWL21; LLWL20; LKL*21], additional attribute classifiers [HZK*19; SBKM21], text prompts [RKH*21], or manipulate the latent codes and the parameter space of the pretrained model to achieve the desired results [KAL*21; HHLP20; SYTZ22; LKL*21].We argue that while the first two approaches require more inputs and supervision than our method, the last two offer less editing flexibility.In fact, describing the shape of human parts is a difficult task that would ultimately limit the diversity of the generated shapes, while the post-training manipulation may limit the exploration of some latent regions.Only a few methods explicitly seek disentanglement during training [AW20; VB20] like ours.However, [AW20] is specifically designed for grid-structured data, like images, and [VB20] still requires a pretrained GAN and two additional networks for disentanglement.In the 3D shapes domain, GAN disentanglement is still researched to control subject poses and expressions [CTS*21; OBD*21] or object parts [LLHF21].However, they suffer the same problems described for 3D VAEs: they have complex architectures and do not have control over the generation of local identity attributes.

Spectral Geometry.
Spectral mesh processing has played an essential role in shape indexing, sequencing, segmentation, parametrisation, correspondence, and compression [ZVD10].Spectral methods usually leverage the properties of the eigenstructures of operators such as the mesh Laplacian.Even though there is no unique definition for this linear operator, it can be classified either as geometric or combinatorial.Geometric Laplacians are a discretisation of the continuous Laplace-Beltrami operator [Cha84] and, as their name suggests, they encode geometric information.Their eigenvalues are robust to changes in mesh connectivity and are often used as shape descriptors [RWP06;GYP14].Since they are isometry-invariant, they are used also in VAEs for identity and pose disentanglement [ATJD19; ATDJ21].However, being geometry dependant, the Laplace-Beltrami operator and its eigendecomposition have to be precomputed for every mesh in the dataset.On the other hand, combinatorial Laplacians treat a mesh as a graph and are entirely defined by the mesh topology.For these operators, the eigenvectors can be considered as Fourier bases and the eigenprojections are equivalent to a Fourier transformation [SNF*13] whose result is often used as a shape descriptor.If all shapes in a dataset share the same topology, the combinatorial Laplacian and its eigendecomposition need to be computed only once.For this reason, multiple graph and mesh convolutions [BZSL13; DBV16] as well as some data augmentation technique [FKD*20] and smoothing losses [FKSC22] are based on combinatorial Laplacian formulations.

Method
The proposed method introduces a novel loss to improve latent disentanglement in generative models of 3D human shapes.After defining the adopted shape representation, we introduce our local eigenprojection loss, followed by the two generative models on which it was tested: a VAE and two flavours of GANs.

Shape Representation.
We represent 3D shapes as manifold triangle meshes with a fixed topology.By fixing the topology, all meshes M = {X, E, F } share the same edges E ∈ N ε×2 and faces F ∈ N Γ×3 .Therefore, they differ from one another only for the position of their vertices X ∈ R N ×3 , which are assumed to be consistently aligned, scaled, and with point-wise correspondences across shapes.

Local Eigenprojection Loss.
We define F arbitrary attributes on a mesh template by manually colouring anatomical regions on its vertices.Thanks to the assumption of our shape representation, the segmentation of the template mesh can be consistently transferred to all the other meshes without manually segmenting them.Mesh vertices can be then grouped per-attribute such that X = {Xω} F ω=1 .Seeking to train generative models capable of controlling the position of vertices corresponding to each shape attribute Xω through a predefined set of latent variables, we evenly split the latent representation z in F subsets of size κ, such that z = {zω} F ω=1 and each zω controls its corresponding Xω.To establish and enforce a direct relationship between submitted to COMPUTER GRAPHICS Forum (4/2023).The signed distance between a given mesh X and a mean shape template is computed as sd(X).sd(X) is locally eigenprojected into a vector z where each subset of variables is a spectral descriptor of a shape attribute.The projection is performed by matrixmultiplying the signed distance by U ω , the highest-variance eigenvectors of each shape attribute ω.The heads in the bottom part of the figure represent one-dimensional vectors whose values are mapped with diverging colour maps on the mean shape head.On the heads corresponding to the columns of U ω , the black seams mark the different attributes that we seek to control during the generation procedure.
each Xω and zω we rely on spectral geometry and compute lowdimensional local shape descriptors in the spectral domain.We start by computing the Kirchoff graph Laplacian corresponding to each shape attribute as: Kω = Dω − Aω, where Aω ∈ N Nω×Nω is the adjacency matrix of attribute ω, Dω ∈ R Nω×Nω its diagonal degree matrix, and Nω the number of its vertices.Values on the diagonal of Dω are computed as Daa = ∑ b A ab .The Kirchoff Laplacian is a real symmetric positive semidefinite matrix that can be eigendecomposed as Kω = UωΛ Λ ΛωU T ω .The columns of Uω ∈ R Nω×K are a set of K orthonormal eigenvectors known as the graph Fourier modes and can be used to transform any discrete function defined on the mesh vertices into the spectral domain.The signal most commonly transformed is the mesh geometry, which is the signal specifying the vertex coordinates.However, the local eigenprojection Xω = U T ω Xω would result in a matrix of size K × 3 containing the spectral representations of the 3 spatial coordinates.Instead of flattening Xω to make it compatible with the shape of the latent representation, we define and project a one-dimensional signal: the signed distance between the vertices of a mesh and the per-vertex mean of the training set M (see Fig. 2).We have: where •, • is the inner product, and N are the vertex normals referred to the mesh template with vertex positions M. If X was standardised by subtracting M and dividing by the per-vertex standard deviation of the training set Σ Σ Σ, being the Hadamard product, Eq. 1 can be rewritten as: We assume that not all eigenprojections are equally significant when representing shapes.Therefore, for each attribute ω, we eigenproject all the local signed distances sd(Xω) computed over the training set, and identify the κ (with κ K) spectral components with the highest variance.While these spectral components are responsible for most shape variations, the small shape differences represented by other components can be easily learned by the neural-network-based generative model.After eigenprojecting the entire training set, we select the Fourier modes U ω ∈ R Nω×κ associated with the highest variance eigenprojections (Fig. 2) and use them to compute the eigenprojection loss.During this preprocessing step we also compute the mean and standard deviation of the highest variance local eigenrpojections, which we denote by m ω and s ω respectively.We thus define the local eigenprojection loss as: Note that combinatorial Laplacian operators are determined exclusively by the mesh topology.Since the topology is fixed across the dataset, the Laplacians and their eigendecompositions can be computed only once.Therefore, the local eigenprojection can be quickly determined by matrix-multiplying signed distances by the precomputed U ω .Also, if the Laplace-Beltrami operator was used in place of the Kirchoff graph Laplacian, the eigendecomposition would need to be computed for every mesh.Not only this would significantly increase the training duration, but backpropagating through the eigendecomposition would be more complex as this would introduce numerical instabilities [WDH*19].Alternatively, an approach similar to [MRC*21] should be followed.

Mesh Variational Autoencoder.
Like traditional VAEs [KW13], our 3D-VAE is also built as a probabilistic encoder-decoder pair parameterised by two separate neural networks.The probabilistic encoder is defined as a variational distribution q(z|X) that approximates the intractable model posterior.
It predicts the mean µ µ µ and standard deviation σ σ σ of a Gaussian distribution over the possible z values from which X could have been generated.The probabilistic decoder p(X|z) describes the distribution of the decoded variable given the encoded one.During the generation process, a latent vector z is sampled from a Gaussian prior distribution p(z) = N (z; 0, I) and an output shape is generated by the probabilistic decoder.Since the decoder is used as a generative model, it is also referred to as generator.Following this convention, we define our architecture as a pair of non-linear functions {E, G}, where E : X → Z maps from the vertex embedding domain X to the latent distribution domain Z, and G : Z → X vice versa.Since traditional convolutional operators are not compatible with the non-Euclidean nature of meshes, we build both networks as in [FKSC22], using the simple yet efficient spiral convolutions [GCBZ19] and sparse matrix multiplications with transformation matrices obtained with quadric sampling [GCBZ19; RBSB18] (see Supplementary Materials for more details).
As in [FKSC22], the 3D-VAE is trained minimising F encourages the output of the VAE to be as close as possible to its input by computing the squared Frobenius norm between X = G(E(X)) and X.The KL divergence can be considered as a regularisation term that pushes the variational distribution q(z|X) towards the prior distribution p(z).Since both prior and posterior are assumed to be Gaussian N TX 2 F is a smoothing term computed on the output vertices X and based on the Tutte Laplacian T = D −1 K = I − D −1 A, where A, D, and K are the adjacency, diagonal degree, and Kirchoff Laplacian introduced in the previous paragraph and computed on the entire mesh rather than on shape attributes.
Latent disentanglement is enforced by separately applying the local eigenprojection loss to the encoder and generator.We thus define the total loss as: where η 1 and η 2 are two scalar weights balancing the contributions of the two local eigenprojection losses.Note that L LE (X, µ µ µ) is backpropagated only through E. This term pushes the predicted µ µ µ towards the standardised local eigenprojections of the input, while the KL divergence attempts to evenly distribute the encodings around the centre of the latent space.Similarly, L LE (X , µ µ µ) is backpropagated only through G and it enforces the output attributes to have an eigenprojection compatible with the predicted mean.

Mesh Generative Adversarial Networks.
We propose two flavours of 3D Generative Adversarial Networks: one based on Least Squares GAN (LSGAN) [MLX*17] and one on Wasserstein GAN (WGAN) [ACB17].Like VAEs, GANs also rely on a pair of neural networks: a generator-discriminator pair {G, D} in LSGAN and a generator-critic {G, C} pair in WGAN.The architecture of the generators is the same as the one adopted in the generator of the 3D-VAE.The architectures of D and C are similar to E, but with minor differences in the last layers (see Supplementary Materials).Nevertheless, all networks are built with the same mesh operators of our 3D-VAE and [FKSC22; GCBZ19].
In the LSGAN implementation, G samples an input latent representation from a Gaussian distribution p(z) = N (z; 0, I) and maps it to the shape space as G(z) = X .While it tries to learn a distribution over generated shapes, the discriminator operates as a classifier trying to distinguish generated shapes X from real shapes X.Using a binary coding scheme for the labels of real and generated samples, we can write the losses of G and D respectively as We also add the Laplacian regularisation term L L (X ) to smooth the generated outputs.When seeking disentanglement, we train the discriminator by minimising L D LSGAN and the generator by minimising the following: In WGAN, G still tries to learn a distribution over generated shapes, but its critic network C, instead of classifying real and generated shapes, learns a Wasserstein distance and outputs scalar scores that can be interpreted as measures of realism for the shapes it processes.The WGAN losses for )] respectively.Similarly to the LSGAN implementation, when enforcing disentanglement, the critic is trained minimising L C WGAN , while the generator minimising:  For comparison, we rely on the 10, 000 meshes -with neutral expression and pose-generated in [FKSC22] using two linear models that were built using a large number of subjects: UHM [PWP*19] and STAR [OBB20] (Sec.2.1).We also use the same data split with 90% of the data for training, 5% for validation, and 5% for testing.Since these data are generated from PCA-based models, we also train our models on real data from the LYHM dataset [DPSD20] registered on the FLAME [LBB*17] template.In addition, even though it is beyond the scope of this work, we attempt to achieve disentanglement through local eigenprojection also on CoMA [RBSB18], a dataset mostly known for its wide variety of expressions.All models and datasets are released for noncommercial scientific research purposes.

Local Eigenprojection Distributions.
We observe that the eigenprojections are normally distributed for datasets with neutral poses or expressions (Fig. 3).By standardising the eigenprojections in Eq. 3 we ensure their mean and standard deviation to be 0 and 1 respectively.Since we enforce a direct relation between the local eigenprojections and the latent representations, this is a desirable property that allows us to generate meaningful shapes by sampling latent vector from a normal distribution.
In order to explain why this property holds for datasets with neutral poses and expressions, we need to hypothesise that shapes follow a Gaussian distribution.This is a reasonable hypothesis for datasets generated from PCA-based models, such as those obtained from UHM and STAR, because vertex positions are computed as linear combinations of generative coefficients sampled from a Gaussian.However, following the maximum entropy explanation [Lyo14], it is also reasonable to assume that shapes in dataset obtained capturing real people (like LYHM), are normally distributed.[Lyo14] argues that although the Central Limit Theorem is the standard explanation of why many things are normally distributed, the conditions to apply the theorem are usually not met or they cannot be verified.We assume that, like people's height, also body and head shapes are largely determined by genetics and partially by environment and epigenetic effects.The selection pressure determines an ideal shape with some variability to hedge against fluctuating circumstances in the environment.This amounts to fixing the mean, and an upper bound on the variance.Apart from that, the population will naturally tend to a state of maximal disorder (i.e., maximum entropy).Therefore, according to the maximum entropy explanation, human shapes are normally distributed because the distribution maximising entropy subject to those constraints is a normal distribution.
If the shapes are normally distributed, we can consider also vertex positions consistently sampled on the shape surfaces to follow each a different Gaussian distribution centred at the corresponding vertex coordinates on the mean shape.Considering that the signed distance and the local eigenprojection are both linear operations, they preserve normality, and for this reason also the local eigenprojections are normal.Note that expressions are subject-specific deformations with a highly non-linear behaviour [CBGB20].There is no guarantee that these transformations preserve the normality of the shape distribution.Therefore, datasets containing expressions, such as CoMA, may not satisfy the normality assumption.In fact, we observe that the standardised eigenprojections have more complex distributions which appear to be mixture of Gaussians (see Fig. 3).Intuitively, each Gaussian in the mixtures could be related to a different subset of expressions.

Comparison with Other Methods.
We compare our local eigenprojection disentangled (LED) methods against their vanilla implementations and against the only stateof-the-art method providing control over the generation of local shape attributes: the swap disentangled VAE (SD-VAE) proposed in [FKSC22].The authors compared their SD-VAE with other VAEs for latent disentanglement.Among their implementation of DIP-VAE-I, DIP-VAE-II, and FactorVAE, the first one appeared to be the best performing.Therefore, we report results for DIPsubmitted to COMPUTER GRAPHICS Forum (4/2023).VAE-I.For a fair comparison, all methods were trained on the same dataset (UHM) using the same batch size and the same number of epochs.In addition, they share the same architecture with minor modifications for the GAN implementations (see Supplementary Materials).The SD-VAE implementation, as well as the evaluation code and the benchmark methods, are made publicly available at github.com/simofoti/3DVAE-SwapDisentangled.All models were trained on a single Nvidia Quadro P5000, which was used for approximately 18 GPU days in order to complete all the experiments.
The reconstruction errors reported in Tab. 1 are computed as the mean per-vertex L2 distance between input and output vertex positions.This metric is computed on the test set and applies only to VAEs.We report the generation capabilities of all models in terms of diversity, JSD, MMD, COV, and 1-NNA.The diversity is computed as the average of the mean per-vertex distances among pairs of randomly generated meshes.The Jensen-Shannon Divergence (JSD) [ADMG18] evaluates the KL distances between the marginal point distributions of real and generated shapes.The coverage (COV) [ADMG18] measures the fraction of meshes matched to at least one mesh from the reference set.The minimum matching distance (MMD) [ADMG18] complements the coverage by averaging the distance between each mesh in the test set and its nearest neighbour among the generated ones.The 1-nearest neighbour accuracy (1-NNA) is a measure of similarity between shape distributions that evaluates the leave-one-out accuracy of a 1-NN classifier.In its original formulation [YHH*19], it expects values converging to 50%.However, following [FKSC22], in Tab. 1 we report absolute differences between the original score and the 50% target value.All the generation capability metrics can be computed either with the Chamfer or the Earth Mover distance.Since we did not observe significant discrepancies between the metrics computed with these two distances, we arbitrarily report results obtained with the Chamfer distance.
Observing Tab. 1 we notice that none of the models is consistently outperforming the others.GANs generally report better diversity scores than VAEs, but they are worse in terms of coverage and 1-NNA.GANs were also more difficult to train and were prone to mode collapse.On the other hand, VAEs appeared stable and required significantly less hyperparameter tuning.The scores of our LED models were comparable with other methods, thus showing that our loss does not negatively affect the generation capabilities.However, LED models are consistently outperformed in terms of MMD, COV, and 1-NNA.These metrics evaluate the quality of generated samples by comparing them to a reference set.Since comparisons are performed on the entire output shapes, we hypothesise that a shape with local identity attributes resembling each a different subject from the test set is more penalised than a shape whose attributes are plausibly obtained from a single subject.Note also that MMD, COV, and 1-NNA appear to be inversely proportional to the diversity, suggesting that more diverse generated shapes are also less similar to shapes in the test set.LED-models report higher diversity because attributes can be independently generated.This negatively affects MMD, COV, and 1-NNA, but the randomly generated shapes are still plausible subjects (see Fig. 4 and Supplementary Materials).Interestingly, SD-VAE appears to be still capable of generating shapes with attributes resembling the same subject from the test set, but at the expense of diversity and latent disengagement (see Sec. 4.4).

LED-LSGAN and LED-WGAN train almost as quickly as the vanilla LSGAN and WGAN. Training LED-VAE takes approxi-
mately one hour more than its vanilla counterpart because the local eigenprojection loss is separately backpropagated through the encoder.However, since latent disentanglement is achieved without swapping shape attributes during mini-batching, the training time of LED-VAE is reduced by 61% with respect to SD-VAE.Note that the additional initialisation overhead of LED models (3.72 minutes) is negligible when compared to the significant training time reducsubmitted to COMPUTER GRAPHICS Forum (4/2023).
Figure 5: Effects of traversing each latent variable across different mesh attributes.For each latent variable (abscissas) we represent the per-attribute mean distances computed after traversing the latent variable from its minimum to its maximum value.For each latent variable, we expect a high mean distance in one single attribute and low values for all the others.tion over SD-VAE, which is the only model capable of achieving a satisfactory amount of latent disentanglement.
If we then qualitatively evaluate the random samples in Fig. 4, we see that the quality of the meshes generated by LED-LSGAN and LED-WGAN is slightly worse than those from LED-VAE.We attribute this behaviour to the -usually undesired-smoothness typically introduced by 3D VAE models.In this case, the VAE model itself acts as a regulariser that prevents the shape artefacts introduced by the local eigenprojection disentanglement.In addition, traversing the latent variables, we find that mesh defects tend to appear when latent variables approach values near ±3 (see supplementary material video).This might be a consequence of the reduced number of training data with local eigenprojections with these values (see Fig. 3).Nonetheless, the problem can be easily mitigated with the truncation trick, thus sampling latent vectors from a Gaussian with standard deviation slightly smaller than one.

Evaluation of Latent Disentanglement.
Latent disentanglement can be quantitatively evaluated on datasets with labelled data.However, such labels are not available for the disentanglement of shape attributes and traditional metrics such as Z-Diff [HMP*17], SAP [KSB18], and Factor [KM18] scores cannot be used.Since the Variation Predictability (VP) disentanglement metric does not require labelled data and it has shown good correlation with the Factor score [ZXT20], we rely on this metric to quantify disentanglement across different models (see Tab. 1).The VP metric averages the test accuracies across multiple few-shot trainings of a classifier.The classifier takes as input the difference between two shapes generated from two latent vectors differing in only one dimension and predicts the varied latent dimension.We implement the classifier network with the same architecture of our encoders, discriminators, and critiques.The network was trained for 5 epochs with a learning rate of 1e −4 .As in [ZXT20], we set η V P = 0.1, N V P = 10, 000 and S V P = 3.
In addition, we qualitatively evaluate disentanglement as in [FKSC22] by observing the effects of traversing latent variables (Fig. 1, left).For each latent variable, we compute the per-vertex Euclidean distances between two meshes.After setting all latent variables to their mean value (0), the first mesh is generated setting a single latent to its minimum (−3) and the second mesh setting the same variable to its maximum (+3).The Euclidean distances can be either rendered on the mesh surface using colours proportional to the distances (Latent Traversals in Fig. 4 and Fig. 6), or plotted as their per-attribute average distance (Fig. 5 and Fig. 6).When plotted, the average distances isolated to each attribute provide an intuitive way to assess disentanglement: good disentanglement is achieved when the traversal of a single variable determines high mean distances for one attribute and low mean distances for all the others.Observing Fig. 4 and Fig. 5, it is clear that the only stateof-the-art method providing control over local shape attributes is SD-VAE.Since the eigenvectors used in the local eigenprojection loss are orthogonal, we improve disentanglement over SD-VAE.In fact, traversing latent variables of LED models determines finer changes within each attribute in the generated shapes.For instance, this can be appreciated by observing the eyes of the latent traversals in Fig. 4, where left and right eyes are controlled by different variables in LED-VAE, while by the same one in SD-VAE (more examples are depicted in the Supplementary Materials).We also notice that the magnitude of the mean distances reported in Fig. 5 for our submitted to COMPUTER GRAPHICS Forum (4/2023).
Figure 6: Results of LED-VAE on other datasets.For each dataset are displayed the effects of traversing latent variables (UHM is reported in Fig. 5), three random samples and three vertex-wise distances highlighting the effects of traversing three latent variables (UHM is reported Fig. 4).Mean distances are plotted following the colour coding depicted in Fig. 3.
LED models is bigger than SD-VAE within attributes and comparable outside.This shows superior disentanglement and allows our models to generate shapes with more diverse attributes than SD-VAE.Our model exhibits good disentanglement performances also on other datasets (Fig. 6).

Direct Manipulation
Like SD-VAE, also LED-VAE can be used for the direct manipulation of the generated shapes.As in [FKSC22], the direct manipulation is performed by manually selecting ϒ vertices on the mesh surface (S • X = S • G(z) ∈ R ϒ×3 ) and by providing their desired location (Y ∈ R ϒ×3 ).Then, minz ω S • G(z) − Y 2 2 is optimised with the ADAM optimiser for 50 iterations while maintaining a fixed learning rate of lr = 0.1.Note that the optimisation is performed only on the subset of latent variables zω controlling the local attribute corresponding to the selected vertices.If vertices from different attributes are selected, multiple optimisations are performed.As it can be observed in Fig. 7, LED-VAE is able to perform the direct manipulations causing less shape changes than SD-VAE in areas that should remain unchanged.

Conclusion
We introduced a new approach to train generative models with a more disentangled, interpretable and structured latent representation that significantly reduces the computational burden required by SD-VAE.By establishing a correspondence between local eigenprojections and latent variables, generative models can better control the creation and modification of local identity attributes of human shapes (See Fig. 1).Like the majority of state-of-the-art methods, the main limitation of our model is the assumption on the training data, which need to be consistently aligned, in dense point correspondence, and with a fixed topology.Even though this is surely a limitation, as we mentioned in Sec. 1, this assumption can simplify the generation of other digital human's properties.Among the different LED models we proposed, we consider LED-VAE to be the most promising.This model is simpler to train, requires less hyperparameter tuning, and generates higher-quality meshes.We trained and tested this model also on other datasets, where it showed equivalent performances.Datasets with expressions have complex local eigenprojection distributions (Fig. 3) which are more difficult to learn.In fact, random samples generated by LED-VAE trained on CoMA present mesh defects localised especially in areas where changes in expression introduce significant shape differences characterised by a highly non-linear behaviour (e.g. the mouth region).Controlling the generation of different expressions was beyond the scope of this work and we aim at addressing the issue as future work.We proved that our loss can be easily used with both GANs and VAEs.Being efficient to compute and not requiring modifications to the mini-batching procedure (like SD-VAE), it could be leveraged also in more complex architectures for 3D reconstruction or pose and expression disentanglement.In the LED-VAE the local eigenprojection loss is computed also on the encoder (see how this improves disentanglement in the ablation study provided with the supplementary materials).Having an encoder capable of providing a disentangled representation for different attributes could greatly benefit shape-analysis research in plastic surgery [OvdLP*22] and in genetic applications [CRW*18].Therefore, we believe that our method has the potential to benefit not only experienced digital artists but also democratise the creation of realistic avatars for the metaverse and find new applications in shape analysis.Since the generation of geometric shapes is only the first step towards the data-driven generation of realistic digital humans, as future work, we will research more interpretable generative processes for expressions, poses, textures, materials, high-frequency details, and hair.

LED-VAE Additional Experiments
As mentioned in Sec. 5, we consider LED-VAE to be the most promising generative model among the proposed LED models because it is simpler to train, requires less hyperparameter tuning, and generates higher quality meshes.Therefore, we conduct experiments to evaluate the importance of the different assumptions made in its construction (Sec.5.1 and Sec.5.2), and observe the smoothness of the latent space (Sec.5.3).

Ablation Study
The ablation study in Fig. 9 is performed by re-training the proposed LED-VAE without some of its characterising design choices.Models are re-trained with the same architecture (Sec.3) and implementation details (Sec.4) of LED-VAE.Only one design choice is altered per ablation experiment.Not standardising the data (No stand in Fig. 9) we observe noisy random samples.Some control over the generation of local attributes appears to be retained, but the presence of noise contributes to shape variations across the entire shape during latent traversals.When the local eigenprojection loss is not computed on the encoder (η 1 = 0), not only the encoder loses its disentanglement capabilities, but the control over the generation of local shape attributes is significantly reduced (No LE E in Fig. 9).Similar results are obtained when the local eigenprojection loss is not computed on the generator (η 2 = 0), though the encoder should retain its disentanglement power (No LE G in Fig. 9).If instead of selecting the κ eigenvectors corresponding to the spectral components with the highest variance, we select the first κ eigenvectors as Fourier modes for the local eigenprojection, the generator creates unrealistic shapes (No max var in Fig. 9).Unrealistic shapes are generated also if the local eigenprojections are not standardised and thus m ω = 0 and s ω = 1 in Eq. 3 (No LE stand in Fig. 9).In addition, note that the vanilla VAE is equivalent to the LED-VAE without local eigenprojection losses and its results are equivalent to those of an ablation experiment where both the local eigenprojection losses are set to zero.
Not only we perform an ablation study removing some characterising design choice of LED-VAE, but we also experiment with the strength of their weighting coefficients.In Fig. 10, we observe the effects caused by changing the smoothing weight (α).As expected, reducing α reduces also the quality of the randomly generated samples.Interestingly, also the disentanglement performance slightly deteriorates.Increasing α does not have major effects on  the generation and disentanglement performance.In Fig. 11, we report the effects of altering the local eigenprojection weighting coefficients.Even though Fig. 9 (No LE E) shows the importance of enforcing the local eigenprojection loss on the encoder, we do not observe significant difference when altering its weight (η 1 ).Most differences can be appreciated in the first latent traversal, showing how lower weights slightly deteriorate disentanglement.On the contrary, η 2 , which modulates the disentanglement on the generator, has more influence on sample quality and disentanglement.In fact, high η 2 values improve disentanglement, but reduce sample quality.

Different segmentations
The segmentations in Fig. 3 were performed with clinical supervision and were aimed at identifying key anatomical areas of the face and body.Nevertheless, different segmentations are admissible.To observe the disentanglement performance of LED-VAE with different segmentations, we re-trained LED-VAE using a coarser and a finer segmentation.As shown in Fig. 12, LED-VAE successfully disentangles local identity attributes when varying the size and number of local shape attributes.
Note that the local segments are used only by the local eigenprojection losses and are not an input to the network.For this reason, attributes can be connected, overlapping or even not connected.The segmentation used to train our models is connected.Overlapping segments could be used, but big overlaps may be counterproductive as they would increase entanglement between neighbouring regions and may produce unexpected results when eigenprojections of overlapping regions are incompatible.

Latent Interpolations and Replacements
We perform two latent interpolation experiments and compare results between LED-VAE and SD-VAE, the two variational autoen-coders providing control over the generation of local shape attributes.Two randomly selected test shapes Xstart and X finish are encoded to compute their respective latent representations.Fig. 13 shows the reconstructed shapes X start and X finish as well as the shapes generated from latent vectors linearly interpolated between the latents of Xstart and X finish .Fig. 14, Fig. 15, and Fig. 16 depict the effects of changing each zω of Xstart with the corresponding zω of X finish .This is equivalent to progressively replacing attributes of the initial mesh with those of the final mesh.The experiment in Fig. 14 is better represented in the supplementary video, where each zω is interpolated instead of being replaced.These experiments show that the latent space of our LED-VAE is smooth.Even though some self-intersections is visible on the ears of heads generated by LED-VAE, this model appears to be better than SD-VAE at replacing attributes.

Random Generation and Latent Disentanglement
We report more randomly generated shapes and latent traversals than those already depicted in Fig. 4 and Fig. 6.In fact, in Fig. 18 we show shapes obtained by all methods trained on heads from UHM and in Fig. 17     While the left-most and right-most heads are the reconstruction of the initial and target shape, the others are obtained with latent replacements.Each shape is generated starting from the one on its left.For example, the second heads from the left are generated with the latent vector of X start and replacing the subset of latent variables controlling the eyes of X start with the subset controlling the eyes of X finish .Similarly, the third head has the same latent representation of the second one, but also the subset of latent variables controlling the ears is replaced.The remaining shapes are obtained repeating the same procedure.

Figure 2 :
Figure2: Schematic representation of the local eigenprojection, the operation at the core of our local eigenprojection loss.The signed distance between a given mesh X and a mean shape template is computed as sd(X).sd(X) is locally eigenprojected into a vector z where each subset of variables is a spectral descriptor of a shape attribute.The projection is performed by matrixmultiplying the signed distance by U ω , the highest-variance eigenvectors of each shape attribute ω.The heads in the bottom part of the figure represent one-dimensional vectors whose values are mapped with diverging colour maps on the mean shape head.On the heads corresponding to the columns of U ω , the black seams mark the different attributes that we seek to control during the generation procedure.
While α and β are weighting constants, L R is the reconstruction loss, L L is a Laplacian regsubmitted to COMPUTER GRAPHICS Forum (4/2023).

Figure 3 :
Figure 3: Local eigenprojection distributions.All training meshes are locally eigenprojected to observe the distributions of the elements in the resulting vectors.Distributions are colour-coded according to the shape attribute they are referred to.The segmentation of the shape attributes displayed next to the distributions is rendered on the mean shape templates of the corresponding dataset.The dashed distributions, which are obtained sampling a Gaussian, are reported for comparison.
Note that to make C a 1-Lipschitz function, and thus satisfies the Wasserstein distance computation requirements, C weights are clipped to the range [−c, c].submitted to COMPUTER GRAPHICS Forum (4/2023).

Figure 4 :
Figure 4: Random samples and vertex-wise distances showing the effects of traversing three randomly selected latent variables (see Supplementary Material to observe the effects for all the latent variables).

Figure 7 :
Figure 7: Direct manipulation.Left: the user manually selects an arbitrary number of vertices (blue) and specifies their desired position (red).Right: results of the direct manipulation optimisation for LED-VAE and SD-VAE.For each method, the output shape, a close-up of the manipulated attribute, and the rendering of the pervertex distances between the initial and manipulated shapes are reported.The colour-map used to represent vertex distances is blue where distances are zero and red where they reach their maximum value.

Figure 9 :
Figure 9: Ablation study.The LED-VAE is ablated by removing the data standardisation and the computation of the local eigenprojection loss on either the encoder or generator.Albations are performed also selecting the first eigenvectors instead of those associated with the maximum variance and without standardising the local eigenprojections in the loss computation.
operating on G is set to 1e −4 , while the learning rate of the optimizer operating on C is 5e −5 .The C network weights are clipped to the range [−c, c] with c = 0.01.The weight of the Laplacian smoothing term is set to α = 10 in WGAN and α = 50 in LED-WGAN.In LED-WGAN the local eigenprojection loss weight is set to η = 0.25 and K = 50 eigenvalues are computed.

Figure 10 :
Figure 10: Effects of smoothness weight (α) on random samples and latent traversals.The row highlighted in pink reports results obtained with the proposed implementation of LED-VAE.Latent traversals are referred always to the same latent variable.

Figure 11 :
Figure 11: Effects of local eigenprojection weights on random samples and latent traversals.Rows highlighted in pink report results obtained with the proposed implementation of LED-VAE.The left column shows the effects of changing η 1 , which controls the local eigenprojection loss affecting the encoder.The right column the effects of changing η 2 , which controls the local eigenprojection loss affecting the generator.Latent traversals are referred always to the same latent variable.
shapes generated with LED-VAE trained on LYHM, and CoMA, and STAR.Then we report the effects caused in the generated shapes by traversing all the latent variables.In particular, Fig. 19 shows the shapes generated by traversing all 5 latent variables in each zω for LED-VAE, SD-VAE, LED-LSGAN, and LED-WGAN.Fig. 20 represents the effects of latent traversals for methods that are not able to enforce disentanglement with respect to local shape attributes, such as: VAE, DIP-VAE-I, LSGAN, and WGAN.Finally, Fig. 21 reports latent traversal results for LED-VAE trained on LYHM, and CoMA, and STAR.submitted to COMPUTER GRAPHICS Forum (4/2023).

Figure 12 :
Figure12: Effects of traversing each latent variable of LED-VAE trained enforcing latent disentanglement with different attribute segmentations.Note that since 5 latent variables are used to represent each attribute, the latent size with 6 attributes is equal to 30, 60 with 12, and 100 with 20.When 20 attributes are disentangled, not only we segment the supraorbital area, but we also separate the left from the right attribute.For instance, while with 12 attributes left and right eye were grouped together, now they are separate.

Figure 13 :
Figure13: Latent interpolations with LED-VAE and SD-VAE.Two shapes (Xstart and X finish ) are randomly selected from the test set.Their latent representation is computed by feeding the two shapes in the encoder network.10 intermediate latent vectors are obtained by linearly interpolating all the latent variables.Shapes generated from these latent vectors smoothly transition from the reconstructed initial (X start) and final shapes (X finish ).

Figure 14 :
Figure14: Per-attribute latent replacements with LED-VAE and SD-VAE.Subsets of the latent variables corresponding to different head attributes (zω) are progressively replaced.While the left-most and right-most heads are the reconstruction of the initial and target shape, the others are obtained with latent replacements.Each shape is generated starting from the one on its left.For example, the second heads from the left are generated with the latent vector of X start and replacing the subset of latent variables controlling the eyes of X start with the subset controlling the eyes of X finish .Similarly, the third head has the same latent representation of the second one, but also the subset of latent variables controlling the ears is replaced.The remaining shapes are obtained repeating the same procedure.

Figure 16 :
Figure 16: Additional per-attribute latent replacements with LED-VAE trained on shapes from STAR (see Fig. 14).

Figure 17 :
Figure 17: Random samples generated by LED-VAE models trained on shapes from LYHM, CoMA, and STAR.

Figure 21 :
Figure 21: Complete latent traversals of LED-VAE grouped per-dataset along columns and per-attribute along rows.The LED-VAE models are trained on shapes from LYHM, CoMA, and STAR.

Table 1 :
Quantitative comparison between our model and other state-of-the-art methods.All methods were trained on UHM [PWP*19].Diversity, JSD, MMD, COV, and 1-NNA evaluate the generation capabilities of the models, while VP evaluates latent disentanglement.The different metrics are computed as detailed in Sec.4.3.Note that the training time does not consider the initialisation time.
[ZKJB17]r main objective is to train a generative model capable of generating different identities, we require datasets containing a sufficient number of subjects in a neutral expression (pose).Most open source datasets for 3D shapes of faces, heads, bodies, or animals (e.g.MPI-Dyna[PRMB15], SMPL [LMR*15], SUR-REAL [VRM*17], CoMA[RBSB18], SMAL[ZKJB17], etc.) focus on capturing different expressions or poses and are not suitable for identity disentanglement.