PVP: Personalized Video Prior for Editable Dynamic Portraits using StyleGAN

Portrait synthesis creates realistic digital avatars which enable users to interact with others in a compelling way. Recent advances in StyleGAN and its extensions have shown promising results in synthesizing photorealistic and accurate reconstruction of human faces. However, previous methods often focus on frontal face synthesis and most methods are not able to handle large head rotations due to the training data distribution of StyleGAN. In this work, our goal is to take as input a monocular video of a face, and create an editable dynamic portrait able to handle extreme head poses. The user can create novel viewpoints, edit the appearance, and animate the face. Our method utilizes pivotal tuning inversion (PTI) to learn a personalized video prior from a monocular video sequence. Then we can input pose and expression coefficients to MLPs and manipulate the latent vectors to synthesize different viewpoints and expressions of the subject. We also propose novel loss functions to further disentangle pose and expression in the latent space. Our algorithm shows much better performance over previous approaches on monocular video datasets, and it is also capable of running in real-time at 54 FPS on an RTX 3080.


Introduction
Digital avatars offer a compelling way to represent human appearance and expression, enabling various applications, like telepresence and virtual reality.Recent advances in StyleGAN [22,23] and neural radiance fields (NeRFs) [30] have accelerated the development of digital avatars [3,18].Radiance fields have also spawned many extensions [3,14,18,53,55] which capture various characteristics of human portraits, such as expression, appearance and identity.
However, the NeRF-based methods mostly focus on reconstructing the input sequence and do not provide a good way to edit or manipulate various properties.Many of these approaches rely on learning a 3D generative prior, thus requiring inversion into the latent space, which requires generator finetuning and an inevitable loss of 3D representation ability.These methods are also non-trivial to extend to Table 1.Comparison of different methods.Our proposed method bridges the gap between 2D and 3D methods by enabling reconstruction on extreme viewpoints through personalization.2D methods allow for efficient image synthesis from single view inputs, but they often cannot handle larger motions.Even if the input video contains large motion changes, 2D methods [51,54] are not able to use the information and reconstruct more difficult viewpoints.On the other hand, 3D methods [14,16] can handle difficult viewpoints and NeRFBlendShape can achieve some form of personalization in expression.However, they do not allow full editability on appearance like the 2D methods.Our method achieves editable head poses, appearance editing via StyleGAN, monocular input, riggability through FLAME-like controls, and personalization with the help of a personalized video prior.

Category
Methods Head pose editing Appearance editing Monocular input Riggable Personalization NeRF HeadNeRF [18] video sequences of a subject.We argue that editability also plays an important part in making the digital avatars engaging, and it offers the user more control over their desired looks (see Fig. 1).A recent paper, FreeStyleGAN [26], enables editing on static human portraits, but it requires multiple carefully-shot images with the subject holding still for several seconds.Concurrent work [21,41] provides an editing interface to allow user inputs, but these methods are not animatable (e.g.enabling 3DMM-like control [4]) and thus do not offer the same level of interactivity.Lastly, while ClipFace [2] provides editability on 3D morphable models (3DMM) and their texture, it does not handle other facial features like hair.
Our proposed method offers a new way to tackle portrait synthesis using personalized StyleGAN from monocular portrait videos.First, we carefully sample several frames from the input videos to use as pivots, and the pivots are used to perform PTI [39] and fine-tune the StyleGAN generator to produce a personalized manifold, allowing for faithful reconstruction of the subject under extreme poses and smooth transitions between different head poses.Then, we employ lightweight pose and expression encoders to enable finer control over the representation.We can render novel head poses and expressions in real-time (at 54 FPS) by feeding in the corresponding pitch, yaw angles and FLAME coefficients [27].We also propose a novel expression matching loss and pose consistency loss to disentangle the editing directions in the latent space, making it easier to change one attribute at a time.Additionally, exploiting various methods in StyleGAN editing [36,40], our method can provide good editing capability.Project website and code: https://cs eweb.ucsd.edu/˜viscomp/projects/EGSR23PVP/ We summarize our contributions as follows: 1. a personalized video prior derived from monocular portrait video of a given subject (Sec.4.1); 2. a novel algorithm that enables control of a dynamic portrait within the personalized manifold, allowing editing on pose, expression and appearance (Sec.4.2, Fig. 2); 3. expression matching and pose consistency loss to better disentangle the poses and expressions given only a short portrait video of a subject (Sec.4.3).

Related Work
Previous methods have shown promising results in reconstructing photorealistic face renderings.In particular, there exist 4 promising directions: (a) StyleGAN models which can synthesize 2D face renderings and allow for user control by traversing in their latent space (Sec.2.1); (b) neural radiance fields (NeRF) that handle the dynamic facial expression and complex visual effects through volumetric rendering (Sec.2.2); (c) parametric models, like 3DMM and FLAME, which provide explicit pose and expression control through skinning (Sec.2.3).(d) facial reenactment methods which enable facial expressions to be transferred to different subjects (Sec.2.4).We give an overview of these methods in the following subsections and a comparison of our method against previous methods in Table 1.

Face Synthesis with StyleGAN
Generative models like StyleGAN2 [23] have demonstrated an impressive capability of synthesizing photorealistic portraits while only trained on in-the-wild imagery.In order to reconstruct human faces, a common method is to invert the underlying latent code.GAN inversion methods, like e4e [48] and pSp [38], have shown remarkable results in finding the most representative latent code.With the latent codes, PTI [39] and MyStyle [31] further optimize the StyleGAN generator to learn a personalized latent space, enabling editing while staying faithful to the same identity.In terms of controlling the pose and expression of the StyleGAN renderings, StyleRig [46] predicts modified latent codes directly from semantic control parameters.Style-HEAT [54] performs warping on the intermediate features layers of the StyleGAN synthesis network.Although these methods show promising results, there are some shortcomings.For example, StyleRig could introduce a shift in identity after rotating the head pose.The warping scheme used by StyleHEAT could not handle large viewpoint changes well enough (see Fig. 5).The aforementioned StyleGAN methods are mainly handling 2D images.
While 3D GAN-based methods [5,6,10,17,32] generate photorealistic 3D representations from 2D images, the editability of such models requires further research to reach its full potential.SofGAN [7], IDE-3D [42], FENeRF [43] and CIPS-3D [58] demonstrated some level of editability on the appearance and expression of 3D GANs.Additionally, inverting a video into their latent space is non-trivial as each frame is inverted independently, and 3D GAN PTI often collapses to a 2D representation.
As a result, our paper looks at lifting 2D StyleGAN renderings to a 3D representation that allows editing of both the facial appearance and expression.One related work is FreeStyleGAN [26].However, it requires the subject to be static and multi-view input images.Our method seeks to handle dynamic portraits with only a single-view input video.Please refer to Table 1 for a comparison of different methods.

3D Methods for Dynamic Portraits
NeRF [30] brought about exciting advancements in the field of image-based rendering.Recently, a plethora of work tackling portrait reconstruction [3,14,18,33,34,45,55,57] has emerged.First, NeRFace [13] conditions the MLP with learnable codes and expression coefficients to represent dynamic facial motions.HeadNeRF [18] encodes identity, expression, albedo and illumination into latent codes and uses them to condition the NeRF MLP, providing some level of user control.RigNeRF [3] combines the 3DMM deformation fields with NeRF to allow explicit control of the expression and head poses.Nerfies [33] and HyperN-eRF [34] provide ways to encode deformable NeRFs, enabling capture of 3D facial movements.Following the above methods, FDNeRF [55] further extends to few-shot inputs and NeRFBlendShape [14] blends hashgrids with different expression coefficients to allow for fast and expressive portrait reconstruction.While the above approaches mostly focus on re-enactment and pose manipulation, they do not provide comprehensive appearance editing.
On the other hand, NeRFFaceEditing [21] focuses on the editability of different facial parts.SofGAN [7] provides finer control over different facial parts, including appearance, shape and lighting effects.However, these methods do not offer explicit facial pose control.Our work aims to achieve granular pose control and appearance editing at the same time.

Parametric Model for Human Faces
There has been a lot of interest in using morphable face models to enable facial animation [15,49,52].The FLAME model [27] offers a good trade-off between controllability and expressiveness.Derivatives like DECA [11] can infer FLAME parameters given only a single image.ROME [24] further expands DECA to include displacements for the head mesh and preserve more details in non-facial regions (e.g.hair, neck and shoulders).A notable work in this category is NHA [16], which optimizes a head model given a single-view video.NHA provides a good geometry estimate of the subject while keeping explicit control over the FLAME parameters.Although NHA supports editing the facial expression and head pose through the FLAME parameters, it does not have an interface for appearance editing.Our proposed method focuses on this aspect and seeks to combine the editability of StyleGAN images with FLAMElike control parameters.

Facial Reenactment Techniques
In addition to the above methods, there are some other techniques that focus on the facial reenactment tasks.In other words, these methods transfer the expressions from one subject to another.Face2Face [47] achieves real-time face capture and reenactment by fitting a parametric 3D model to monocular RGB input videos.Deep Video Portraits [25] utilizes a translation network to convert coarse facial renderings into photorealistic video portraits.Different from these methods, our work enables reenactment and editing simultaneously through the use of pose and expression encoder networks and the StyleGAN latent space.

Background
In this section, we give a brief overview on StyleGAN and its latent space design.Then, as we seek to edit real portraits, we introduce pivotal tuning inversion (PTI), which is the state-of-the-art method to perform GAN inversion, while keeping good editability.PTI serves as an important foundation for our method, since we would like to explore the idea of a personalized video prior by fine-tuning the StyleGAN generator on selected frames of a given video (more in Sec. 4).Lastly, we briefly go through editing on the StyleGAN.
StyleGAN [23] demonstrates high-quality image synthesis results, and it has a well-structured latent space, which allows smooth transition between different latent codes through linear interpolation.To be more specific, it takes a latent code z ∈ R 512 as input (often referred to as Z space) to a mapper network, which maps the input to the intermediate latent code w ∈ W. The latent code w is then affine transformed and fed to each convolutional layer in the synthesis network via adaptive instance normalization (AdaIN) [19].The affine transformed latent code is often referred to as the W+ space.We can then write the StyleGAN image synthesis as I ′ = G(w + (z); θ), for a StyleGAN generator G with weights θ.
A way to adapt StyleGAN to editing real images is through GAN inversion.The idea is to freeze the StyleGAN weights and optimize for the latent code in either W [23], or W+ [1].However, there is a tradeoff between the two methods.W space inversion offers better editability but poor reconstruction quality, whereas W+ provides superior reconstruction, yet produces inferior editing results.To address this tradeoff, the better way is to find pivot latent codes in the W+ space via e4e [48] and then fine-tune the Style-GAN generator based on the pivots, namely pivotal tuning inversion.With PTI, it is possible to extend StyleGAN to handle samples which differ greatly from the training distribution, for instance, persons with makeup and faces viewed at large angles (≥ 60 • ).To be more specific, we can define the pivots as where E is the e4e encoder, I the input image, and w + the inverted latent code in W+ space.And we can synthesize the rendering with the StyleGAN G by where θ denotes the weights of a StyleGAN generator.Once we acquire the pivots, we can then optimize the StyleGAN generator G and obtain the fine-tuned weights θ p with the following objective: where L LPIPS is the perceptual loss [56], L L2 denotes the MSE loss, λ P L2 denotes the weight for MSE the loss, L R is the locality regularization loss [39], and λ P R the weight for the regularizer.The p superscript denotes personalization, since we use L2 loss later for training our mapping networks as well.Aside from the LPIPS and MSE loss, the locality regularization is enforced to make sure the PTI changes stay local without affecting other parts in the latent space.Once, the network is fine-tuned, it can be seen as a personalized generator which now incorporates a prior based on the given subject, instead of the domain prior it originally has [31].
Finally, we can edit an inverted image by manipulating the latent code, which can be described as: where ∆w + could be any editing directions from Interface-GAN [40] or GANSpace [20].
In the next section, we discuss our proposed algorithm, starting by sampling frames from the input video, performing PTI on the selected frames to learn a personalized video prior, and then learning pose and expression mapper networks to represent different head pose and facial expressions on the personalized manifold.

Proposed Method
Given a monocular video of a dynamic portrait, we seek to produce an editable representation with fine controls over the pose, expression and appearance of the subject.To achieve this goal, we develop an optimization pipeline with two stages: (a) learning the personalized video prior in StyleGAN, and (b) training pose and expression mapper networks.To start, we discuss the first stage in Sec.4.1.Then, we describe the second stage, namely, how we train the pose and expression mappers, and how we use them to control different head poses and expressions in Sec.4.2.Last, we describe our loss design for the mappers in Sec.4.3.Please refer to Fig. 2 for an overview of our optimization pipeline.

Personalized Video Prior in StyleGAN
Our main goal is to generate an edited portrait video with StyleGAN2 [23].A naive way to address this is to perform GAN inversion on a frame-by-frame basis.However, previous methods generally struggle with extreme viewing angles (∼ 90 • ).This is because large head rotations (> 40 • in yaw, > 25 • in pitch) are less represented in the Style-GAN latent space, as shown in the left diagram in Fig. 3.We notice that these out-of-domain (OOD) samples can be better represented with PTI [39] as discussed in Sec. 3. Another property we discovered is that interpolation between the pivots offers a smooth transition between different head poses (see the right diagram of Fig. 3), and we can represent the head motions as a linear combination of different pivot vectors, similar to the notion of linear blending in animation.As a result, our idea is to seek out these pivots in the given portrait video, and construct a manifold to allow animating the portrait by user control.
As discussed, when presented an image sequence, our task is to seek a personalized video prior p that forms a local manifold W p within the latent space of the fine-tuned Style-GAN G(•; θ p ). First, we preprocess the images by aligning and cropping the face following FFHQ [22].As the facial landmark detection could introduce slight inconsistencies and cause the results to flicker, we employ Gaussian filters to smooth out the noise [12,50].Since there are many frames in a video, we seek to select the most useful frames as pivots to ensure efficient learning of the personalized manifold.We design a sampling strategy to uniformly sample different head poses and expressions throughout the video to ensure good coverage of all possible facial changes.To be more specific, we utilize an off-the-shelf face detector, DECA [11], to estimate the yaw ψ, pitch ϕ, neck pose κ, jaw pose γ and expression parameters ξ of a given sequence.We then stack the ψ, ϕ and ξ channel-wise and perform K-means clustering [28] to select the most prominent K clusters.While uniform sampling sometimes yields similar results, our sampling strategy would avoid oversampling cases where the pose or expression stay similar.We find that this strategy provides good coverage of different expressions and head poses in the input video.We define the K samples from the video as I = {I 1 , I 2 , ..., I K }.Finally, we optimize the StyleGAN generator, similar to Eq. 3, with the following objective: (5) The main difference from Eq. 3 is that we are optimizing over the K images instead of only one image.From this point forward, we introduce a shorthand G p = G(w + ; θ p ). Once we acquire the fine-tuned StyleGAN G p , we can define the β-dilated convex hull [31] in the W+ space as the local manifold W p : where α i is the blending weight for pivot w i .As discussed in Sec. 3, interpolation between StyleGAN latent codes provides smooth transitions between different pivots.We also notice that interpolating between poses at 90 • and −90 • in W p gives stable and high-quality renderings of the poses in between, while maintaining the same identity of the subject.Discussion.Our proposed fine-tuning stage has two differences from MyStyle [31].First, MyStyle uses 100 to 200 images of the given subject under different settings, including lighting, ages, expressions and hairstyles, whereas our method focuses on frames from the same input video where the subject can have similar appearance.This difference make redundant samples and overfitting more likely to happen due to less variety in the training samples.However, our proposed sampling strategy ensures the samples are different enough in poses or expressions by clustering and selecting the representative frames from the input video.This way, our method can represent a wide variety of facial expressions and poses given a short video of around 20 seconds.Another difference is that fine-tuning the StyleGAN without any regularizer L R could lead to degraded editability (see Fig. 9).We theorize that this is because the finetuned StyleGAN overfits to the input video and causes other parts in the latent space to lose the smoothness of the original StyleGAN.To prevent this, it is important to apply the locality constraint in our setting to keep the structure in the StyleGAN latent space relatively unchanged after the finetuning.It is also beneficial to use StyleGAN's expressive latent space to synthesize expressions which are not observed in the input video.We design a loss function that utilizes random expression objectives to encourage the network to synthesize these unseen expressions.More details are in Sec.4.2 and Sec.4.3.

Pose and Expression Mapping Networks
A key contribution of our proposed method is the ability to easily control the face with parameters like yaw, pitch and a 50-dimensional expression vector from DECA [11].To this end, we implement the pose and expression mappers, which are MLPs taking pose and expression coefficients as input and producing the corresponding latent code (see Fig. 4 for examples).
First, in order to change the head pose of the rendering, we could manipulate the latent code by moving in the personalized manifold W p .Specifically, we employ an MLP F rot to calculate the latent code w rot as a linear combination of the pivots {w i } K i=1 by α rot i w i , and {α rot i } K i=1 = F rot (ϕ, ψ; θ r ), (7) where ϕ denotes the pitch, ψ the yaw of the head and θ r represents the weights of F rot .We use the MLP to predict the blending weights instead of the latent code in W space or W+ space, because the blending weights effectively constrain the latent code to be inside the manifold where all points are meaningful and would represent the subject (see the video results in supplementary materials for an example).On the other hand, it is possible for editing directions in W and W+ space to go beyond the personalized manifold and degrade the rendering significantly.Lastly, we introduce expression control by learning a mapping network that takes as input the jaw pose γ and the expression parameters ξ and outputs the difference of the latent code ∆w expr in W+ space.Specifically, we have We then add it to the latent code w rot : In practice, we only predict the first 8 layers of the latent code instead of the full 18 layers.This is because we would like to focus on geometry changes instead of other appearance changes.We are effectively predicting a local change in the latent space to represent changes like lip motions during a talking sequence and facial expressions like raising of the eyebrows.The local neighborhood of the latent space contains a rich and disentangled representation of the subject's possible facial expressions.Note that it is possible to move the latent code slightly outside the convex hull.To regularize, we only use the first 8 layers of W+ to reduce degrees-of-freedom, and we apply L local (see Sec. 4.3) to keep the changes small.Finally, we can then render the novel view I ′ with the fine-tuned StyleGAN generator G p : Note that once the mapping network is trained, it can be run in real-time given the pose and expression parameters.

Loss Design for Mapping Networks
Next, we fix the StyleGAN generator weight θ p , and train the mapping networks F rot and F expr with the loss design discussed in this section, optimizing for their weights θ r and θ e .As we aim to reconstruct an input video, we supervise the output with rendering losses like LPIPS [56] and L2 loss, which are denoted by L LPIPS and L L2 , respectively.Additionally, to ensure the synthesized identities are the same, we apply an identity loss L id [48], which uses a pretrained ArcFace [9] network to obtain identity features.
Since the input video would often show one combination of the head pose and expression, it is insufficient to train on the input video alone.To avoid overfitting, we introduce regularization by synthesizing images with a slightly perturbed expression input.Precisely, for each target view I and its corresponding jaw pose γ and expression ξ, we render an additional rendering I e with G p by changing the input to the perturbed parameters γ ′ = γ + ϵ, ξ ′ = ξ + ϵ, where we add a normal distribution ϵ ∼ N(0, σ).Then we feed the new rendering I e to the encoder of DECA [11] to predict yaw ψ e , pitch ϕ e , jaw pose γ e , expression ξ e , and neck pose κ e .and define the expression matching loss as: We found this loss to improve the disentanglement of expression and head pose, since the synthesized portrait has the same head pose but a slightly different facial expression.This additional objective increases the diversity of facial expressions that the expression mapper can generate.Visualization of the perturbed renderings is shown in Appendix A. Furthermore, we enforce a pose consistency loss and an RGB consistency loss to make sure I e does not show large head motions.We define the pose consistency loss as: Similarly, κ is the neck pose for the input frame, and κ e denotes the DECA-estimated neck pose of the perturbed rendering.Note that we do not feed the neck pose κ as an input to the pose MLP.For the RGB consistency loss, we calculate a mask m excluding the eyes and mouth regions, and then minimize the loss between the perturbed rendering I e and the original rendering I ′ : To further encourage the expression network to explore local regions in the manifold, we apply a regularization loss on the predicted latent ∆w expr , namely, Lastly, our training objective is defined as follows: (θ r , θ e ) = arg min where we update the pose network weights θ r and the expression network weights θ e .

Experiments
We demonstrate the efficacy of our proposed method against SOTA 2D methods [37,51,54] and a 3D method [16] in this section.We first describe implementation details in Sec.5.1.We show quantitative comparison in Sec.5.2 and Table 2. Then we provide visual results in Sec.5.3, Figs. 5 to 8. Finally, in Sec.5.4, we discuss limitations and possible future directions.

Implementation Details
We implement our algorithm using PyTorch [35].We first run PTI [31] with the LPIPS threshold set to 0.03, and max PTI steps set to 350, on the samples from the input video to acquire a personalized manifold.Then we train our pose and expression mappers for 50k steps.We set our learning rate for the MLPs to be 5 × 10 −4 .For the loss functions in Sec.4.3, we use λ LPIPS = 10, λ L2 = 10, λ id = 0.5, λ pose = 0.1, w expr = 0.1, λ cons = 1.0, and λ local = 0.5.We set σ = 0.5 for the perturbed expression parameters.Further details can be found in Appendix B.
For the dataset, we use the monocular video dataset from NHA [16] and NeRFBlendShape [14].For the NHA dataset, we choose the first 750 frames as training data.Due to the last few frames being corrupted in the original video data provided by the author, we choose frames 751 to 1450 as the evaluation dataset instead of frames 751 to 1500.The NeRFBlendShape dataset has videos with 3000 to 4000 frames, and we choose the last 500 frames as evaluation data.We briefly describe dataset composition in Appendix C. Source code can be found on our website: https://cseweb.ucsd.edu/˜viscomp/projects/EG SR23PVP/.

Quantitative Results
To evaluate the visual quality, we used metrics like peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and perceptual loss (LPIPS).We evaluate all .Visual results of different methods.We show reconstruction results on the held-out views.Our method can predict better geometry and appearance of the subject.For example, in column (a), (b) and (c), we can see the ear reconstruction failed for NHA [16].For 2D methods, they mostly fail to reconstruct the correct details and expression.To be more specific, PIRenderer [37] and StyleHEAT [54] both fail for large viewpoint changes, with StyleHEAT generating overly-smoothed results, as shown by the red arrow in column (c).LIA [51] overall has a tone shift and fails to reconstruct the details correctly, like the earring shown in (d).Our method offers faithful reconstruction of the head pose and expressions of the subject, as well as details, like the tint on the eyeglasses in (f).methods on the held-out views from the evaluation dataset discussed in Sec.5.1.As for the baseline methods, we evaluate the resolution at 512 × 512 and remove the background as a preprocessing stage to ensure fairness when compared to methods like NHA [16].We also set the first frame of the video as the source frame for 2D-based methods and train NHA on the same training set as our method.We show the quantitative results in Table 2. Our method offers the best performance among all the methods across all metrics.Specifically, it can handle difficult head poses like 90 • to the left and to the right.Moreover, the personalized manifold enhances the details in the reconstructed images as the StyleGAN generator is fine-tuned to produce highlysimilar images as the input video.Note that although our method is a 2D-based method, it still can produce multiview consistent imagery.This is because the head rotations are mostly represented by interior points of the personalized manifold.And we observe that the interpolation between the pivots show good consistency for the identity and the geometry.As a result, our method can provide better visual quality than 3D methods, such as NHA.Finally, our method supports real-time rendering thanks to its lightweight mapping network.The inference time for our method is 0.018s, which is about 54 FPS, on an RTX 3080 GPU.

Qualitative Results
Figure 5 demonstrates the visual results of our method against other baseline methods.Our algorithm is able to reconstruct face poses at extreme viewpoints, while previous methods like StyleHEAT and PIRender fail to generate reasonable results.We can observe distorted face cheeks in (a) for both methods and even some repeating patterns near the back of the head.Moreover, StyleHEAT often produces changes in the subject's eyes, as can be seen in (b) and (c).The StyleHEAT results are overly smoothed, leading to missing details like wrinkles seen in (d) and (e).On the other hand, PiRenderer, while keeping some details, misses the mouth texture, such as missing teeth shown in (d) and In Figs. 4 and 6, we demonstrate how our method can be used to generate rotated views of the subject, while fixing the expression.Previous methods [54] often have a difficult time handling extreme viewpoints as these viewpoints are out of the distribution of the pretrained StyleGAN generator.Since we construct a personalized manifold, our method can represent extreme head poses well, as long as they are presented in the input views.Note that we choose to learn the blending weights in the manifold, instead of an editing direction which controls the head motion.This design choice is because we found that while the linear interpolation can be smoothly changing between pivots, it does not fully represent the desired head motion.In other words, the personalized manifold might still contain some nonlinearity which causes the actual rotation to follow along a curved path, instead of a linear path.The learned MLP ensures that we can follow a correct trajectory to represent different head Figure 8. Editing results of our method.We show a random frame from the input video sequence and the edited renderings with different head pose and facial expressions.Below each inset, we show the StyleCLIP prompt to edit the latent code and produce different editing results.Our method preserves the same identity and edits across different expression and pose changes.For example, in the "Kid" prompt, we show various expressions of the subject as a younger avatar and keep similar identities.Also, note the tint in the "Chubby" prompt is preserved and shows view-dependent changes across different viewpoints.
poses in a high-dimensional latent space.
Moreover, we show reenactment results in Fig. 7. Our method can be used to reenact different subjects with the provided expression parameters.We extract the FLAME parameters ψ, ϕ, γ and ξ from the driving sequence in (d) and feed it to the learned mapping networks for each identity in (a), (b) and (c).Then, we can produce reenactment results with good accuracy.For instance, in the rightmost column, we show reenactment results of all avatars with the mouth open expression.Each avatar can do this expression properly with their unique and accurate teeth texture, which is often difficult to recover in previous methods.We notice that since the FLAME parameters might have different distributions for each individual, it is better to renormalize the input parameters with the mean and standard deviation of the source and target.
Finally, we show editing results in Fig. 8.In the figure, we demonstrate various editing tasks using StyleCLIP [36].Our method can be adapted to use other editing methods like InterfaceGAN [40] and GANSpace [20].Note that our method provides consistent renderings and maintains the edits even after the head is rotated.Furthermore, we show that the edited avatar can still produce various expressions, as shown in the "Kid" inset.The edited latent code also shows good multi-view consistency.Most notably, the "Eyeglasses" inset shows rotated versions of the subject with the eyewear with good geometry across different head poses.Some view-dependent effects are retained, for instance, the tint on the eyeglasses in the "Chubby" inset.
It is worth noting that while we fine-tune the StyleGAN generator on selected video frames, the editing capability is not impacted thanks to the regularizer L R .Comparison of the regularizer can be found in Fig. 9. Since we only have a handful of views of the given subject, it is highly likely for the fine-tuning stage to overfit to the distribution of these views.Therefore, it is critical to enforce the regularizer to avoid degraded results in editing.For instance, without the regularizer, the edited images often show color shift artifacts as shown in "Eyeglasses" and "Long Hair".Additionally, the details and expressions are not retained after the edits, as shown in the "Chubby" and "Man" insets.

Limitations and Future Work
While our proposed method shows promising results in enabling editing of pose, expression, and appearance of digital avatars, there are several limitations that we would like to address in future work. .Effects of the regularizer on the editing results.We observe that using selected frames from the input video could make it easier to overfit.Consequently, different from MyStyle [31], we apply a regularizer during the fine-tuning stage to maintain the editing ability of StyleGAN.
First, our method only handles the regions within the face alignment bounding box, excluding the back of the head and upper body.Future work could focus on improving the StyleGAN generator or cropping to incorporate these regions, allowing for a more comprehensive representation of the subject.
In addition, our approach requires learning a personalized manifold for each subject, which can take some time for optimization.To be more specific, learning the manifold takes around 3 hours for 200 pivot images on an RTX 3080 and training the mapping networks takes around 4 hours on an NVIDIA A10 24 GB GPU.Note that after the optimization stage, our pipeline runs in real-time at 54 FPS on an RTX 3080.It could be interesting to explore meta-learning and have a network that outputs a personalized manifold directly by looking at images.Also, the mapping network could be pretrained on a large-scale dataset and apply to each avatar directly or with a fast fine-tuning stage.This could potentially remove the optimization overhead and enable faster adaptation to new subjects.
We also note that sometimes the gaze or eye regions are not perfect.However, this is not bound by our network design, but rather the performance of the DECA algorithm.For future work, it is possible to swap out DECA for better facial expression detectors like EMOCA [8] and add some gaze regularization.Finally, our method works best for interpolation, but may not perform well when extrapolating beyond the training data.Future work could explore additional regularization methods or data augmentation techniques to enable extrapolation and improve the overall generalization of the model.

Ethical considerations
The success of recent approaches in synthesizing photorealistic editable representations of a given subject, as achieved in this paper, has necessitated the introduction of various methods to detect if an image is fake [29].However, the best methods can often be used as critics in the training paradigm of the state-of-the-art generative or editing models, in order to avoid detection.Additionally, as models become better, existing detection methods may be unable to scale.To prevent the misuse of editing methods, the development of more robust detection and verification techniques is paramount.

Conclusions
We propose a novel algorithm that encodes a monocular portrait video into a personalized manifold to enable editing on pose, expression, and appearance.Our approach selects useful pivots from the video sequence, allowing for efficient learning of the personalized manifold.We also design loss functions to learn pose and expression mapping networks, enabling granular control of the rendering given only a single video.Moreover, our method provides good editing capability through various StyleGAN editing methods.Overall, our work significantly contributes to the development of digital avatars by making them more interactive, engaging, and personalized.

Figure 2 .
Figure2.Overview of our proposed method.Our algorithm is divided into two stages.First, we use PTI to fine-tune the StyleGAN generator with the selected frames from the input video (see Sec. 4.1).The fine-tuned StyleGAN now has a personalized manifold that reconstructs the pivots faithfully, and we utilize its smoothness to interpolate between the pivots and represent different head poses.Then, we train the pose MLP to output the blending weights of the personalized manifold to provide a latent code that represents the rotated head.We use the expression MLP to calculate the W+ latent code residuals (see Sec. 4.2).Finally, the combined latent code is sent to the fine-tuned StyleGAN to synthesize the final rendering, and we supervise it with reconstruction loss and other supervision (see Sec. 4.3).

Figure 3 .
Figure 3. Head pose distribution of vanilla StyleGAN2 and personalized manifold.We randomly synthesize 10k facial samples with the truncation parameter set to 0.5.Then we use DECA [11] to detect the yaw and pitch of the head.Most head poses have yaw inside [−60, 60] and pitch in [−20, 30], making it difficult to invert to head pose outside of this distribution.In the right figure, we randomly sample a linear combination of the pivots and synthesize novel views with StyleGAN.We highlight the head poses from the input video in red and the samples from the personalized manifold in black.The learned personalized manifold can reconstruct extreme yaw poses (≥ 60 • ) in the input video.

Figure 4 .
Figure 4. Examples of latent code shown in Sec.4.2.(a) The personalized manifold W p provides a good representation of the subject, and it allows us to further edit the pose and expression.(b) The rotation code w rot enables free pose control of the synthesized portraits.Here we show specific yaw (−60 • to 60 • ) and pitch (−15 • to 15 • ) controls.(c) The expression code w expr allows finer control over the facial expressions.

Figure 5
Figure 5. Visual results of different methods.We show reconstruction results on the held-out views.Our method can predict better geometry and appearance of the subject.For example, in column (a), (b) and (c), we can see the ear reconstruction failed for NHA[16].For 2D methods, they mostly fail to reconstruct the correct details and expression.To be more specific, PIRenderer[37] and StyleHEAT[54] both fail for large viewpoint changes, with StyleHEAT generating overly-smoothed results, as shown by the red arrow in column (c).LIA[51] overall has a tone shift and fails to reconstruct the details correctly, like the earring shown in (d).Our method offers faithful reconstruction of the head pose and expressions of the subject, as well as details, like the tint on the eyeglasses in (f).

Figure 6 .
Figure 6.Visual results of our method with extreme head poses.A key contribution of our method is to enable direct control over head poses with StyleGAN renderings.Our method can synthesize renderings with yaw from −90 • to 90 • , which previous 2D-based methods cannot achieve.

Figure 7 .
Figure 7. Reenactment results of our method.We show reenacted sequences in (a), (b) and (c), and the driving sequence in (d).Our method is able to transfer the expressions across different subjects.

Figure 9
Figure9.Effects of the regularizer on the editing results.We observe that using selected frames from the input video could make it easier to overfit.Consequently, different from MyStyle[31], we apply a regularizer during the fine-tuning stage to maintain the editing ability of StyleGAN.

Table 2 .
Evaluation of ours and baseline methods.Our renderings offer the best visual quality across all metrics.2D-based methods struggle to handle viewpoints at larger angles, resulting in poor visual quality.While 3D-based methods like NHA can handle larger head motions, they fail to reconstruct good 3D geometry for in-the-wild videos, leading to inferior results across all metrics.