IMos: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions

Can we make virtual characters in a scene interact with their surrounding objects through simple instructions? Is it possible to synthesize such motion plausibly with a diverse set of objects and instructions? Inspired by these questions, we present the first framework to synthesize the full-body motion of virtual human characters performing specified actions with 3D objects placed within their reach. Our system takes textual instructions specifying the objects and the associated intentions of the virtual characters as input and outputs diverse sequences of full-body motions. This contrasts existing works, where full-body action synthesis methods generally do not consider object interactions, and human-object interaction methods focus mainly on synthesizing hand or finger movements for grasping objects. We accomplish our objective by designing an intent-driven fullbody motion generator, which uses a pair of decoupled conditional variational auto-regressors to learn the motion of the body parts in an autoregressive manner. We also optimize the 6-DoF pose of the objects such that they plausibly fit within the hands of the synthesized characters. We compare our proposed method with the existing methods of motion synthesis and establish a new and stronger state-of-the-art for the task of intent-driven motion synthesis.


Introduction
Humans regularly use and interact with objects in numerous ways in the real world.Interactions like eating a fruit or brushing the teeth, as shown in Fig. 1, are part of our daily routines.Being able to synthesize such interactions in a virtual 3D environment through textual instructions has widespread applications in several areas, including computer graphics and robotics [ALNM20; HTBT22; WLK*22], movie script visualization [HMLC09] and game design [SSR07].For instance, in a digitally created movie scene or a virtual role-playing game, it is natural for the character to interact with the scene objects based on a set of instructions, such as yielding tools, using objects, or eating various items.Manually modeling such 3D character-object interactions or intentions is timeconsuming and laborious, when we desire to synthesize a variety of possible motions with the same intention and object. .These autoregressive motion synthesis frameworks predict short-term future sequences from a short history.There are also several methods for hand-object interactions [KYZ*20; TGBT20; JLWW21; ZYSK21; CKA*22], which focus on generating only the wrist and finger movements for grasping various objects.However, modeling hand motion alone is insufficient to create a plausible motion sequence for an intent-driven virtual character.Instead, we believe it is crucial to operate in the space of full-body motion synthesis.There are two prime reasons for this.Firstly, synthesizing fullbody movements allows for a broader range of interactions (Fig. 1).For several intents, such as eating, drinking, inspecting, passing, or exchanging objects between hands, the head, the arms, and the torso are also part of the complete action sequence [TGBT20].Secondly, trivially attaching the synthesized hand motion to the remaining body [PRB*18] leads to an uncanny and physically implausible motion generation (see suppl.video).Further, recent works [TCBT22; WWZ*22] have demonstrated the ability to generate whole-body grasping motion starting from a T-Pose till the moment of the grasp.However, synthesizing a plausible motion sequence after the first grasp moment, based on an intent guiding the human-object interaction, remains unaddressed.
To address these limitations, we propose IMoS, a novel framework to synthesize diverse, full-body motion sequences of humanobject interactions.Crucially, we synthesize the motions based on the input textual instructions consisting of actions (intentions) and objects (Fig. 2).We learn generalizable intent encodings from the input intent-object pairs using a CLIP encoder [RKH*21], which is a large-scale language model trained on a large corpus of textimage pairs.Given the initial body poses and the 3D object positions, we design an intent-driven full-body motion generator model to autoregressively generate full-body motions (Sec.3).We follow a decoupling approach and model the arms and the body motions using separate Conditional Variational Autoregressors to make our Figure 2: Overview of Our Intent-Driven Full-Body Motion Generator.Our model takes in the initial 3D body poses and object positions (upper-left) and instruction labels (upper-middle) describing the object types and the intended actions.We design a pair of decoupled conditional variational auto-regressors, the Arms Synthesis Module and the Body Synthesis Module (lower-middle), to separately synthesize the arms and the rest of the body.We also design a Condition Encoder (middle) to condition our decoupled autoregressors based on the input instruction labels and the body shape parameters.We concatenate our synthesized arm and body motions and use our Object Optimizer Module (lower-right) to optimize the 6-DoF parameters of the object while satisfying the grasping constraints.Our model outputs the synthesized full-body motion sequence together with the object positions (upper-right).output arm and body movements more precise.Since these autoregressors are variational in nature, they allow us to sample diverse motions from the latent space at inference time.We also observe that regressing the motion from a larger past context is crucial in modeling long-term temporal dependence between the joints.We use a position-encoded self-attention mapping to model correlations between the different joints to allow a broader range of interactions.Lastly, we perform an optimization routine to estimate the corresponding 6-DoF object positions relative to the hand position in each frame (Sec 3.2.4).We use the recovered object positions to condition future motion synthesis.
We train and evaluate our method on the recent GRAB dataset [TGBT20] (Sec.5.1), consisting of ∼1.3K sequences of humanobject interactions exhibiting multiple intents.We quantitatively evaluate our synthesized sequences on established metrics, such as the mean per-joint position error, the average variance error, the Fréchet Inception distance, recognition accuracy, diversity, and multimodality, to test the effectiveness of the model.Further, we conduct a visual perceptual study for subjective evaluation of our synthesized motions compared to recent conditional motion synthesis methods (Sec.5.6).
In summary, our primary technical contributions are threefold: • A new framework for generating diverse motion sequences of virtual human characters interacting with objects of known shapes placed within their reach, according to text-based instruction labels.In contrast to previous works on character-object in-

Motion Synthesis
Full Body

Intent-Driven
Only Till Grasp Object Manipulation Table 1: Overview of the Problem Definitions of Existing Methods.Our method is the only one combining three important characteristics and the first one to synthesize intent-driven full-body pose sequences for motions with object manipulation.teractions, our proposed method also optimizes the 6-DoF object positions in 3D.
• Synthesizing interactions involving both hands, including sequences where the character exchanges an object between the hands ("offhand") -a previously unexplored setting.• Learning separate variational latent embeddings for the arms from the rest of the body to enable diversity in the synthesized motions and accurate synthesis of both-handed interactions.

Related Work
Our work aligns with past works on modeling 3D human-object interactions.We study these works from four vantage points: human pose forecasting and synthesis, human-object 3D interaction modeling, hand-object grasp synthesis, and full-body grasp synthesis.
Human Pose Forecasting and Synthesis.Human pose forecasting methods predict future motions from a sequence of past poses as joint positions [MBR17] [TGBT20] further introduce the GRAB dataset, which captures the contact map from hands and the full-body motions before and during the grasp.They also propose GrabNet, a network that estimates MANO parameters at the moment of grasp for unseen objects in a coarse-to-fine manner.[KYZ*20] proposes Grasping Field, a method that learns an implicit representation of the hand-object interaction using a generative model.Grady et al. [GTT*21] derive physically plausible hand pose estimation by optimizing estimated hand meshes with contact prediction.We differ from all these methods as our work focuses on synthesizing full-body sequences.While modeling hand-object interaction is a well-researched problem, it is inherently limited in its ability to model several types of human-object interactions that require the full human body (e.g., tilting back the head when drinking from a glass).
Full-Body Grasp Synthesis.This is a relatively recent line of work following the success of hand-object grasp synthesis.GOAL [TCBT22] synthesizes full-body motion for grasping a given object by first estimating the whole-body grasping pose for the object and then treating this pose as their goal for a motioninfilling module that interpolates the motion between a T-Pose and the goal pose.SAGA [WWZ*22] also follows a similar strategy of motion infilling but uses markers to represent the body pose while also learning a contact map for the grasp for additional supervision.Both these methods synthesize full-body motions until the point of grasping.In contrast, we synthesize the motion taking place after the object is grasped (see Table 1).This is a non-trivial and more challenging setup.Conditioning human and object motions based on the intended actions while also ensuring diversity in the generated motion sequences requires additionally learning their intentbased mutual interactions in an efficient and generalizable manner.

Intent-Driven Full-Body Motion Generator
We show the architecture of our intent-driven full-body motion generator model in Fig. 3. Given a human character's shape and initial 3D body pose, a rigid 3D object placed within their reach, and an intended action to perform with that object, our goal is to synthesize a full-body motion sequence of the character performing the intended action with the object.We pose this problem as synthesizing the full-body motion sequence conditioned on the given object and a textual instruction label indicating the intent.We solve this problem through four modules.First, we encode the input instruction labels consisting of the type of the object and the associated action using our Condition Encoder.We also input the subject's body shape parameters into our Condition Encoder.We use this encoding as a conditioning signal for all the modules.A key characteristic of our problem is that the arms are the primary movers during human-object interactions.Therefore, we use a pair of decoupled conditional variational autoregressor networks to synthesize the arm movements and the rest of the body movements separately, using an Arm Synthesis Module and a Body Synthesis Module, respectively.Lastly, we use an Object Optimizer Module to optimize the 6-DoF pose of the given object such that it fits plausibly within the hands of the synthesized character.

3D Human Body and Object Representation
We represent the human mesh using the SMPL-X [PCG*19] parametric body model.SMPL-X parametrizes the full human body along with the hands and the face as a differentiable function SMPLX(β, r, Ψ, t), consisting of body shape parameters β ∈ R 10 , the root translation t ∈ R 3 , the axis-angle rotations for the body joints r ∈ R J×3 (J = 55), and the face expression parameters Ψ ∈ R 10 .It maps the parameters to a body mesh with 10, 475 vertices.To improve the stability and the convergence characteristics of our model, we use the 6D continuous representations [ZBL*19] θ ∈ R J×6 to represent body joint rotations.We downsample all the objects in the dataset to 300 vertices for faster optimization.The object's 6-DOF pose is represented using a rotation matrix R ∈ R 9 and a translation vector T ∈ R 3 .

Model Design
We now discuss each of our modules in detail.Our synthesis pipeline assumes that the character interacts with only one object at a time.Interactions can be either one-handed or both-handed, depending on the type of action and the object.

Condition Encoder
We input the object's category label using a one-hot vector wo ∈ R 51 .To represent the intended action information, we pass the intended action label, given as an English word, through the pretrained CLIP [RKH*21] model and use the embeddings wa ∈ R 512 that it outputs.The idea behind encoding the action labels with a pre-trained text encoder is the general relevance between the action semantics and the corresponding body movements.For example, actions such as "drink" and "pour" typically invoke similar arm movements and are semantically close.In contrast, other actions, such as "inspect" and "pass", invoke different body movements and are semantically different.Therefore, their embeddings, given by a large-scale language model such as CLIP, provide a regularized, semantics-based distribution of the intended actions and stabilizes further processing.
We concatenate wo and wa with the body shape parameters (β ∈ R 10 ) and pass them into our Condition Encoder qc.Our Condition Encoder uses a series of MLPs to encode these input signals and projects them onto an encoded feature vector φ ∈ R 400 as φ = qc(wo, wa, β). (1)

Arm Synthesis Module
Our Arm Synthesis Module is a conditional variational autoregressor that synthesizes the arm movements, conditioned on our condition encoder output φ and the previous k frames of synthesized arm poses along with the 3D object positions.The encoder of this module, qa, takes in the tuple , where θ a t−k:t−1 are the rotations for the arm joints synthesized by the past k frames, and T t−k:t−1 , R t−k:t−1 are the translation and rotation parameters of the object for the past k frames.During training, qa uses a series of MLPs on the input and maps them to the parameters of a latent normal distribution, µa, σa ∈ R 32 .The decoder, qa, samples za ∈ R 32 from the latent distribution and uses the previous pose information (q in a ) to synthesize the arm pose for the current frame ( θa t ) through a series of MLPs with skip connections as θa t = qa(za, q in a ). (2)

Body Synthesis Module
Similar to the Arm Synthesis Module, the Body Synthesis Module is a variational autoregressor.We use the term 'body' to denote the rest of the body parts apart from the arms.It includes the head, the torso, the hips, and the legs.We also note that the movements of all these parts are correlated when performing a full-body action.
For example, to drink from a cup, one has to tilt their head back when bringing the cup to their mouth.To model such fine-grained correlations, we first compute a self-attention mapping between all the joints in each pose as where the query Q is a joint position and the key-value pair (K, V) are information of all other joints provided as J sinusoidal positional encodings for each of the k frames.
The encoder of the module, q b , takes in the tuple The structure of q b is similar to that of the Arm Synthesis Module encoder qa, and it maps the input q in b to the parameters of a latent normal distribution, µ b , σ b ∈ R 100 .The decoder, qb , samples z b ∈ R 100 from the latent distribution and outputs the rest of the body poses as We then concatenate θa t and θb t to obtain the full-body pose θt at time t.We pass θt to our Object Optimizer Module, along with the last predicted object position, to generate the object position for the current frame.

Object Optimizer Module
We have so far focused only on synthesizing the body poses for a given instruction.For a complete synthesis, we also need to estimate the corresponding 6-DoF positions of the object.Although fine-grained object synthesis is not the primary goal of our work, Figure 4: Our Hand-Object Setup.We design the energy term E d to enforce that the distances between the hand and the object vertices remain constant throughout the synthesis.Through the handobject contact term Ec, we also enforce that the points in contact in the first frame remain in contact during the synthesis.
we aim to produce plausible object trajectories faithful to the synthesized full-body motion.To this end, our core assumptions are that (a) at the moment of grasping in the initial frame, the object is at rest in an upright position and (b) inter-vertex distances between the vertices of the object and the hand remain constant throughout our intent-driven motion synthesis.
With these assumptions, we optimize the object's rotation R, translation T, as well as the pose parameters of the hand, P h , in the SMPL-X parameter space.
We first compute the matrix of Euclidean distances D ∈ R N×M between the vertices on the hand, V h ∈ R N and those on the surface of the object, V o ∈ R M for the initial frame.We can retrieve the hand vertices using the SMPL-X parameterization, For each subsequent frame, we then minimize the objective: R * , T * , P h * = min R,T,P h We use an energy term, E d , to enforce the same inter-vertex distances between the hand and the object vertices in all the subsequent frames as in the first frame, as However, this term alone does not guarantee that the object is in contact with the hand in subsequent frames because, in practice, the hand joints do not converge to plausible poses using E d .We address this issue by introducing the contact term Ec, which forces the distance between the in-contact vertex pairs of the first frame to be zero, as Here, δ(•, •) is a contact indicator function for the elements of the distance matrix for which the distance is less than a threshold: δ(i, j)) = 1, if D i, j < τ and 0 otherwise, as we show in Fig. 4.
Finally, Er consists of L2 regularizers to ensure that the object and hand poses do not deviate significantly from the previous frame and thus enforce temporal consistency, as where ∆ signifies the difference in values between the current and Figure 5: Object Position Optimization.We optimize the 6-DoF pose of the object such that it fits plausibly within the hands of the virtual character.We show three snapshots of such fitting after the 0 th , 500 th and the 1200 th iteration of our optimization.
the previous frame.We initialize the hand poses using a state-ofthe-art grasp estimator proposed in [TCBT22].The optimization routine iteratively corrects the initial estimates of the finger movements while placing the object within the person's hands.Fig. 5 illustrates the optimization routine.

Implementation
This section describes our training and inference routines and the implementation details for our generator network.
Training and Inference Routines.To maintain a fixed number of input frames for computational stability, to reduce the parameter load and associated training overheads, and to avoid overfitting to redundant frames, we represent our ground-truth motion sequences using T = 15 frames, taken at a sampling rate of 8-10 fps.These 15 frames act as the key frames determining the motion sequence.
The encoders and the decoders inside our four modules use fullyconnected layers with skip connections, LeakyReLU activations, and batch normalization [Aga18; BGSW18].We use k = 4 past frames (optimized through experiments) to synthesize the subsequent time steps.We train our autoregressor based Arm Synthesis and Body Synthesis Modules to minimize the KL divergence loss: We compute the pose and the velocity reconstruction loss between the ground-truth rotations θ and the predicted rotations θ as We train our model on the following weighted sum of these losses: where λ KL and λp are the weight parameters.We can then use the regressed body motion parameters p to optimize the 6-DoF object positions at every time step.
During inference, we synthesize motions for novel intent-object pairs and novel body shape parameters.We input an initial body pose, a 3D object placed within reach of the character and an intended action to be performed with the object, and autoregressively synthesize the intent-based full-body motion sequence.
Implementation Details.We train our model for 1,600 epochs using the Adam Optimizer [KB14] with a base learning rate of 5 × 10 −4 , and a batch size of 64, which takes roughly four hours on an NVIDIA A100-PCIE-40GB GPU.We decay the learning rate (LR) using a Reduce-on-plateau LR scheduler with a patience of 3 epochs and a decay rate of 0.999.We set λ KL = 0.001, λp = λ d = 1.0 and λc = λr = 0.005.During inference, synthesizing the full-body poses and the corresponding object positions for a motion sequence of 15 frames take approximately 1-1.5 minutes.Finally, we perform a linear interpolation on our generated frames to up-sample the motion to 30 frames per sequence for cleaner visualization.We have implemented our network, training, and inference using the PyTorch framework [PGC*17].

Experiments and Results
This section reports the results of our experimental evaluation, including the dataset, the evaluation metrics we use, and our ablation studies.Since there are no existing methods for generating fullbody human-object interactions, we use existing methods that generate full-body poses based only on action labels as our baselines.

Dataset
We use the GRAB dataset [TGBT20] consisting of whole-body grasping sequences performed by ten different subjects.The subjects interact with 51 different objects via four basic intents, "use", "pass", "lift", and "offhand"."Use" further has a sub-category of 26 different actions depicting plausible intent-object interactions such as drinking or pouring from a cup to taking a picture with or browsing a camera.Following the split of [DFD22], we take subject 'S1' for validation, 'S10' for testing, and the remaining subjects 'S2' through 'S9' for training.This data split ensures that we test on novel subjects with different body shapes and our inference contains novel (intent-object) pairs such as offhanding a water bottle, which is not present in our training set.We discard the "lift" intention from our setting as the motions depicting lifting an object were inconsistent in the dataset.Thus our train, validation, and test splits respectively consist of 789, 157, and 115 sequences.

Baselines
We compare our results with ACTOR [PBV21], Ac-tion2Motion [GZW*20] and TEMOS [PBV22].Since these methods were not originally trained on the GRAB dataset, we re-train them for our setting.We re-train ACTOR and the Action2Motion methods for 1,600 epochs (the same number of epochs we train our model for, see Sec. 4) conditioned only on the action labels with no object information.For comparison with TEMOS, we create sentences of the form "A person < action > the < ob ject >" (e.g., "a person eats the apple") to use as input sentences, and re-train the TEMOS model for 1,600 epochs as well.We apply our Object Optimizer Module for all three motion synthesis methods to generate the object positions for visual comparison.

Evaluation Metrics
We evaluate our method using the Mean Per-Joint Positional Error (MPJPE), which measures the mean joint error over all time steps, and TEMOS [PBV22], and three ablated versions of our model (Sec.5.4).We evaluate the methods on the MPJPE, AVE, FID, recognition accuracy, diversity, and multimodality metrics."↓" denotes lower values are better, "↑" denotes higher values are better, and "→" denotes values closer to the ground-truth are better.
Figure 6: Perceptual Study Evaluation.We conduct a user study where participants answer two questions: "Which animation looks more realistic?"and "which animation best corresponds with the input instruction label?".We show them 30 randomly sampled motion sequences synthesized by our method and the two baselines, ACTOR [PBV21] and Action2Motion [GZW*20].We see our method is chosen more than 80% times.
and the Average Variance Error (AVE) [GCO*21], which measures the variance error between the joint positions.
We further evaluate the naturalness and the overall diversity of our generated motions using the Fréchet Inception Distance (FID) [HRU*17], recognition accuracy, diversity, and multimodality.Following ACTOR [PBV21] and Action2Motion [GZW*20], we train a standard RNN action recognition classifier on the GRAB dataset and use the final layer of the classifier as the motion feature extractor for calculating FID, diversity, and multimodality.

Ablation Studies
We compare the performance of our model with the following ablated versions: • Ablation 1: Randomly initializing the input action labels with 512-d vectors: To study how the CLIP model influences the conditioning of the synthesized motion, we conduct an ablation where we train our Condition Encoder with a 512 dimensional randomly initialized vector for the input action labels instead of taking the CLIP embeddings.• Ablation 2: Training the Body Synthesis module without using the self-attention mapping.In this ablation, we exclude our position-encoded multi-head self-attention from the input of the Body Synthesis module of our framework to see how it influences the quality of our motion.• Ablation 3: Training the full body instead of decoupling to the Arm Synthesis and the Body Synthesis Modules.We train the whole body movements in one module instead of separately synthesizing the arms and the rest of the body.

Quantitative Evaluation
Table 2 shows the MPJPE, AVE, FID, recognition accuracy, diversity, and multimodality on our test set compared to the three state-of-the-art methods of ACTOR [PBV21], Ac-tion2Motion [GZW*20], and TEMOS [PBV22].We also include the ablated versions of our methods (Sec.5.4) in our evaluation.We repeat each experiment 20 times as done in ACTOR [PBV21], and report a statistical interval with 95% confidence.Our method shows significant improvements in all the metrics compared to the existing methods and the ablated versions.

Perceptual Study
To evaluate the visual quality of our motions, we conduct a perceptual study where we compare our results with ACTOR [PBV21] and Action2Motion [GZW*20].Except for TEMOS [PBV22], which would quickly settle on the mean pose, the other two methods generated plausible full-body motions after re-training.We, therefore, exclude TEMOS from the user study.We conduct our perceptual study in the following two sections.
Comparison with Motion Synthesis Methods.In the first section, we displayed our results and the results from ACTOR and Action2Motion side-by-side in random order, along with the input instruction label.We asked the participants to answer these questions for each sequence: "Which motion looks the most realistic?"and "Which motion best corresponds with the input instruction label?".We collected answers for 30 such sequences from 75 partic- ipants.Fig. 6 illustrates the results of the study.In 80% responses, participants marked our method as the most realistic compared to ACTOR and Action2Motion.Likewise, 81.6% participants chose our method to have the best semantic fidelity with the instruction label.Upon examining the cases for which the participants preferred ACTOR instead of us, we found that it performed better for a few actions, e.g., "screwing" the light bulb and "toasting" with the wineglass, where the motion does not need to have hand-to-eye or hand-to-mouth coordination.These actions do not include significant variations within the dataset and are, therefore, easy to overfit.
Comparison with Ground-Truth.While ACTOR and Ac-tion2Motion are the closest methods for our paradigm, they were not originally designed to synthesize motions conditioned on intents.Therefore, to get an additional perspective on the performance of our method, we asked the participants to compare our best synthesis results with the ground-truth motions in the second section.To establish an upper bound on our performance, we chose the 10 best samples from various intent-object pairings to compare with the corresponding ground-truth motions.Again, we displayed our synthesized and ground-truth motions side-by-side in random order.This time, we kept an extra option: "cannot distinguish".While our synthesized motions are, expectedly, less preferred than the ground-truth motions (15.6% vs. 36.9%),47.5% of the responses rate our best syntheses as indistinguishable in terms of realism.We also note that participants rated our synthesized motions as more realistic than the ground-truth motions when it involves actions such as "eating" an apple with one hand, which has abundant training samples.On the other hand, our method encounters difficulties when synthesizing intents involving high-frequency wrist or finger movements such as "shaking" or "squeezing".This is because our 1 loss function (Eqn.( 11)) tends to smooth out the high-frequency components from the motion sequence, and the GRAB dataset does not have sufficient samples of these actions to train them separately.

Qualitative Evaluation
We show full qualitative results in our supplementary video.When qualitatively compared with the ablated versions (Sec.5.4), we find that Ablation 1 (using random initialized vectors instead of CLIP) and Ablation 3 (training one module for the whole-body) fail to synthesize precise hand-mouth or hand-eye coordination for actions such as "drinking" and "eating".Ablation 2 (without using self-attention mapping) lacks subtle body movements, such as tilting back the head or bending the knee to pick up an object, which improve the motion plausibility.We further analyze our generated motions under the following headings: Diversity Analysis.As we noted earlier (Sec.1), generating diverse motion sequences for the same input instruction label is crucial for an immersive user experience.Fig 7 shows our result for two different sequences (left and right).Sampling from the variational latent space allows us to synthesize diverse motion sequences.In Fig. 7, we show two different sequences: "taking picture" with a camera (left) and "eating" an apple (right).We show two variations of the same motions (upper and lower rows).We note that the variations are diverse w.r.t how the head, arm, and torso are angled to use the object.Our method benefits from operating in the full-body space and produces more natural results compared to naïvely performing a fixed mapping from the global hand pose parameters to the end effectors of the remaining body.
Synthesis of Both-Handed Interactions.Our method is the first to plausibly synthesize full-body motions for both-handed interac- .We show five (ground-truth) frames where the body and the object are in contact.However, these contacts are not precise.The fingers do not touch the object when grasping the mug, the camera, or the cup.On the other hand, we see inter-penetration between the hand and the object for the cube and the toothpaste.
tions.We achieve this by decoupling the arm synthesis from the full-body synthesis in our generator design (Sec.3.2).The wrist and the elbow joints play a crucial role in tasks such as picking up an object with both hands or holding the object precisely.Learning the arm motions in a separate latent space helps our generator focus more on such precise synthesis.
Object Position Predictions for Off-Handing Interactions.In addition to both-handed interactions, we encounter sequences in the GRAB dataset where the character passes an object from one hand to the other.It is non-trivial to optimize the accurate object positions when the object switches hands.Here, we first compute the most likely frame at which the switching occurs and then transfer the optimized hand parameters to the other hand.Fig: 9 shows two such off-handing interactions with two objects.
Plausibility of Head Motions.Similar to the motion of the fingers and the arms, the coordinated movement of the head and the hands also determines the synthesis quality.While recent works like GOAL [TCBT22] explicitly account for the head direction vector during network training and optimization, we observe that our model learns visually plausible head orientations and hand-head coordination without any explicit supervision.This raises the question of whether explicit supervision is indeed necessary.

Discussion and Limitations
Through quantitative evaluations and a perceptual study, we establish that our method synthesizes plausible motions of virtual characters performing intended actions with given objects.While we can synthesize motions for various intents and objects, we observe failure cases for certain rare intents with high-frequency wrist motions, e.g., "squeeze", "shake" (see supplementary video).Additionally, our Object Optimizer Module (Sec.3.2.4)optimizes the fingers and the object positions based on an initial distance between them.This assumption works well with most of the intents Figure 9: Off-Handing.We show two interactions of "offhanding" where the character passes the object from one hand to the other.Such interactions pose a unique optimization challenge when the object is switching hands. in the GRAB dataset, which involve static grasps.However, dynamic grasping, which involves hand slipping and relative motion between the object and the hands (such as "rotating" a cube and "stretching" an elastic band), is limited in our setting.We also note that the contacts between the body and the objects are not fully precise for all samples in the GRAB dataset, possibly due to the sparse marker-based motion capture.In some sequences, the fingers do not touch the object while grasping, while in others, there are interpenetrations between the hand and the object (Fig. 8).Lastly, we do not address long-term motion synthesis (in the order of minutes) involving a sequence of actions performed with an object.
Ethical Considerations.Our method does not support texture and fine appearance details and cannot be used to produce deceptive content.Our results are not photo-realistic by design and cannot be confused with real scenes.However, combining our technique with a method supporting more realistic texture might raise ethical concerns in the future.

Conclusion and Future Work
We presented the first full-body motion synthesis method for character-object interactions.Such a motion synthesis pipeline can become a useful, practical tool in applications requiring large-scale character animations.We demonstrate that a decoupling approach that separately models the arms and the body motions using conditional variational autoregression leads to measurable perceptual improvements and advances the state-of-the-art on multiple quan-titative evaluations.We also synthesize interactions involving both hands, including sequences where the object exchanges hands.
In the future, we intend to extend our model to synthesize dynamic grasps and full-body poses such that the virtual character can change the grasp within a sequence.We also plan to explore descriptive sentence embeddings for the interactions (e.g., "a person passes the bowl using the right hand") to generate more precise and controllable motions.

Figure 3 :
Figure3: Architecture of Our Intent-Driven Full-Body Motion Generator Model.Given previous k frames of body poses and object positions, we train the arms and the rest of the body separately using our Arm Synthesis (upper-middle) and the Body Synthesis (lowermiddle) Modules, respectively.We jointly synthesize the entire motion sequences autoregressively, conditioned on the input intent, the object, and the body shape, all encoded through our Condition Encoder (upper-left).We use position-encoded self-attention on the past k frames for the body joints before passing them through our Body Synthesis Module.After generating the body pose, our Object Optimizer Module (lower-right) optimizes for the 6-DoF pose of the given object such that it plausibly fits within the hands of the synthesized character.

Figure 7 :
Figure 7: Qualitative Results Showing Diversity in the Synthesized Motions.The two rows depict two diverse motion sequences generated by our model.Our method can generate variations for the same instructions using either or both hands, along with plausible coordination of the head and the body.Please refer to the supplementary video for more results.

Figure 8 :
Figure 8: Examples of Imprecise Contacts in the GRAB Dataset[TGBT20].We show five (ground-truth) frames where the body and the object are in contact.However, these contacts are not precise.The fingers do not touch the object when grasping the mug, the camera, or the cup.On the other hand, we see inter-penetration between the hand and the object for the cube and the toothpaste.