Markerless Multiview Motion Capture with 3D Shape Model Adaptation

In this paper, we address simultaneous markerless motion and shape capture from 3D input meshes of partial views onto a moving subject. We exploit a computer graphics model based on kinematic skinning as template tracking model. This template model consists of vertices, joints and skinning weights learned a priori from registered full‐body scans, representing true human shape and kinematics‐based shape deformations. Two data‐driven priors are used together with a set of constraints and cues for setting up sufficient correspondences. A Gaussian mixture model‐based pose prior of successive joint configurations is learned to soft‐constrain the attainable pose space to plausible human poses. To make the shape adaptation robust to outliers and non‐visible surface regions and to guide the shape adaptation towards realistically appearing human shapes, we use a mesh‐Laplacian‐based shape prior. Both priors are learned/extracted from the training set of the template model learning phase. The output is a model adapted to the captured subject with respect to shape and kinematic skeleton as well as the animation parameters to resemble the observed movements. With example applications, we demonstrate the benefit of such footage. Experimental evaluations on publicly available datasets show the achieved natural appearance and accuracy.


Introduction
Motion and performance capture as well as human modelling techniques have revolutionized modern movie and game productions as well as telecommunication scenarios. There is a clear demand for 3D capturing of performing humans and the ability to manipulate such footage manually/automatically, e.g. by exchanging the appearance model while keeping the motions (retargeting), assembling certain movements into a desired sequence, fine-tuning certain actions (e.g. gaze correction), etc.
Inspired by this demand, the work presented here provides an approach to automatically model subject-specific human geometry (shape vertices as well as kinematic skeleton) while tracking the subjects motion (Figure 1). Adaptation of the template model towards the observed subject is required to accurately resemble the captured input sequence. The surface of the human tracking model is based on skinning techniques and thereby provides a typical, easyto-manipulate kinematic skeleton control structure. A fully rigged model of the subject is learned and can be animated in various ways, e.g. slight corrections of the input sequence as well as completely new movements. Our motion capture and shape adaptation framework processes an input sequence of noisy meshes, which represent a partial view onto a moving subject. As a result, an animation of a topologically consistent template model is generated which closely resembles the input sequence. Alternatively, the proposed optimization framework is well suited to create a kinematically animatable human model of a subject from a set of 3D scans of static poses. The main challenges are the inconsistency of vertices respectively, triangle topology, the noise level obscuring the details in the captured data as well as that the visibility is of only partial views onto the subject so that considerable surface areas of the subject are not always observed and need to be inferred in a plausible manner.  similar observations. Typically, a generic model (or even just a broad approximation, e.g. set of cylinders) of a human is used, which simplifies the estimation of motion data. However, such a rough model usually lacks the personal characteristics and consequently cannot explain certain observable peculiarities. On the other side, modelling very realistic human models has become relatively simple with nowadays modelling software, but modelling an existing real person in detail is still a very cumbersome task.
This has been considered in the last years by several researchers and has led to approaches combining modelling with tracking. Quite thorough overviews of the literature can be found in [JDKL14,Bla16,CCM*18]. In the following, we point out some approaches closely related to our motion capture.

Animation and modelling
Animation is the process of transforming the vertices of a model in rest pose to a desired target pose. One approach is to specify certain conditions the result should comply to and transform vertices by optimizing the error function with the use of optimization or similar techniques. A frequently used method is to control the orientation of triangles [ASK*05, HSS*09, PMRMB15, BBLR15, LMR*15, KPMP*17] or to employ a probabilistic approach [ZB15]. This flexible formulation supports very realistic animations. However, solving for all vertices at once produces usually considerable computational costs.
In contrast, skinning-based methods provide the capability of calculating vertex motion directly, thereby simplifying the task of generating or manipulating animation sequences. Linear blend skinning (LBS) is dominantly used throughout academic research as well as industrial applications [MG03, LD12, LD14, CLC*13] because of its simplicity and wide spread support in graphics hardware. However, direct vertex animations based on linear skinning typically suffer from severe artefacts, e.g. candy-wrapper, as shown in Figure 2. Various non-linear approaches have been investigated to mitigate such negative effects [KCvO08,KS12,FHE16]. In order to model human characters effortlessly in a realistic and detailed manner, data-driven methods have been developed. Such statistical models are generated with the use of Principal Component Analysis (PCA) to extract a parametric model from a set of training data consisting of registered example scans. By varying the eigenvalues, the shape of the resulting model is adjusted according to the statistical dominant variations of the training data. Deformations can either be represented by vertex displacements [JTST10, PWH*17, LMR*15, RTB17,JSYS18] or triangle deformations [ASK*05, ZFL*10, HTRS10, WHB11, HLRB12, ZFYY14, ZYZD16].
Motion capture and shape adaptation Motion capture aims at recovering the frame-by-frame pose configuration of a human captured in a temporally sequential manner. Marker-based methods require the subject to accurately wear placed markers (active or passive) or clothes with certain patterns. In contrast, markerless motion capture directly exploits the captured image data. The omission of any subject preparation allows for example to also extract texture maps. Several of newer approaches in motion capture exploit machine learning techniques to infer pose information from sensor A very recent trend in motion capture is to simultaneously adapt a geometric template model with individual geometric details towards the captured subject during tracking. In [BBLR15], a low-resolution model is fitted with respect to shape and pose to the depth channel of a monocular RGB-D sequence in order to initialize a registered high-resolution model. Then, the initialized high-resolution model is refined to match the input frames by exploiting photoconsistency and thereby simultaneously calculating a texture map. Finally, a displacement map is calculated encoding the remaining geometric details. An approach to process sequences of full-body 3D scans is presented in [ZPBPM17]. A statistical model, enhanced with a term to compensate for cloth deviations, is initially fitted to an input sequence. Based on the per input frame adjusted statistical model, a free-form surface is optimized to better resemble the input sequence. This free-form surface is coupled to the statistical model, in order to prevent too strong deviations from it. In [YZG*18], a human model representation is presented consisting of two layers that support model adaptation and tracking in real-time on a single depth input stream. The inner layer is a parameteric shape model encoding the subject identity and pose without cloth or other disturbances. An outer layer, which is progressively refined and extended with the input frames, is used to model everything observable, like cloth, hair, etc., to support free-form deformations. Theses layers are coupled through a node graph, so that the kinematic skeleton structure of the inner layer constraints the deformation of the outer layer. In [AMX*18b, AMX*18a], the authors enhance a statistical model with an auxiliary term to support individual geometric peculiarities. Using this enhanced statistical model, a geometrically individualized and textured model of a subject is generated from a monocular RGB image sequence by using robust silhouette matching in a canonical reference frame. Further fine geometric details are included into the individualized statistical model by means of additional vertices, which are initially placed onto the already existing triangles. Using shape-fromshading techniques, these additional vertices are moved along the triangle normal direction in order to optimally resemble the captured input.

Contribution of proposed approach
The proposed method falls in the latter category of algorithms: detailed model adaptation together with pose optimization. The contribution of this paper is the following: A motion capture framework with shape adaptation is presented, focusing on processing sequences of noisy 3D reconstructions captured with a multi-view setup covering only partial views of a moving subject. Alternatively, the method supports the generation of a subject-adapted model from sets of 3D reconstructions representing independent poses.
In contrast to indirect animation approaches, the underlying tracking model used here is based on a recent compact, vertex-based, kinematic skinning function [FHE16] and thus can be directly animated. Since the model is optimized with respect to surface vertices as well as skeleton joints, the result closely resembles the captured world data. The output is a kinematically animatable model resem-bling the shape of the captured subject as well as the animation parameters extracted from the input sequence. This is achieved by applying several heuristics to tackle the challenge of processing such imperfect real-world input data, such as outlier filtering, constraining the visibility of vertices, using a robust norm, etc. Further, prior knowledge extracted a priori from preprocessed examples is used to guide the optimization towards realistic results. By using a data-efficient Gaussian mixture model (GMM) formulation as prior on valid joint configurations, we reduce the risk of unlikely poses.
In contrast to statistical models with their limited parametric space of attainable shapes, we support free-form deformations to adapt the template model's surface to the details of the captured subject. Softconstraints, learned a priori from example data, are used to ensure realistic human shapes without compromising subject-specific peculiarities, e.g. like a pony tail hair style.

Model Adaptive Pose Estimation and Tracking
The proposed method calculates a kinematically animatable human model from 3D reconstructions of a continuously moving subject, although a set of independent poses is also suitable. The input data are given as one noisy, non-registered 3D mesh per time instant, each representing a partial view onto the subject. The template tracking model is adapted with respect to shape vertices as well as skeletal joints to closely resemble the observed subject in all captured pose configurations. Since optimizing shape and pose independently resembles a chicken-and-egg problem [HLRB12], we jointly optimize for all parameters simultaneously: pose variables of each input frame (global translation and rotation and kinematic rotations for each skeletal joint) and subjectspecific model/shape parameters for the complete sequence (template vertices and skeletal joints). For accelerated convergence, the optimization is divided into two steps: given a generic template model, the pose parameters for each input frame are optimized first, followed by jointly adapting the model towards the captured input data in conjunction with refining the pose parameters. Further, in order to prevent convergence to undesired model configurations (unnatural poses and/or models), prior knowledge, learned a priori from training data, is used to guide the optimization towards plausible poses and humanoid models.

Template model for tracking
Our framework is based on a skinned humanoid template model with fixed-topology used for tracking partial 3D reconstructions with inconsistent topology. This, on a kinematic skinning functionbased template model, is set up only once. In order to capture true human characteristics, the template model is generated with a datadriven optimization using registered 3D scans. This template model is adapted during tracking of the captured subject with respect to kinematic pose as well as shape including skeletal joints.
We chose a computer graphics model based on a kinematic skinning function as template tracking model [FHE16] -in contrast to complex physiology simulation-based models like [LST09]. As shown in Figure 2, typical skinning functions are prone to undesired artefacts, e.g. LBS suffers from candy-wrapper for twist motions and dual quaternion skinning (DQS) suffers from bulging artefacts in swing motions.
We use a combination of skinning functions that drastically reduces these artefacts, as shown in [KS12]: LBS for swing motions and DQS for twist motions. As a consequence, each joint rotation needs to be decomposed into its 2-DoF-swing-and 1-DoF-twistcomponents [PB12]. This implies that each joint has its uniquely orientated coordinate system in which the joint rotation takes place. In order to increase the expressive flexibility of the skinning function, we use dual quaternion linear blending (DLB) [KCvO08] instead of DQS, which supports the blending of multiple joint rotations. This allows vertices to be influenced by more than one joint and thereby better approximates the true behaviour ( Figure 2).
We denote the complete pose aligning transformation by T () and DLB/LBS-based skinning interpolation by S(). Rigid transformations representing the transformations of all previous joints in the kinematic chains, denoted by K(), and global similarity alignment by G(). Hence, the full vertex transformation is given by with v ∈ R 3 representing a vertex of the template model in rest pose and v being the resulting vertex of the animated template model.
In our blending function S() of joint transformations, a model vertex v in rest pose is first transformed by the vertex-specific DLBcombination of twist rotations of kinematic joints. The twist transformation is followed by the swing transformation using LBS (2) Using the x-axis of the joint's local coordinate system as axis to twist around, the unit quaternion q = [q w , q x , q y , q z ] representing a joint rotation is decomposed into its swing and twist components: x and q swing = q · (q twist ) * , (3) with q = q swing · q twist and () * denoting quaternion conjugation.
Representing a joint's local orientation with quaternion o, the joint rotations combined within DLB are given as o * q twist o. These quantities are augmented with a dual quaternion part representing the joint rotation centre to formq twist [DKL98]. According to DLB, a vertex-specific twist transformation is calculated by summing up the twist dual quaternions, each weighted with its vertex-specific DLB skinning weight w twist j , with subsequent normalization: The resulting DLB blended unit dual quaternionq DLB represents the combined twist rotation of all involved joints and is used to transform a rest pose vertex v into v DLB .
The following swing transformation acts on the twist transformed vertex v DLB . For this purpose, the oriented swing components o * q swing o of all involved joints are converted into rotation matrices R swing and blended together with the joint rotation centres t according to standard LBS: Having applied the interpolated twist/swing transformation, joint rotations of the remaining kinematic chain are applied K(), followed by global rigid transformation G() to locate/orient the model.
The template model based on this skinning function consists of the mesh itself, kinematic joints as well as per vertex and joint skinning weights for LBS and DLB. Similar to our previous work [FHE16], we optimize all these three constituents to closely approximate an example real-world dataset of registered scans of a single person in various poses to fit the shape and pose of the captured person (we use SCAPE [ASK*05] with M = 71 meshes, shown in Figure 3). Since this optimization scheme is fully data driven, we achieve a very natural appearing human model. In comparison to our previous approach [FHE16], we added a heavily weighted symmetry soft-constraint to enforce that corresponding bones on the left-and right-hand-side have the same length. Additionally, we enhance the deformation characteristics of resulting animations by using a block coordinate descent algorithm [JDKL14] for the skinning weight optimization, instead of simple coordinate descent. Finally, the flexibility of the model is improved by increasing the number of joints from J = 15 to J = 18 ( Figure 3).

Shape adaptive motion capture
Our optimization framework processes meshes, 3D reconstructed from multi-view setups covering a partial view of the working volume where a subject is captured (one mesh for each captured multi-view frame). The key idea is to track the subjects pose for each frame by kinematically fitting the template model in an iterative closest point manner in combination with adapting the template models shape and skeleton to the 3D reconstructed meshes. To achieve this successfully for noisy and incomplete real-world data, prior knowledge on plausible shape and poses is integrated as soft-constraints.

Heuristics to set up meaningful vertex correspondences
For motion capture, an objective function Q (Equation (6)) is minimized based on vertex correspondences between the template tracking model and each input frame. Setting up meaningful correspondences for a sequence of independent 3D reconstructions is challenging [vKZHCO11], because the number of captured vertices and their connecting mesh topology varies (in contrast to the input data used to learn the base template model in Section 3.1). Additionally, real-world 3D reconstructions may contain strong noise as well as 3D reconstruction artefacts and may cover only a pose dependent partial area of the subjects surface.
Finding correspondences using simple nearest neighbour search, e.g. exploiting kd-trees, is not sufficient. In order to find sufficiently good correspondences, the following additional heuristics have been integrated:  Visibility constraint Adopting the idea from [HBB*13], we use the external calibration data to limit the set of potential correspondences among the vertices of the template model per frame to the ones which are visible by any of the capturing sensors, ignoring all template vertices that are facing backwards or are occluded with respect to all cameras. Normal constraint The nearest neighbour search ignores matching candidates if the angle between their normals exceeds the threshold of 90 • . This drastically reduces the risk that vertices of close but opposite surface areas are matched. 2-Way matching Correspondences are set up by finding for each vertex in the input mesh a correspondence within the visible template mesh and vice versa. This principle enforces that the visible template mesh and input mesh should cover each other, in contrast to only one being covered by the other one. Uniform outlier removal In order to reduce the negative influence of strong noise, the correspondences with the largest distances are skipped and not used for optimization.
Every iteration of pose optimization or model parameter adaptation is based on an up-to-date set of vertex correspondences. Consequently, the first step of any optimization is the registration of the input frame(s) with the template model. Based on these vertex correspondences, one iteration of optimizing pose and/or model parameters takes place. Then, vertex correspondences are recalculated, prior to using them again for the next iteration.

Setting up a suitable objective function Q
The proposed model adaptive pose estimation is treated as an optimization problem. The objective Q to be minimized in order to find optimal parameters for pose, shape, and skeleton is the sum of several objective terms, each introduced to enforce compliance to a certain sub-criteria: The scalar-valued, non-negative weights w pose , w temp , w shape , w sym are used to weight the influence of their corresponding objective terms Q pose , Q temp , Q shape , Q sym in relation to the data term Q data .
For each input frame, one set of pose parameters is optimized, consisting of global similarity and kinematic joint rotations. The subject-specific rest pose shape vertices and skeletal joints are optimized commonly for all input frames.
Data term Q data The objective term Q data enforces, that corresponding vertices of template model and input meshes are close to each other for all input frames N. It is formulated as a normalized sum over all C n vertex correspondences independently for each input frame n: with v rec c,n being the c th correspondence vertex of the 3D reconstruction in input frame n, v templ c,n being the template vertex corresponding to v rec c,n , T n () being the pose alignment transformation (Equation (1)), parameterized to map the template model from rest pose to the pose of input frame n.
In order to make Q data n more robust towards outliers, a robust norm is used, i.e. the Charbonnier norm [SRB10]: |x| * = √ x 2 + ε 2 , with ε being a small constant, e.g. 10 −6 . This Charbonnier norm can be interpreted as a continuous approximation of the L1 norm. Minimizing this robust norm can be implemented by using an iteratively reweighted least squares scheme [Zha95].
This data term Q data is influenced by any parameter change and needs to be considered in any optimization case (in contrast to, e.g. Q sym , which needs to be considered only if the skeletal joint locations are optimized).
Objective term Q pose to enforce plausible poses Estimating the pose of a human from depth information imposes the challenge, that ambiguities need to be resolved. Twist rotations of cylinderlike shaped limbs cannot be determined from depth information because of their almost symmetric nature. Unnatural poses might be the minimum of objective Q, e.g. self-intersection and unnatural joint angles. Further, missing views onto surface areas (due to occlusion) can result in an under-determined equation system to optimize (singular matrix).
Consequently, a prior term Q pose for plausible pose configurations is introduced. We set up a likelihood function of joint angle configurations, which is maximized within the objective Q. We exploit the assumption, that rotations of neighbouring joints are stronger correlated than rotations of joints further apart in the kinematic chain. Hence, we set up a data-efficient likelihood function by using the GMM [Jae16] for each pair of successive joint rotations. Thus, P pairs of consecutive kinematic joints result in P GMMs. Since we use the 3D exponential map [Ude98] as representation for rotations, the employed GMM probability function P p () for each pair p of successive joints is defined on the six-dimensional domain. According to the GMM framework, the probability density function P p (), consisting of |P p | weighted Gaussian distributions N (), is formulated as follows: with r pre and r suc representing the rotations of two successive joints, π c being positive mixing weights summing up to one and μ c (6D mean vector) and c (6 × 6 covariance matrix) representing the model parameters of mixture component c of the p's GMM.
Within the GMM learning phase, we set the off-diagonal values in the precision matrix (=inverse covariance matrix) to zero if their correlation turns out to be too small. Because the precision matrix needs to be positive semi-definite [RW06], zeroing is not always possible. In such cases, we lower the respective values in magnitude as long as the matrix remains positive semi-definite, which can be checked by inspecting the eigenvalues. In order to reduce the risk of over-fitting, the optimally configured GMM for each pair of successive joints is determined by using the leave-one-out crossvalidation principle, with respect to the number of mixture models as well as the location of zeros for off-diagonal precision matrix elements. In order to train the GMMs, the pose parameters resulting from the initial template model learning phase are used (shown in Figure 3).
The resulting prior term Q pose for plausible pose configurations consists of a normalized sum to enforce independently a natural pose for each frame: with Q pose n being the objective term penalizing for frame n the deviation of the pose parameters from the a priori learned GMM of likely poses, and P p,n being the GMM (Equation (8)) for a pair of successive kinematic joints p.
This pose prior Q pose is influenced solely by the kinematic pose parameters. It does not need to be considered during optimization, if the pose parameters are kept constant.
Objective term Q temp to enforce temporal consistency Realworld data of successively recorded input frames typically contain noise. Consequently, extracted poses are noisy as well, if the input frames are processed independently and the perception of a smooth flow of movement is distorted. On the other side, smoothing the pose variables independently from the captured sequence can successfully remove the jitter, but at the cost of no longer resembling the true motion of the observed subject.
The objective term Q temp is a soft constraint to enforce temporal consistency within the recovered movement across a complete input sequence. Since motion corresponds to the difference of pose parameters in successive frames, undesired changes in motion are reduced by penalizing the difference of the difference of pose parameters in two successive pairs of input frames. This objective term Q temp is formulated as a normalized sum over all input frames (the first n = 1 and last n = N frame are considered implicitly, see Equation (11)): with Q temp n being the objective term penalizing the pose parameter difference in frame n with respect to the linear interpolation of its neighbour frames n − 1 and n + 1. Using the exponential map representation [Ude98] for the rotations of the J joints, encoded as 3D vector r = [r 1 , r 2 , r 3 ], the temporal consistency soft constraint is formulated as: This constraint Q temp is influenced solely by the frame-by-frame pose parameters and reduces the likelihood of drastic changes in motion. It is optional in the sense, that it should be discarded from the objective function Q, if the input frames contain independently captured poses of a subject, a set of laser scans for example.
Shape prior term Q shape The shape of the template model is successively adapted to the captured data in order to approximate the subject and thereby increasing the pose estimation accuracy. We use prior knowledge extracted from the training phase of the initial template to guide this shape adaptation towards plausible human shapes. We calculate the uniform Laplacian vector [Sor06] for each vertex v of each of the M registered training scan meshes, where the pose parameters are known through the template model learning phase (Figure 3) The prior term Q shape is introduced into the objective function Q to guide the shape adaptation robustly towards realistic shape results. This prior enforces that the local neighbourhood of each template model's vertex is similar to the corresponding vertex found in the training set. It is formulated as normalized sum over all M registered example scan meshes used for training the initial template model: For each of the training set mesh m, the shape prior is formulated as normalized sum over all V vertices: with v train Soft-constraining the proposed objective function Q with this shape prior Q shape allows to adapt the shape vertices of the template model directly to the captured input frames without suffering from imperfect input data. A suitable value for weight w shape depends on the nature of the input data. As illustrated in Figure 4, too low values result in noisy shapes, whereas too big values result in loosing the details intended to be learned from the input data.
The shape prior Q shape is influenced by the rest pose vertices as well as the kinematic skeleton used within the skinning function T m () to transform the rest pose vertices to the registered training scan m. It does not need to be considered during optimization, if only pose parameters are optimized.
Symmetry term Q sym within the kinematic skeleton Since the ability of the template model to adopt certain poses depends on the proportion of the embedded kinematic skeleton, the location of the joints is adapted towards the captured data as well. While optimizing the locations of the joints, symmetry within the kinematic skeleton is enforced using a heavily weighted soft-constraint. The prior term Q sym to penalize differences in length of corresponding bones on the left-and right-hand-side of the kinematic skeleton is formulated for the set J s of pairs of joint indices of bones intended to be of equal length as follows: with t l and t r being the location of corresponding joints on the left and right of the skeleton control structure and suc[·] the successor function to yield the index of the next joint within the kinematic chain. The 3 × 3 rotation matrix R lr is used to bring the orientation of both 'bones' into alignment, so that their length is directly comparable. This R lr matrix is calculated from the 'bones' t l − t suc[l] and t r − t suc[r] using the dot-and cross-product.

Optimization of objective Q
The proposed model adaptive pose estimation is treated as an optimization, which is based on minimizing the objective Q. Because of unknown shape and pose parameters, an accurate registration of the template model with the input scans is difficult, particularly because of the high DoF the pose variables and model parameters impose. If solved naively, the likelihood is high to end up in a local minimum of the objective function.
Our optimization can be used either for a sequence of frames representing a continuous movement, or alternatively, for a set of input frames representing independent poses. In the latter case, a rough pose initialization of all input frames is required, whereas in the former case such an initialization is required only for the first frame. Such a pose initialization can be achieved manually, through pre-defined poses (A-pose, T-pose, etc.), or automatic methods, like shape similarity trees [BHH11].
First step: optimization of pose parameters The first step of the proposed method is to optimize for the pose parameters of all input frames using the initial non-adapted template model, without any temporal consistency enforcement (w temp = 0). Since neither the skeleton control structure nor the template vertices are modified in this step, no shape prior or symmetry constraint is used (w shape = 0 and w sym = 0). Consequently, the objective function reduces to Q = Q data + w pose Q pose .
In the case of independent poses, already roughly pose initialized, the pose parameters are independently optimized in parallel for each input frame. For sequences of continuous movements, one frame after another is optimized, using the result of the already optimized predecessor frame as initialization of the next pose.
The specific variables to be optimized during pose fitting for each of the input frames are the global translation (3DoF) and rotation (3DoF) as well as the rotation parameters for each kinematic joint (J × 3DoF). This results in a highly non-linear equation system, which is minimized using a combination of Levenberg-Marquardt and line search. The Levenberg-Marquardt scheme is used to determine the optimal direction of the optimization update step and the line search algorithm is used to determine the optimal step width.
Additionally, for robustness purposes, the optimization is restricted to a reduced solution space of poses. A hard min/max box constraint is imposed independently for each DoF of the joint rotation variables. These lower/upper bounds correspond to the min/max values present in the training set of the initial template model (Figure 3).
Using all pose-adapted frames simultaneously as input (or alternatively a representative subset of them), the subject-specific uniform scale parameter (1DoF) can be jointly optimized together with the pose parameters. Solving this optimization including scale is accelerated by exploiting the block structure found in the Jacobian matrix due to the independence of the pose parameters among the different input frames.
Second step: adaptation of template model jointly together with pose parameters Our model adaptation optimizes the model parameters, specifically skeletal joints (J × 3DoF) and shape vertices (V × 3DoF), in conjunction with refining the pose variables mentioned above in one common joint optimization for all input frames (or alternatively a representative subset). Besides optimizing the objective terms Q data and Q pose , also the shape prior Q shape and skeletal symmetry constraint Q sym are employed at this step to keep the skeleton control structure symmetric as well as the shape plausibly human.
Inspecting the complete pose aligning transformation T () (including kinematics-based skinning as well as global similarity, Equation (1)), it can be seen, that an animated template vertex v linearly depends on its rest-pose vertex v. Thus, the transformation can be rewritten in matrix form: with T T being a 3 × 3-matrix and t T a 3-vector. Note, that as a consequence of LBS, T T is not a rotation. Both quantities, T T and t T , are specific for each vertex because they are influenced by their corresponding skinning weights.
Employing a first-order Taylor approximation scheme to discard higher order terms, an equation system is set up resulting in a huge sparse matrix. Following the Gauss-Newton optimization scheme, this equation system is solved efficiently in an (iteratively reweighted) least squares sense using a sparse solver, such as the sparse PARDISO solver from the Intel MKL library.
Optional third step: enforcement of temporal consistency Jittering motions are avoided by enforcing temporal consistency in an additional optimization pass with coupled pose variables of succes-sive frames using the objective term Q temp . The resulting equation system for a whole sequence requires the complete input data to be processed at once and kept in memory. This high memory requirement is reduced, by employing an overlapping sliding window algorithm: Process the first n frames (e.g. n = 10) from index 1 to n, then process the frames from index n 2 to n 2 + n with keeping the pose parameters of frame n 2 constant, then n to 2n and so forth. Because the pose parameters of the first frame in the sliding window are already optimized by the previous sliding window iteration and kept constant for the current one, smooth continuous movements along the complete sequence are enforced, while still correctly following the observed movements of the captured subject.
Practical considerations Typically, an insufficiently good adapted template model is not well suited to resemble a captured input sequence, especially for complex motions and/or a subject too different to the initial template model. Mismatching proportions of limbs as well as non-conforming shape lead to tracking loss. In order to increase the tracking robustness, the model parameters can be adapted multiple times, e.g. for the first pose optimized input frame, after 10, 50, 100, etc. frames again.

Evaluation, Discussion and Applications
The proposed model adaptive pose estimation supports tracking of the kinematic pose of a moving human from noisy, partial 3D observations while adapting the template model's skeleton joints and shape vertices to better resemble the captured subject. This optimization is guided towards realism with integrated prior knowledge, which is controlled through prior weights. As demonstrated experimentally in Section 4.1, this approach is capable to track motion and learn a realistic articulated model from a 3D reconstructed multi-view sequence or a set of different poses. The output is a model, adapted to the captured subject with respect to shape vertices and kinematic skeleton joints, as well as animation parameters to resemble the observed movements. The proposed approach is focused to model kinematically caused motions. Handling of other kinds of non-rigid deformations (e.g. facial expressions and loose cloth) can be integrated, as we will discuss in Section 4.2. The separation of animation parameters and 3D shape perfectly renders the model suited for various manual/automatic manipulations. Further, exploiting the vertex correspondences, this approach can be used to directly animate captured 3D scans.

Experimental evaluation
The proposed optimization provides pose estimation and tracking while simultaneously adapting the template tracking model towards the observed subject. The model used for all experimental evaluations is shown in Figure 3. It has been learned from the SCAPE dataset [ASK*05] using the algorithms described in Section 3.1.
Several investigations have been carried out to experimentally evaluate the performance of the complete algorithmic framework as well as of its two main components: model adaptation and pose estimation. All results are shown in the accompanied video (Video S1) to provide a better impression of the performance.  To judge the accuracy of the proposed algorithms, the Euclidean distance between fitted template model and input data is analysed. The distance between every input 3D scan vertex and its closest triangle of the fitted template model is computed. The widely used mean Euclidean distance and standard deviation are calculated. Since the input data are imperfect 3D measurements with noise as well as severe outliers, these non-robust statistical measures are negatively impacted and give a skewed impression. Consequently, the robust, outlier tolerant statistical measure median (MED) is computed as well. In order to robustly quantify the dispersion, we use the median absolute deviation (MAD) and interquartile range (IQR), which are commonly used in robust statistics [IH93,UC97].

Model adaptation
The capability to adapt the template model with respect to shape vertices and skeletal joints towards captured 3D data of other individuals is validated using the publicly available dataset MPI FAUST [BRLB14]. This dataset consists of non-watertight fullbody 3D scans of 10 subjects, five males and five females, in an age range between 18 and 70 with strong variation in physical properties (height, weight and fitness). Captured meshes of 20 different static pre-defined poses are provided for each subject, resulting in a dataset of 200 3D reconstructions in total.
Initially, the template model is manually brought into coarse kinematic alignment. Since each subject adopted the same set of poses, this task is done for the first subject only, providing a suitable initial pose configuration for all subjects. The pose parameters for each input frame are optimized independently for each frame using the initial non-adapted template model. Using the resulting optimized pose parameters, the template model is adapted with respect to shape vertices and skeletal joints, jointly together with refining the already optimized pose parameters. The prior weights are empirically set to w pose = 0.001, w shape = 1000 and w temp = 0.
As shown qualitatively in Figure 5, the adapted models resemble the geometry of the captured subjects very well in all cases. This wide range of spectrum, men as well as women and skinny as well as corpulent physical properties, is recovered in detail from the input data, leading to an appropriately adapted template model. The fists of the initial template model are transformed and approximate closely the finger postures of the different input scans. The left value in each column denotes the measurement across the whole sequence, on the right the maximum value is provided.
A quantitative evaluation is summarized in Table 1. Despite the 3D scan artefacts (noise, missing data and outliers) shown in Figure 5, the broad range of physical properties present in this dataset is well approximated. On average, the median Euclidean distance of input scan vertices to the closest template model triangle is well below 2 mm with a median absolute deviation of 1.3 mm.

Pose estimation and tracking
The pose estimation and tracking accuracy of our optimization is evaluated using the publicly available dataset MPI BUFF [ZPBPM17]. This dataset, captured with 30 FPS, consists of unregistered full-body 3D reconstruction sequences of five different subjects (three males and two females). Each subject performed the same set of three different movements with two different types of cloth, resulting in six sequences per subject. In total, the complete dataset of all subjects contains ≈10 k non-watertight meshes with an average per-scan size of ≈125 k vertices and ≈250 k triangles.
A crude pose alignment is manually generated for the first frame of the first sequence. This pose is used as tracking initialization for all sequences, because all start with a nearly identical pose. The proposed optimization processes each input sequence according to Section 3.2.3. In order to keep the computational load and memory requirement at a moderate level, every 25th frame is used for the model adaptation, resulting on average in using 43 frames. The temporal consistent pose estimation uses the overlapping sliding window mechanism with a window size of 11 frames and 5 frames increments. The prior weights are empirically set to w pose = 5, w temp = 10 3 , w shape = 10 5 and w sym = 10 6 . The intermediate results of the different processing steps are shown in Figure 6.  A qualitative result is shown in Figure 7 (also Figure 8) with input/output comparison for every subject's frame. It can be clearly seen, that even difficult motions are well recovered, e.g. rotated upper body part. The hands, represented in the template model as closed fists and in the dataset as forced apart fingers, are well approximated without any orientation error. Individual geometric peculiarities are recognizably reproduced, e.g. facial appearance, hair cut and to a certain degree the loose cloth.
Quantitatively, as summarized in Table 2, the median Euclidean distance of input frame vertices to the closest template model triangle is on average well below 2 mm with a median absolute deviation from this value of 1.4 mm. With this result, the proposed model adaptive motion capture qualifies as being as accurate as current state of the art approaches. The method presented in [ZPBPM17] evaluated an average mean accuracy of 2.5 mm on the same MPI BUFF dataset, whereas the proposed method achieves 3.3 mm. Qualitatively, [ZPBPM17] is able to better recover the details of the hand and face, but does not extract details of cloth, like folds and wrinkles of the trousers and shirt, as the proposed method does. The method presented in [AMX*18b] evaluated an average mean accuracy of 5.37 mm on this dataset. But it needs to be stated, that this comparison is informative only, because they solve a different problem: model adaptive motion capture using a single monocular 2D RGB sequence of a moving person as input (generated artificially for this dataset via projection onto the 2D plane).

Influence of model adaptation on pose estimation
The influence of model adaptation on pose estimation accuracy is experimentally evaluated using again the MPI BUFF dataset. Therefore, the interim results of the different processing steps presented in the previous section are analysed. Additionally, each sequence has been processed both with and without model adaptation, but without temporal consistency enforcement.
A qualitative result is shown in Figure 8. The non-adapted template model is not able to accurately adopt the pose of the input frame. Only a very rough approximation is achieved, because the kinematic bone structure and surface properties of the template model do not match the subject's characteristics. The single frame adaptation of the template model to the first frame of the input sequence roughly individualizes the model, but already sufficiently to allow accurate tracking of the complete sequence. The multi-frame adaptation of the template model finally resembles the captured subjected in finer detail, e.g. the face becomes recognizable. The quantitative analysis, summarized in Table 3, measures the distance between the vertices of input 3D scans and the closest triangle of the adapted template model. The median distance for tracking with a non-adapted template model is on average 6.5 mm. After adapting to the first frame, the median tracking accuracy already reduces to 2.8 mm. Finally, adapting the model to the complete sequence results in a median tracking accuracy of less than 2 mm. Similarly, the median absolute deviation from the median distance reduces from 4.4 mm for the non-adapted template model, over 1.9 mm for the first-frame-adapted template model, to 1.4 mm for the fully adapted template model.

Temporal consistency
The enforcement of temporal consistency using a weight w temp > 0 reduces the jittery effect to a non-visible minimum. The resulting sequence of poses is smooth, but still follows accurately the captured subject, thereby enhancing the qualitative perception. Quantitatively, the statistics of enforcement and non-enforcement of temporal consistency are nearly identical, indicating that there is no loss in accuracy.

Noisy, incomplete input data
Our method provides pose estimation with simultaneous model adaptation on incomplete 3D reconstructions, where not every direction has been captured (missing views). Such footage results, for example, from hardware setups with cameras on only one frontal plane, or if the cameras are placed along a half circle. To experimentally evaluate the achievable quality and accuracy, three different hardware setups with increasing hardware performance have been used to capture three different subjects (one male and two females). Two of the datasets have been made publicly available [Fec17].
In all captured sequences, the subjects wore everyday, relatively tight fitting cloth. The subjects were instructed to perform movements, where every major limb was moved at least once (in order to properly extract the characteristics of the kinematic joints) and to turn around 360 • during this movement (in order to capture every surface patch at least once).
The characteristics of the three multi-view hardware setups are summarized in Table 4. All sequences were recorded with 25 FPS (hardware synchronized). The 1920 × 1080 resolution cameras  The left value in each column denotes the measurement across the whole sequence, on the right the maximum value is provided. The left value in each column denotes the measurement across the whole sequence, on the right the maximum value is provided.
were Basler ace acA2000-50gc and the 5120×3840 resolution cameras were Ximea CB200CG-CM. In all capture scenarios, the cameras were placed in pairs to allow stereo 3D reconstruction algorithms to be used, which are merged to one 3D mesh per captured frame [WFS*16].
As can be seen in Figure 9 of the qualitative results, the three captured sequences are of different quality. The resulting animations of the template model of all three sequences are accurately aligned in pose and shape with the captured 3D reconstructions. Complex motions, like turning around, are recovered without difficulty, which, for example, is not possible with the Kinect device. The model adaptations closely resemble the subject including detailed individual peculiarities, e.g. a ponytail hair style.
The quantitative measurements are summarized in Table 4. The achieved accuracy varies with the quality of the input data: a high noise level within the input data produces higher measurement errors (median: >4.3 mm, MAD: >2.7 mm), whereas a low level of noise results in higher accuracy (median: <2.2 mm, MAD: <1.5 mm).

Validation and discussion
The specific run time of the proposed method depends on the number of input meshes and vertices. On an Intel E5-2687W CPU with 3.40 GHz, one iteration of pose parameter optimization for an MPI FAUST dataset mesh takes less than 300 ms. One iteration of model adaption for one MPI FAUST subject (20 example meshes) takes ≈100 s. The complete computation of one MPI FAUST adapted template model including pose refinement requires up to 6 h. Processing the MPI BUFF sequences per subject requires ≈8 h for one sequence (≈400 frames), to calculate the adapted model together with temporally consistent pose parameters.
The implemented algorithms are only moderately optimized with strong potential for further enhancements. For instance, computational load can be reduced by using a hierarchical approach similar to [dAST*08, BBLR15]: a registered low-poly and high-poly model of the template model is employed, using the low-poly version for the majority of computations and the corresponding high-poly model only for the final high-detail steps. Similarly, in cases of high-poly input 3D scans, a mesh simplification [KH13] or sampling approach [YSD*16] might be beneficial. Last but not least, inherent parallelism could be exploited to a much stronger degree on CPU as well as GPU and thereby again reducing computational time. Using such acceleration mechanisms, real-time tracking could be achieved by shifting the model adaptation to a prior offline process. Alternatively, a progressive model adaptation scheme could be realized by employing the recent Levenberg-Marquardt Kalman filter [SHA15, TTR*17].
The proposed optimization approach focuses on modelling and handling surface deformations evoked by kinematic dynamics. It does not model pose-dependent shape deformations (e.g. muscle bulging) or facial expressions. Such capabilities could be integrated using blend shape approaches in a similar way as [LMR*15, AMX*18b]. Moreover, small-scale deformations, as required for finger motions, are out of scope of this work. Principally, it is straightforward to directly extend the template model with kinematically animatable fingers. But this significantly increases the demand for high-resolution 3D scanning [JSYS18] or alternatively requires other means to robustly extract the hand poses from captured input data, e.g. [RTB17]. Explicit handling of complex non-rigid nonkinematic deformations, e.g. from loose cloth or hair, has not been considered in this work. Identification of cloth vertices, automatic estimation of cloth properties as well as real-time animation of cloth deformations could be integrated straightforward by blending  However, despite the restrictions on kinematically evoked deformations, the proposed optimization experimentally proved to perform well. Further, its versatile applicability is demonstrated in the next section with relevant example applications.

Applications
The output of our framework is a model adapted in shape and skeleton to a captured subject as well as animation parameters to resemble the observed movements. This footage is beneficial for several kinds of applications.
Since the model animates the whole surface instead of only a partial view, it can be used for free viewpoint video. The separation of animation parameters and shape renders it well suited for manipulations. Combining sequences of animation parameters is straightforward. Since animation parameters correspond to joint angles, interpolations between configurations provide valid interim poses.
Textured animations can be achieved in real-time [FPHE16,FPE14]. The model is animated with Kinect using retargeting to map the skeleton parameters. A static full-body texture is extracted from the multi-view sequence of the subject-adapted model generation. Realistic facial expressions are provided by dynamically updating the textures facial area from camera input.

Manipulation and editing of motion
Since the animation model is based on skinning using a kinematic skeleton, they are directly accessible and can be manipulated artistically by hand as well as in an automatic manner. Generating new/edited animations is compliant to traditional character modelling, thereby making it amenable to human modelling artists/animators. Importing it into standard human modelling software can be achieved using methods like [KCO09]. Automatic adaptations can be achieved by exploiting the knowledge of the underlying mesh topology of the original template model, providing semantic annotations. An example is to automatically control the gaze direction in realtime. A virtual line going through the area between the eyes and the back of the head is used to infer the required animation parameters to let the model look into a certain direction. To provide a natural appearance of the head movement, the animation parameters are blended, temporally and spatially: Temporally -the magnitude of gaze direction adaptation is faded in and out using sigmoid functions to start and stop seamlessly the control over the viewing direction. Spatially -the gaze correction is distributed among the breast joint (10%), neck joint (40%) and head joint (50%). As shown in Figure 10, this automatic gaze direction control seamlessly adapts the viewing direction without disturbing the natural appearance.

Direct animation of 3D reconstructed input data
To further increase the realism, 3D reconstructed input data can be directly animated by using the model only internally without visualizing it. This is achieved by starting from a frame, where shape and pose is already fitted to a captured subject. For each of the scan vertices, the location relative to the closest triangle of the model is calculated and held fixed, virtually gluing the scan vertex to the template triangle with constant distance and orientation. Therefore, the scan vertex is projected onto the plane of the closest triangle and the barycentric coordinates are calculated as well as the orthogonal distance vector between the scan vertex and the triangles plane. Modifying an animation parameter alters the location of model vertices. The updated model triangles are then used to calculate new positions of the scan vertices, thereby animating the scan itself.
In Figure 11, a full-body scan of a human is shown, consisting of 227 k vertices and 453 k triangles, 3D reconstructed using [NLM*17]. Using the pose and shape optimized model to deform a captured human produces realistic animations of 3D scans. Since no model is visualized, but the original captured data, no loss of detail happens, thereby reproducing as accurately as possible every captured detail.
This method can also be applied for the animation of dynamic body scans (e.g. created by [EFR*17]) in order to allow for pose modification while keeping the original details of the scans.

Conclusions
In this paper, we have presented a novel markerless model adaptive motion capture method. A rigged shape model, learned a priori from real-world data, is used as kinematic tracking model. In order to properly resemble an input 3D mesh sequence, the template model is adapted to the observed input data with respect to shape vertices and skeleton joint positions. Prior knowledge extracted from the training set of the initial template model is used to guide the optimization towards plausible poses and shape: (1) A data-efficient pose prior based on GMMs defined on every pair of successive joints in the kinematic chain and (2) a shape prior based on mesh Laplacians defined on the vertex neighbourhood structure of the template model's triangle topology. Since a semantic mapping of a template to the input 3D sequence is achieved, it is straightforward to manipulate the resulting animation sequence, e.g. looping over sub-sequences, exchanging the model, adapting movements, etc. Due to the underlying skinning-based animation style, manipulations are easy accessible to human modelling artists/animators. The high accuracy of pose and model fitting of this approach has been experimentally demonstrated with publicly available datasets and captured multi-view sequences.