Human motion retrieval based on freehand sketch

In this paper, we present an integrated framework of human motion retrieval based on freehand sketch. With some simple rules, the user can acquire a desired motion by sketching several key postures. To retrieve efficiently and accurately by sketch, the 3D postures are projected onto several 2D planes. The limb direction feature is proposed to represent the input sketch and the projected‐postures. Furthermore, a novel index structure based on k‐d tree is constructed to index the motions in the database, which speeds up the retrieval process. With our posture‐by‐posture retrieval algorithm, a continuous motion can be got directly or generated by using a pre‐computed graph structure. What's more, our system provides an intuitive user interface. The experimental results demonstrate the effectiveness of our method. © 2014 The Authors. Computer Animation and Virtual Worlds published by John Wiley & Sons, Ltd.


INTRODUCTION
With the development of motion capture techniques, motion capture data has been widely used for capturing character motions in many applications such as performance animation, video games, featured films, and virtual reality [1,2]. The manipulation of motion capture data such as motion retargeting [3,4], motion synthesis [5][6][7][8], and motion editing [9,10] needs an easy, efficient, and accurate method to retrieve similar motions from the large amount of recorded motion capture data.
To get a desired motion from large human motion repository, the input of the retrieval system is a very important consideration for the animators. Prior works usually utilize the motion textual [11] such as 'running', 'kicking' and 'fighting', or using a sample motion clip [12][13][14] as the input of retrieval system. Motion textual is efficient to find similar motions, but annotating large amount of motions in the repository requires plenty of manual work and textual cannot always describe human motion sufficiently. For example, it is hard to describe a complicated motion such as 'run in the first half, several loops in the second half', and it is difficult to decide on a suitable textual for two different free style dances. Using motion clip as the query term is easy to measure the similarity between clips, especially the similarity between single postures, but the query clip is usually hard to acquire, the user may have to examine a large number of examples in the motion repository by reviewing the motions. Although N.Naoki et al. [15] use a puppet interface to generate query motion clip, the puppet instrumented with sensors and potentiometers is not an easy-obtain device. In this paper, the hand-draw sketch is utilized as the input. Sketching interface is intuitive even for novices, because most people grew up scrawling and drawing some figure line. Some simple constraints such as the first stroke that must be torso are appended in our sketching interface.
The efficiency and accuracy are the key issues of a retrieval system. What the user cares about is whether the retrieval result is similar with what they want and the retrieval efficiency. The major challenge for the sketching retrieval system is matching the given 2D sketches to the appropriate sequences in the 3D motion database. In this work, we extract the key postures of the 3D motions and then project them into several 2D planes. After that, a great amount of projected postures can be obtained. The limb direction feature is proposed, which can match sketches with projected postures accurately. A novel method based on k-d tree [16] is used to index the LD features of projected-postures, which is efficient for searching large amount of human motions. Sometimes, there is not a desired motion in the database for the user, but it can be synthesized by blending some existing motion segments  [7] is used to generate continuous motion.
An intuitive user interface makes the retrieval system easy to use for both the professional animators and the ordinary users. We provide an interface on which the user can draw any number of sketching key postures. Only some very simple rules have to be obeyed. The sketching postures can be modified at any time. For the sketching posture, our system provide a group of candidate postures so that the user can select several best matches. This paper is organized as follows. We first cover related work in Section 2. Section 3 gives an overview of our human motion retrieval framework. Section 4 describes the sketching interface of our system. The approach of motion index based on k-d tree and the LD feature is introduced in Section 5. Section 6 presents the retrieval algorithm and how to build motion transition graph. Finally, we present the experimental result and conclusion in Section 7 and Section 8.

RELATED WORK
Motion retrieval is a key issue of motion synthesis and motion reuse, which have been studied for over a decade. The existing motion retrieval approaches can be classified into three categories based on the query input item: motion clip, puppet, and sketch.
Because of the complexity of motion data, the motion clip itself would be the best query to search for similar motions. For instance, Chiu et al. [17] propose a two-stage motion retrieval system. They use an index map structure based on an affine invariant posture feature to find the candidate clips, then Dynamic Time Warping (DTW) is used to calculate the similarity between query example and each candidate clip. However, the high computational complexity of DTW make it difficult to meet the real-time motion retrieval requirement. Müller et al. [14] use boolean geometric posture feature that describes geometric relations between several specified body joints for an efficient comparison between different motion data. Deng et al. [18] suggest a motion pattern discovery and matching approach that breaks human motions into a part-based, hierarchical motion representation. They use a fast string matching algorithm for efficient runtime query processing. Zhu et al. [19] present a quaternion space decomposition method, they use quaternion combination to represent the high dimensional human motion data. All methods mentioned above require a query motion clip, which is similar to the desired one. When the user does not have a similar clip, too much effort must be taken capturing the query clip, and some motions are hard to perform such as back flip and handstand walk.
A few researchers utilize instrumented puppet as the input interface. Johnson et al. [20] develop an instrumented puppet to control a bird-like character, and the Hidden Markov Models (HMMs) is used to recognize the user's manipulation. Feng et al. [21] acquire the query key postures by posing a wooden artist's doll with painted joints in front of a stereo camera. Numaguchi et al. [15] present a puppet instrumented with orientation sensors and potentiometers. The user can control the puppet to performing an approximation of a desired motion. However, the cost for devices and the operational complexity are the limitations of these methods.
In recent years, sketch input was gradually used in character animation. Thorne et al. [22] introduce an approach where animated motion is created for the character by drawing a continuous sequence of lines, arcs, and loops. Then they are mapped to a parameterized set of output motions. Li et al. [23] propose a sketch-based approach for creating Kung-fu motions. They retrieve similar motions by sketching the initial and closing posture of a Kung-fu motion and the trajectories on specific moving joints. But their sketch is a kind of complex robot style, which is difficult to draw for unprofessional users and the joint location must be labeled manually. Choi et al. [24] suggest a retrieval method by hand-drawn stick figure, they convert the 3D motions into 2D stick figure style on a selected plane. Then, the user retrieves the motions by drawing some stick figure style sketches after reviewing the generated stick figures database. But the view plane selected by the user may not be the same as what their algorithm select, and their retriever algorithm, which will search the whole database, spends too much time when the database is very large. In our work, the 3D posture are projected into multiple planes, and the motion index helps speeding up the retrieval process. What is more, some researches [25][26][27] use sketch to define the trajectory constraints that the character or its specific joint move along. However, the user still has to provide some 3D query postures.

OVERVIEW
Our retrieval method can be divided into two stages: the preprocessing stage and the run-time stage. Figure 1 shows the framework of our method. In the preprocessing stage, we have two assignments. First, we construct a k-d tree index for the whole motion database based on key posture extraction and LD feature. Then, we build a motion graph by clustering the postures of all the motions.
In the run-time stage, the user inputs several sketching postures, then the joints are labeled automatically and the limb feature is extracted for each posture. Next, we apply the posture-by-posture retrieval algorithm. If a result motion is returned directly, which means a motion in the database is the user's target, the retrieval is over. Otherwise, the system will travel the motion graph to find a shortest transition path and generate a new consecutive motion.

SKETCHING INTERFACE
The sketching interface allows the user to define a desired motion by drawing several simple motion strokes. Before using our system, the user is required to obey some rules. These rules are very easy to learn, which will never be obstacles for the novices. When the user finishes a posture sketch, the joints are labeled automatically.

User Rules
View plane: Generally speaking, people would draw some key postures from a specific view plane when they want to express a certain motion. And the view planes selected by different users are not always the same. For a 3D motion, its projection is quite different when the view plane is changed. Generally, the view tilt and elevation angles are set to zero and a azimuth angle that makes the motion easily distinguished is selected. For example, when viewing a running motion, the side view is a better choice because it describes the moving direction clearly. For a waving hand motion, the front view will be chosen mostly. Key postures: Drawing some key postures is an effective and succinct way to express a desired motion. The user is required to select the postures where the motion changes the direction or stops [24]. For instance, The key postures of the boxing motion are at the moment where the fist reach out and the fist take back. Besides, the user is required to select a representative posture as the first sketch, which is beneficial to our retrieval algorithm. Sketching order: In our system, a sketching posture is limited to five parts that include torso, two upper limbs, and two lower limbs. The head is included in torso. In order to make the automatic joints labeling easier, the user is required to sketch a posture by starting with drawing the torso. And then right upper limb, left upper limb, right lower limb and left lower limb are drawn orderly. Posture rotation: In our retrieval system, we consider them to be different postures if a man stand or lie with the same pose, which should be obeyed when the user draws the input postures.
These rules are very simple for a professional animator. They are also not an obstacle for the ordinary user after some training.

Automatical Joints Labeling
In this paper, the joints considered are head, neck, elbows, hands, spine, pelvis, knees, and foots (12 in total). Figure 2 shows an example of sketching posture. The joints are labeled automatically along with user's input. Besides, Spline interpolation and smoothness are applied to the user's strokes. The first stroke is regarded as torso. The two endpoints of torso are labeled as head and pelvis, respectively. When two upper limbs are finished, the intersection of torso, and the straight line connecting the starting points of two upper limbs is labeled as neck. Obviously, the two end points of upper are hands. The middle point of neck and pelvis is the spine. Next, two endpoints of lower limbs away from pelvis are labeled as foots. For the high flexibility of the elbow and knee and the different style of the user's sketch, we cannot easily label the middle point of limbs as elbow or knee just as Choi's work [24]. Here, we detect the corner point of the limb and label it as elbow or knee. If no corner point is detected, the middle point is selected.

MOTION DATA INDEX
There are masses of motions in a human motion database. So comparing the input sketch with projected-postures one by one is time-consuming, which does not meet the efficiency of some applications. In this paper, we propose an effective method to index the motions in database. It contains four procedures in total.

Key Posture Extraction
An original motion in database includes large amount of postures. The adjacent postures in a motion is always similar with each other. So some key postures can be selected to represent the motion just as the users express their desired motion by sketching several key postures.
Qi et al. [13] propose an approach based on k-means cluster, the motion postures are clustered into several classes, and all the class center postures are selected as the key postures. But the number of classes in k-means must be manually set in their method. Peng et al. [28] introduce another cluster-based method for motion data preprocess. Inspired by their work, a cluster-based and fully automatic key posture extraction approach is proposed in this paper.
At the beginning of clustering for a motion, we initialize a new cluster and set the first posture to be its center posture. For the incoming posture later, the distance between the new posture and the center posture of last cluster is computed. If the distance is greater than a user-specified threshold, a new cluster is created and current posture is specified to be the center posture. If the distance is less than threshold, we continue to the next posture. Finally, the center postures of all the clusters are selected as key postures. Two examples presented in Figure 3 show the validity of our method. The distance function refers to Lee's [25]. We remove the spacial distance terms of two postures in their formula, which is used to calculate the cost for transiting a posture to another, because only the geometric similarity is considered in our work and the spacial position has no effect on our posture similarity. It is computed as where q i,k 2 S 3 is the orientation of joint k with respect to its parent in posture i, and joint angle differences are summed over n rotational joints. log.q 1 a q b / is a vector V such that a rotation of 2kVk about the axis V kVk takes a body from orientation q a to orientation q b . Weights ! k is set to one for the 12 selected joints, and it is set to zero for the rest joints such as ankle, toes, and wrists, which have less effect on obvious differences between two postures.

Posture Projection
To compare the similarity of 3D posture and 2D sketching posture, the 3D posture is projected to 2D plane. Prior work [24] tries to find a best projection plane that maximizes the standard deviation of the projected 2D points, that makes the widest image of the 2D projected posture. But with some motions such as walk or run, the user tends to sketch them on side view, which may not make the widest image of 2D posture. In this paper, we set the view directions parallel to the XOZ plane and apply orthographic projection.  Figure 4 shows an example of posture projection.

Limb Direction Feature of Posture
In the process of retrieval, the features are extracted from input sketch and the projected posture, then the distance between them is computed. The feature has to describe the  similarity between the user's sketch and the projected posture accurately; meanwhile, it cannot be affected by the user's changeful input. Each user has different sketch style, the body proportion is hard to be unified for the same posture. So some information such as limb (as illustrated in Figure 2, limb is defined as the part of body that links two joints) length can not be used as an effective feature for the posture similarity comparison. In this paper, the direction of limb is utilized as 2D posture feature. Limb direction can express posture intuitively and accurately, and it can be computed efficiently. The LD feature is calculated as where ! ld i is the normalized direction vector corresponding to the i-th limb. n is the number of limbs. x i and y i are the elements of ! ld i , p i and q i are the joints at the endpoints of the ith limb, p x i , p y i , and q x i , q y i are their 2D coordinates. Limb direction feature is not rotation invariant, which meets the last rule we make for the user in Section 4.1

Feature Index with K-D Tree
In the motion database, there are many similar segments between different motions. For example, the high jump motion and the broad jump motion have the same running segment that appears in their first half. So, after key posture extraction for all the motions in database, a lot of key postures can be acquired, in which we can find many similar postures. The feature vectors of these similar key postures lie closely in the feature space. For efficient retrieval, we utilize k-d tree to construct a motion index. Traditionally, one dimension of the feature vector is selected to partition the feature space in each iteration of k-d tree construction. In this paper, a LD vector ! ld i that is two dimensions is chosen as the partition basis. The limb whose directions makes the maximal variance is chosen as partition basis in each iteration. The variance of the i-th LDs is calculated as is the cosine distance between two vectors. Assume the i-th limb is chosen as the partition basis in current iteration and n D 5. As shown in Figure 5, the i-th limb in fourth posture, which is closest to ! ld mean i , is chosen as the current partition node. The searching space is divided into two subspaces equally by the dotted line.
Because of the deviation of hand-drawn sketch, the LD of sketch and projected-posture can not be matched accurately. So some modifications is made to the k-d tree searching algorithm. Assume that we arrive at the partition node in Figure 5 when searching the k-d tree. If

RETRIEVAL
Given an input sketching posture S and a projected 2D posture P. The similarity between them is calculated as follows: where cos.
is the cosine distance between the i-th limbs in sketching posture and projected posture. ! ld S i and ! ld P i is computed as Formula 3. n is the number of limbs.
For a sequence of input sketching postures, what the user desires is a consecutive motion, in which the similar postures should appear in the same order as input postures. When given n sketching postures fS 1 , S 2 , ..., S n g, our posture by posture retrieval algorithm is presented in Algorithm 1.
In Algorithm 1, we utilize a pre-computed directed motion graph [7], which is similar as Peng's work [28], to splice some motion segments into a consecutive motion.
A clustering method for key postures extraction is proposed in Section 5.1. Here, we modify it to cluster all the postures in database. Namely, for the incoming posture, we compute the distance between the new posture and center postures of all existing cluster and decide its cluster. Each motion graph node represent a cluster in motion database. If two clusters are adjacent in a motion file, an edge between the corresponding graph nodes is added. As illustrated in Figure 6, three motions are clustered into five classes, then a directed motion graph with five nodes is created. A transition path 'C-A-D' is found when user requires a motion from node C to node D.
In the Step 6 of Algorithm 1, assume that k groups is included in U. After selecting one motion segment in each group, k motion segments MS 1 , MS 2 , ..., MS K are acquired. We have to find a transition path, which passes from MS 1 to MS k orderly, by traveling the motion graph. For arbitrary two neighbouring result segment MS i and MS iC1 . Firstly we locate the two nodes that the last posture of MS i and the first posture of MS iC1 belong to. Then, we find the shortest path between the two nodes in the motion graph. Where SP i is the posture in a motion corresponding to the sketch S i . 5: If(k < n), Set i D k C 1, Go to Step 2. 6: If U only contains one group, the motion segments in the group is returned as the retrieval results. Otherwise, we select any one motion segment from each groups and splice them into a consecutive motion by motion graph, which is introduced in Section 6. All the synthetic motions are returned as final results.
After that, the path from MS i to MS iC1 is obtained, which could synthesize a continuous motion [28].

EXPERIMENTAL RESULT
Our experiments perform on a desktop computer with a 3.10 GHz Intel-Core-i5 cpu, 4 GB RAM and Windows7 operating system. To demonstrate the efficiency of our approach, we experiment on the CMU motion database [29], which contains 2434 motion data files. 222 584 projected-postures are extracted and a motion graph with 40 137 nodes is constructed.
A user study is conducted to show the effectiveness of our method. Ten participants, who are students and animators, are chosen in our experiments. They have a few minutes training before the user study. First, we will evaluate the accuracy of single posture retrieval. The participants sketching a posture and four similar candidate 3D posture results are given. If one of the results is accepted by the participant, we treat it as a successful retrieval. Each participant do it ten times with a different input posture each time. As shown in Table I, the accuracy of single posture retrieval is 93%. Some retrieval results are shown in Figure 7. The failure retrieval examples such as the third one in Figure 7 are mainly because the participant's sketching posture does not exist in the database. The average time to retrieve a single pose is about 1.5 s. If k-d tree index is not used, the average time is about 4.1 s for single pose retrieval. Next, we evaluate the accuracy of motion sequence retrieval. Ten motions are picked randomly in the database. The participants retrieve them by sketching several key postures. If the candidate motion sequence results contain the target motion, we treat it as a successful retrieval. Each given motion will be displayed before the participants start to retrieve. The experiment earlier is repeated for ten times. Table II(a) shows the results. About three key postures inputs are needed if the user wants to retrieve a motion sequence. And about 87 s, which include the sketching time, are spent retrieving a motion. When the participant finishes drawing the key postures, our system can return the results in about 2.5 s. What is more, we ask each participant to retrieval arbitrary ten motions they want, which may not exist in the database. The result is evaluated by the participants like single posture retrieval earlier. It is shown in Table II(b). Figure 8 presents some examples of motion sequence retrieval, and the last one is synthesized by motion graph. Choi's [24] method is also evaluated in this phase. We gain the approximate

CONCLUSION
In this paper, we present a human motion retrieval framework that uses hand-drawn sketching key postures as query input. After the key posture extraction based on cluster and posture projection, a novel k-d tree index structure based on LD feature is constructed to index the motions in database, which is one of the main contributions in this work. Besides, an effective retrieval algorithm is proposed to get a consecutive motion result in the large motion database, which is another contribution. Our system can retrieve not only the existing motions in database but also the combined motion, which is spliced by some existing motions. However, our retrieval framework still has some limitations. Firstly, a few complicated human motions such as martial arts motions and human interaction motions are difficult to be sketched by unprofessional user. Secondly, the posture by posture retrieval algorithm cannot handle the follow situation well. The user sketches three key postures and the third one has a big deviation. The algorithm may get two motion segments, where one segment contains the first two key postures and another segment contains the third key posture, and splice them into a motion. But Pi is the ith participant. 1 The average number of input sketching postures. 2 The average time, which includes sketching time, for retrieving one motion and the unit is seconds.