Review of the techniques used in motor ‐ cognitive human ‐ robot skill transfer

A conventional robot programming method extensively limits the reusability of skills in the developmental aspect. Engineers programme a robot in a targeted manner for the realisation of predefined skills. The low reusability of general ‐ purpose robot skills is mainly reflected in inability in novel and complex scenarios. Skill transfer aims to transfer human skills to general ‐ purpose manipulators or mobile robots to replicate human ‐ like behaviours. Skill transfer methods that are commonly used at present, such as learning from demonstrated (LfD) or imitation learning, endow the robot with the expert's low ‐ level motor and high ‐ level decision ‐ making ability, so that skills can be reproduced and generalised according to perceived context. The improvement of robot cognition usually relates to an improvement in the autonomous high ‐ level decision ‐ making ability. Based on the idea of establishing a generic or specialised robot skill library, robots are expected to autonomously reason about the needs for using skills and plan compound movements according to sensory input. In recent years, in this area, many successful studies have demonstrated their effectiveness. Herein, a detailed review is provided on the transferring techniques of skills, applications, advancements, and limitations, especially in the LfD. Future research directions are also suggested.


| INTRODUCTION
Similarly to the appearance of the robot discussed here, the world's first robot prototype, Unimate, was officially born in 1959. Only two years later, Unimate1900 series robots had been used as the first batch of automated mass production tools in factories. Since then, with a vision for the future, humans have embarked on a long journey, but at a surprisingly fast speed, to explore the potential of robots. Unquestionably, today, 60 years later, the boom in the robotics market is clear, behind which has been the active development of robotics technologies and artificial intelligence. So far, robots with skill learning have generally appeared in industrial manufacturing [1,2], logistics [3], field robotics [4], surgery [5] and other fields. In addition to these common areas, robots are also expanding into nursing [6], human-robot collaboration [7], autonomous vehicles [8] and transfer learning [9] (i.e. sharing knowledge between agents with similar but discrepant mechanisms). It is not difficult to see that robots are continuously finding new ways to impact people's lives, and robotics technologies have gradually begun to develop to enter for the trend of lightweight and collaborative areas, which could potentially allow more people to benefit from the safety and efficiency brought about by robots in the future.
In manufacturing, robotic surgery and various other robot applications, traditional robot programming is oriented toward specific skills, and engineers focus on the realisation of each task by hard-coding everything that needs to be executed. Traditional robots, while performing the task of 'bottle-to-cup water pouring', need to pay attention to the factors involving the position of the bottle/cup mouth and path planning. However, the task requirements in the actual scene are far from simple. When encountering challenges like obstacles, moving objects, and environmental changes, robots struggle to handle these adaptability challenges. These pre-programmed skills take engineers a long time to develop, and they may not be compatible with different task environments, as such demand is beyond their capacity [11]. Once the developed system starts to be used, the robot's performance will not be enhanced any further, and any improvement will depend entirely on the engineer's knowledge of a specific task. In order to solve these problems, modularised skills, improved adaptation ability, and higher robot cognitive level are necessary. Moreover, in an era where robots are expected to perform a variety of personalised tasks, developing a useful skill transfer framework which is capable of endowing a robot with human-like motor and reasoning skills is their destiny.
The term learning from demonstration (LfD) or programming by demonstration (PbD) is sometimes also referred to as imitation learning or apprenticeship learning. Whatever it is called, it is an efficacious technique for skill transferring and reducing the complexity of search spaces for a robot trying to learn a skill [12]. Research into LfD started about 40 years ago; as its name suggests, LfD allows robots to learn a skill by analysing the skill performed by the demonstrator and then imitating the skill. This transforms the tiresome and complicated (even sometimes unmanageable) skill preprogramming process into a demonstrating (i.e. teaching) process, which can be implemented by laymen (i.e. people who do not have programming skills), and resembles the method of human-human interaction and social learning. LfD could be categorised as a supervised learning approach, assuming the experts' demonstrations are (sub-)optimal. The observed demonstrations direct the robot policy search to a local optimal area where refinement could be further applied.
Overall, these kinds of technologies that replace traditional programming have a broad range of applications; for example, remote underwater vehicle operation [13], reusable surgical skills [14], autonomous services robotics [15] and physical collaborative tasks [16], etc. Most importantly, LfD is ideal for highly customised/personalised tasks, as it reduces the time and labour cost of re-programming. A typical LfD humanrobot skill transfer method involves three main steps, which are (1) demonstration, (2) modelling and learning, and (3) skill extraction and reproduction. Herein readers are guided in recapping the different outcomes, in recent years, that have made breakthroughs in the above three steps.
In [17], four primary issues were summarised that deal with imitation learning/LfD problems. Two issues raised, 'what to imitate' and 'how to imitate', are generally of interest to robotics researchers, and many solutions have been proposed and tested. When imitators learn a skill, a good learner knows clearly on a high-level what they want to imitate. The world we observe is composed of countless amounts of low-level and high-level information. Only a small portion of this information is essential in imitation. For example, when imitating tablewiping skills, agents need to pay attention to hand position and press strength, while speed information is usually irrelevant. The fundamental reason for a human being able to continuously improve his/her skill performance, based on observing his/her own skill completion, is the understanding of performance and the corresponding systematic evaluation metric. Then, the question of 'how to imitate' explores how to maximise the metric. This implies that imitators need to have sufficient high-level cognitive abilities to solve these problems.
The term 'cognition' refers to all psychological skills that can help learning, understanding, reasoning, and perception. Decision-making is a higher level cognitive process, and is also a consideration that focused on here. Artificial intelligence like machine learning (ML), deep learning (DL) and other datadriven technologies can achieve reasoning, planning, learning, decision-making and other functions to a certain extent [18]. After Gibson's theory of affordance [19], research on affordance learning also began. Affordance explained the mapping between objects, their affordable behaviours, and the effects of behaviours, enabling skill transfer and learning from a unique viewpoint [20]. Robots in the new era require more personalised skills and a more user-friendly programming interface. So far, most of the research into LfD has been limited to specific scenarios and skills. Due to the inadequate cognitive/decisionmaking capabilities of robots, most applications are not suitable for frequently changing task scenarios because they cannot reduce the load of reprogramming well (i.e. reprogramming is needed when the task is changed). In some current application scenarios, robots need to possess a great variety of skills and be able to evolve into novel ones by reorganising existing skills, rather than acquiring them from the human teacher each time. In addition to the realisation of learning individual (motor) skills, the text herein will also motivate readers to review and understand some research into cognitive skill transfer and LfD dealing with challenges in complex task scenarios, high-level decision-making and action-sequence planning.
A detailed overview of the critical techniques, advancements and problems, and developmental trends of robothuman skill transfer (i.e. in particular, LfD) in both high-level and low-level perspectives is provided. Section 3 introduces the works in LfD to realise low-level individual (motor) skill. Section 4 focuses on the high-level aspects which introduce cognition/decision, complex compound skills. Section 5 extracts the current works and shortcomings, putting forward constructive suggestions for future developments. Finally, Section 6 provides a summary.

| PRELIMINARIES
As one of the most comprehensive topics in robotics, humanrobot skill transfer involves a lots of work of which some is solved and some remains unsolved. Figure 1 depicts the main issues that have been long attracting researchers' attention. This can be simplified by understanding the big picture using a few perspectives, as shown below.
� Demonstration: When tackling skill transfer problems, the first thing that needs to be specified is the formation of a skill. Depending on the modal choice, different sensors may be applied as parts of the demonstration interface. Through three main kinds of interface, the expert/failure demonstrations are captured. Each interface has its pros and cons, which are discussed later. � Learning low-level motor primitives: Temporal alignment, motion trajectory segmentation, and recognition may be the prerequisites for learning individual sub-tasks, as people usually demonstrate a complex sequence of skill primitives. Two main types of primitive representation approaches can now be found: the dynamical system-based approach and the probabilistic approach. Each model usually has its own dedicated learning techniques that have been proved to work effectively. Sometimes, reinforcement learning can be additive to model learning, as it refines a skill that could then be performed by a robot, which will have a higher success rate. � High-level symbolic reasoning: Being able to perform highlevel reasoning is an improvement to robot cognition. From that, robots automatically recognise skill primitives and give them labels. The task goal is usually predefined as it reduces the system complexity. However, modern industrial robots are expected to realise the task goal by themselves so that conventional robot programming can truly be replaced with advanced autonomous methods. Some techniques for primitives re-sequencing and motion planning already exist; however, learning that task planning skill with object affordance seems to be more intuitive since it allows to plan from high level to low level. � Skill reproduction and generalisation: Techniques employed for regenerating skills can be different, depending on the model used in the primitive learning stage. Perception forms a crucial role when generalisation is necessary. Feedback terms may be added to the motion planning module in order to achieve online adaptation. It is certainly the case that active perception would improve perception accuracy; however, computational modelling of active perception skill is only in its preliminary research stage. � Incremental learning: To further refine a skill performance, correction is a means to provide additional information for robots to learn a user preference based on inverse reinforcement learning. Allowing emulation is an important user preference that differs from imitation in being mainly motion-driven or effect-driven. Again, affordance would definitely facilitate the emulation process, which should make researchers pay attention.

| Overview of LfD
LfD is a skill-learning process for a robot that does not require people to do hands-on hard-coding programming during learning. Instead, it provides other interfaces that are friendly to users for demonstrating the skill, which makes the interaction more intuitive, as shown in Figure 2. Performing demonstrations requires an interface (i.e. a medium); there are three main categories of interfaces for demonstrating a skill. Methods such as kinesthetic teaching and observational learning are commonly used as they are fundamental to routine human social learning. Methods like teleoperation, which are not applicable in skill transfer between humans, efficiently work between humans and robots. It is worth mentioning that some methods, like trial-anderror learning [21], are not in the scope of 'demonstrating', and hence cannot be classified as LfD.
Kinesthetic teaching is an approach in which the teacher directly interacts with the robot via physical contact and manipulates each degree of freedom to complete a task-for example, holding a robot arm to open the door [22]. This kind of direct contact interaction retains the teacher's haptic perception while demonstrating, but the performance level is  hugely dependent on the teacher's operation level. In other words, for a remarkably complex behaviour (such as manipulating a humanoid robotic arm with redundancy [23]), it is difficult for the teacher to take into account the coordinated movement of all degrees of freedom, which reduces the expertise level of the skill performance to a certain extent. Demonstrators often just hold the end-effector of the manipulator to roam in its workspace and feel his/her motion is under constraint, which clearly shows that its drawback is its inadequacy for a highly dynamic task.
The observational learning approach primarily exploits computer vision systems or systems involving other modalities, which map the pose of humans to robots by estimating human joint states and calculating them according to the corresponding kinematic algorithms. In this process, a binocular camera with a depth sensor [24,25] and markers [26] can be used. Other motion-recording systems like gyroscopes and accelerometers can also be used as a substitute for vision. Through observational learning, the pose of the entire person's arm will be considered, not just the end-effector (i.e. hands), which maximises the dexterity of robot skills. However, the drawback of this method is that no interactive force feedback can be obtained.
The teleoperation method requires the teacher to control the leader device (e.g. manipulator/joystick/exoskeleton/haptic device, etc.) [27], and consequently control the movement of the follower mechanism. Compared to kinesthetic teaching, this method has to use kinematics to map the configuration at both ends. Moreover, not all teleoperated systems are equipped with a force/tactile feedback system [28], so the teacher loses haptic perception while manipulating, which will degrade the performance of the skill; even if the system is based on a haptic device, the time-delay intervention will also lead to resonating, shaking and other problems [29].
The correspondence problem [30] describes the effect of the discrepancies in the embodiment (i.e. nature and working mechanism) of humans and robots on accomplishing the same skill. Different agents utilise different perceptual systems and physical mechanisms to interact with the environment or the other agents, which makes the sensory capabilities and motor capabilities among them different. From a technical aspect, for example, a robot might use depth-imaging sensors [31] as its visual sensor, which do not capture as much detail as human eyes do; and a 3 degree-of-freedom (DoF) robot may not be suitable for a 6 DoF skill, which is easy to achieve by humans. To successfully evaluate a learnt skill, it is necessary to make sure both that agents share the same, or at least a similar, visual and motor capability [32]. However, the requirements of those capabilities for learning skills are task-oriented. As benefiting from contacting robot, kinesthetic teaching can simplify the correspondence problem as humans adapt their skill based on their own experience and cognitive skill.

| What is a skill?
The composition of skills mainly involves the problem of modalities or '"what to imitate'. The development of robotic skills usually progresses with demand. Robots were initially used in industrial production, and the task goal was principally to control the pose (i.e. position and orientation) of the robot. After that, pure position control cannot meet the force-related and contact-rich task scenarios, such as mopping and polishing. Researchers then began to consider encoding force information into the skill [33]. Later, some used bioelectric signals such as surface electromyography (sEMG) to match the human muscle activation level to the robot impedance, thereby enhancing the adaptability of the robot [27,34].
In LfD, the motor skills of robots are usually subdivided into various skill primitives. When performing complex tasks, the robot needs to find the most suitable primitive among the existing skill library and arrange them in the correct order. In addition, there are many uncertain factors in the environment, and it is particularly necessary to perceive and select appropriate sensory information, which greatly reflects the robot's symbolic reasoning ability. Does the robot need to perceive and analyse all modalities in this high-level decision-making process? If not, how can humans quickly extract the key information needed to complete the task from the sensory information? Can such cognitive skills be transferred to robots technically? Problems like these have not been able to be solved well so far.
Lopes et al. [35] proposed a computational model of social learning. The article points out the essential difference between imitation and emulation. They suggested using the '"strategy weighting triangle' to explain the '"what to imitate' problem. In their theory, imitation refers to fully understanding and copying the demonstrator's behaviour and intentions, while emulation refers to the understanding of the effect of the demonstrator's behaviour and then achieving this effect with F I G U R E 2 Human-robot skill transfer [10]: (a) learning via teleoperation; (b) kinesthetic teaching action more open (i.e. observed actions or actions that were never-before-seen). An example of emulation can be described as this, the demonstrator grabs the object away from the desktop and put it down to another position, while the robot (i.e. emulator) realises the goal is to change the location of the object, so it pushes the object to the goal position instead of grasping it. The following baseline preference is to minimise the energy consumed by the behaviour. Through the tradingoff between the three, the learnt skills may be better evaluated. Figure 3 is a visualisation of the theory of Lopes et al. Nonsocial behaviour performed using a passive role intends to minimise the energy consumption. Imitation and emulation as two common imitative behaviours are the motion/action copier and goal/final effect copier, respectively. Points located in the middle of the triangle then indicate the nature of the imitator. From the above points of view, skills are comprised of single low-level motor primitives and high-level cognition/ symbolic reasoning, although it is very challenging to model and integrate high-level reasoning into robot skills.

| Represent and learn individual motor skills
In order to facilitate the computational modelling and reuse ability, motor skills are often modelled into various primitives at a trajectory level. The article [36] summarises and introduces the algorithms for implementing skills learning at the trajectory level. A survey [37] broadly summarised the different technologies (including LfD and other techniques) that will be used in robot skill transfer learning and list the applications in different scenarios. Recently, some new review articles on LfD conclude the advances based on multiple categorising methods [38] and applications in the robotic assembly domain [39]. Meanwhile, readers can also obtain information from some classic LfD review articles, such as [40][41][42].
A simple recording and replay software will satisfy the behavioural clone, which is a brute-force method. However, skill synthesis and encoding allow generalisation in different situations. This kind of skill modelling technique seeks an abstracted and generic expression of the skill, which usually reduces the total number of feature points of the skill and generalises/ modifies it at an abstraction layer. Generally speaking, skills can be described in two broad ways, namely, dynamic system-based approaches and probabilistic approaches.

| Dynamic system (DS)-based approaches
DS-based methods generally seek to encode the dynamic attractor landscape in the state space (i.e. typically position and velocity) of the demonstrations.
Dynamic movement primitive (DMP) is one of the most popular techniques to model skills at the trajectory level, which was first officially proposed in 2002 [43]. It was then updated and described in detail in 2013 [44]. The essence of DMPs is a second-order non-linear dynamic system, which contains a spring and damper. The original DMPs (e.g. non-periodic ones) are presented with the following set of differential equations in a first-order notation: Since DMPs are orienting to the trajectory, thus we have the trajectory x and its rate of change v. g is the set point of the system and τ is a time constant which relates to the movement duration. α v and β v are positive constants that relate to the spring constant and the damping coefficient of the secondorder system. f is a non-linear forcing term that is driven by a state parameter s.
where i is the index of basis, x 0 is the initial position of the trajectory and α is a constant. The state parameter s as stated in Equation (4), which monotonously reduces to zero as time passes, forms an indicator of the task completion. In other words, s → 0 as time t → ∞, which makes the forcing term f(s) go to zero when the task is about to complete. While the forcing term is zero, Equations (1) and (2) become a standard spring-damper system that always leads x towards g. The personalised pattern of a trajectory is encoded in the forcing terms as stated in Equation (3) by using a set of weighted Gaussian kernel basis functions ψ. The learnt pattern is stored F I G U R E 3 A computational model describing the relationship between imitation and emulation in social learning [35] GUAN ET AL.
as the weights ω. The scaling term g − x 0 helps to adapt the skill to a new initial position, while maintaining the global shape of the trajectory. The high popularity of DMPs confirms some of their advantages; the four main features are discussed below.
� DMPs can encode not only discrete (non-periodic) skills, but also rhythmic (periodic) skills, which is implemented by replacing the system with a limit cycle dynamic system. Nakanishi et al. [45] successfully realised biped walking locomotion of a humanoid robot with rhythmic DMPs. However, the walking trajectory in the experiment was generated by a state machine controller, and the walking demonstration of a human would cause failure. This well reflects that the physical or sensory discrepancy between imitator and demonstrator will greatly affect the success rate of skill transfer. � DMPs can be easily extended to multi-dimensional systems by sharing the same canonical system (Equation 4) and using different transformation systems for each trajectory. In [46], DMPs are separately modelled for each joint of the manipulator by establishing multiple nonlinear functions and transformation systems. In [47], motion trajectory and force trajectory are encoded into a single skill. � Inherent generalisability is another advantage of DMPs, which can be implemented straight forwardly. The task performance duration can be adjusted by changing the temporal scaling factor τ to get a more rapid or slow state change, and the start/goal position can also be adaptable while maintaining the overall shape that is stored in the learnt weights ω [48]. � Because of the spring-damper system, DMPs are very stable and robust. They are also very resistant to small external disturbances. For large constraints like someone obstructing the robot arm while it is moving, a feedback term can be added into the system to adjust the behaviour online [49].
Despite DMPs showing many kinds of advantages in skill learning, they also have certain disadvantages [44]. The system cannot be used when the start and goal positions are too close. Moreover, there is a mirror effect when generalising the movement to a specific goal. To this end, [50] proposed a modified DMP formulation, as shown in Equation (5). This equation is then a replacement of Equation (1).
K and D are the stiffness coefficient and damping coefficient, respectively, which are somehow equivalent to α v and β v . The left of Figure 4 intuitively describes the DMPs that operate based on a canonical system and a virtual spring-damper attached to the start and goal. The canonical system, as a virtual timer, controls the phase s according to the time t, and the virtual spring-damper driven by s produces a virtual force at each time and attracts the system state space values to the equilibrium.
Other works that approach a better performance in using DMPs are as described below. By adding an external signal to rhythmic DMPs and introducing a set of additional dynamic systems for the temporal scaling factor, a smoothly changed and speed-adjustable rhythmic skill with synchronisation is obtained [52]. Kober et al. [53] extend the DMPs by involving external variables to each DoF, so that the perceptually coupled motor primitives are obtained.
Based on the insight that humans utilise a small number of motor primitives to generalise a large number of motions according to different environment stimuli (perceptual information or task parameters), gating networks can be used to further extend the generalisability of the skill. A gating network-based model Mixture of Movement Primitives (MoMP) [54] is proposed, which outputs a weighted sum of the old motor primitives to form a new movement. In MoMP, the gating network is triggered by an augmented state, which is associated with meta-parameters (i.e. the hitting position, velocity and orientation of the racket in a table tennis task). In order to learn the meta-parameter in that table tennis hitting task, two methods can be used, which are (1) analytically predict and convert to joint space using inverse kinematics; (2) episodic reinforcement learning approach (Cost-regularised Kernel Regression (CrKR) [55]). Summing up MP candidates some time may not be able to deliver a good policy (and may be even worse than a single primitive) [56]. The weights (i.e. the responsibility of a motor primitive to produce a new movement) for each motor primitive can be updated through reinforcement learning methods such as the method inspired by Relative Entropy Policy Search (REPS) [57].
The authors of [58,59] introduce the idea of Query q to DMPs, which was inspired by MoMP to extend the generalisability. q → [ω, τ, g], is the key concept, and finds the mapping between queries (i.e. goal of primitives) and the learnt parameters. Then, for any given novel query q, the robot can generate a motion plan that is more human-like. The work [60] proposed DMPs that further refine the query system, encoding each style directly into the attractor landscapes (i.e. the forcing term f). The authors of [61] point out that the importance of the movement styles which can be utilised to accomplish specific tasks like shooting ball in Small Size eague and table tennis. 'Style Adaptive DMPs' (SADMPs) are presented that merge the weights of different styles to adapt to the changes in goal position.
Inspired by some findings in biology, Rückert el al. [62] provide a parameterised version of DMPs called 'DMPSynergies'. This technique parameterises the basis function in the forcing term, so that, in multi-task learning, the synergies knowledge in each task can be shared and speed up the learning process. Very recently, another parameterised version of DMPs was proposed in [63], which is the Compliant Parametric DMPs (CPDMPs). This uses the Parametric Hidden Markov Model (PHMM) to encode the forcing term of the modified DMPs so that f(s) becomes f(θ m , s) where θ m can be any high-dimensional variable that affects the shape of the trajectory. For example, θ m can be the position of an obstacle. After training of the PHMM parameters based on multiple demonstrations and Expectation Maximisation algorithm [64], the system is then ready to generate a trajectory that avoids obstacles. In addition, a force-feedback coupling term is introduced into the transformation system so that the robot is able to handle the external forces.
DMP Plus [65] has been designed to increase the trajectory reproduction precision by adding bias to each Gaussian basis kernel ψ i and truncating the kernel. In [66], the exponentially decaying canonical system Equation (4) is replaced by a linearly decayed one. The system shows comparable results and reduced the user's time expansion while teaching the robot. The authors of [67] developed a new DMPs representation called arc-length DMPs (AL-DMPs) that decouples the temporal speed and spatial shape of motion trajectories. This was done by representing the DMPs differential equation using the derivatives of arc-length rather than time. It is worth pointing out that AL-DMPs well solve the temporal scaling and time alignment problems. Gams et al. [68] studied the DMPs in a bimanual scenario where two arms are tightly coupled. They modified the DMPs to be cooperative DMPs by adding complementary coupling terms to each transformation system.
Stable estimator of dynamic systems (SEDS) is complementary to DMPs [69,70]. A SEDS model uses Gaussian mixture model (GMM) to encode the attractor landscape as the joint probability of position and velocity. The learning outcome becomes an optimisation problem for finding GMM parameter values that minimise the trajectory error concerning all the demonstrations, such that the system is globally asymptotically stable. Similar to DMPs, high stability always allows converging to the goal. While DMPs allow for a single demonstration encoding, a SEDS is suitable for multiple demonstrations learning. A drawback of SEDS is that it assumes the dynamic system is time-invariant, which is not as versatile as DMPs, where the system dynamic could change with time as the forcing term changes. Another drawback is that the fixed attractor normally does not like to make the states vector (i.e. position and velocity) go away from the attractor, which may distort the overall shape of the motion. To this end, the Control Lyapunov Function-based dynamic movements (CLFDM) approach was designed [71], which is also guaranteed to be global asymptotically stable. The difference between the two models is that the CLFDM uses regression techniques like Gaussian mixture regression (GMR) to model an unstable motion according to demonstrations; however, it learns a Lyapunov function to control the stability at runtime during task reproduction/generalisation by solving the problem of constrained convex optimisation.

| Probabilistic approaches
Probabilistic approaches seek the help of probability theory to encode a spatial or temporal pattern using joint probability density, while a method called probabilistic movement primitives (ProMP) can be slightly different, and is introduced below.
The GMM is one of the most popular models for encoding a trajectory. The term GMM is also referred to as mixture of Gaussian (MoG), and was first proposed to describe multimodal probability distributions [73], which is then utilised to encode the complex shape of a trajectory of low or high dimensions [74]. The idea of a GMM is as shown as below: where p(ξ j ) is a probability density function of a point j on the trajectory ξ j ; there would be K number of Gaussian components in total, π k is the prior knowledge of the k-th Gaussian. N denotes the symbol of a Gaussian distribution. μ k and ∑ k are the mean and covariance of the k-th Gaussian, respectively. Put simply, a GMM encodes the trajectory as the K-modal probability map that shows what spatial position a point is most likely occur at in the trajectory space. An example of GMM encoding uses single demonstration one-shot learning, F I G U R E 4 Dynamic movement primitives in conjunction with Gaussian mixture model [51] GUAN ET AL.
that can be seen in the left of Figure 5, where a 3D trajectory is encoded into a four-modal Gaussian map. The uniqueness of a DMP is the dynamic equation-based representation, which means that the frequently used forcing term of DMPs composed of weighted Gaussian kernel can be replaced by a GMM [75]. As shown in the right of Figure 4, GMM models the joint probability of forcing terms f and phase s, which is more compact when the trajectories have more fluctuations. Hidden Markov model (HMM) and hidden semi-Markov model (HSMM) are both alternatives to GMM [76,77]. The key difference between an HMM and GMM is that the HMM considers the transition probability between each states (i.e. each Gaussian modal). One can think of an HMM model as a GMM with latent variables (not directly observable) changing over time. An HMM encodes the state duration implicitly in the probability of no state transition, which may be inaccurate, while an HSMM explicitly defines the state duration using a log-normal distribution (LN ) with mean μ D i and covariance P D i , as shown in Equation (10). The parameters for modelling an HSMM (Θ HSMM ) based on K state components can be seen in the Equations (8) and (9).
D i is the duration of the i-th state, a i,j is the state transition probability between state i and j, and Π i is the prior of the state i, indicating that the probability of the initial state is i. The parameters of an HMM (Θ HMM ) would be similar to the HSMM, where duration terms are removed and j can be equal to i. Figure 6 visualises an example of encoding using HSMM, however an HMM example will not be shown here since HSMM has a similar structure but without state duration probability. It can be seen easily from the figure that six Gaussian components are used for encoding, where state transitions (i.e. arrows between states) are enabled with transition probabilities. Close to each state, a state duration probability density function is calculated, shown as small bellshaped functions. Based on all of the above, state sequence probabilities, which describe the conditional probability of seeing a point in a certain state given the time variable, are plotted on the timeline. Trajectory-based GMM is a method that explores not only the static feature but also the dynamic features (i.e. utilising the relationship between the trajectory and its derivatives, e.g., position, velocity, acceleration and jerk) [80]. As shown in Equation (11), ξ t can be calculated through Euler approximation using spatial position (i.e. it can be any dimensions) of three executive time steps, where Φ t is the matrix that relates to the Euler approximation. By stacking data of all the time steps a generic function (Equation 12) can be obtained, where Φ is a large sparse matrix.
Suppose multiple trajectories are obtained, information can be encoded either in a GMM, HMM or HSMM. Taking GMM, for example, a likelihood of trajectory ξ given state s can be computed where s is a sequence of indicators (i.e. one for each time step) that denotes which Gaussian component contributes to each time step, as shown in Equation (13).
where μ s t and P s t are the mean and covariance of the state (i.e. which Gaussian component) s t at time step t. The maximum likelihood then can be performed with the help of Cholesky and QR decompositions [81] to solve the motion generation problem. As shown in Figure 7, the trajectory-GMM works well in a four-demonstration, 18 Gaussian components situation, which synthesises the trajectories to form a complex one that contains a diversion in the middle, allowing to go in either direction. This method does not require time/spatial aligned demonstrations.
Gaussian process regression (GPR) is a generic method that computes with brute force the high-dimensional trajectory distribution [82]. It finds the correlation between each degreeof-freedom and stores the trajectory distribution information into a high-dimensional positive definite covariance matrix. The trajectories can be regenerated or resampled by conditioning. Figure 8 illustrates the process of using a GPR method in modelling a 1D trajectory. This method results in a large covariance matrix and means, which may be easy to use. However, its performance depends entirely on the demonstration quality and it is prone to spatial/temporal variations. A tremendous amount of demonstrations may be needed to synthesise and infer a smooth trajectory.
Task-parameterised solutions are well summarised in [83], which clearly shows three types of solutions, namely, GPR with Trajectory Models database; Multi-Streams approach; and Parametric Gaussian Mixture Model (PGMM/ PHMM). Generally speaking, the task-parameterised model includes information of task parameters (i.e. offset positions and transformations) as query points for each demonstration data point. Suppose the task parameter in each demonstration is fixed, the issue can be treated as a GPR problem [82] solving with techniques like GMM and Gaussian mixture regression. This allows the motion to be inferred based on novel task query in real time; however, GPR cannot handle query points that are too far from the demonstrated ones. The Multi-Streams approach utilises a different strategy that observes motion at multiple different frames, and trains each model separately, which, however, may raise a computational limitation issue [84]. The parametric hidden Markov model (PHMM) takes all the demonstrated motion and query into a single model, while PGMM holds the same strategy but does not take state transition into account [85]. PGMM/PHMM can be problematic sometimes as the covariances are not parameterisable (i.e. constants) and lead to a local minimum solution in Expectation-Maximization (E-M) learning. An improved version is presented in [83] based on PHMM, which makes a fusion of the main features of the above three models. Task parameterisation can be used in conjunction with trajectory-GMM model, where the motion derivatives are introduced.
Probabilistic movement primitives (ProMP) [86] uses a similar idea as DMPs to add an artificial clock that maps time to a phase variable with an arbitrary non-linear function. Locally weighted regression can also be applied to learn a set of basis function weights. The uniqueness of a ProMP model is that it further encodes the trajectory using GPR in the weighting space (to find a distribution of the weights), where abstraction is carried out twice to get a compactly represented model, which yields a number of fruitful properties to use, and it is easy to perform temporal/spatial generalisation and online modifications. It estimates a trajectory shape using a weighted sum of basis function, as x t = Φ t ω + ϵ x , where Φ t is a matrix of time-variant basis function and ω is the weight that encodes the spatial information; ϵ x is the Gaussian noise which belongs to a zero mean covariance matrix ∑ x .
Once the weights are learnt, a probability map of the trajectory can be obtained as in Equation (14). To learn the weightings from multiple demonstrated trajectories, the parameter Θ ω = {μ ω , ∑ ω } of a Gaussian distribution of the weights for all demonstrations can be learnt. Then Equation (15) can be used to derive the weights. An example of application of ProMP can be seen in [87] for table tennis, where the authors also studied the probability distribution in task space and robot joint space, so that the task space trajectory can be obtained from a ProMP model trained in the joint space. ProMP model has a flexible generalisation ability through probability conditioning, and it can also be used for action recognition, which is not so easily accomplished using DMPs.

| Locally weighted regression
Locally weighted regression (LWR) [88] is a super-fast algorithm combining the simplicity of linear least squares regression and the flexibility of non-linear regression, which was originally proposed in [89]. It allows doing one-shot learning with the idea of doing linear regression locally on a non-linear problem. An extended version called locally weighted projection regression (LWPR) was introduced in [90] to reduce the F I G U R E 5 (left) Example of Gaussian mixture model motion encoding with five Gaussian components; (right) Gaussian mixture regression trajectory generation and its covariance [72] GUAN ET AL. redundant complexity using a partial least-square. LWR is commonly used for learning of DMPs, and some recently published examples can be seen in [91,92] for target reaching and pouring with obstacle avoidance tasks, respectively.

| Expectation-maximisation
E-M is an algorithm for learning the parameters of GMM-and HMM-based models [93]. It is an iterative algorithm that involves two steps, E-step and M-step. More intuitively, the idea of E-M is to update the lower bound function of the objective function, then the lower bound function is maximised, which implicitly (indirectly) maximises the overall object function. Note that, the E-M algorithm may not work efficiently when the GMM or HMM contains too many components [94], which increases the dimensionality in solving E-M learning and significantly adding the computational complexity to the system. A variant of E-M called the Baum-Welch algorithm can be seen in [64], and an example of using it to train an HMM can be seen in [10].

| Gaussian mixture regression
GMR is a commonly used technique that is frequently used in combined with GMM [95], HMM [10] and DMPs [51,96,97]. An excellent review on different regression algorithms can be found in [98]. Comparing GMR with LWR or LWPR, it computes the regression function from the (multivariant) joint density function. For a robot trajectory, the temporal value often acts as the model input and the other data as the output values, and by finding the conditional probability map of trajectory given time, Pðx trajectory |tÞ, the infinitely differentiable (non-discontinuous) spatial trajectory can be obtained [99]. Another interesting example usage of GMR can be found in [70], which uses a GMM to encode Pðx trajectory ; _ x trajectory Þ and derive velocity at different spatial positions with GMR. Timedependent GMR [100] is another variant usage that has been proven to be useful; it takes time, position and velocity together as variables for computing the joint probability density, and then estimates Pðx trajectory ; _ x trajectory |tÞ. In the right of Figure 5, the computed trajectory means and covariances can be visualised. In [101], not only the motion dynamics is encoded, the joint probabilities between position and force, and joint stiffness are also encoded using HSMM, then GMR is used to generate motion with the desired contact force, joint stiffness and velocity at each position.

| Reinforcement learning
The idea of reinforcement learning in robot policy learning is to find the cost and reward function for a robot to optimise its action, hence obtaining better outcomes. Conventional trial-and-error [21] requires robots to search in a very large action space, which means the training takes a very long time and is not friendly to a real robot. With a combination of LfD, the action space would be significantly confined to a local optima area, which reduces the training time. In other words, the success rate of reproducing the learnt skills after one-shot learning may not be assured a high value with known and unknown reasons, which reminds us of the effectiveness of reinforcement learning (RL) and makes it a non-replaceable tool. In a ball-in-the-cup skill transfer study [102], the manipulator cannot succeed in reproducing the skill at all by just using a single demonstration because some spikes in acceleration cannot be reproduced correctly by the motor. Inverse reinforcement learning (IRL) [103] has been proposed to extract reward functions based on perceiving actions and environment, as shown in Figure 9. In some scenarios, such as in minimally invasive surgery (MIS), expert surgeons are continuously optimising their actions through evaluation. However, even the surgeons themselves cannot fully explain what their evaluating policy is. This, again, reminds us of using IRL. Recently, Li et al. [104] presented their inspiring work on implementing IRL in robotic surgery procedures to derive a policy for surgery skill evaluation.
Most works in robot reinforcement learning are timeconsuming and based on the real system. Some studies have tried to train a robot in a physical simulator, which may significantly reduce the time spent on the real system as it searches in a submanifold of space of policy and highdimensional sensory input. However, it is quite common that the robot produces catastrophic failures after transfer to real scenarios, and that is why people often use model-free RL in the real system. Domain Randomisation (DR) is a useful technique that allows to fill in the reality gap (i.e. mismatch between real world and simulations) by randomising the simulation scenes, which embed the uncertainty of the world into the learnt parameters to form a more robust policy. Examples can be seen in [105,106], which utilise DR to train for a balancing task and ball-in-a-cup task. Peng et al. [107] also developed a dynamic randomisation technique to randomise the unknown dynamics property of objects to link a Sim-Real.
In summary, DS-based methods are often used for direct learning control because of the characteristics of dynamic equations. Users can access and improve the robot system's F I G U R E 8 A 1D example of Gaussian process regression [80] GUAN ET AL.
-11 stability and complete the task stably by online regulation. However, the method based on probability theory is more flexible and suitable for the target/via-point generalisation. It assumes that the user has an effective robot controller. Therefore, this method is usually used for task space trajectory planning rather than directly learning motor control variables. Nevertheless, this also facilitates the robot task planning in the task space at a high level based on cognitive skills.

| Temporal alignment
As for multiple demonstrations and complex skill demonstrations, demonstration differences are commonly seen in both spatial and temporal values. Demonstrators may not be able to or not necessarily produce temporal and spatial aligned motions. Spatial difference is totally dependent on the demonstrator; however, for temporal difference, there are some models that can get rid of these issues; for example, the AL-DMP and trajectory-based GMM as introduced in the previous sections. Trajectory-based methods solve this issue by pushing the model complexity to a relatively high level, which is not always ideal. As other methods do not provide a built-in mechanism for temporal alignment, commonly used techniques are introduced in this section.
Dynamic time warping (DTW) was first proposed in [108], and tackles speech recognition problems, as speech and robot motion share a similar trajectory structure using time series, and DTW is also well applied in the robotics domain. In [109], the authors use DTW and E-M algorithm to achieve a method that synthesises multiple demonstrations and computes the time-aligned trajectories at the same time. The outcome of this method is to compute a synthesised trajectory z as a reference with a set of time mappings τ for each demonstration. The model of their work is based on Equation (16) where x(τ t ) is the demonstrated trajectory with its time mapping at time t, and M is the total number of demonstrations. N denotes the symbol of Gaussian distributions; here, the mean is zero and covariance matrices R are the weighting which indicates the contribution of each demonstration in shaping the reference trajectory z. Open parameters are z, R and τ. Then a Kalman smoother is established, and optimises results via E-M. The E-step compute a Gaussian distributions Z for the reference z given R and τ, and the M-step updates R and τ given Z by maximising the likelihood, then closes until they converge. Another extension of DTW, Generalised Time Warping (GTW), can be found in [110] which allows to work with high-dimensional, multi-modality data efficiently using a Gauss-Newton algorithm. DTW often causes non-continuous trajectories with very large acceleration jumps, instead of giving a constraint on the velocity or acceleration, and [111] proposes the Local Time Warping (LTW) method to solve that problem by optimisation using local information.

| COMPOUND SKILLS AND HIGH-LEVEL COGNITIVE REASONING
In the previous section, the learning outcomes of a robot are mainly defined as low-level policies, which can be utilised for a robot to perform a single task element. However, to endow the robot with more capability in handling people's routine work, achieving human-like high-level reasoning is required. Highlevel skills are the premise of combining low-level skills to complete more complex compound skills. Research in this area usually involves skill segmentation, labelling, recognition, and planning/sequencing. The significance of these studies is notable, considering that if the learnt skill primitives cannot be arranged correctly, the robot will not be able to complete more complex tasks autonomously. Meanwhile, the robot can use this learnt high-level reasoning skill knowledge and produce novel skills through some generative frameworks. This is just like the learning process of humans, which is well reflected in the learning of musical instrument performance. For example, skilful violinists can compose and play a piece of never-before-seen sheet music almost immediately, as that is just resequencing of all the learnt 'skill primitives'. In this example, composing involves high-level decision-making, while playing involves resequencing and joining smoothly the learnt motor skills primarily. Learning of such complicated high-level cognitive skills is no different from learning the interaction between the labels of environmental conditions, agents and objects. The following section reviews previous studies and the current works in techniques of skills segmentation, recognition, motion taxonomy, joining, and sequencing; and then, an inspiring topic of F I G U R E 9 An intuitive comparison between reinforcement learning (RL) and inverse reinforcement learning (IRL) affordance-learning used in robotics is introduced, which tries to solve imitation in an affordance-based method.

| Skill segmentation and recognition
Segmentation of a continuously performed complex compound skill (i.e. computer assembly) can be solved in various kinds of ways. Supervised segmentation, or manual segmentation, is intuitive and precise, but is also time consuming. However, achieving a perfect generic framework for segmentation can be extremely challenging since the segmentation problems are often task relevant. The most straightforward, efficiently worked unsupervised method is based on stopped motion by assuming each cut-off point has zero velocity.
In [112], a segmentation algorithm based on spectral analysis is introduced. In this study, the target trajectory for conducting segmentation was the temporal varying 3Dposition of hand. It is noteworthy that the affinity matrix of spectral analysis is computed using Gaussian kernel function based on spatial difference and temporal difference, which is considered to be less intuitive. This method has a drawback of manual selection for parameters (i.e. standard deviation term of Gaussian kernel and the total number of clusters), where, in a more complex trajectory, it is difficult to take all kinds of motor pattern into account, and people do not want to specify the number of sub-trajectories manually.
Meier et al. [113] modified the DMPs to be a discretised Kalman filter version to achieve movement segmentation with online movement recognition. At each time step, the E-M algorithm was used to update the optimal parameter estimation for each DMP in the library. The DMPs with the maximum of likelihood are then the best predicted motion, which can be done online with increasing confidence. An issue with this is that the estimated optimal DMP temporal scaling factor can be sometimes unreasonably large or small. The auto-segmentation is done by monitoring the drop of likelihood value. This method assumes all skill primitives are stored in a library and the segmenting target joins each primitive with no spacing, which largely constraints its effectiveness.
Another method based on Beta-process Autoregressive Hidden Markov Model (BP-AR-HMM), which is less straightforward but has good performance, was proposed by Emily et al. [114,115]. For the details of this algorithm, refer to their works; however, the process is briefly summarised below. A beta process (BP) is seen as the conjugate of the Bernoulli process (BeP), which describes the parameter of BeP and implies an infinite number of features/modes (i.e. infinite possible patterns of a segment). Suppose multiple compound demonstrations (i.e. time series) are given, each demonstration can share modes in the infinite features set, which is described as different subsets of all the features and is encoded in a binary vector f i , where f i indicates which mode has occurred in series i. This suggests that features can be shown multiple times in the same or a different time series, which benefits the process of multiple demonstration during teaching. Then, a Dirichlet distribution is used to derive the transition probability vector π (i) with finite parameter for HMM, which helps to compute the modes of z (i) for the next time step based on all accessible knowledge. Finally, the problem is solved by defining VAR dynamics for observation y (i) . Notice that π (i) has finite parameters, which suggests that the total number of modes does not need to be specified. The structure of BP-AR-HMM can be seen in Figure 10, and application examples can be seen in [116,117].
The above-mentioned methods worked fine in certain situations and performed with a key concept of similarity. In other words, it attempted to maximise the similarity/likelihood. Other works shown below tackle segmentation in another way based on indicative events. Konidaris et al. proposed a segmentation method based on changepoint detection [118]. The idea would be finding the point where a coming time step cannot be fit into the same model as before. This method can only be used with the prerequisite that all of the model is stored as the prior or just using a simple model like linear/ quadratic trajectories.
Zhe et al. presented their Sensorimotor Primitive Segmentation algorithm based on triggering events [119]. This method utilises Bayesian online changepoint detection (BOCPD) [120] as a technique with multi-modal sensory data including robot state, environment/object state and tactile sensory signal based on a BioTacs sensor [121]. With these setups, the results show that it is very effective for using such multi-modal sensory signals, especially for using a tactile signal within a contact-rich task.

| Skill taxonomy and skill library
In Section 3, it was shown that DMPs and probabilistic approaches can store the skill pattern into each primitive. That actually facilitates the classification process. The motivation of classification is that, from the result of motor skill auto-classifiers, one would be able to implement motion prediction and choose the reactive skill from the preconstructed skill library to solve problems. DMPs are mainly designed for skill generation and not yet for classification, although they show some capability of classifying [122]. Probabilistic models like an HMM would be fascinating in classification. Motions can be easily put into a category with its model parameter value, for example, an HMM parameter of Θ HMM . Motion is classified by finding a set of parameters Θ HMM to maximise the likelihood of seeing a motion x, as shown in Equation (17).
All of the classified motion models that are fulfilled by personalised characteristics encoding construct a skill library for use in specific application domains. However, such classified motion is not automatically labelled with physical meaning.
In other words, similar motions in a skill library, which is constructed in LfD, can be used interchangeably. However, skills like 'push' may not necessarily hold similar spatial shape in the task space (e.g. push could in any direction with any speed). This increases the difficulty in assigning a symbolic label to motion primitives based on current LfD skill representation models, whereas symbolic labels can be very helpful in planning a compound motion at high level. To this end, would it be very helpful to derive a better motion taxonomy methodology for generic usages, by finding the most effective and most explicable sensory modalities?
Aksoy et al. [123] proposed a Semantic Event Chain (SEC) tool for object-action relation, as shown in Figure 11 (a). In SEC, objects in the scene are segmented and semantic scene graphs are established based on the object's spatial relations in four modes. With SEC as the basis, the most fundamental manipulations in life are studied in [124]. This research summarises a manipulation ontology tree which contains three fundamentally different types of manipulations and six manipulation goal categories, as shown in Figure 11(b). The authors further explored the importance of motion trajectory in motion taxonomy and discovered a positive answer. Different from the above, a motion taxonomy for manipulation embedding with similarity metric was proposed in [125]. In this taxonomy, four main aspects are involved for encoding motions to binary vectors, which are, interaction type, object structural outcome, motion trajectory, and active descriptor, which indicates whether a tool is used to actively manipulate or not. The motion trajectory analysis is based on principal component analysis (PCA); interestingly, this binary encoding allows to compute motion differences using weighted Hamming distance with tuneable weight (i.e. trajectory profile may influence the taxonomy more than the interaction type). The calculated distances between motions can help to cluster similar motions, as shown in Figure 12, which is produced by a visualisation tool (t-SNE) [126].
In addition to all the above, other inspiring works can be seen in [127][128][129], which present motion taxonomy in different perspectives. Additionally, motion taxonomy could be beneficial for extending skills in affordance-based learning, which is introduced in Section 4.4.

| Skill sequencing and joining primitives
Enabling robots to solve problems in a complex environment and perform a series of actions requires the robot to know the meaning of the label behind an action. A metric for inferring task constraints and goals can be essential.
Ekvall et al. [130] presented their framework for complex task planning and resequencing the learnt sub-tasks. The essence of this framework is to express complex task demonstrations as many series of states/effects (i.e. effects mainly for absolute/relative position changes). Based on all the demonstrations, task priority constraints can be found, hence the possible skill sequence can be found to achieve an effect. The learning framework also allows incremental learning, which updates the knowledge as new demonstrations arrive. Another area of work that trains the robot incrementally with the help of vocal comments under a LfD framework can be seen in [131]. HMM is commonly used in high-level skill planning and recognition [132,133], as it encodes the transition between each state in a time-series. The Hierarchical Hidden Markov Model, formally defined as Hierarchical Dynamic Bayesian Network (H-DBN), in multiple levels is presented in [134]. Another version of H-DBN, Growing H-DBN, can be seen in [135], and has two levels for encoding high-level abstraction and low-level motor skills, respectively.
Nevertheless, one can always define the action labels manually and join all the primitives in a predefined sequence. Joined primitives can be problematic while they are reproduced, because of the discontinuity. Problems can be solved by refining the connection between two primitives. For example, third-order DMP formulation allows a smooth acceleration profile [136] or adapting the primitive model parameters to combine primitives as one skill [137].

| Affordance-based learning
Gibson et al. [19] first proposed their theory about affordance a few decades ago. It is used to describe the relationship between objects/agents, actions and effect after actions. For example, a larger force of a hand will push a ball away along a F I G U R E 1 0 The illustrative structure of betaprocess autoregressive hidden Markov model [115] direction for a long distance, while a lower force can only produce a short movement. Examples can be seen in [138], where a robot learnt the rolling capability of a toy car and learnt the tool selection skill by searching the affordance of objects in the scene for achieving an effect. Figure 13 shows that affordance can usually be used in three ways. As action, object and effect are correlated, inference of one component can be feasible when other two components are known. Do et al. [140] proposed a deep network-based framework for affordance recognition training, which is called 'Affordance-Net.' As shown in Figure 14, with the object detected, its affordance attribution inference can be accomplished (with the colour-augmented area). Based on that, robot WALK-MAN is able to grab a bottle and pour towards a pan.
LfD and affordance learning can be used as a conjunction; in [139], affordance is learnt with Markov Chain Monte Carlo and Bayesian network [141]. In addition, a robot is able to perform task abstraction based on the action-object state demonstration, hence, the robot learns the policy of optimally finishing the task. Learning of the policy is achieved by adopting a variant of Bayesian inverse reinforcement learning [142,143]. Based on the idea of affordance, Kroemer et al. [144] present a kernel-based approach that works together with DMPs, which improves the action planning due to better perception. In their work, affordance of object subparts is learnt (i.e. a particular object shape is affordable for specific actions and effects). Kernel basis function is used for computing subpart similarity in shape, and kernel logistic  regression is used to choose a DMP. Another interesting example can be seen in [145,146], where potential object grasp positions are learnt via image processing with machine learning or estimated by a Fuzzy Gaussian Mixture Model (FGMM).

| Collaborative skills
Collaboration between robots and humans is remarkably useful but challenging, even just for a human-like hand-over task. Achieving collaborative or interactive skills requires a robot to have an accurate human motion recognition and estimation module, which also takes safety factors into account. Collaboration is the consequence of reaction to an action or an active promotion to the overall task goal. It is relatively easy to train a robot to collaborate in a supervised approach, where action primitives are manually labelled. However, an unsupervised approach is more preferable because of the exemption of labelling work, so that robots can observe human co-workers all day long or actively involved in the task as a co-worker and discriminatively learn the intention-action pairs. HMM, a widely used time-series model which is capable of encoding both temporal and spatial patterns, could be a way to approach collaboration. In [147], a Mimetic Communication Model, as shown in Figure 15, was proposed to realise collaboration control. The motions of humans and robots are recognised as continuous HMMs (CHMMs), where human motion is used to trigger interaction state changes in a discrete HMM (DHMMs) and select an appropriate reference motion primitive for the robot. Then the reference trajectory is modified online to adapt to human behaviours, and meanwhile, humans will also adapt to robot behaviour online (i.e. do not need a computational model for that, i.e. the human's reaction. However, in the future, it is interesting to compute the model of the human's adaptive behaviour to influence the human indirectly).
Another fascinating example of a collaborative skill transfer framework is proposed in [148], where ProMP is used. For achieving a single task collaborative skill, the motion demonstration of both humans (with P DoF) and robots (with Q DoF) are encoded into a single ProMP model using M number of demonstrations as described in the previous section. Then F I G U R E 1 2 A motion taxonomy: visualised by t-SNE [125] F I G U R E 1 3 Put affordances into use [139] 16the probability distribution of a motion pattern (i. e. p(ω; Θ) where ω is the weight of basis functions that encodes a motion pattern, Θ is a hyperparameter) is a multivariate distribution where part of its dimensions relate to human motion, and the rest are for robot motion. In the inference stage, by providing a human motion y*, the likelihood p(ω|y*; Θ) is obtained by conditioning. To generate a robot motion, the weights are integrated in the posterior distribution. This interactive skill also has a mixture representation if multiple subtasks exist, which allows a non-linear correlation between tasks. The process is relatively straightforward since it utilises the idea of Gaussian mixture, which was covered in the previous section. All the key concepts of this method are depicted in Figure 16. Apart from skill learning for collaboration, perception accuracy is also an important factor that influences performance as perception is directly related to the inference of people's intent. Su et al. [149] trained a model-free Deep Convolutional Neural Network (DCNN) for recognition of surgeon gestures based on the information from a kinect camera, inertial measurement unit (IMU) and EMG sensors.

| Learning to emulate
Start with questions: 'If you need to get an object right in front of you, would you choose to grab by your hand or grasp with another grasping tool? If the object is out of reach of the robot's end-effector, would the robot be able to use a grasping tool in its reach to extend its capability of achieving the goal without a human's explicit demonstration or being told to perform like this?' As a human, reasoning about a task goal is relatively easy with knowledge about the affordances, which is a result of training that happens every day. Whiten et al. [150] presented a taxonomy of social learning effects that classifies the emulation within copying. Emulation is first distinguished from imitation by the observation of children that can achieve a goal using idiosyncratic means which were never seen previously [151]. It suggested an approach for engineers of using a different way to re-look at imitation learning problems in an emulation aspect. Normally, the term emulation means to copy the end-state of a series of actions or copy the final effect of a series of actions ( Figure 17).
Despite affordance learning being an alternative of emulation that facilitates the generalisation of skills, emulation can be a very unique way to produce novel skills that forms an alternative to conventional imitation learning. A good example of emulation learning in robot skill learning can be seen in [152], where skills like pick-and-place can be done with moving obstacles. Comparing this to the previous works, where obstacle positions are assumed to be fixed after calculating an optimal trajectory, it contains an emulation module which feeds an estimated parameter (i.e. relates to the Gaussian representation of obstacles) back to the system using a utility function. At each time step, the solution would be the trajectory that minimises the utility function, which is actually minimising the error between the learnt demonstration trajectory and the generated trajectory with obstacle avoidance (i.e. satisfying both task and state constraints). Apart from that method, where the task constraint and state constraint are manually defined, affordance-based methods could help to select those constraints based on learnt high-level knowledge, which would definitely be a research direction worth attempting in the future.

| Active perception
Perception provides information for an agent to learn the context of the environment. To the best of the authors' knowledge, robot perception is commonly used in a passive manner, while humans often take an active role in perceiving information for useful modalities. For example, perceiving a circular object does not necessarily mean that is a ball, it could F I G U R E 1 4 AffordanceNet: objects detected with the affordance attribution attached [140] F I G U R E 1 5 Mimetic communication model [147] GUAN ET AL.
-17 F I G U R E 1 6 Key concept of a probabilistic movement primitives (ProMP)-based collaborative skill realisation: (top) single task interactive ProMP training and inference key process; (bottom) multi-task mixture of interactive ProMP training and inference key process [148]. CHMM, continuous HMM; DHMM, discrete HMM; HMM, Hidden Markov model F I G U R E 1 7 A taxonomy of social learning effects [150] also be a piece of paper in a circular shape, and people usually actively move their viewpoints to get a better perception. Figure 18 intuitively demonstrates this problem. Some articles have also pointed out the importance of being an active perceiver in an experiment with animal subjects [153]. The essence of active perception is defined as setting up goals based on the current belief about the state of the world and performing behaviours that may achieve it [154]. One of the most informative sensory modality in perception is vision. The information embedded in images is not fully extractable in one go. The goal of active perception is to know the context well, adopting a top-down attention strategy in the vision system has been proven to be efficient [155,156], and this is a feature selection strategy that starts from the highest level.
In [157], an active sensing control method is proposed and tested in a more plane simultaneous localisation and mapping (SLAM) application. In their work, an online gradient descent strategy is used to shape a B-spine as an optimal path for perception. The results showed that relying on a calculated optimal path, which was compared to a straight path, the maximum estimation uncertainty can be reduced. This work is being extended to more complex scenarios [158]. As seen in the previous sections, by adding perceptual coupling to the skill model, the skill will be adaptive to the changes. Imagine if better perception could be achieved by a robot as a skill, robots would be more human-like, which may attract the attention of increasing numbers of researchers.

| Learning from failure
Another interesting topic would be learning policies from failures as they are often ignored in conventional LfD. Shiarlis et al. [159] proposed an algorithm called Inverse Reinforcement Learning from Failure (IRLF) based on IRL. Using IRLF, policy can be derived from both successful demonstrations and failure demonstrations, even if the failure trajectory is very similar to the successful one. Common LfD methods aim to maximise the similarity between the generalised trajectory and the demonstrations; when failures occur, robots can also minimise the possibility of reproducing such a mistake. Based on this idea, [160] proposed a framework that a human demonstrate failure actions, followed by a robot's exploratory trials that make the policy diverge from the failures, which was tested using a real robot.

| Incremental learning from correction
As discussed above, the action may not be perfectly reproduced after the learning process, hence the skill refinement process would be very helpful. Although a huge amount of state-action pairs may have been used for training, due to the complexity and uncertainty of the environment, robots still produce errors. For example, actions may be generalised to an unexpected goal. In [161], an interactive algorithm named confidence-based autonomy was proposed. This method the for expert's corrective demonstration to incrementally update the trained policy. Additionally, they used the corrective demonstration as a replacement of the policy that would be generalised under the specific state.
However, there can be another way for correction learning that has been rarely tried; that is, to seek changes in both lowlevel dynamics and high-level goals. A robot may continuously monitor its dynamic changes, if the human interacts with the robot, most likely an error has occurred in either low-level motion or the high-level decision, otherwise, refinement is required. By abstracting these human intentions or preference during the correction process, and putting this knowledge into use, would the performance be getting better?
Very recently, a fascinating learning outcome from a demonstration framework was proposed [162], where robots and humans share the role of controlling motion. The human correction is analysed to learn the human's preference. For example, the human may want the robot to hold fragile objects (i.e. slow down the motion), which is not a predefined preference that a robot would take into account. The robot needs to be aware of its lack of ability to explain the human's intention by having less confidence, and then the reason about how to behave to meet people's requirements, which is potentially solved in [162,163].

| Open issues
There remain many open issues for human-robot skill transfer that are difficult to answer or close. On a technical level, humans arrange tasks through high-level cognition and use low-level motor muscle memory to generalise certain behaviours, which is one of the most commonly agreed facts. However, whether computers can have the same cognitive abilities as humans may remain controversial for a long time in F I G U R E 1 8 Illustration of the use of active perception to obtain more information [153] GUAN ET AL.
-19 the future. Whether or not it exists, a generic framework or solution to arrange tasks to solve problems through cognition needs everyone's joint efforts to find out. An agents' emulation behaviour is actually the production of a kind of 'creativity'. However, whether this kind of 'creativity' is necessary to serve the development of human society is also a very interesting topic. In terms of law and ethics, the skill transfer framework is based on an individual's experience or a small number of individuals' experiences. More investigations may be needed to discover if it is suitable to be applied in real industrial applications.

| CONCLUSION
A comprehensive review has been provided herein on the topic of techniques used in human-robot skill transfer, to enable readers to capture an overall bigger picture of issues that may be met in research. Both a dynamic system-based model and a probabilistic model can be used to model individual skill primitives. With a different model, various learning and generalisation techniques could be used. To allow reactive and complex task performance, high-level reasoning is the most important aspect. HMM-based methods are shown to be very effective in high-level knowledge learning. Despite that, active perception should definitely be studied for a compact computational model to facilitate the perception accuracy in collaborative skill performing. Affordances-based learning is then a useful tool for tackling motion/intention recognition and action planning at a high level. Finally, the importance of incremental learning for robot skill refinement has been emphasised. Based on the introduction given here, it is hoped that researchers will have a clear idea of the structure of robot skill learning and transfer this knowledge and explore for more potential topics to further extend robot capability.

ACKNOWLEGDEMENT
This work was partially supported by Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/ S001913.