Multimodal Learning‐Based Proactive Human Handover Intention Prediction Using Wearable Data Gloves and Augmented Reality

Efficient object handover between humans and robots holds significant importance within collaborative manufacturing environments. Enhancing the efficacy of human–robot handovers involves enabling robots to comprehend and foresee human handover intentions. This article introduces human‐teaching–robot‐learning‐prediction framework, allowing robots to learn from diverse human demonstrations and anticipate human handover intentions. The framework facilitates human programming of robots through demonstrations utilizing augmented reality and a wearable dataglove, aligned with task requirements and human working preferences. Subsequently, robots enhance their cognitive capabilities by assimilating insights from human handover demonstrations, utilizing deep neural network algorithms. Furthermore, robots can proactively seek clarification from humans via an augmented reality system when confronted with ambiguity in human intentions, mirroring how humans seek clarity from their counterparts. This proactive approach empowers robots to anticipate human intentions and assist human partners during handovers. Empirical results underscore the benefits of the proposed approach, demonstrating highly accurate prediction of human intentions in human–robot handover tasks.


Introduction
[3] This collaboration allows robots to handle monotonous, repetitive tasks while reserving intricate, skillful work for human operators.6] The exchange of objects between humans and robots (referred to as human-robot handovers) becomes pivotal in assembly activities involving human-robot collaboration. [7,8]Implementing human-robot handovers contributes substantially to time and labor saving during the assembly process, thereby augmenting overall efficiency. [9]In automobile assembly, for example, human workers spend considerable time and effort collecting and delivering parts or tools to their partners (or receiving parts or tools from partners and correctly placing them). [10]The integration of human-robot handovers significantly diminishes these time-related expenses and human effort, as robots assume the role of cooperative assistants, adeptly managing these straightforward yet timeintensive tasks. [11]fficiently orchestrating the seamless and intuitive transfer of objects between a human and a robot necessitates meticulous coordination in both spatial and temporal domains. [12,13]A fundamental component of this coordination lies in predicting human handover intentions. [14]his prediction process encompasses the sensing of human intentions and the subsequent forecasting of these intentions.Human intention sensing entails gathering human-centric information, while human intention prediction involves generating anticipatory outcomes for human intentions based on the acquired intention data.Recent research endeavors have delved into the realms of sensing and prediction of human handover intentions.
In terms of research on human handover intention sensing, some scholars have explored the use of vision-based systems to discern human intention information.For instance, Rosenberger et al. achieved secure object-independent handovers using robotic vision to interact with imperceptible objects. [15]heir approach involved object detection through the division of human hand utilizing an RGB-D camera.Consequently, they succeeded in facilitating object-independent handovers from humans to robots.Melchiorre et al. put forward a visual control architecture tailored for scenarios involving human-robot handovers. [16]To enable seamless handovers between humans and robots, they employed a 3D sensor system to predict the hand position of human worker, subsequently adapting the robot's tool center point posture to match the human worker's hand pose.Ye et al. developed a visual system for establishing a benchmark to evaluate human-human handovers from a visual perspective. [17]sing Intel RealSense cameras, they recorded RGB-D videos of handover processes involving two humans, which could then be employed to predict human handover intentions.However, vision-based systems encounter challenges in cases involving occlusion.Certain researchers have explored physical contactbased methods to sense human handover intentions.For example, Alevizos et al. [18] utilized force and torque sensors to gauge contact force as well as torque between robots and humans.They proposed a technique for predicting human motion intentions using the measured contact force, facilitating smooth and natural physical interactions.Wang et al. compiled a dataset containing haptic information for physical human-robot interactions. [19]By tracking 50 individuals using a tactile sensor, they captured 12 distinct types of tactile motions.This dataset enabled the prediction of human intentions during interactions with robots.Chen et al. [20] introduced a methodology for adjusting admittance factors based on human intentions.By incorporating a torque detector at the manipulator endpoint, they quantified interaction forces to anticipate human motion intentions.Yu et al. put forward a distinctive method to evaluating human motion and impedance instructions for controlled human-robot interactions. [21]They utilized angle as well as torque detectors to gather data of human motion and interaction force, subsequently estimating human motion intentions and promoting smooth and natural interactions between humans and robots.Yu et al. introduced a novel methodology for human-robot cocarrying. [22]They used a combination of visual and force sensing to construct the human-robot cocarrying framework.The visual sensing is used to get human motion and estimate human's intended motion.The force sensing is used to measure the external force on the robot gripper and further estimate human force and human motion intention.Nevertheless, these techniques require continuous contact physically between humans and robots throughout the handover process, potentially endangering humans.To address the limitations of methods based on vison and physical contact, wearable sensing technology has recently emerged in the context of human-robot handovers.For instance, Zhang et al. developed a human-robot cooperation platform on the basis of electromyography (EMG). [23]On this platform, EMG sensors were utilized to capture signals from the human forearm.They proposed an algorithm leveraging simultaneous localization and mapping to control robot motions, offering a safer and contactfree approach for orchestrating various motions during humanrobot interactions.On this platform, EMG sensors were employed to acquire a human forearm's EMG signals.They proposed an approach to enable a robot to perform motions based on the sam-pled EMG data.Sirintuna et al. harnessed a neural network model trained with EMG signals to direct a robot in executing particular tasks during human-robot interactions. [24]They positioned several wearable sensors on different arm muscles to capture EMG signals.Mendes et al. explored a method based on deep learning for enhancing collaboration between humans and robots. [25]Their methodology involved wearable EMG sensors to capture surface EMG signals, effectively controlling a robot's movement during interactions between humans and robots.Cifuentes et al. introduced a wearable controller designed for tracking human motion during handovers between humans and robots. [26]This controller integrated data from a laser rangefinder and inertial measurement unit (IMU) to develop a human-robot handover plan based on collected movement data.Artemiadis et al. devised a user-robot control interface utilizing wearable sensors to detect EMG signals from human upper-limb muscles. [27]It's worth noting that many of these techniques primarily rely on single-modal perception methods, which may limit their versatility in various scenarios.
In terms of research on human handover intention prediction, Tanaka et al. proposed a motion planner tailored for humanrobot handover applications, developing a method for forecasting human action trajectories within discrete workspaces using the Markov model. [28]Cohen et al. [29] used the Kinect camera to establish a "metaphor-free" interface and then introduced a technique for predicting human intentions based on a computational cognitive science framework.Song et al. [30] devised a visual framework based on probability to anticipate human intentions in the context of human-object interactions using visual frames.Hawkins et al. [31] put forth a graph-based model for probabilistically predicting human intentions, exemplified in a scenario where a robot assists its human counterpart in various tasks.Aplaza et al. introduced a distinctive attention-based deep learning algorithm incorporating context information as well as human intention to predict human body movements within human-robot handovers. [32]This model uses a multihead attention architecture, considering factors such as obstacle positions, the end effector, as well as human motion.The model yields predictions for both human intention and body motion.Mavsar et al. proposed an innovative technique to discern a human worker's intentions during collaborative interactions with a robot. [33]They designed two architectures of recurrent neural networks (RNNs) capable of forecasting a worker's intention.In the first RNN architecture, hand locations are tracked using markers, while the second RNN architecture captures human motion from RGB-D videos.Liu et al. [34] proposed an approach for predicting the object handover point based on a model of human comfort.In their research, they developed a human comfort model by combining the model of joint torque and the model of joint angle.This model was integrated into a binary cost function for predicting human handover intentions within human-robot handovers.Yan et al. [35] introduced an online learning approach for forecasting human handover intentions in terms of task initiation and handover positions.Their method accommodates both regular and irregular mobility patterns of human caregivers through online updating of probabilistic models.Liu et al. [11] proposed an evolving hidden Markov method to incrementally learn human intentions.This method dynamically adjusts the model structure and parameters based on observed sequences, facilitating the recognition of novel human intentions.The capacity for incremental learning enhances the viability of this method in dynamic contexts with changing tasks.Huang et al. introduced the intention tracking concept and implemented a platform capable of concurrently tracking human intentions across several hierarchical levels. [36]The overarching objective was to comprehend human interactions, enabling the robot to prevent collisions with humans, minimize disruptions, and assist humans in resolving issues for safe and effective work.However, it's important to note that many of these approaches entail manual annotation of human intention data, a process that can be labor intensive and costly.Additionally, these methods predominantly rely on offline programming, which demands substantial human involvement when updating handover tasks.Moreover, these approaches may struggle to address scenarios wherein the robot encounters confusion regarding human handover intentions.
To summarize, the existing research on human handover intention sensing and prediction exhibits several limitations.
The predominant reliance on visual and vocal systems for collecting human intention information is susceptible to challenges such as occlusion and noise, which can impede accurate perception.
Current methods predominantly utilize single-modal perception techniques to gather information about human intentions.This approach, when compared to more comprehensive multimodal techniques, results in a narrower range of predictable human intentions due to the lack of sufficient data.
While many studies have successfully predicted human intentions, these predictions often hinge on manually annotated human intention data.This annotation process can be labor intensive.Only a handful of studies have explored automated methods for annotating human intention data in human-robot handover tasks.
Individual human workers tend to possess distinct handover preferences, necessitating swift robot programming adjustments to accommodate different workers or new tasks.
Limited attention has been given to scenarios wherein robots encounter confusion stemming from human handover intentions.For instance, during a human-robot handover involving multiple objects of varying geometric shapes, certain objects might share similar geometry but differ in weight.Current methods can only predict that the human intends to hand over a cylindrical object.However, discerning the weight of such an object remains unclear.This will lead to confusion (In this article, a quantifiable threshold is not required to determine whether there is confusion.However, the prior knowledge that certain objects (e.g., cylindrical objects) share similar geometry but differ in weight needs to be provided to judge the existence of confusion).The challenge lies in devising strategies for robots to mitigate this confusion, ensuring accurate interpretation of human intentions.
These drawbacks underline the need for further research to address these challenges and enhance the effectiveness of human-robot handover interactions.
To address these limitations, this study introduces an innovative framework that leverages a deep neural network (DNN) to capitalize on multimodal human handover intention data obtained from both wearable datagloves and augmented reality (AR) systems.The proposed framework facilitates seamless robot programming through multimodal handover demonstrations, where human work preferences and new task requisites are integrated.The robot subsequently enhances its cognitive capacity via the DNN-based model, enabling accurate prediction of human handover intentions.Recognizing the notion that humans proactively seek clarity when confronted with confusion, an active-intention inquiry approach was developed.This approach empowers the robot to actively request essential information from humans through an AR system whenever it encounters uncertainty, thus enhancing the precision of intention prediction.Based on the anticipated human handover intentions, the robot undertakes action and motion planning, encompassing tasks such as delivering objects to humans, receiving objects from humans, or adjusting its motion pattern during human-robot handovers.Practical experiments focusing on human-robot handover tasks were executed, and the results underscored the accuracy and efficiency of the proposed approach.
The key contributions of this work can be summarized as follows.
A pioneering approach is introduced that leverages natural multimodal cues, including human gestures and eye-gaze information, to enhance human-robot handover interactions.Each natural cue has its own advantages.By combining the benefits of both cues, the multimodal cues can make the handover more robust and accurate.
A novel interactive interface is developed, comprising a wearable dataglove and an AR system.The wearable dataglove can be worn on the human hand to obtain human gesture information and the AR system can be worn on the human head to obtain eyegaze information.Through this interface, the robot gains insights from human handover demonstrations and anticipates human handover intentions within human-robot handover tasks.
A comprehensive human-teaching-robot-learning-prediction (HTRLP) framework is proposed for human-robot handover tasks.This framework mainly includes a human teaching module, a robot learning module, and a human intention prediction module.This framework effectively reduces the need for manual robot programming and substantially enhances the efficiency of human-robot collaboration.
An innovative active intention inquiry approach is put forth for human-robot handovers.Drawing inspiration from common interpersonal dynamics, this approach empowers the robot to proactively request essential information from humans through an AR system when it faces uncertainty regarding human handover intentions.Thanks to the method, the robot can have a stronger ability to predict human handover intentions.
The structure of the HTRLP framework is detailed in Section 2. Section 3 and 4 elaborate on a comprehensive HTRLP modeling approach that incorporates human teaching through multimodal human handover intention information, robot learning via human handover demonstrations, and human intention prediction utilizing a DNN.The study proceeds to showcase experimental assessments carried out to gauge the effectiveness of the proposed method, covered in Section 5 and 6 separately.Eventually, the study's findings and implications are encapsulated in Section 7, providing a conclusive perspective on the research.

Overall Framework
The primary objective of this research was to empower robots to learn from human handover demonstrations, predict human handover intentions, and adaptively assist humans during handover tasks.The comprehensive framework is visually represented in Figure 1 and comprises four fundamental components: human teaching robots using multimodal information, robot learning from human handover demonstrations through a DNN-based approach, prediction of human handover intentions, and the collaborative performance of humans and robots in joint tasks.
In scenarios where the robot needs to deliver or retrieve objects from humans, the ability to predict human handover intentions becomes crucial.To achieve this, prior to engaging in a human-robot handover within a novel task, the human imparts instruction to the robot by showcasing multiple handover demonstrations for each intended action.These demonstrations utilize a wearable dataglove system and AR setup, tailored to individual working preferences.
During this teaching phase, quantifiable attributes of human handover intentions are acquired from multimodal information and subsequently integrated into the robot's learning algorithm.This process serves to enhance the robot's cognitive proficiency in comprehending human handover intentions.
Throughout the robot's learning phase, the human-handover information gathered is subjected to a more intricate parameterization.Eye-gaze data, obtained via the AR system, is refined into learning objectives, while human gesture information from the wearable dataglove system is transformed into a knowledge set.These learning objectives serve as the training labels, while the knowledge sets act as the training features for the learning-based models.Subsequently, these training labels and features are input to the proposed DNN-based model, enabling the robot to establish a cognitive framework for comprehending human handover intentions.This process can be likened to an analogy where a child gains the ability to lift objects from observing an adult's actions, and this newly acquired cognitive capacity is subsequently utilized to predict different human handover intentions.In the context of human-robot handovers, variations in human experiences and task requisites necessitate a tailored approach.As different individuals possess distinct handover preferences and handover tasks exhibit varying demands, it's imperative to adapt the robot's cognitive framework accordingly.Traditional offline robot programming methods, which involve coding to establish cognitive abilities, become labor intensive when updates are required.The proposed approach in this study, however, significantly mitigates this coding effort by enabling robot programming through human demonstrations.
Leveraging the acquired cognitive understanding of human handover intentions, the robot adeptly anticipates human intentions during the handover process.Subsequently, the robot devises action and motion plans in accordance with the prediction outcomes, fostering seamless collaboration with humans and facilitating the handover process.Furthermore, if alterations are made to the handover task midprocess, the proposed method can be effectively employed to instruct the robot in participating in the new task, streamlining adaptability and enhancing human-robot cooperation.

Human Handover Intentions Representation Using Multimodal Information
In the context of human-robot handovers, the ability to predict human handover intentions holds paramount importance for effective collaboration between humans and robots.Human handover intentions pertain to the desired actions the human intends for the robot to execute, such as "I want to give you an object" or "I require an object from you".To assess the efficacy and merits of the proposed method, a selection of nine distinct human intentions was made, as depicted in Figure 2.Such nine intentions were identified following observations of five persons (three males and two females) engaging in handover tasks, involving the exchange of objects with a robot.A thorough examination of their handover interactions revealed that these nine selected handover intentions were recurrently utilized throughout the human-robot handover process.These intentions include 1) giving a large and heavy object (solid), 2) giving a large and light object (hollow), 3) giving a middle object, 4) giving a small object, 5) moving upward, 6) moving downward, 7) moving far, 8) moving closely, and 9) needing.The difference between subfigures (a) and (b) for Figure 2 is that subfigure (a) represents the handover intention of "giving a large and heavy object (solid) while subfigure (b) represents the handover intention of "giving a large and light object (hollow)".The objects in subfigure (a) and subfigure (b) have identical geometry but different weights.The object in subfigure (a) is heavy because it is solid while the object in subfigure (b) is light because it is hollow.The prediction of these two handover intentions represented in subfigures (a) and (b) requires multimodal information including geometry information and weight information.We use these two objects to evaluate whether the robot can have the ability to distinguish these two different handover intentions using our proposed method.
Various approaches exist for representing human intentions, often utilizing visual and vocal systems.However, visual systems can falter in cases of occlusion, while vocal systems can be ineffective amidst noise.Moreover, these methods often rely on single-modal interfaces for human-robot interaction, which may lack robustness during human-robot handovers.To enhance the resilience of the human-robot interface in such scenarios, this study innovatively integrates natural multimodal information, encompassing authentic eye-gaze and gesture cues, to formulate a new multimodal human-robot interactive interface.
In the context of human-robot handover tasks, natural eyegaze information serves as a label for human intentions, while human gestures contribute to the features characterizing these intentions.As illustrated in Figure 1, the reality system (Microsoft HoloLens 2) facilitates the acquisition of natural eye-gaze information.In parallel, the wearable dataglove system is employed for the collection and processing of human gesture information.Hence, the human handover intention I H is able to be formalized by the intention labels I AR acquired from the eyegaze information and the features I WS acquired from the human gesture information as As shown, each human worker has personalized preferred manners for teaching the robot.Information on these methods can be contained in I WS .In addition, the learning objectives as well as knowledge sets of the robot were contained in I H .I H is used in the proposed DNN-based learning algorithm for teaching the robot to build its cognitive integrity of human handover intentions.The cognitive capacity developed through this process will play a pivotal role in predicting human handover intentions and governing the handover procedure.In practice, diverse individuals exhibit varying preferences when engaged in humanrobot handovers.As outlined in Equation ( 1), these individualized preferences can be encompassed within the definition of I H .This accommodation allows different human workers to instruct the robot based on their preferred methods of interaction.Furthermore, the utilization of multimodal information offers distinct advantages over relying solely on single-modal data.For instance, when confronted with multiple objects of identical geometry but differing weights, relying solely on hand gestures for recognition may prove insufficient.However, integrating additional sources of human handover intention data, such as eye-gaze information and interactive interfaces like AR systems, can bridge this gap.By displaying all the objects within the AR system, the robot can prompt the human to specify the object they wish to hand over, and the human's eye-gaze can be used to select the desired object.This integration enables successful recognition of the specific object, illustrating the crucial role of multimodal information in facilitating effective communication between humans and robots.Section 3.2 and 3.3 delve into the details of how the robot is instructed using this multimodal information.

Teaching Robots Using the Augmented Reality System
The process of teaching human handover intentions through eye gaze involved the utilization of a Microsoft HoloLens 2 AR headset to capture human eye gaze information. [37,38]To instruct the robot that their intention involves interacting with a specific object, such as a red cylinder, as depicted in Figure 3a, the human would fix their gaze upon the red cylinder hologram and sustain this focus for a designated dwell time, thereby confirming the intention, as illustrated in Figure 3b.The sequential steps of how a human selects a red cylinder using eye gaze are outlined below: 1) Onset delay: Upon looking at the red cylinder, an immediate response is avoided to prevent overwhelming or frustrating the human.Instead, a timer is initiated to assess whether the human is purposefully fixating on the red cylinder or simply glancing over it.In this study, the onset time was established at 200 ms.If the onset time is too short, it will make the human feel overwhelmed or frustrated.If the onset time is too long, it will make the human lose patience because of waiting too long.Therefore, setting the right onset time is essential.Through continuous debugging in the experiment, we found that 200 ms is suitable for the onset time.Therefore, we set the onset time as 200 ms.2) Start dwell feedback: For the Hololens2 system, the Eye Tracking API provides information about where and what the user is looking at as a single eye-gaze ray (gaze origin and direction).The user will keep looking at the target they would like to select and dwell for some time.Then we can know where the visual concentration of the user is.Once it is determined that the human is deliberately concentrating on the red cylinder, feedback is provided to signify the initiation of dwell activation, enabling the human to recognize the commencement of the process.3) Continuous feedback: As the human persists in maintaining their focus on the red cylinder, ongoing feedback is presented to remind them to keep their gaze fixed.For eye-gaze input, a partial circle gradually progresses to completion, capturing the human's visual attention.Upon the conclusion of the dwell period, an indicator of the final state (the complete circle) signifies the completion of the process.4) Finish: If the human sustains their gaze on the red cylinder for an additional 800 ms, the dwell action concludes, and the red cylinder is chosen as the intended object for interaction.
There are typically several handover intentions for a humanrobot handover task.In this study, handover intentions expressed through the human eye gaze were parameterized using the target factor I AR .The target factor I AR can be represented as follows.
where N represents the amount of human handover intentions.i n AR refers to nth human handover intention target factor.I AR is used as the robot's learning objective in the proposed approach.

Teaching Robots Using the Wearable Dataglove
Human gestures play a crucial role in communicating human handover intentions.To effectively convey these intentions, a wearable dataglove was employed to capture information about human gestures, as depicted in Figure 4a,b.
The wearable dataglove is designed for easy wearability and removal.It incorporates six 9-axis IMUs, [39][40][41] with 1 IMU placed on the hand back and the remaining 5 IMUs positioned on the second segment of each finger.The 9-axis IMU in the dataglove includes a 3-axis gyroscope, a 3-axis accelerometer, and a 3-axis magnetometer, corporated with a proprietary adaptive sensor fusion algorithm to provide accurate pose estimation.The signal drift from the IMU is corrected in real time by the Kalman filter algorithm using the geomagnetic data.These IMUs facilitate the sensing of finger and hand postures.The IMUs' sensing data is able to be wirelessly transmitted via Bluetooth. [42,43]he sensing data acquired from each IMU encompassed three-axis acceleration data as well as three-axis angular velocity data. [44]Through an internal data processing algorithm, these data were transformed into quaternions. [45]The quaternion of each IMU is represented as where w, x, y, z denote components of the quaternion.The quaternion was determined using an established global coordinate system, as shown in Figure 4c.The IMU samples from the dataglove are at a frequency of 120 Hz.The 6 IMUs' quaternions are combined to represent human gestures as where u i denotes the quaternion that the ith IMU obtained.Human gestures can be described using u.Additionally, u can be input into the human handover intention prediction algorithm to determine the prediction results.
The handover intention data is expressed by the quaternions of six IMUs, as described in Equation ( 4).For each IMU, there are four elements (u.x, u.y, u.z, u.w), and for six IMUs, there are 24 elements.Therefore, each handover intention data is a 24-D vector.The 24-D vector causes a significant computational burden.Therefore, we suggest a new approach for expressing the human intention for the sake of reducing the calculation load.The five fingers' bending angles vary based on different gestures, a fact widely acknowledged.Consequently, human handover intentions can be expressed through these fingers' bending angles.To ascertain these bending angles, a glove-hardware coordinate system was established.This system adopted a left-hand coordinate framework to define the coordinates for the glove hardware.In Figure 4c, the X-axis, Y-axis, and Z-axis are depicted by the red, green, and blue coordinate axes, respectively.The determination of the bending angle for each finger follows a specific procedure: First, we determined the angular displacement of the finger's second segment in reference to the wrist joint as [46] Δ where illustrates the angular displacement of the finger's second segment in reference to the wrist; illustrates the quaternion which denotes the posture of the second segment of each finger, u w ¼ u w .w,u w .x,u w .y,u w .z½ illustrates the wrist's quaternion.Second, for simplicity, we convert Δ u into Euler angle as follows [47] φ θ ϕ where φ, θ, ϕ denote the rotation angles.The three angles rotate around the X-axis, Y-axis, and Z-axis, separately.In this article, the X-axis is the only axis around which the finger can bend; thus, ϕ as well as θ are identical to 0. Particularly, each finger's bending angle can be expressed by φ.
Ultimately, the human gesture is described as where the bending angles of the thumb, index finger, middle finger, ring finger, and pinky finger are represented by φ 1 , φ 2 , φ 3 , φ 4 , φ 5 respectively.φ stands for the five fingers' bending angles, which can be used to indicate human gestures.Figure 4d shows the calculation of the five fingers' bending angles.Consequently, the dimensions of the human handover intention data decreased from 24 to 5.This reduced the number of calculations.In addition, handover-intention information obtained from the wearable dataglove is characterized by where is a five-element vector that represents the human handover intentions expressed by human gestures.M stands for the number of human handover intentions.i m WS denotes the mth human handover intention.I WS was used as the robot knowledge set in the proposed method.

Learning and Predicting Human Handover Intentions Based on DNN
In our work, we propose an approach for predicting human handover intentions utilizing a DNN.The approach encompasses three main aspects.First, the creation of a human handover intention dataset is outlined.Subsequently, we elucidate the training process of a DNN for comprehending human handover intentions using the established dataset.Finally, we describe the utilization of the trained DNN model to predict human handover intentions.

Human Handover Intention Dataset Construction
In human-robot handover tasks, human intentions are labeled using natural eye-gaze information.Additionally, gesture information serves as a distinctive feature of human intention.As depicted in Figure 1, the acquisition and processing of natural eye-gaze information were facilitated through the AR system (Microsoft HoloLens 2 AR headset).Furthermore, the wearable dataglove system was employed for the acquisition and processing of human gesture information.Hence, the human handover intention I H is able to be formalized by the intention labels I AR acquired from the eye-gaze information and the features I WS acquired from the human gesture information as The humans can use their own preferred manners to demonstrate the handover intentions of robots.We collected N demonstrations to form a human-handover-intention dataset.
where I used to train a DNN-based handover intention learning model in the following section.N is equal to 4500.φ ¼ φ 1 , φ 2 , φ 3 , φ 4 , φ 5 ½ denotes five fingers' bending angles.l denotes the human handover intention class.

DNN-Based Human Handover Intention Learning
The goal of learning human handover intentions is to create a model from the dataset I, described in Equation (10).This model let the robot be able to accurately forecast human handover intentions.The problem at hand involves multiclassification, which can be addressed using various machine-learning methods such as support vector classification (SVC), k-nearest neighbor (KNN), and more.Among these, DNNs exhibit superior performance. [48]his data-driven method does not depend on expert knowledge and offers enhanced prediction through more data or network optimization.Thus, this study adopts a DNN-based approach to construct the prediction model of human handover intention.
The goal of the human intention prediction is to forecast accurate handover intention class l according to the input gesture information φ.According to the Bayes decision rule, the prediction results of the human handover intention can be acquired using the maximum a posteriori probability criterion.We define P ljφ ð Þ as the probability of human handover intention prediction result.Then the prediction result can be obtained as The key problem is how to model P ljφ ð Þ.To solve the problem, in the following section, we will use DNN to model P ljφ ð Þ.We then used the human handover intention dataset I to train the model.

Model Structure
The structure of the human handover intention prediction model is demonstrated in Figure 5.
First, multilayer neural network is applied to map the received human gesture information φ ∈ ℝ MÂ1 to a vector d which is D-dimensional as follows.
where M equals 5, indicating the five fingers' bending angles.D is equal to nine, which indicates nine types of human intentions, as illustrated in Figure 2. Function f is identified by the multilayer neural network structure and their parameters.We constructed six neural network layers contained one input layer, four hidden layers, as well as one output layer.The hidden layers comprise six neurons, a count established through performance evaluation across trials.Meanwhile, the output layer contains nine neurons, aligning with the nine distinct human handover intentions.Each hidden layer activation h l is calculated as follows.
where σ is the rectified linear unit (ReLU) activation function. [49]hen, the vector d is passed through a softmax activation function and a D-dimensional output vector y can be derived as Since P DÀ1 i¼0 y i ¼ 1 as well as y i ≥ 0. therefore, y i is able to be treated as a probability.In the article, y i represents the posteriori probability modeled.According to the designed DNN structure shown in Figure 5, y i can be formulated as where l represents the human handover intention class.
] represents the weights and bias of each layer in the DNN.φ denotes the input five fingers' bending angles.
According to the Bayes decision rule, a human handover intention class l can be determined by maximizing the posteriori probability y i as follows.
Before Equation ( 16) is applied for predicting a human intention in a human-robot handover task, we need to determine the parameter θ.In the following section, we describe how to determine the parameter θ.

DNN Model Training
To determine the parameter θ, we use a human handover intention dataset I which is constructed through the approach described in Section 4.1 to train the DNN model shown in Figure 5.The training process was as follows.
First, we selected the crossentropy function as the loss function because label l in dataset I is coded as a one-hot vector.The target of the model training is to minimize the crossentropy function as follows.
Then, we adopt the gradient descent approach for iteratively updating the parameter θ by the equation where η denotes the learning rate which controls the iteration step size.Here, we set the value of η to 0.001.When the minimum value of the cross-entropy function (Equation ( 17)) satisfies the accuracy requirements, the model training ends.Then, the trained model will be applied to predict human handover intentions.

Human Handover Intention Prediction Using DNN
Based on the trained model, the robot is able to collaborate with its human partner in handover tasks.If a human performs a handover intention, the newly acquired handover intention information φ will be input to the trained model.Then, the result of predicting each handover intention can be evaluated as Therefore, leveraging Equation ( 19), the robot is able to predict human intentions and cooperate with its human partner on human-robot handover tasks.

Experimental Platform
As shown in Figure 6, the experimental setup encompassed an operator station, an engineer station, a six-degree-of-freedom robot, and multimodal interactive interfaces-namely, the AR system and wearable dataglove system.The human-robot interface was constructed utilizing a wearable dataglove and the Microsoft HoloLens 2 AR headset, as detailed in Section 3. The operator station boasts an Intel Core i5-8500U CPU with 16 GB of RAM.This station serves as the nexus for amalgamating multimodal interactive information exchanged between the human and robot.Additionally, the DNN-based prediction algorithm proposed in this article was executed at the operator station.The engineers crafted the algorithms for data reception and processing, human handover intention prediction, and robot motion control using programming languages such as C þ þ, C#, and Python.These algorithms were developed using programming interfaces including Visual Studio 2017, Unity, and ROS.The study was conducted in accordance with the Declaration of Helsinki and approved by the Medical Ethics Committee of Harbin Institute of Technology.

Human Handover Intentions
We engaged five participants (comprising two females and three males) in executing handover tasks involving the transfer and retrieval of objects to and from a robot.Through an examination of handover interactions, we identified three frequently observed human intentions: "Giving", "Needing", and "Motion mode adjustment".The "Giving" intention denotes the human's desire to offer an item to the robot, while the "Needing" intention signifies a human's request for assistance from the robot.Furthermore, the "Motion mode adjustment" intention emerges when a human seeks to modify the robot's motion during the handover process.This adjustment encompasses variations like "Moving upward", "Moving downward", "Moving closely", and "Moving far".These modifications dictate whether the robot should ascend, descend, approach closely, or distance itself from the human during handover.This study recognizes nine distinct human intentions, depicted in Figure 2. Notably, the proposed approach holds the potential for extension beyond these nine intentions, thereby evading limitations in scope.

Handover Task Description
To validate the efficiency of the proposed method, a series of human-robot handover experiments were conducted.Prior to commencing these experiments, participants were provided with a concise tutorial outlining the usage as well as procedural steps of our approach.In the initial experiment, the aim was to meticulously capture human handover intentions during the instruction phase.Participants were instructed to execute hand gestures while wearing the dataglove, followed by directing their gaze toward the corresponding human intention icon in the AR system.This approach resulted in the accumulation of 4500 sets of human handover intention features.Notably, individuals possess the freedom to educate the robot about handover intentions using multimodal information in their preferred manner.In the subsequent experiment, participants engaged in the act of randomly selecting objects and then communicated the "Giving" intention to the robot.The objects designated for handover were segregated into two distinct groups: one group comprising objects with varying geometries and weights and another group with identical geometries but varying weights.For the former group, the prediction of the object intended for handover was facilitated using human gesture information.Notably, human gestures differ based on object geometries.However, for the latter group, utilizing human gesture information alone proved insufficient for object prediction.This limitation emerged because hand gestures remained consistent for objects with identical geometries but differing weights.Consequently, a combination of human gesture information and human eye-gaze information was employed to predict the object intended for handover.We randomly selected the large heavy object and the large light object to perform this kind of prediction.However, our method was not limited to the large objects with different weights.It can also use other objects such as middle or small objects to perform the handover intention prediction.In the third experiment, our investigation incorporated four distinct "Motion mode adjustment" intentions to assess the predictive capability of the robot.These "Motion mode adjustment" intentions include "Moving upward", "Moving downward", "Moving closely", and "Moving far" intentions.The "Needing" intention is also expressed to the robot to identify if the robot can forecast the human's intention in the fourth experiment.The proposed approach accommodates diverse human workers, enabling them to instruct robots according to their unique preferences for various handover tasks.This inclusivity stems from the inherent attributes of personalization and customization embedded within the proposed approach.The system proposed in this article was general and it was not limited to being used to the four objects displayed in the article.Our system can be used for many kinds of objects.As long as the geometry and weight were different for these objects, our system will have the ability to distinguish these objects based on the proposed multimodal learning-based handover intention prediction method.The human gesture information was used to distinguish those objects with different geometry and the eye-gaze information can be further used to distinguish those objects with the same geometry and different weights.The combination of different sensing information enables our system to be widely used for many kinds of objects.approach, humans convey distinct handover intentions to the robot by blending both eye-gaze and gestures in accordance with their individual preferences.The collection of multimodal handover information is integral to this learning process for the robot.

Results and Discussions
In our work, the human handover intentions contain nine subclasses.They are "Giving the large and heavy object", "Giving the large and light object", "Giving the medium object", "Giving the small object", "Moving upward", "Moving downward", "Moving closely", "Moving far", and "Needing".As described in Section 5.3, for "Giving" intentions, objects with different geometries and weights can be recognized using human hand gestures.However, objects with the same geometry but different weights cannot be recognized using human hand gestures alone.Therefore, these objects need to be recognized by the fusion of information from human gestures and eye-gaze, employing the wearable dataglove and the AR system together.To develop the robot's cognitive capacity for comprehending human handover intentions using DNN algorithms, a collection of 4500 sets (with 500 sets allocated to each intention) of human handover intention features were garnered from these human handover demonstrations.These demonstrations were conducted sequentially, progressing through the sequence (a) to (i) as depicted in Figure 7.
In our work, TensorFlow as well as the Keras which is the commonly used deep learning framework is employed to train the DNN models. [50]This framework offers rapid testing of diverse DNN models.Deep learning algorithms encompass multiple hyperparameters that influence their behavior and performance.As the amount of hidden layers and neurons increases, the search space for hyperparameters expands considerably.Consequently, discovering an optimal hyperparameter combination is a time-intensive undertaking.Various hyperparameter optimization algorithms are available to address this task; however, hyperparameter optimization remains a developing field and lies beyond this research's scope.In our work, we concentrated on investigating the depth of the DNN, denoting the amount of hidden layers.This hyperparameter significantly impacts the DNN's behavior.Proposed DNN parameter selections are outlined in Table 1.We undertook measures to assess the influence of increasing DNN depth on its prediction performance for human handover instructions.Following the parameters in Table 1, initial training was executed using a single hidden layer.Subsequently, we evaluated the DNN's estimation error in predicting human handover instructions.The assessment relied on the sparse categorical crossentropy metric to gauge DNN performance.The identical procedure was followed as the DNN depth increased.
Figure 8 illustrates the learning curves plotted throughout the training phase.Notably, validation loss consistently diminished from epoch 0 to 100 for the all DNN models.After epoch 20, all DNN models, except the two-hidden-layer DNN, exhibited a plateau in validation loss until the training's completion.The efficiency of the proposed DNN was assessed using the human handover intention training dataset that was constructed.A comparison of sparse categorical crossentropies for all DNN models is illustrated in Figure 9.The sparse categorical crossentropy data displayed a "U-shaped" trend with increasing hidden layers.Among the DNN models, those with two and eight hidden layers showed the least favorable performance, while the models with six hidden layers exhibited the most optimal results.Thus, a DNN comprising six hidden layers was selected for further utilization.

Human-to-Robot Handover
Figure 10 and 11 depict the progression of human-to-robot handover interactions.In this study, we defined a set of predefined actions for the robot.These actions serve as the foundation for the robot to plan its corresponding movements and align with the predicted human handover intentions.The outcome of the human intention prediction instigates action mapping.Subsequently, relevant control signals are dispatched to the robot's controller, directing its actions and facilitating interaction with the human during handover tasks.For instance, if the robot anticipates the intention "Give the small object", it adjusts its proximity to the human, accepts the small object, and positions it accordingly.As shown in Figure 10, the initial two images portray the human preparing to hand over a small object to the robot.Then the robot effectively predicts the handover intention of the human by analyzing human gesture information through the proposed approach, as depicted in Figure 10(3).Upon prediction, the robot commences the process of receiving the object, as shown in Figure 10(3)-( 4).Subsequently, it positions the object correctly, as illustrated in Figure 10(5)-( 6), before returning to its original location, as depicted in Figure 10(7)- (8).Furthermore, as indicated in Figure 11(1) and ( 2), the human presents the robot with a large and heavy object.Utilizing human hand gesture information, the robot can infer whether the object is large, yet it cannot ascertain its weight, as evident in Figure 11(3)(the larger Images are shown in Figure 12a).This situation led to confusion for the robot.To resolve this, the robot utilized the AR system to address the issue.It chose to ask the human about the weight of the object, as seen in Figure 11(3).Through the AR system, the robot presented icons representing "heavy" and "light" options.The human's choice was conveyed through their gaze fixation and duration, as depicted in Figure 11(4) and more clearly shown in Figure 12b.Subsequently, the robot accurately interpreted the human's  intention, as evident in Figure 11 (5).It proceeded to pick up the heavy object and then place it correctly, as shown in Figure 11(6)- (8).The robot then returned to its initial position, marking the completion of the interaction, as depicted in Figure 11(9) and (10).
These processes demonstrate that through the utilization of multimodal information, the robot adeptly predicts human intentions using the proposed approach.Subsequently, it effectively collaborates with the human to successfully execute the entire human-to-robot handover task.

Adjustment of Robot Motion Mode During Handover
We outline an experiment involving the adjustment of a robot's motion mode based on various human intentions.Over the course of the human-robot handover process, the human communicated four intentions: "move upward", "move downward", "move closely", and "move far".Employing our method, the robot predicted these human handover intentions and then adapted its own motion mode accordingly.Figure 13 illustrates the utilization of the proposed method to modify the robot's motion mode.As depicted in Figure 13a-(1),-(2), when the human signifies the "Move upward" intention, the robot elevates its gripper.Conversely, Figure 13a-(3)-( 4) showcase the robot lowering its gripper in response to the human's "Move down" intention.Shifting focus to Figure 13b-(1),(2), it is evident that the robot moves closer to the human upon the expression of the "Move close" intention.Furthermore, upon the human conveying the "Move far away" intention, the robot retreats from the human's position, as indicated in Figure 13b-(3)-( 4).These images effectively highlight the robot's precise prediction of human intentions and its subsequent adjustment of motion mode using the proposed method.
The "adjustment of robot motion mode" can be used to manually guide robot motion such as guiding the robot to avoid obstacles during the handover process (e.g., if the external visual perception system fails due to occlusion, the robot loses its ability to localize and autonomously avoid obstacles and therefore needs to be manually guided to avoid obstacles).In the following, we give an example scenario during the robot-to-human handover task.As shown in Figure 14, during the process of robot-tohuman handover, there is a box obstacle in front of the robot.The human guides the robot across the obstacle through the adjustment of the robot's motion mode.First, as depicted in Figure 14(1), the human conveys the "Move up " intention, the robot elevates its gripper over the top of the obstacle.Then the human signifies the "Move close" intention as shown in Figure 14(3), and the robot surmounts the obstacle and moves closer to the human.Next, the human conveys the "Move down" intention to guide the robot to lower its gripper to the appropriate handover locations, as indicated in Figure 14(5).Finally, the human picks up the object from the robot gripper, as shown in Figure 14 (6).

Robot-to-Human Handover
Figure 15 illustrates the robot-to-human handover process.Initially, the human communicates a "Need" handover intention to the robot, as shown in Figure 15(1) and 15 (2).The proposed approach utilizes data from a wearable dataglove system to forecast the human handover intention.As portrayed in Figure 15(3), the robot, upon receiving the prediction result, employs an AR system to pose the question, "What kind of object do you want?"This interaction is executed through a holographic interface.Subsequently, the human's intention, which involves the need for a large red object, is conveyed by gazing at the corresponding icon in the hologram, as depicted in Figure 15(4), with a more detailed view provided in Figure 16b.Upon accurate forecasting of the human intention, as indicated in Figure 15(5), the robot proceeds to retrieve a large and heavy object, subsequently handing it over to the human.This sequence of events is captured in Figure 15(6)- (9).Following the human's receipt of the object, the robot goes back to its starting position, as depicted in Figure 15 (10) and (11).The robot's precise prediction of the human intention is evident through the utilization of the proposed method, leveraging multimodal information.This collaboration between the human and robot culminates in the successful completion of the robot-to-human handover task.

Evaluation of Prediction Performance
We assessed the predictive performance of nine distinct human handover instructions using a range of methods.These methodologies encompass KNN, [51] support vector machine (SVM), [52] and linear discriminant analysis (LDA). [53]Each algorithm underwent evaluation on the same operating station.To testify the efficiency of the proposed method, we asked five participants (three males and two females, age between 25 and 33) for collaboration  with a robot to perform handover tasks.A total of 4500 sets (comprising 500 sets for each human intention, 9 human intentions in total) of human handover intention features were sampled and employed for the assessment of prediction performance.To objectively evaluate the applicability of these approaches for human intention prediction, a crossvalidation technique was adopted. [54]Utilizing empirical estimations, the dataset was partitioned into ten equally sized segments. [55]Within this framework, one subset was designated as the validation set, and the remaining nine subsets were amalgamated to form the learning set.This crossvalidation procedure was iterated ten times and each subset is utilized as validation data once.The accuracy assessment, detailed in Table 2, was achieved by averaging the prediction results from the nine subsets for each distinct human handover intention.
The average prediction accuracy (APA) results for the nine human intentions are depicted in Figure 17a, while the standard deviations of prediction errors (StD-E) for these intentions are shown in Figure 17b.The histogram portrays APA outcomes for the methods as 99.3%, 87.2%, 77.7%, and 89.2% respectively.It's evident that our method predicts all human intentions with higher accuracy compared to other methods.Furthermore, Figure 17b showcases the StD-E values generated by our method at about 0.0060, significantly lower than those of SVM (0.078), KNN (0.062), and LDA (0.285).This disparity underscores the stability of the proposed method in human handover intention prediction.
We have compared our method's performance to that of several previous approaches which include the understanding and prediction of human intention.These previous approaches are detailed in refs.[30,56-58].Our method uses DNN model.These previous approaches include self-organizing maps with Gaussian mixture Models (SOM with GMM), intention-driven dynamics model (IDDM), Bayesian network (BN), and hidden Markov models with SVM (HMM with SVM).For our method and these previous approaches, we can use the APA as the indicator to evaluate their performances.The APA refers to the success rate of human-robot interaction tasks.As shown in Table 3, our method has higher accuracy compared to previous methods.

Conclusion
This study introduces a novel HTRLP framework for robots to acquire insights from human multimodal demonstrations and anticipate human handover intentions.Diverging from existing methods, our framework offers simplicity in robot programming through multimodal handover demonstrations facilitated by an augmented system and a wearable dataglove.This approach aligns with task requirements and human preferences.Our framework employs a DNN algorithm to facilitate robot learning from human multimodal demonstrations, fostering the enhancement and adaptation of a robot's cognitive capability for deciphering human handover intentions.Furthermore, an active intention inquiry technique is embedded in this framework, enabling the robot to proactively seek necessary information from humans when faced with confusion, analogous to how humans seek clarification from their partners.The proposed framework empowers robots to anticipate human intentions actively, contributing to collaborative efforts in handover tasks.Empirical results underscore the advantages of our approach in accurately predicting human handover intentions within human-robot handovers.Song et al. [30] Kinect Vision system SOM with GMM 93.5% Wang et al. [56] Prosilica Vision system IDDM 83.8% Bu et al. [57] EMG sensors BN 86.4% Wang et al. [58] IMU and EMG sensors HMM with SVM 92.91%

Figure 2 .
Figure 2. Human handover intentions.a) Giving the large and heavy object (solid), b) Giving the large and light object (hollow), c) Giving the middle object, d) Giving the small object, e) Moving upward, f ) Moving downward, g) Moving far, h) Moving closely, and i) Needing.

Figure 3 .
Figure 3. Teaching robot using the augmented reality system.a) The process of teaching human handover intentions through eye gaze using a Microsoft HoloLens 2 augmented reality (AR) headset.b) The sequential steps of how a human selects a red cylinder using eye gaze.

6. 1 .Figure 7
Figure 7 illustrates the process of the robot learning from human handover demonstrations.In alignment with the proposed

Figure 7 .
Figure 7. Learning from human handover demonstrations.a) Giving the large and heavy object (solid), b) Giving the large and light object (hollow), c) Giving the middle object, d) Giving the small object, e) Moving upward, f ) Moving downward, g) Moving far, h) Moving closely, and i) Needing.

Figure 8 .
Figure 8. Learning curves of all tested DNN models during training.

Figure 9 .
Figure 9. Sparse categorical crossentropy for all tested DNN models using the human handover intention dataset.

Figure 11 .
Figure 11.Process of human-to-robot handover.The human handover intention is "Give the large and heavy object to robot".

Figure 10 .
Figure10.Process of human-to-robot handover.The human handover intention is "Give the small object to robot".

Figure 13 .
Figure 13.Adjustment of the robot motion mode in accordance with the human intentions.a) Move upward and downward.b) Move closely and far away.

Figure 15 .
Figure 15.Procedure of the robot-to-human handover.The human intention is "need a large heavy object.".

Figure 14 .
Figure 14.Example scenario.The "adjustment of robot motion mode" can be used to manually guide robot motion such as guiding the robot to avoid obstacles during the handover process.

Figure 17 .
Figure 17.Results of comparing different approaches in predicting human handover intentions.a) APA.b) StD-E.

Table 1 .
Hyperparameter values of the proposed DNN.

Table 2 .
Prediction accuracy of each human handover intention using different approaches.

Table 3 .
Comparative results of our method and previous approaches.