Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Human–robot interaction (HRI) has escalated in notability in recent years, and multimodal communication and control strategies are necessitated to guarantee a secure, efficient, and intelligent HRI experience. In spite of the considerable focus on multimodal HRI, comprehensive disquisitions delineating various modalities and intricately analyzing their combinations remain elusive, consequently limiting holistic understanding and future advancements. This article aspires to bridge this inadequacy by conducting a profound exploration of multimodal HRI, predominantly concentrating on four principal modalities: vision, auditory and language, haptics, and physiological sensing. An extensive review encapsulating algorithmic dissection, interface devices, and applicative dimensions forms part of this discourse. This manuscript distinctively combines multimodal HRI with cognitive science, deeply probing into the three dimensions, perception, cognition, and action, thereby demystifying algorithms intrinsic to multimodal HRI. Finally, it accentuates the empirical challenges and contours preemptive trajectories for multimodal HRI in human‐centric smart manufacturing.

applications sprawling across sectors such as healthcare, tourism, hospitality, surgery, construction, agriculture, and smart manufacturing, the HRI focus elegantly pivots based on the unique demands of each field.
HRI plays a cardinal role in service and social robots within the healthcare, tourism, and hospitality sectors.Healthcare positions social robots as surrogate family members and caregivers, extending support to patients with dementia, autism, and other mental or physical conditions. [8]Concurrently, the tourism and hospitality spheres leverage service robots to enhance customer experiences. [9]Conversely, fields like surgery, construction, agriculture, and smart manufacturing primarily utilize HRI for human-robot collaboration (HRC) and collaborative robots.The surgery field observes surgical robots' advent, offering heightened precision and spatial dexterity and facilitating safer, quicker, and less invasive interventions with improved surgical outcomes. [10]In construction, collaborative robots execute tasks like excavation and assembly, merging worker expertise with autonomous efficiency. [11]Agriculture utilizes HRI for targeted crop recognition, aiming to enhance agricultural adaptability and efficiency. [12]In smart manufacturing, the aim is to integrate AI in a human-centric way to improve capabilities and create personalized products, a step toward Industry 5.0. [2]This paper focuses on industrial HRI, specifically multimodal technologies.HSM is one of the most important characteristics of Industry 5.0, it prioritizes the human at the center of the production system, improves worker health and safety conditions through a synergistic combination of humans and machines, supports individual needs and factory requirements for flexibility, agility, and robustness, focuses on human needs and interests, shifts from a technology-centric to a human-centric and societal-centric approach, and leverages AI thinking methodologies to support decision-making to improve system performance and human well-being. [1]In this context, collaborative intelligence is essential for HSM. [2]It necessitates leveraging the capabilities of humans and robots to establish smart manufacturing systems that prioritize human needs, and HRI is the pivotal factor in achieving this outcome.Human beings possess qualities such as leadership, creativity, versatile problem-solving abilities, and decisionmaking abilities.Meanwhile, robots excel in speed, endurance, quantitative accuracy, scalability, and a vast knowledge base.In a cohesive human-robot system, humans serve as trainers and commanders of robots, playing the role of creators and decision-makers.Meanwhile, robots enhance human cognitive skills and physical capabilities, resulting in a mutual increase in capabilities through HRI that enables them to accomplish complex tasks together. [2,13]

Human-Robot Interaction with Cognitive Intelligence
Industrial HRI usually occurs when human operators and robots share a workstation and communicate. [14]The goal of HRI research is to boost production efficiency and ease the workload by creating cooperative robots that fuse their abilities with human skills. [3]According to the Institute of Electrical and Electronics Engineers (IEEE), robots are autonomous machines capable of real-world perception, computation, and action. [15]ommonly employed robots for interaction and collaboration include robotic arms, AGVs, and chatbots.With progress in areas like AI, soft materials, and bioelectronics, the popularity of HRI is surging, drawing numerous researchers from related fields.
In cognitive science, human intelligence is broadly classified into perception, cognition, and action categories, [16] as shown in Figure 1.Perception involves sensing the environment, humans, and objects and interpreting sensor data.This process encompasses gathering information via sensory modalities like vision, audio, and touch and processing it for recognition and prediction.Cognition involves reasoning, making recommendations based on perceived information and assigned goals, and communicating with others.This includes procedures such as planning, problem-solving, and learning.Action involves physically interacting with others, the environment, and task execution, based on perceptual and cognitive abilities.This includes object manipulation, collaborative tasks, and physical actions.Equipping robots with maximal degrees of these intelligence types will edge us closer toward effectuating more natural HRI.

Multimodal Communication and Control for Natural HRI
The three levels of natural HRI involving cognitive robot intelligence all mandate multimodal information analysis, as shown in Figure 1.At the robot perception level, multimodal sensing and interpretation complement each other, providing more accuracy and robustness than single-modality perception.At the cognitive level, multimodal information processing enhances robots' comprehension of their environment and human behavior, making task planning and execution more proficient.It also improves human-robot communication efficiency and efficacy.Finally, multimodal robot control at the action level ensures human interaction safety and enhances ergonomics.Overall, for natural HRI and advancing HSM, multimodal technology is crucial.HRI inclusive of multimodal communication and control methods represents a crucial foundation for future HRI development, supporting a safe, natural, efficient, and intelligent human-robot collaborative system.
In this article, the modality is classified into four elements: vision, auditory & language, haptics, and physiological sensing, as shown in Figure 2. Despite the common five senses, only sight, sound, and touch are considered due to their broad usage in robotics technology.Also, given the extensive role of bioelectronics in human physiological sensing and analysis, as well as its essential communication role between humans and robots, physiological sensing is regarded as a critical modality.

Literature Review Process
To facilitate efficient search and collection of research papers, Web of Science (WOS) (https://webofscience.com)and Scopus (http://www.scopus.com/)are elected due to their comprehensive representation of high-quality peer-reviewed publications in the engineering field.
Over 30 top journal publications are collected based on their relevance to multimodal and HRI topics, rather than searching all publications.The top five publications in categories like computer science, engineering, and multidisciplinary are selected following the Journal Citation Reports in Web of Science.These categories cover diverse research areas, including information science, automation, bioinformatics, imaging science, industrial manufacturing, electrical and electronic engineering, and robotics.Additionally, top AI and robotics conferences, such as the International Conference on Intelligent Robots and Systems and IEEE International Conference on Robotics and Automation, are included.
The paper searching and selection process is shown in Figure 3.In this review paper, two aspects are covered, one for different unimodal technologies and interfaces in HRI and one for the combination and fusion of multiple modalities in HRI.The search sentence (WOS version) for unimodal HRI includes '(TS = (vision OR visual OR haptic OR touch OR tactile OR physio* OR bio OR motion OR gesture OR auditory OR linguistic OR language OR acoustic OR speech OR voice) AND (human-robot OR "human robot" OR "human machine" OR interaction OR interactive OR collaboration OR collaborative) AND (industrial OR industry OR manufacturing OR production)) AND (PY = 2020 AND 2021 AND 2022))', while the search sentence for multimodal HRI includes '(TS = (multimodal OR multi-modal) AND (human-robot OR "human robot" OR "human machine" OR interaction OR interactive OR collaboration OR collaborative)) AND (PY = 2018 AND 2019 AND 2020 AND 2021 AND 2022)'.
After an initial search, 179 journal papers on multimodal topics and 2098 journal papers on single modalities are obtained from WOS, while 102 conference papers on multimodal topics and 104 conference papers on single modalities are provided by Scopus.Among them, numerous conference papers report advancements in algorithm design and accuracy, particularly in computer vision (CV) and natural language processing (NLP), but do not perform the algorithm on agents such as robots, which are excluded.As a result, in step 2 of Figure 3, 145 relevant papers are selected after filtering out papers unrelated to the survey topic based on title, keywords, and abstract.Additionally, relevant papers are included from selected papers' references, as shown in step 3 of Figure 3. Thus, a total of 199 papers form the basis of this review paper.

Preliminary Results
During the process of the literature review, certain initial findings surfaced.As evident in Figure 4a, there's been a consistent rise in the number of articles related to multimodal HRI over the past decade.This indicates that the field of "multimodal" has steadily gained popularity among scholars in recent years.
The distribution of selected multimodal papers across three cognitive dimensions is illustrated in Figure 4b.Most of the articles concentrate primarily on the perception dimension, while cognition and action receive less emphasis.This can be attributed to the relative maturity of perception algorithms, such as object detection, human recognition, and intention estimation, which act as a foundation for cognition and action.The advent of large language models (LLMs) has instigated significant advances in the AI field.Consequently, further enhancements in both robotic cognitive intelligence and action intelligence are anticipated in future research.

Existing Technologies for Human-Robot Interaction
For the cultivation of profoundly organic and astute HRI, a profound comprehension of avant-garde technologies within each modality is pivotal to facilitating multimodal interaction.This section illuminates the techniques, distinctive tasks, and uses of each modality within HRI contexts.

Vision-Based Technologies
The realms of CV, machine learning (ML), and deep learning (DL) are witnessing exponential advancement, with vision-based methods receiving primary emphasis in recent years.This review paper, centering on HRI, principally accentuates algorithms relating to human-centric recognition and prognostication, namely encompassing four critical aspects: human position, human activity, human pose, and human emotion.An encapsulated summary of contemporary studies employing vision-based algorithms in the HRI sphere is compiled in Table 1.

Human Position
Tethered to two primary objectives, human position detection encompasses the discernment of a human body's existence and the identification of a human face. [17]Body detection algorithms [18][19][20][21] precisely and accurately chart the location of a complete human figure, while discerning certain characteristics including color.Simultaneously, face detection algorithms [22,23] go beyond recognizing the location of the human face to the authentication of the individual's identity.These algorithms are implemented for collision avoidance during HRC tasks and enable mobile robots to track specific individuals during task execution.

Human Pose
The primary goal of human pose recognition is the discernment of the human body's configuration, achieved by detecting various bodily joint positions (such as elbows, knees, and hips) and gauging the angles formed between them.Given the specificity of HRI, human pose recognition bifurcates into two categories: body pose recognition and hand gesture recognition.47][48][49][50][51][52]

Human Emotion
[55] The objective of this research is to gauge the emotional state of humans to glean insights into their mental and physical well-being, thereby enabling more efficient and secure HRI and HRC.

Auditory and Language-Based Technologies
Language, ubiquitous in human communication, serves as an indispensable modality for organic HRI.Bolstered by advances in NLP algorithms, three principal tasks have surfaced in HRI over the preceding triennium: automatic speech recognition (ASR), spoken language understanding (SLU), and question-answering (QA) & dialogue systems, painstakingly compiled in Table 2.

Automatic Speech Recognition
An ASR system forms the bedrock for voice-assisted interactions between humans and automated machines, equipped to recognize patterns in human speech and language subtleties and distinguish individual voices.In an industrial setting, deep neural network (DNN)-based ASR algorithms [56] have been proposed for instructing robotic systems in aligning and picking tasks.

Spoken Language Understanding
SLU system focuses on the robot's higher level of human intent understanding behind the words from the spoken language.This process commences following ASR.The application of corresponding algorithms [57] enhances the smooth execution of speech-oriented HRI tasks.

Question Answering & Dialogue System
ASR and SLU form the backbone of QA and dialogue systems in HRI.60] However, compared to other modalities, auditory and language-based technologies are less observed in HRI.This is due to the sensitivity of ASR to noise, usually present in HRI environments, and the requirement for an understanding of deeper, context-based language use.Therefore, these factors have limited such technologies' application in industry-based HRI.Despite this, the recent emergence of LLMs, such as ChatGPT, shows promise in enhancing the inclusion of auditory and text elements in HRI.

Haptics-Based Technologies
Haptics is a crucial modality in HRI, encompassing tactile sensing and haptic feedback, due to its active and bidirectional interaction capabilities. [61]Sensing involves capturing tactile information by tactile sensors, which enables robots to perceive and comprehend the characteristics of objects, humans, and the environment.Display, also known as actuation, refers to the use of actuators to provide tactile feedback to humans, creating an immersive and interactive experience.The methods for processing tactile signals and their respective tasks and applications in HRI are summarized in Table 3.

Sensing
Pressure sensors, encompassing piezoelectric and piezoresistive varieties, function as energy transducers, eliciting electrical signals responsive to imposed pressure.Methods derived thereof [62][63][64][65][66] are adept at facilitating high-precision HRI tasks, such as detection, localization, and identification of human tactile interactions, as well as in-hand object recognition predicated on learning algorithms.
Force and torque (F/T) sensors, integral to HRI, measure the forces and torques exerted between the robot and its milieu or components, allowing the robot to tailor its reactivity in real time.This capacity proves noteworthy in regard to HRC assignments. [27,67,68] spectrum of distinctive sensors, proficient at detecting specific attributes of objects and environments, including material constitution, texture, rigidity, and conductivity, bears relevant implications for HRI.The role of these tactile sensors in robotic grasping, manipulation, safety precautions, VR, and mixed reality (MR) applications suggests a broad sphere of influence within the field.Supporting details delineating these facets are presented in Table 3.

Display
Tactile sensations can be conveyed to humans via a multitude of mechanisms, including but not limited to electrical stimulation, [69] electromagnetic stimulation, [45] temperature stimulation, [70] pneumatic actuation, [71] and vibration. [72]The feedback of tactile information constitutes a critical component in establishing a sense of immersion within VR environments, particularly within the domain of remote HRC scenarios.A selection of actuation methodologies is elucidated in Table 3.

Physiological Sensing Technologies
Physiological sensing represents an anthropocentric approach to detection, entailing the use of sensors conceived to monitor human physiological states encompassing brainwave activity, cardiac frequency, dermal conductance, and muscular dynamics.

Task
Methods Applications ASR DNN [56] Speech recognition for aligning and picking of industrial robot in HRI SLU BERT þ Bi-LSTM þ CRF [57] Speech-centric HRI system

Question answering & dialogue system
Word2VecþBi-LSTM; [58] Semi-supervised learning with multimodal data augmentation; [59] RoBERTa [60] Chatbot; User state, barge-in, and backchannel selection detection for human-like interaction in spoken dialogue system; QA task in Artificial intelligence of things domain These technologies relay feedback to robots, thereby calibrating their conduct to better accommodate human requirements, culminating in a more innate and efficacious HRI experience.Physiological sensing signals may be broadly categorized into electric and physical varieties, as explicated in Table 4.

Electrical Signals
Electroencephalography (EEG), electrocardiography (ECG), and electromyography (EMG) typify electric input signals within the HRI domain.EEG and ECG respectively relay information pertaining to cerebral and cardiac activity, offering direct and  [62] Detecting, locating and recognising human touches

VR and MR interaction
Optical-based sensor CNN þ NIP (FBG sensor); [102] KNN [184] Contact location and intensity detection; Texture classification Safe HRC; Grasping and manipulation Display Electrical stimulation Circuit model [69] High spatiotemporal tactile rendering VR applications Electromagnetic stimulation BPNN [45] Contactless force feedback Collaborative manipulation

Temperature stimulation
Thermo-driven high-frequency friction modulation, surface heating [70] Virtual texture rendering HMI and VR applications
In contrast, EMG gauges the electrical activity engendered by muscle contraction, thereby detecting particular movements tied to human activity and posture.An array of algorithms designed to decipher and analyze these signals, along with associated tasks and applications, are concisely summarized in Table 4.

Physical Signals
Physical signal detection primarily serves as sensor-based gesture identification within HRI.Using inertial measurement units (IMUs) and accelerometers, shifts in acceleration and orientation are recorded to discern specific gestures for subsequent recognition.Concurrently, strain and pressure sensors monitor mechanical distortion and pressure dispersal to detect particular hand movements for recognition.The algorithms and tasks pertinent to HRI applications are neatly cataloged within Table 4.

Limitations of Each Sensing Modality
While the four designated modalities maintain the capacity to execute an extensive array of tasks, they are not without certain constraints.These include sensitivity to occlusion, illumination variances, environmental noise, diminished sensor resolution, compromised dynamic sensor performance, and signal drift.A more comprehensive overview of these limitations is provided in Table 5.
The use of multimodal approaches offers the potential to counterbalance the limitations inherent in individual modalities, enabling each to leverage the strengths of their counterparts.For instance, vision-based algorithms may become affected by issues such as alterations in lighting, occlusion, and object zooming and rotation, yet the incorporation of data from wearable sensors can effectively negate these factors.In tasks such as audio-visual speech recognition, the input of visual information can mitigate the impact of environmental noise on ASR tasks. [73]Additionally, integrating haptic feedback with visual and auditory stimuli in VR significantly enhances the immersive experience. [58]In human intention prediction tasks, vision-based technologies are typically reactive, only activating after an action has been completed.This limitation can be addressed by fusion with EEG, capable of detecting prescient intention signals.The specifics of the modality combination and the associated advantages will be explored in greater detail in Sections 5 and 6.
While multimodal HRI offers significant advantages over single-modality HRI, its popularity within the research community remains comparatively underdeveloped.The distribution of articles, which either focus on a single modality or multiple modalities within HRI papers published in the last five years, is represented in Figure 4c.Despite a marked increase in its prevalence over the past decade (as depicted in Figure 4a), a mere 5% of the articles examined multimodal HRI in the past five years.In contrast, 95% of the articles concentrated on single-modality studies.This glaring disparity underscores the pressing need to direct increased attention toward the area of multimodal HRI in upcoming research endeavors.

Interface
For successful integration of the algorithms elaborated in Section 3, proper consideration of the hardware requisites for their deployment is crucial.Accordingly, in Section 4, a discussion is presented regarding the devices and equipment employed as interfaces in practical HRI applications, with an emphasis on real-world industrial settings.Table 6 outlines select mainstream Table 5. Limitations of each type of modality.

Modality type
Limitations References

Vision
Affected by lighting, occlusion, object zooming, and rotation [6]   Difficult to differentiate between incidental and intended contact types [27]   Passive for action intention prediction [122]   Auditory & language Contaminated by serious speech overlap [191]   Affected by strong background noise [191]   Unknown num of speakers [191]   Haptics Sensors show poor dynamic performance, and low temporal and signal resolution.[6,62,117,192]   Sensors tend to provide a binary output, which just signals contact or not. [62] Tactile gloves cannot recognize multiple elements such as material, pressure, temperature, or hand pose at the same time.[113,117]   Physiological sensing Certain physiological signals are challenging to capture and highly susceptible to suppression.[193]   Sensors are characterized by hysteresis, signal drift, sensitivity to environmental influence, and poor dynamic performance.[193]   Difficult to fabricate complex components with high integration density, and spatial resolution. [5] Wearable sensors may cause discomfort.[17]   Table 6.Mainstream commercial devices for HRI interfaces.commercial devices along with their respective manufacturers, pertinent to various HRI interfaces.

Web and App Interface
The web and app on touchscreen serve as commonplace conduits for HRI, enabling human operators to issue simple robotic control commands via touch-based interaction with virtual buttons and other elementary operations.These interfaces also furnish real-time feedback information on the screen.In this vein, Rey et al. [74] utilized the above methodology for the management of manufacturing processes.

XR Interface
Extended reality (XR) encompasses VR, augmented reality (AR), and MR.VR submerges users in an entirely simulated digital environment; conversely, AR superimposes virtual graphics and data onto the physical world, thereby heightening human perceptual abilities and facilitating interaction with both the real and virtual worlds concurrently.MR's application and definition tend to be somewhat nebulous.Depending on the context of its usage, MR might function similarly to AR, serve as the transitional experience between AR and VR, or simulate virtual scenarios that closely mirror real-world conditions. [75]

VR
The VR market is rife with a wide array of head-mounted displays (HMDs), including products like the HTC Vive (https:// www.htc.com/hk-en/) and the Oculus Rift (https://www.oculus.com/rift-s/).These contrivances often operate together with a multitude of additional peripherals, inclusive of computer screens, haptic feedback devices, laser positioners, handheld controllers, the Kinect Sensor v2, and Leap Motion controller.They can be deployed in a myriad of applications, such as VR-chatbot for HRI, [58] digital twin (DT)-based HRC, [76] remote HRI, [77] and even gestural interpretation for visual navigation within VR environments. [78]

MR
Equipment for MR applications is typically shared with VR and AR devices, often allied with DT applications.For instance, Li et al. [85] utilized Microsoft Hololens 2 to effectuate human-inthe-loop control for a multirobot collaborative manufacturing system.In a similar vein, Al-Sabbag et al. [86] leveraged Microsoft Hololens 2 for visual inspection and damage detection in human-machine collaboration.In addition, Su et al. [87] combined an HTC Vive HMD, computer monitor, HTC Vive controller, and a Vive motion tracking system to facilitate MR vision and motion mapping for the teleoperation of mobile robotic manipulation.

Vision-Based Interface
Beyond XR devices, the most prevalent device for visual sensing is the camera, which includes the monocular camera for 2D image sensing and stereo RGB and depth cameras (RGB-D) for 3D image sensing.Notably, two depth cameras frequently employed in HRI research are the Intel Realsense D415 (https://www.intelrealsense.com/depth-camera-d415/) [39,88,89]nd Microsoft Kinect V2 (https://learn.microsoft.com/en-us/[92] Once processed by CV algorithms outlined in Table 1, images acquired from such depth cameras can accomplish a range of tasks in HRI outlined in Section 3.1, including detection, recognition, and prediction.Devices integrating several cameras are also prominently featured in HRI research.The Leap Motion controller (https:// www.ultraleap.com/),operating on the principle of stereo vision, is outfitted with two cameras to capture hand points allowing for hand gesture recognition.The Optitrack system, comprising six FLEX3 cameras, enables motion tracking, thereby facilitating tasks like human body detection and pose estimation.These devices are widely utilized in HRC and robot teleoperation tasks. [45,51,52,93,94]eyond the cameras noted, radar, laser, and even mirrors can serve to capture visual information.Zhang et al. proposed ReflectU, which is based on mirror reflection, to detect human motion in human and multirobot interaction.Furthermore, devices such as Hypersen Solid-State Lidar and Vayyar Imaging Radar Walabot 60 GHz are used for human activity recognition within the scope of HRI. [95]espite often being overlooked in vision-based technologies, eye tracking has the potential to estimate human attention and recognize emotions during HRI.The Tobii Eye Tracker is a device that logs eye data, including gaze position and pupil diameter, facilitating the identification of emotions during humancomputer interaction (HCI). [92]

Auditory-Based Interface
The auditory and speech devices used in HRI are typically uncomplicated, often incorporating a combination of microphones with headsets or speakers.This configuration is evident in a variety of applications, such as speech recognition for aligning and picking in industrial robots [56] and user-oriented programming of collaborative robots. [96]dditionally, several software interfaces can be directly applied in HRI.Google's Cloud Speech, for instance, provides translation capabilities from spoken language to written text.Rosen et al. [82] incorporated this interface for speech recognition, thereby enabling MR HRI.
Beyond naturally occurring sound, ultrasound can also play a contributory role in HRI.Wei et al. [97] employed a wrist-worn device, Elonxi, to capture A-mode ultrasound (AUS) and surface EMG (sEMG) signals for hand gesture recognition, achieving higher accuracy than when utilizing sEMG signals alone.

Haptics Interface
As detailed in Section 3, haptics technology facilitates bidirectional interaction, which involves the robot sensing and analyzing its external environment, objects, and people, as well as providing tactile feedback to humans for a more immersive experience.Consequently, this subsection will delve into five categories of tactile interfaces: e-skin for robots, sensors installed on robots, e-skin for humans, gloves, and contactless devices.

E-skin for Robots
Electronic skin (e-skin) is a thin, flexible material that can be attached to human skin or robots to sense haptic information and perform other functions beyond the capabilities of conventional electronics. [98]Utilizing various smart materials, e-skins for robots can offer highly accurate tactile sensing.
One function of e-skin is the ability to decouple various tactile signals at a singular position.Certain e-skins can precisely decouple normal force and shear force for adaptive robotic grasping and dexterous manipulation. [99,100]Others focus on differentiating thermal and mechanical information to provide real-time force direction and strain profiles for various tactile motions such as shear, pinch, spread, and torsion. [101]nother conventional use of e-skin is to detect the exact touch position and pressure intensity.[104] Moreover, e-skins integrated with multimaterial detectors have the ability to sense the composition of the surrounding environment.Yu et al. [105] proposed Soft e-skin-R, allowing robots to sense a broad range of hazardous materials in their environment.Additionally, Armleder et al. [106] presented a robot skin cell comprising a microcontroller and several tactile sensors for robust force control in HRC.Further, Ge et al. [107] introduced a capacitive and piezoresistive hybrid sensor array, facilitating longdistance proximity and wide-range force detection for HRC in complex, precise, and safety-critical conditions.

Sensors Installed on Robots
Sensors enhance robot perception by gathering pertinent characteristics of surrounding objects, humans, and the environment, translating them into necessary information for executing specific tasks.In the context of robot tactile and haptic perception, F/T sensors and pressure sensors are commonly mentioned in numerous studies.
A F/T sensor like the ATI Mini-45 (https://www.ati-ia.com/products/ft/ft_models.aspx?id=mini45) measures real-time force and torque, as well as the motion state of the robot, and is typically installed on robot arms.[110] Standard pressure sensors like the SynTouch BioTac sensor (https://syntouchinc.com/robotics/), on the other hand, primarily focus on the detection of force values and are habitually installed on robot hand fingers, grippers, and other end effectors.Nonetheless, some advanced tactile sensors [64,65,88,111] are capable of aiding robots in executing more agile and meticulous operations, including in-hand object recognition, detection of subtle collisions, object weighing, slip detection, and touch modality recognition.

E-Skin for Humans
Contrary to e-skins for robots, most e-skins designed for humans produce tactile stimulations to be perceived by humans.Once programmed, these e-skins aim to recreate the roughness, shape, and size of perceived or virtual objects through soft pneumatic actuators [70] and vibration patterns. [112]Beyond providing haptic feedback, these e-skins are capable of recognizing human gestures and actions.Further, they aid humans in accurately detecting an object's material, texture, and other attributes, [113] as well as distinguishing between different types of touch and pressure forces. [114]Within the context of HRI, e-skins for humans can be utilized for remote interaction and VR applications to offer an immersive user experience.
Beyond haptic feedback, gloves also have the capacity to identify hand poses and sensory inputs for HRI, [117] determine tactile signatures for robot grasping learning, [119] and detect force and inertial data during handshakes between humans and robots as a source of control feedback. [120]ost haptic feedback gloves currently available in the market provide only single-modal feedback, primarily focusing on vibration.However, human touch perception is inherently multimodal.Therefore, to enhance the immersive user experience in VR, further research is required to design gloves capable of delivering multiple haptic feedback modalities, such as vibration, temperature, and force feedback (Challenge and Future Direction 7.1).

Contactless
Certain contactless devices can also supply tactile feedback to human operators.Christou et al. [121] proposed an air-based haptic feedback device that allows users to interact with virtual objects while delivering midair tactile feedback.Similarly, Du et al. [45] presented a device capable of providing contactless electromagnetic force feedback, facilitating accurate and efficient collaborative manipulations in HRC.

Physiological Sensing Interface
Physiological sensing apparatuses, typically worn, are designed for the acquisition of human electrical and physical physiological data.Herein, these devices are introduced based on their placement on diverse segments of the wearer's body.

Head
Devices worn on the head devised to monitor EEG, ocular activity, and facial sEMG are employed in contemporary technology.Integrated devices making use of EEG electrodes record EEG waves from a multitude of cerebral zones.These include the Emotiv Epoc Plus (https://www.emotiv.com/epoc-x/), [122,123]euroscan SynAmps2, [124] the ESI NeuroScan System, [125] SKINTRONICS, [126] ADInstruments PowerLab, [92] and TGMA. [127]ll of these have found applications in HRI and HCI.Devices and algorithms of this nature facilitate the control and interaction of human operatives with robots exclusively by intention.Concurrently, certain interfaces grant human users interaction capabilities through ocular activity.These include SMI ETG eye-tracking glasses, [125] used for emotional recognition, the portable eye-tracking headset referred to in ref. [83] for safeguarding HRI through gaze detection, and graphene electronic tattoos electrooculogram (EOG) sensors [128] for directing rotors via EOG tracking.Additionally, an epidermal sEMG tattoo-like patch facilitates perceptive silent speech recognition as well as voice synthesis for HRI. [129]6.2.Hand Appurtenances donned on the hand customarily function to discern gestures, acquiring the locomotion of hands and digits through the measurement of disparate physical metrics and encompassing inertial characteristics, acceleratory motion, and mechanical strain.The most commonly used ones on the market are VR controllers and VR gloves, such as VIVE Cosmos Controller (https://www.vive.com/hk/accessory/cosmos-controllerright/),Meta Quest 2 Controller (https://www.meta.com/quest/accessories/quest-2-controllers/), ROKOKO Smartgloves (https://www.rokoko.com/products/smartgloves),and MANUS Quantum Gloves (https://www.manus-meta.com/products/ultimate-mocap-package).Most of these VR controllers and gloves are capable of motion tracking based on IMUs and accelerometers.[72,130,131] While other gloves are based on strain sensors [132,133] and are also capable of gesture recognition.The handworn devices are widely observed in VR & AR applications, remote HRI, robot control, and navigation tasks.[134][135][136][137] 4.6.3.Upper Limb Devices adorning the upper limbs primarily perform EMG sensing, exemplified by devices like the ELONXI sEMG Analyzer (http://elonxi.cn/?list_15/25.html).Utilized in HRI, integrated EMG electrode devices such as electrode armbands and e-skin facilitate gesture recognition, [97,105,[138][139][140] motion tracking, and robotic arm manipulation.[141,142] They also aid stiffness detection for robotic skill learning in HRC [143] and help provide precise control for prosthetic hand manipulation.[144] Likewise, arm-placed IMUs deliver human pose estimates vital for safe HRC [94] and capture human motion useful for robotic skill learning in HMI.[3] Furthermore, Araromi et al. [145] proposed a motion-detecting arm sleeve, integrated with a textile-based strain sensor, for HCI and environmental interaction.

Others
Yang et al. [146] unveiled a conformal and adhesive polymer electrode, an innovative tool for tracking ongoing electrocardiogram signals, underpinning the electrical activity of the heart, potent in emotion recognition.Furthermore, Kim et al. [147] delineated a pliable strain sensor-based suit, adept in tracking corporeal movements, thereby finding utility in VR and AR applications.

Multimodal Technologies in HRI
Whereas ML and DL algorithms have procured notable victories within singular modalities, markedly within the realms of CV and NLP, the surge of multimodal ML cannot be overlooked.This burgeoning branch of study has been incandescent across numerous research terrains, inclusive of the nexus of HRC, necessitating the fusion of diverse sensory inputs for organic interaction.Subsequently, this section will delve into an array of multimodal ML algorithms and interfaces that have found application in actualizing multimodal HRI within the industrial sphere.

Multimodal Fusion
Multimodal fusion encompasses the scrutiny of dual or multiple data sets derived from distinct modalities through the lens of classification and regression tasks, standing as a flagship technique within the purview of multimodal ML.Such task-specific algorithms typically espouse a combination of ML and DL models to render a nuanced analysis of the individual modalities, coalesced with fusion methodologies that amalgamate data, features, and decisions stemming from multiple modalities.While Section 3 casts light upon the algorithms harnessed for dissecting each standalone modality, the present segment turns its focal point toward the fusion methodologies deployed in multimodal HRI applications, with an emphasis on industrial utilization.
Three broad categories of multimodal fusion methods exist, namely, early fusion, late fusion, and intermediate fusion, contingent upon the fusion phase incorporated within a multimodal ML algorithm. [148]Paradigmatic models of each subgroup are depicted in Figure 5.

Early Fusion
Early fusion, often denoted as feature-level fusion, grapples with raw and preprocessed data derived from varied modalities, amalgamating them into a feature vector that feeds into a model for subsequent scrutiny. [148]Figure 5a portrays a prototypical model adopting early fusion for three modalities.Early fusion techniques potentially harness and capitalize upon the correlation prevalent in the input modality data, thereby enhancing the precision of the concluding verdict, whilst also being resource and time efficient due to the singular phase of learning. [149]Nonetheless, as they intertwine the data from disparate modalities and not the previously extirpated features, early fusion methods grapple with an expanse of challenges, encompassing issues of temporal synchronization, dimensionality disparity, and different sampling frequencies. [148]

Late Fusion
Late fusion, also referred to as decision-level fusion, alludes to the amalgamation of data at the culminating decision phase of the algorithm. [148]Initial steps entail employing diversified models for each modality to scrutinize relevant data and render a preliminary resolution.These initial decisions are subsequently merged via various fusion techniques to evolve a definitive judgment.Figure 5b illustrates a traditional model utilizing late fusion.This approach touts several merits, such as simplified implementation due to standardized input representation for the fusion fragment, the potential for uncorrelated errors, and the ability to elect the most apt model for each modality.Conversely, the lacuna of late fusion is conspicuous: given that fusion transpires at the end of the algorithm, exploiting correlations between modalities proves taxing.Moreover, it is time-intensive as it necessitates the engagement of multiple models for analysis. [149]

Intermediate Fusion
Concurrent with substantial advances in DL, intermediate fusion has ascended as an evolution from early fusion.Once each modality has been transmuted into a representation within the DL model, these representations may be concurrently fused via the fusion stratum (the shared representation layer).Additionally, the representations may also undergo a gradual fusion process, ultimately culminating in a decision. [148]Figure 5c delineates a rudimentary model of intermediate fusion.This multimodal fusion approach, adeptly melding the virtues of both early and late fusion strategies, has witnessed a significant upswing in popularity in recent years.

Multimodal Algorithms in HRI
This subsection will delve into recent multimodal algorithms apt for deployment within HRI applications, addressing aspects such as pertinent task, fusion category, specific fusion technique, amalgamated modalities, accuracy metrics, and the interface employed for data harvesting.In recognition of the tripartite dimensions of robot cognitive intelligence, these algorithms are trifurcated based upon tasks, namely perception, cognition, and action.

Perception
Within the context of HRI scenarios, it is imperative for the robot to possess cognizance of the human operator's attributes.In line with the discussion in Section 3, most foundational attributes can be discerned based on singular modalities; however, these possess various constraints.The integration of multiple modalities can redress these inadequacies and facilitate precise, robust robot perception, thereby rendering multimodal technologies critical.This subsection encapsulates the algorithms pertinent to multimodal perception within HRI, inclusive of the perception of human activities, gestures, speech, and emotions, as showcased in Table 7. Also, some typical algorithms are provided in Figure 6 for reference.
With regard to human activity and gesture recognition, the preponderant methodologies employed in HRI stem from vision-based technologies, subject to three primary limitations as expounded in Section 3.5.Primarily, vision-based algorithms succumb to influences of lighting, occlusion, object scaling, and rotation, thereby resulting in information loss.Fusion with physiological sensing data extracted from wearable apparatus like IMUs and EMG sensors serves as a viable remedy to this constraint.Second, differentiating between incidental and intended As delineated in Table 7, several endeavors have exploited multimodal technologies for human activity recognition.For instance, Pohlt et al. [150] proposed a Product-of-Expert (PoE) approach to synthesize vision and touch data captured by a stationary FLIR Bumblebee2 stereo (SC) and a Basler acA1920-155um (GC) camera, in addition to joint torque sensors on a Kuka LBR iiwa.Notwithstanding the loss of a modality's information or ambiguity in the count of sensors, the PoE approach can render exemplary fusion results across different modalities and streams.Further, Islam et al. [151] introduced a multimodal graphical attention-based model founded on two modalities: vision data harnessed from cameras and skeleton data gleaned from physical sensors.The proposed model innovatively implements a multimodal mixture-of-experts (Multi-MoE) for segregating and extracting unimodal attributes and also incorporates a crossmodal graphical attention to master the intermodality representations.Reproduced with permission. [151]Copyright 2021, IEEE.c,d) MMCANet algorithm for hand gesture recognition based on the fusion of sEMG and AUS modalities: (c) sEMG and AUS fusion device; (d) the structure of MMCANet.Reproduced with permission. [97]Copyright 2022, IEEE.e) Multimodal sparse transformer network for human speech recognition based on the fusion of vision and auditory modalities.Reproduced with permission. [73]Copyright 2022, IEEE.f,g) Stacked RBMs for human emotion estimation based on the fusion of vision and EEG modalities: (f ) the framework of the proposed approach; (g) the stacked RBM structure.Reproduced with permission. [125]Copyright 2019, IEEE.
Recent endeavors utilizing multimodal technologies for hand gesture recognition in HRI are chronicled in Table 7. Lee et al. [72] introduced a glove equipped with IMUs to record hand kinematic information.Visual markers were installed for the HMDs camera to seize visual information, and the data was then processed through tightly coupled filtering-based visual-inertial fusion.Wang et al. [137] harnessed a convolutional neural network (CNN) to analyze visual information captured by a commercial camera.Subsequent steps encompassed concatenating the vision feature vector and strain-based somatosensory information, followed by a sparse neural network for classification.Moreover, Wei et al. [97] employed Elonxi, a device merging four-channel sEMG and AUS, to accumulate EMG and AUS data.A transformer encoder was then applied for fusion to facilitate hand gesture recognition.
Speech recognition serves as the initial conduit for HRI via auditory cues and language.However, audio information's susceptibility to environmental noise significantly impacts recognition accuracy and practical implementation, particularly in real-industrial settings.Incorporating facial expression recognition, based on visual inputs, can considerably enhance accuracy.For instance, Song et al. [73] bifurcated a video into three constituent modalities, visual, optical flow, and audio.Their proposed multimodal sparse transformer network employs crossmodal attention fusion to augment the visual and optical flow attributes extracted.These features are then conjoined with audio features for subsequent analysis.
Human emotion recognition relies on a multitude of factors: facial expression, posture, auditory cues, language, as well as physiological indicators such as brain waves and body temperature.Each modality narrates a distinctive facet of emotion and encapsulates complementary information.The fusion of these varied modalities, therefore, possesses the potential to establish an accurate and dependable emotion recognition system for HRI.
Kanjo et al. [152] amalgamated disparate multimodal data and inducted them into the CNN-long short-term memory (LSTM) model for an in-depth analysis of emotional states.This data comprises physiological metrics, environmental aspects, and locational features.
Considering the profound synergy amidst text, visual, and auditory information, an abundance of researchers have pitched early fusion techniques to intertwine these modalities.Mansouri-Benssass et al. [153] exploited spiking neural networks (SNN) to ascertain crossmodal associations between visual and auditory characteristics.Additional models, including the hierarchical feature fusion network (HFFN), [154] the coupled-translation fusion network (CTFN), [155] and the multiway multimodal transformer (MMT), [156] have been pioneered to decipher the intricate correlation of textual, auditory, and visual data, conducive to human emotion analysis.Besides the early fusion technique, models such as attention mechanisms [157] and Bayesian networks [158] have manifested as late fusion methodologies for these modalities.
Pure physiological sensing data also exhibits potential in estimating human emotion.Zhang et al. [159] elucidated on regularized deep fusion of kernel machines (RDFKMs) and multiple physiological signals training, encompassing EEG, EMG, GSR, RES, MEG, EOG, and ECG.
Merging vision and EEG data for sentiment estimation has also gained considerable traction.Certain studies, for instance, stacked restricted Boltzmann machines (RBMs) [125] and CNN output concatenations, [92] have demonstrated impressive results.

Cognition
Oriented on perception, natural HRI necessitates that the robot possess the faculties of reasoning and comprehension, a realm that falls under cognition.The robot needs to comprehend, process, and utilize perceptual data at the cognition level and independently formulate recommendations, without the prerequisite of conducting actual actions.The robot's most prevalent behavior involves interaction with humans through multimodal channels, primarily revolving around vision and language.Hence, several visual-language (VisLang) tasks are leveraged in HRI, encompassing visual question answering (VQA), visual dialogues, vision-language navigation (VLN), and spatial reasoning, delineated in Table 8.Also, some typical algorithms are provided in Figure 7 for reference.
Pertaining to a generalized framework for VisLang tasks, the majority of extant studies employ intermediate fusion.This process entails initial encoding of visual and linguistic inputs, followed by multimodal feature fusion. [160]Subsequently, the synthesized representation is funneled into pertinent decoders for providing answer predictions, generating queries, and strategizing future actions, contingent on various tasks.
VQA encapsulates the aptitude to accurately respond to a query based on an image or video. [160]A compendium of recently proposed VQA models, viable for HRI implementation, are encapsulated in Table 8.In the context of image-based QA, Yu et al. [161] harnessed CNN for image feature extraction and LSTM for text feature extraction, subsequently merging them via the generalized multimodal factorized high-order pooling (MFH) technique for visual reasoning and answer production.Cao et al. [162] provided the blueprint for the decision tree and proposed a parse-tree-guided reasoning network for interpretable VQA.Wang et al. [163] fashioned a model that learns multimodal interaction representations from trilinear transformers (MIRTT) for VQA tasks.In the domain of video QA, Peng et al. [164] unveiled a multilevel hierarchical network (MHN) that takes into account the information spanning various temporal scales.
Visual dialog, encompassing VQA and generation, signifies maintaining a dynamic conversation with humans based on archived preceding dialogues. [160]There exist various channels for dialogue engagement.Rosen et al. [82] conceptualized a dialog system wherein a human operator utilizes speech, eye gaze, and pointing gestures to command the robot, akin to "hand me that box."The robot retorts by physically gesturing toward or highlighting the object via MR to validate its accuracy.They advanced a physiovirtual deixis partially observable Markov decision process (PVD-POMDP) model for decoding human speech, eye gaze, and pointing gestures and for determining the ideal juncture to ask a question.Additionally, they employ Google Cloud Speech for auditory recognition, Microsoft Kinect v2 equipped with OpenNI software for gesture tracking, and an MR-HMD armed with Magic Leap One for MR response.Furthermore, Geng et al. [165] formulated a spatiotemporal scene graph representation (STSGR) for multimodal representation education and reasoning anchored in video, audio, and text data for human-robot video dialog systems, representing a prevalent mode of communication within HRI.
VLN implicates interactive cooperation between a human interlocutor and an AI agent, facilitated through dialogue, to orchestrate the agent's maneuvering within an environment. [166]Pashevich et al. [167] configured an episodic transformer to actualize the VLN task for autonomous agent interaction with humans and the environment, mediated via visual and textual modalities.Concurrently, Yan et al. [168] introduced a memory vision-voice indoor navigation (MVV-IN) system, enabling humans to guide an AI agent verbally for VLN tasks.In the MVV-IN system, they employed FC or CNN to transfigure features into aggregatable tensors, opted for concatenation or self-attention for fusion, and succeeded by a memory network for further action generation by the agent.
A host of additional studies have offered insightful contributions to VisLang HRI.Venkatesh et al. [169] introduced a twophase model, LANG-UNet, for spatial reasoning steered by natural language directives.In this study, they innovated a novel binary 2D image representation for object positioning data, furnished by a separately trained object-detecting mechanism.Subsequently, they expounded on a language model, U-Net, utilized for processing the decoded natural language directives in conjunction with the object positions.Consequently, the origination and terminal locations for the manipulator, predicated on the directive, could be designated.
In the bulk of the studies explored in the antecedent discussion, HRI was meticulously envisaged.In essence, the concurrent communicative exchange in HRI was more or less predestinated and hinged on a predefined interaction language.However, this technique appears to disrupt the organic continuity of genuine human communication.Significantly, the esteemed model of natural HRI ought to simulate the unpredictability inherent in typical, organic human exchanges.It should remain unhampered by predetermined constituents.Despite advancements in HRI research, the field retains a vast expanse of untrodden terrain awaiting exploration (Challenge and Future Direction 7.2).

Action
While perception lays emphasis on preliminary processing of sensory data, it's followed by cognition which is centered on reasoning, comprehending, and generating recommendations, whereas action constitutes the physical behavior consequential to cognitive processes.In the realm of HRI, one of the pivotal tasks pertains to robot control.Correspondingly, Table 9 outlines a compendium of papers proposing exceptional control frameworks to facilitate physical interaction between robots and humans through multiple communication pathways, inclusive of tactile, visual, F/T feedback, and EMG signals.
In these studies, the multimodal data is incorporated into algorithm frameworks, though the fusion techniques referenced are not as explicitly delineated as in Table 7 and 8.This tends to be due to the fact that the bulk of the work involving multimodal information occurs during the perception and cognition stages, with the focus in the action phase being predominantly on devising a control system that facilitates the robot's accurate and stable execution of commands from the preceding phase.
Despite limited exploration, distinctive papers apply intermediate fusion methods at the action level, as noted in Table 9.Also, some typical algorithms are provided in Figure 8 for reference.
Fazeli et al. [89] advanced a hierarchical learning-based methodology to effectuate robot self-learning for complex manipulations.This was accomplished by combining visual data captured by an Intel RealSense D415 camera and force information gleaned from an ATI Gamma F/T sensor mounted on the robot's arm wrist.Besides robot learning through self-experimentation,  Reproduced with permission. [161]Copyright 2018, IEEE.c,d) PVD-POMDP algorithm for visual dialog based on the fusion of speech, eye gaze, and gesture modalities: (c) an HRI example; (d) a graphical model of the PVD-POMDP.Reproduced with permission. [82]Copyright 2020, IEEE.e,f ) Episodic Transformer algorithm for VLN based on the fusion vision and text modalities: (e) overall illustration of VLN; (f ) episodic Transformer architecture.Reproduced with permission. [167]Copyright 2021, IEEE.g,h) LANG-UNet algorithm for spatial reasoning based on the fusion of text and vision modalities: (g) overall illustration of spatial reasoning task; (h) the LANG-UNet modal architecture.Reproduced with permission. [169]Copyright 2021, IEEE.
learning from human demonstration is also a popular approach within HRI.Le et al. [108] employed robot F/T feedback in concert with visual data to replicate learned skills derived from endeffector poses and measured force.They devised a learned model proficient in merging pose and force features, optimizing stiffness for varying stages of skill acquisition, and implementing an online execution algorithm for adaptive enactment.Built on taskparameterized optimization and attractor-based impedance control, this approach allows robotic manipulators to learn from demonstrations, particularly during industrial processes such as assembly.Feng et al. [88] proposed a robust grasp strategy for unfamiliar objects, which incorporates slip detection and regrasp planning predicated on the object's center of mass.They used an RGB-D camera to capture images for object segmentation and picked up slips via pressure sensors.By retrieving pressure data and F/T feedback from these sensors, they employed an LSTM network to extract pressure and F/T features.Then, they amalgamated these extractions with other parameters through fusion at concatenation layers for further scrutiny.The model then predicted grasp robustness and updated the position for regrasp learning and planning.
Martin et al. [170] suggested an evolved hierarchical control tactic to reconcile position-based visual servoing (PBVS) tasks with collision prevention through reactive skin control.They used a Logitech C270 HD USB RGB camera for visual data collection for the PBVS task and developed CellulARSkin, uniting accelerometer, force sensors, proximity sensors, and temperature sensors into a responsive robot skin.Armleder et al. [106] integrated the CellulARSkin system into robot force control tasks.This self-organizing skin technology was built into a tactile omnidirectional mobile manipulator (TOMM), complemented by Lidar.They introduced a whole-body reactive hierarchical force control framework utilizing vision, acceleration, force, proximity, temperature, and torque data, facilitating tasks such as tactile guidance, collision aversion, and compliance.
Moreover, Rezaei-Shoshtari et al. [171] highlighted the task of predicting the outcomes of physical interactions.They utilized the see-through-your-skin sensor (STS) to capture tactile and visual information about the object and employed a multimodal variational autoencoder (MVAE) model for multimodal fusion.In the MVAE model, all modalities were encoded, and specialists were assigned to learn shared latent representations.Thereby, they could make accurate predictions regarding the latent outcomes of physical interactions with minimal errors.
Furthermore, Zhang et al. [172] revised the fast orthogonal search (FOS) algorithm for motion intention prediction based on EMG and IMU information.They introduced a deep deterministic policy gradient (DDPG) reinforcement learning algorithm for control to actualize intentions.
Apart from studies that adopt multimodal ML methods and integrate multimodal data into their proposed control algorithms, various research papers propose diverse methods for interaction that also yield positive results.However, these papers don't incorporate modality fusion within their algorithms.For example, some papers resort to a temporal cascade of different modalities, [27,105,173] employing one modality first before Figure 8. Multimodal fusion algorithms for action.a) Regrasp planner with multisensor modules for re-grasp learning after slip detection based on the fusion of tactile and F/T modalities.Reproduced with permission. [88]Copyright 2020, IEEE.b) Enhanced hierarchical control approach for PBVS task with collision avoidance based on the fusion of vision, acceleration, force, proximity, and temperature modalities.Reproduced with permission. [170]Copyright 2020, IEEE.c) Overall diagram of the proposed method for robot LFD task based on the fusion of vision and F/T modalities.Reproduced with permission. [108]Copyright 2021, IEEE.d) The proposed framework with FOS and DDPG for human motion intention recognition and realization based on the fusion of EMG, IMUs, and F/T modalities.Reproduced with permission. [172]Copyright 2022 Elsevier Ltd.All rights reserved.e) Whole-body reactive hierarchical force control framework for force control based on the fusion of vision, acceleration, force, proximity, temperature, and F/T modalities.Reproduced under the terms of the Creative Commons Attribution License (CC-BY). [106]Copyright 2021, The Authors, published by Wiley-VCH.
leveraging another for the subsequent step.Other papers employ redundant modalities, allowing the human operator the latitude to select a preferred channel for interaction from multiple available modalities.For instance, Fogli et al. [96] proposed a collaborative robot that provides human operators the option to use either natural language chat or a block-based interface for programming the robot.However, the singular interaction per unit of time and manual modality switching could render the whole interaction process somewhat unnatural.
Overall, there are comparatively fewer studies dealing with action intelligence compared to perception and cognition intelligence, owing to the complexity of melding hardware and software elements.Yet, as robotics matures and as LLMs emerge, embodied intelligence garners increasing attention.Within this context, the prospect of robots with action intelligence capable of executing specific physical tasks promises a rich area of future research (Challenge and Future Direction 7.3).
Moreover, a substantial fraction of tasks focus on accurately understanding commands and ensuring their consistent and secure execution, with few efforts directed toward robots autonomously exploring unfamiliar objectives and environments.However, robots with proactive interaction capabilities bear great potential to facilitate a bidirectional, self-structured collaboration between humans and robots in the future.This progression could greatly spur the advancement of HSM, suggesting another promising avenue for continued research exploration.(Challenge and Future Direction 7.4).

Multimodal HRI Applications
The analytical procedures for HRI operations, employing unimodal or multimodal methodologies, are delineated in Sections 3, 5.1, and 5.2.Concurrently, the apparatuses designed for amassing data from disparate modalities are explicated in Section 4. In this particular section, an array of applications incorporating multiple modalities for HRI purpose will be examined to elucidate their utilization in specific industrial situations via various modality combinations.Figure 9 proffers a quintessential representation of a human-centric HRC application scenario predicated on an amalgamation of several modalities.Figure 9 encapsulates four pivotal tasks.Initially, within the HRC work zone, the robotic appendage and the human jointly undertake a multifaceted assembly task underpinned by visual and tactile modalities.Subsequently, an AGV meanders toward the storage rack and embarks upon a search for materials, concurrently interacting with the human operator, thereby embodying a VLN task founded upon visual and auditory modalities.Third, the robotic appendage affixed to the AGV procures the requisite material and transfers it to the human operator.This transitions into a handover task, fundamentally hinging on visual and tactile modalities.Finally, the human operator is equipped with EMG electrodes to facilitate an ergonomics analysis, predicated on physiological sensing.A multitude of applications with varying modality combinations are comprehensively encapsulated in Table 10.
Within these applications, the interactional objects are predominantly terrestrial robots, encompassing the robotic appendage and the AGV situated within the workstation.These enabled the completion of intricate tasks such as manipulation, assembly, and navigation, thereby actualizing HSM within the confines of an intelligent factory.Furthermore, human-drone interaction boasts an immense potential for scalability and a comprehensive airborne perspective.These could usher in unparalleled advancements for HSM, but this avenue has yet to garner substantial attention due to rampant technical and design challenges (Challenge and Future Direction 7.5).

Combination of Two Types of Modalities
Table 10 can be discerned as a 4 Â 4 matrix, where each quartet of rows and columns pertains to the four distinct modalities: vision, auditory and language, haptics, and physiology, as delineated by the inaugural column and row of Table 10.Each constituent matrix element epitomizes the implementation of multimodal HRI within industrial contexts.Here, an application's row embodies the principal modality, whereas the column encapsulates the ancillary modality.Vision Gesture interpretation for visual navigation in VR environment [78] Spacial reasoning for robot pickup manipulation in HRC [169] Human activity recognition for safe HRC [27] Hand gesture recognition for robotic control and navigation and HMI [137] Human activity recognition for HRI [95] Vision-and-voice navigation for autonomous agent interaction with human and environment [168] Contactless force feedback and gesture tracking for enhancing the accuracy and efficiency of humanrobot manipulation tasks [45] Human position estimation for safe HRC [94] 3D object detection in autonomous driving [194] MR bidirectional communication for HRI [82] Visuo-haptic guidance for mobile collaborative robotic assistant MOCA [173] Human activity recognition for HRC in the noisy environment [151] Audio-visual scene aware dialog for human-machine conversation [165] Human activity recognition for pHRI and HRC [150] Gesture for HRI and social robot [130] Predicting interactions between objects and environment by tactile and visual feedback for intelligent robotics [171] Emotion recognition for HCI [92] Bi-directional navigation intent communication for safe HRI [83] Visual-inertial hand motion tracking for HRI and VR & AR application [72] Auditory and language Human emotion recognition for natural HRI [153,157,158] -Voice user interface for in-vehicle interaction [174] -Audio-based motion generation for HRI [195] User-oriented programming of collaborative robots [96] Audio-vision speech recognition for HRI of the industrial robot and HMI [56,73] VLN for autonomous agent interaction with human and environment [167] Haptics Teaching by demonstration task for HRC [196] -Hardness, temperature and roughness feedback for robot hand controlling and VR applications (haptic glove) [118] Ion-electronic skin providing realtime force directions and strain profiles for various tactile motions (shear, pinch, spread, torsion…) [101] Slip detection and re-grasp planning of unknown objects for robot robust grasping [88] A wearable glove for hand pose and sensory inputs identification for HRI [117] Multimodal robotic sensing system (M-Bot) for HMI [105] PBVS with collision avoidance for safe pHRC [170] Robot self-learning for complex manipulations skills (play Jenga) [89] Robotic manipulators learning from demonstrations for industrial processes such as assembly [108] In-hand pose estimation for robotic assembly [110] Physiology Human activity recognition (MR glass) for hands-free HRI [197] Hand gesture recognition for HMI [97] Wearable glove (hand pose reconstruction, identify sensory inputs such as holding force, object temperature, conductibility, material stiffness, user hear rate) [117] Gesture-based control (EMG þ EEG) to detect and correct robot mistakes for HRI target selection tasks [198] Human motion intention recognition for HRC sawing task [172] Human emotion recognition for HRI and HCI [125,152,157] Lower limb movement prediction for HRI [199] 6.1.

Vision þ Another Modality
As discernable from Table 10, a significant portion of applications gravitates toward the inaugural row and column.This insinuates that, within industrial contexts, HRI predominantly employs the melding of visual aspects with an alternate modality, either as a principal or a subsidiary source.Figure 10 encapsulates an assortment of applications that could manifest via the integration of visual input with various modalities.'Vision plus vision' corresponds to the extraction of visual data via a multitude of devices.For instance, harnessing a Kinect camera for full-body pose identification while simultaneously employing Leap Motion technology for refined hand gesture recognition enhances precision and broadens the task spectrum compared to the solitary utilization of a single-camera model.
The pairing of vision with auditory and linguistic data is a welldiscussed topic within the realm of artificial intelligence.These modalities are crucial for HRI tasks encompassing navigation, dialogue, manipulation, and collaboration.Nevertheless, the primary application remains communication and consultation due to the inherent difficulties in facilitating physical interaction via these modalities.Vision equips the robot with person, object, and environment recognition capabilities, while humans employ text or verbal communication (language and auditory modes) to interact with the robotic system.
The 'vision plus haptics' synergism is also prevalent, effectively enabling a larger array of physical HRI applications, solidifying its role in robotic grasping, collision avoidance (safe HRC), intricate manipulation, assembly, robotic self-learning, and learning from demonstration (LFD).Vision shoulders the burden of recognition and prediction tasks, while tactile information significantly aids in controlling and learning the robot's action trajectory during the interaction process.
As for the fusion of vision with physiological sensing, the principal application lies in recognition tasks, such as human position, gesture, activity emotion detection, and hand tracking.These applications typically prioritize the visual modality, supplementing it with physiological sensing to compensate for the shortfalls in visual algorithms, such as noise, illumination issues, shadow distortions, and occlusions, thus bolstering overall accuracy.

Haptics þ Another Modality
As portrayed in Table 10, the amalgamation of haptics and other modalities is a common occurrence within HRI. Figure 11 encapsulates popular applications that encompass tactile fusion with visual and physiological information.Visual algorithms contribute to obtaining broader, macroscopic recognition, while  www.advancedsciencenews.com www.advintellsyst.comtactile sensors facilitate fine-grained sensory detection, covering aspects such as pressure, object temperature, conductibility, material stiffness, and surface roughness.As such, applications incorporating tactile modalities enhance the precision of robotic control and facilitate the completion of refined tasks, ranging from shearing and pinching to spreading, twisting, and sawing.The fusion of haptics and physiological sensing generates both human haptic feedback and human-centered sensory capabilities.These features are indispensable for XR haptic gloves to guarantee an immersive user experience.

Others
The four constituent blocks in the lower-right corner of Table 10 encompass a variety of applications incorporating haptics and physiological sensing modalities.'Haptics plus haptics' stands for the deployment of multiple tactile sensors on the robot for the assimilation of external data and the use of diverse actuators to replicate this information such as hardness and temperature.Among the myriad applications that emulate haptics, one of the most prevalent is the smart glove for HRI.Physiological signals like EEG and EMG can be harnessed for human gesture recognition and anticipation of human movements to facilitate robotic control while detecting user emotions and fatigue for ergonomic considerations.
When it comes to language and auditory modalities, singular usage without an association with vision is infrequently reported.This is partially attributable to the vulnerability of auditory recognition to environmental noise.Typing text is not always a convenient mode of physical interaction with robots for humans.Furthermore, there are remarkably limited natural HRI instances demanding a trifecta of natural language communication, tactile perception and feedback, and physiological sensing without the need for visual interaction.Consequently, in Table 10, only a couple of such applications exist.Wei et al. [97] incorporated AUS with sEMG to enhance the precision of gesture recognition tasks.Jung et al. [174] invoked tactile information to augment the voice user interface in vehicular applications.

Combination of Three Types of Modalities
Certain research endeavors capitalize on three modalities.Hong et al. [158] introduced a social robot that amalgamates visual, auditory, and tactile data, facilitating bidirectional emotional communication with human users.They utilized a Kinect camera for the collection of human 3D imagery and a microphone for auditory recording to collectively discern human emotions.This was integrated with a 2D camera and tactile sensors assigned for robot hand movements, eliciting reactive emotional displays.Within the context of human emotion recognition, they deployed a Bayesian network for multimodal classification and implemented a two-tiered emotional model involving a hidden Markov model and rule-based strata for expounding robotic emotional expression.
Lee et al. [72] introduced smart gloves that integrate visual, physiological, and tactile data for VR applications.They incorporated numerous IMUs and passive visual markers into the gloves.In tandem with a head-mounted stereo camera, the gloves achieved highly accurate hand motion tracking.Moreover, they leveraged fingertip cutaneous haptic devices (CHDs) for haptic feedback.These haptic smart gloves are poised to enhance remote HRI potential and innovate interaction formats within manufacturing.
Profound attention is warranted for an array of novel multimodal interaction applications in manufacturing, such as manufacturing resource allocation reliant on vision and language.Additional modalities, intricate combinations, and innovative interaction formats are ripe for exploration in the future (Challenge and Future Direction 7.6).

Challenges and Future Directions
In this section, some challenges and the corresponding future directions for natural HRI based on multimodal technologies will be discussed.

Haptic Feedback-Based XR
The surge in remote work and learning underscores the importance of remote HRI.By enhancing collaboration, eliminating time and space constraints, and boosting productivity, remote HRI, interspersed with XR, provide a seamless blend of virtual and reality for immersive remote communication scenarios. [175]urrent research is focused largely on enhancing the visual experience through VR and AR. [77,87]However, immersive remote HRI requires congruous haptic feedback.As observed in Section 4.5.4,while research has been undertaken on feedback like vibration and force, [70,72,112,116] there are also some haptic feedback gloves on the market that can provide vibration or force feedback, such as HaptX Gloves (https://haptx.com/),TESLAGLOVE (https://teslasuit.io/products/teslaglove/),and MANUS Prime X Haptic VR Gloves (https://www.manusmeta.com/products/prime-x-haptic).Yet the thermal feedback, a critical aspect of human tactile sensation, has been relatively overlooked. [115]Future research may hence prioritize the development of an interface integrating vibration, force, and thermal feedback within XR and remote HRI (similar to the 5D movie experience).Though such a concept, embodied byproducts like WEART TouchDIVER (https://www.weart.it/),bring forth its own set of challenges like actuator decomposition and electronic component interaction; it presents an exciting frontier for research into creating a more immersive interaction experience.

Nonpredefined Human-AI Interaction
Just as interpersonal conversations are not prescripted, natural HRI should also minimize preset elements.Nonetheless, as noted in Section 5.2.2, contemporary HRI involves a certain degree of predetermined interaction language.Take gesture recognition-based HRI as an instance, where most research choreographs a set of gestures and their implied meanings and then controls the robot via recognizing the gestures procured during the interaction, limited to those within the dataset.However, human gestures are circumstance dependent and hence, considerably diverse.As such, natural HRI ought to redress this discrepancy.Recent strides have been taken in this direction.For example, Wu et al. [78] employed deep reinforcement learning (DRL) for tackling a VLN task.Herein, the agent could interpret natural gestures with unfamiliar semantics and accomplish the navigation task, devoid of any predefinitions.Looking ahead, nonpredetermined human-AI interactions could broaden to encompass assorted modalities and multimodal applications.The advent of LLMs like ChatGPT promulgates expeditious progress in this field.

Industrial Embodied Intelligence
The burgeoning potency of LLMs necessitates their integration into HRI within smart manufacturing, pivoting research focus toward robotic action intelligence and a step toward embodied intelligence.Referring to Section 5.2.3, this area, when intertwined with multimodal technologies, remains underexplored compared to robotic perception and cognition.This is because, equipping robots for executing precise and deliberate actions is a complex challenge, demanding attention to elements like automation and control algorithms for balance and adaptability, alongside state-of-the-art hardware for energy, speed, and locomotion.However, it is to be expected that prominent firms have recently pioneered in the realm of action-intelligent robots.Boston Dynamics' humanoid robot, Atlas (https://www.bostondynamics.com/atlas),exemplifies this with its high-level perception and object manipulation.Similarly, Tesla's Optimus has showcased the accurate execution of diverse robotic operations.These advancements certify the promising future research direction of industrial embodied intelligence.

Proactive HRC with Autonomic Learning
Proactive HRC centers on cognitive intelligence to establish reciprocal, self-regulating collaboration between humans and robots. [176]In this case, robots, once acquainted with the task objective, deduce their responsibilities and strategize their assistance to humans in task completion by interacting and communicating with the latter.Thus, these anticipatory robots engage in activity prediction, proactive ambiguity resolution, and exploration of novel environments.An imperative technique to materialize proactive HRC is autonomic learning, where robots proactively engage with their external milieu through assorted sensors for self-guided learning, as opposed to manually collecting data to generate training datasets. [6]Notwithstanding the scarcity of research on robots proactively interacting with humans using multisensory inputs for learning collaborative tactics (as highlighted in Section 5.2.3),Fazeli et al. [89] introduced a robot that employs hierarchical learning for autonomous learning.Herein, robots learn to play Jenga by independently exploring external environments and objects using visual and tactile sensors.Consequently, proactive HRC and autonomous learning beckon further exploration.

Human-Drone Interaction
nmanned aerial vehicles (UAVs), or drones, possess unique capabilities outpacing conventional machinery.They facilitate efficient aerial logistics, enhance factory safety through surveillance and anomaly detection, and expedite data collection, processing, and analysis.Recent attention has been drawn to infinite-scale aerial operations involving drone swarms, such as the incorporation of 3D printing nozzles for unfettered 3D building printing as demonstrated by Zhang et al. [177] However, based on the discussion in Section 6, HRI application of UAVs remains restricted primarily due to their unconventional interaction methods compared to conventional robotic arms, AGVs, humanoid robots, and flight safety considerations.These issues compound the complexity of developing human-drone interaction technologies.Future efforts should focus on leveraging the distinct opportunities offered by UAVs in HRI.Novel aerial robots could be engineered for HRI, such as drones equipped with grippers [178,179] for basic manipulation tasks.Despite the inherent challenges spanning mechanical design, mathematical modeling, flight control, edge computing, load, power consumption, and more, the potential transformative impact on industrial processes makes this an exhilarating research frontier.

Unified Modalities for Industrial Artificial General Intelligence
Natural human interaction involves modulating communication modes like speech, gestures, and touch for optimal interaction.Yet, as outlined in Section 6.2, contemporary research, which primarily focuses on a particular modality-task combination, lacks the complexity of authentic HRI mirroring human intelligence.Future HRI should consolidate modalities, enabling robots to adjust interaction modalities based on situational demands, moving further toward general AI.The emergence of multimodal LLMs promises a unified modalities future in HRI.Models like OpenAI's GPT-4, [180] Tang et al.'s Composable Diffusion (CoDi), [181] and Girdhar et al.'s IMAGEBIND [182] are able to align multiple modalities to realize arbitrary input and output.Incorporating these techniques into industrial robots presents the chance to endow them with the capability of executing several tasks concurrently, thus moving one step closer to industrial artificial general intelligence.These approaches may consequently radically reshape the industrial landscape, ushering in unprecedented innovation and progress.

Conclusion
To actualize safe, efficient, intelligent, and natural HRI, deploying effective multimodal communication and control methodologies is imperative.This article offers an extensive exploration of the four prevailing modalities within HRI, along with an analysis of multimodal algorithms through a cognitive science lens.It provides an introductory synopsis of algorithms for each modality, common tasks and limitations within mainstream HRI, as well as typical interface devices and application areas for these modalities.The narrative then delves into multimodal fusion algorithms and their primary applications in HRI for HSM.The article concludes by discussing certain challenges within multimodal HRI and potential corresponding future pathways.
Overall, this article underscores the critical role of multimodal communication and control approaches in fostering natural HRI.It is hoped that the insights furnished herein will be of benefit to both scholars and industry professionals, aiding them in the development of multimodal HRI solutions optimally designed for future human-centric smart manufacturing.

Figure 1 .
Figure 1.The importance of multimodal technologies for robot cognition intelligence.

Figure 3 .
Figure 3. Searching process and filtering result.Step 1: Search keywords in selected publications and conferences.The search date is 4/12/2022.Step 2:filter out papers that are not related to HRI and collaboration topics (e.g., purely mathematical, biological).Step 3: add relative papers from the references of selected papers.

Figure 4 .
Figure 4. Preliminary results: a) papers related to multimodal HRI published each year; b) distribution of selected multimodal papers on three cognitive dimensions; c) distribution of articles focusing on a single modality and multiple modalities among the HRI papers published in the last five years (search date: 5 June 2023).

Figure 6 .
Figure 6.Multimodal fusion algorithms for perception.a,b) Multi-GAT algorithm for human activity recognition based on the fusion of vision and skeleton modalities: (a) vision modality focuses on RGB and depth information; (b) skeleton modality focuses on physical sensors' information.Reproduced with permission.[151]Copyright 2021, IEEE.c,d) MMCANet algorithm for hand gesture recognition based on the fusion of sEMG and AUS modalities: (c) sEMG and AUS fusion device; (d) the structure of MMCANet.Reproduced with permission.[97]Copyright 2022, IEEE.e) Multimodal sparse transformer network for human speech recognition based on the fusion of vision and auditory modalities.Reproduced with permission.[73]Copyright 2022, IEEE.f,g) Stacked RBMs for human emotion estimation based on the fusion of vision and EEG modalities: (f ) the framework of the proposed approach; (g) the stacked RBM structure.Reproduced with permission.[125]Copyright 2019, IEEE.

Figure 7 .
Figure 7. Multimodal fusion algorithms for cognition.a,b) MFH algorithm for VQA based on the fusion of vision and text modalities: (a) overall illustration of VQA task; (b) coattention network architecture with MFB or MFH for VQA.Reproduced with permission.[161]Copyright 2018, IEEE.c,d) PVD-POMDP algorithm for visual dialog based on the fusion of speech, eye gaze, and gesture modalities: (c) an HRI example; (d) a graphical model of the PVD-POMDP.Reproduced with permission.[82]Copyright 2020, IEEE.e,f ) Episodic Transformer algorithm for VLN based on the fusion vision and text modalities: (e) overall illustration of VLN; (f ) episodic Transformer architecture.Reproduced with permission.[167]Copyright 2021, IEEE.g,h) LANG-UNet algorithm for spatial reasoning based on the fusion of text and vision modalities: (g) overall illustration of spatial reasoning task; (h) the LANG-UNet modal architecture.Reproduced with permission.[169]Copyright 2021, IEEE.

Figure 9 .
Figure 9.A typical human-centric smart manufacturing application scenario based on multimodal HRI.

Figure 10 .Figure 11 .
Figure 10.Various applications that can be achieved by vision þ another modality.

Table 2 .
Auditory and language-based technologies.