Persistent Human–Machine Interfaces for Robotic Arm Control Via Gaze and Eye Direction Tracking

Recent advances in sensors and electronics have enabled electrooculogram (EOG) detection systems for capturing eye movements. However, EOG signals are susceptible to the sensor's skin‐contact quality, limiting the precise detection of eye angles and gaze. Herein, a two‐camera eye‐tracking system and a data classification method for persistent human–machine interfaces (HMIs) are introduced. Machine‐learning technology is used for a continuous real‐time classification of gaze and eye directions, to precisely control a robotic arm. In addition, a deep‐learning algorithm for classifying eye directions is developed and the pupil center‐corneal reflection method of an eye tracker for gaze tracking is utilized. A supervisory control and data acquisition architecture that can be universally applied to any screen‐based HMI task are used by the system. It is shown in the study that the classification algorithm using deep learning enables exceptional accuracy (99.99%) with the number of actions per command (≥64), the highest performance compared to other HMI systems. Demonstrating real‐time control of a robotic arm captures the unique advantages of the precise eye‐tracking system for playing chess and manipulating dice. Overall, this paper shows the HMI system's potential for remote control of surgery robots, warehouse systems, and construction tools.

and users' small movements, limiting precise eye angle and gaze detection. Thus, the HMI with EOG shown in previous studies can only perform simple actions such as unidirectional motions in drones and wheelchairs. These limited control capabilities challenge the areas that require complex movement with a high degree of freedom (DOF) and precision, such as surgery applications (e.g., surgery robots). [6] In addition, gel electrodes are commonly used for high-fidelity recording. However, they have poor breathability, potentially cause skin irritation, and suffer performance degradation during long-term monitoring due to drying. [7] In this regard, the recent development of machine-learning technology with a video monitoring system based on eye trackers has gained increasing attention in various fields, such as autism spectrum disorder diagnosis, [8] facial emotion diagnosis, [9] and surgical robot support. [10,11] However, most commercial eye trackers also have limitations: 1) they do not have control functions and only track gaze. 2) Even though some eye trackers have added control functions, they are expensive. 3) Commercial software offers control functions to commercial eye trackers, but the control functions require complicated eye movements for HMI applications and even cause extreme eye fatigue. 4) Camera-based image analysis is heavily influenced by environmental lighting conditions. Since our work is a proof-of-concept model with controlled experimental settings based on controllable lighting conditions, we used an eye tracker through camera-based image analysis.
Here, we introduce a two-camera eye-tracking system (TCES) that can record eye movements and demonstrates visual recognition and its application to robot arm control. We have adopted 1) the machine-learning technology, a convolutional neural network for detecting eye directions with a webcam, 2) the pupil center-corneal reflection (PCCR) method for gaze tracking with a commercial eye tracker, and 3) the single module platform including HMI/supervisory level system control and data acquisition. As a result, the TCES can track the user's gaze and the directions of the eye at a low cost and offer precise control of HMI with simple eye movement. The TCES shows the potential to use in various applications, such as medical applications, surgery robots, and remote heavy equipment controllers.

Overview of an HMI System
In this section, we elaborate on the details of the techniques used for TCES and their applications in various fields. Figure 1A  Overview of a human-machine interface (HMI) system using a screen-based hands-free eye tracker. A) Photos of a subject using the eye-tracking interface to control a robotic arm; a screen-based hands-free system (left) and a frontal photo with a webcam (right). B) Flowchart showing the sequence from data recording (eye movements) with two devices (webcam and eye tracker) for robotic arm control. C) Schematic illustration capturing possible implementation examples of the HMI system with eye tracking.
www.advancedsciencenews.com www.advintellsyst.com shows the overview of TCES, composed of a webcam and a commercial eye tracker with an embedded convolutional neural network (CNN) model to monitor eye movements (eye directions and gaze). CNN model has been employed to detect eye directions because of its excellent performance in dealing with image data, such as computer vision. [12] Hundreds of different eye image inputs are used with this CNN model to classify eye directions (up, blink, left, and right). In addition, the image data of the pupil acquired through the commercial eye tracker conducts to track the eye's gaze and helps it act as a trigger for the all-in-one interface. The classified eye directions send commands to the robotic arm to perform various tasks (up: move, blink: stop, left: grip, and right: release), as shown in Figure 1B. Previous studies using screen-based eye trackers could control only 6-9 limited actions with eye movements. [11,13] Additional controls often rely on other gestures, such as the user's hands or complex eye movements. TCES, with the all-in-one interface, enables control of 32 grids through one eye movement with two versions of actions (grip and release) per grid. More grids can be created within the operating range of the robot arm without controlling issues. Figure 1C shows the potential applications: 1) people with disabilities who cannot move their hands can benefit from TCES,  www.advancedsciencenews.com www.advintellsyst.com which only requires eye movements to perform tasks, such as calling a doctor/nurse or controlling the medical bed; 2) an endoscopic upper-airway surgery system for airway obstruction, recurrent hemolysis, and severe granuloma formation. Endoscopic equipment for infants (small diameter) and experienced professionals are urgently needed and dependent because of its equipment that you can remove foreign substances only with a camera. Robot-assisted endoscopic upper-airway surgery system with our system enables rapid and efficient surgery. Also, solo surgery is possible without an assistant. 3) TCES can help to work remotely on dangerous construction sites or warehouses, preventing exposure to dangers as many construction workers are often exposed to less safe work environments, such as controlling heavy equipment. Our target is to automate box-moving tasks in warehouses and distribution centers. Robot arm with our system can work in a workplace where repetitive box lifting is required, such as unloading trucks, building pallets of boxes, and order building. Also, it makes warehouse operations more efficient and safer for workers. Figure 2 summarizes the overview of the data-driven image classification process using our CNN model. We have prepared hundreds of different image inputs, which were used to develop a CNN classifier for four eye directions (up, blink, left, and right). Experimentation and model selection determined the optimal range of parameters and hyperparameters. The structure and parameters are determined based on several factors, including layers, convolution filters, stride, pooling, and activation functions. The images were split into the training set (80%) and the test set (20%). Figure 2A illustrates CNN architecture and the details of the classifier development processes. First, the CNN model featuring layers of 2D convolutions consists of three hidden layers. The two 2D-convolutional layer uses (2,2) pool size of 2D-max pooling, ten filters, and (3,3) kernel size. After batch normalization to prevent overfitting, ten filters and (3,32), (3,64), (3,128) kernel size are used on the three single convolutional cells sequence with (2,2) pool size of 2D-max pooling. Then, the model is followed by filters of flattening from a convolutional layer to a fully connected layer. Lastly, the model uses rectified linear unit and softmax for output classification. The overall real-time eye direction classification process is shown in Figure 2B. Moreover, deep neural networks are prone to overfitting because they surround many parameters, so we applied the iImageDataGenerator function in TensorFlow to enhance accuracy and treat overfitting with our classification model. The iImageDataGenerator increases the diversity of learning data by shifting the range of rotation, width, height, and shear of the input images. As shown in Figure S1, Supporting Information, eye image data obtained from the webcam can capture the user's eye through the face detection function (face landmark detection using CMake, Dlib, and OpenCV library). The captured eye image data is scaled to 34 Â 26 mm and then gray scale. The CNN model calculates the scaled images into four eye directions (up, blink, right, and left). For the demonstration, various eye movements are detected with classification accuracy ranging from 0 to 1 ( Figure 2C). Figure 2D presents the confusion matrix with an accuracy of 99.99% for four classification classes (up, down, left, and right). Table 1 captures the advantages of TCES and superior classification performance compared to existing technologies introduced in prior studies.

Eye Tracking and Principles of Operation
Eye gaze discloses users' intentions and attention. Naturally, the human eye gaze focuses on objects related to the tasks, reveals responses, and predicts future behaviors. [14] Eye-gaze-tracking solutions use PCCR and time to first fixation (TTFF) algorithms to find specific details in human eyes. PCCR delivers input of eye movement to the commercial eye tracker by computing the light reflecting from the center of the pupil and cornea. [14] TTFF calculates the time of the eye stimulus and finds the area of interest (AOI) by computing fixation time information. Figure 3A,B shows that eye movement is analyzed by infrared light. The infrared light directly enters the pupil, reflects from the iris with clear reflection, and renders the boundary of the pupil. [14] An embedded optical sensor and two cameras in the Tobii eye-tracker utilize PCCR and TTFF to generate an aggregated gaze plot to the screen. PCCR has been used as a primary eye-tracking method in this Tobii eye tracker, and processed data through TTFF has been used for better accuracy. Eye-tracking accuracy relies on analyzing the pupil's contour lines with corneal reflection. Tobii eye tracker in this study uses two PCCR illumination techniques: bright pupil eye tracking and dark pupil eye tracking, as shown in Figure 3C. Tobii eye tracker computes both techniques for accuracy purposes and analyzes gaze comprehensively. Bright pupil eye tracking captures bright glints to identify the pupil in eye images from an optical axis of a direct light source. [15] Dark pupil eye tracking has a similar process. For pupil detection, dark glints are captured in eye images from an indirect dark light source. [15] Tobii eye tracker applies and complements both technologies to track with high accuracy, even in unpredictable light sources. TCES can estimate the eye position accurately and convert it to Cartesian coordinate systems ( Figure 3D,E). Figure 3F shows an overview of task activation through the user's  www.advancedsciencenews.com www.advintellsyst.com simple actions, such as "blinking" and "looking up". For the preliminary test, we designed the interface to play chess ( Figure 3F and Video S1, Supporting Information). Playing chess is a complex task that continuously involves selections and movements of pieces. Eye movements were classified into three control sequences, as shown in Figure 3G, to reflect the user's intentions. By the user's "blinking" motion, the gaze point data is converted to the position data and set the trigger to the interface. We have found that our interface was able to use a combination input of eye directions and gazes to move the chess pieces in the multigrid setting. A typical delay time in the control of the chess game is summarized in Figure S2, Supporting Information.

HMI Applications
In HMI applications, many control scenarios involve complex and multitasking. [16] This study introduces TCES with an all-in-one interface that can control a fully commercialized robotic arm. This all-in-one interface presents a hybrid integration of eye movement detection, eye tracking, and robotics control system that can meet various needs, including health applications, surgery robots, and remote heavy equipment controllers. A graphical user interface (GUI) synchronically controls the robot with eye and gaze movement and completes specific tasks with the user's intentions. The study introduces the allin-one interface compatible with a remotely controlled robotic arm and a computer-based GUI. An overview of the developed system is illustrated in Figure 4A. We demonstrate a robust interface by integrating two HMI software models (supervisory control and data acquisition (SCADA), and Machine-Level embedded systems) to the industrial control panel (LynxMotion Programmable Logic Controller-PLC) robotic system. Eye movements are divided into two categories: eye direction and eye gaze. [14] The all-in-one interface classifies and tracks eye information to get the user's intention. The user's eye direction from the webcam is classified in real time, and the result triggers the robot's movement decisions. The user's eye tracking from the commercial eye tracker indicates detected gaze in real time.
We have demonstrated 32 grids interface that reflects the location of gaze by grids. The interface can also support a high-resolution grid depending on eye-tracker resolution under the purpose and needs of the user. Figure 4B shows both eye gaze and results of classified eye direction displayed on the screen in real time.
Combining the two primary input variables commands the robot's motion control. The fused data is synchronized using built-in generic hardware transistor-transistor logic signal embedded in control hardware for the robot arm control. These features enable input into www.advancedsciencenews.com www.advintellsyst.com real-time visualization with only eye gaze. The system can expand behavior or trigger signals into multiple custom applications. Figure 4C shows two primary input data streamed into a microcontroller chip inside a circuit system. The interconnected servomotors operate the robot arm with a triggered control signal from the microcontroller. The robot operates and moves to a specific location and performs specific behavior. The Lynxmotion's PLC control panel controls the operating system at the machine level for robot control under the central computer's command. The robot's arm and grabber operation system was developed for the shortest distance to the final position. This all-in-one interface between the end devices and the robot shows a remote user's ability to view and control in front of a screen. The SCADA system in this study processes the real-time image and optical data to interpret eye information (gaze and direction) from two camera-embedded end devices operating by a central computer. Custom GUI (Python and C base) used as a master central SCADA system that triggers commercial eye-tracker software (Tobii Experience) and controls Lynxmotion PLC. The central computer processes custom GUI, back-end, network, and computation. This study shows that the TCES operates and performs as user intention-based eye movement and eye gaze. The eye-tracking system detects eye movements via cameras, which are used to control the robotic arm. A user's eye image successfully and remotely conducts complex and multiple tasks in front of a screen showing objects (dice and robotic arm). The user's eye direction and eye gaze inputs can express the user's intention and perform a sequence of tasks. [16] For simple and intuitive demonstration, the dice are randomly thrown to show the random location, as shown in Figure 5A and Video S2, Supporting Information. The user identifies the random number of the dice due to the target location through the screen. This random number of the dice implements the user's free will.
Using TCES with the all-in-one interface, the user can move the dice from a random location to the corresponding dice number location. Each eye direction triggers modes of double click (up), stop (blink), grab mode (left), and release mode (right), as shown in Figure 5B. To distinguish different eye movements, the all-in-one interface trigger direction commands with continuous 1 s input time to classify user intention clearly. Prior literature about eye blinking shows that adult human blinks every 5 s, which take one third of the second. [17] To classify the difference between "blinking" motions and the user's intentions, the TCES system captures four combination inputs of an eye "blinking" classified motions in the 1 s window to detect intentional blinks accurately. Accordingly, even if the user moves his eyes freely, the robot arm does not make a mistake. The eye direction    Figure S3, Supporting Information. Through the demonstration, TCES proves the feasibility of simulating the real-world situation of medical applications, surgery robots, and remote heavy equipment controllers.

Conclusion
The presented research in this work introduces a TCES that enables continuous real-time classification of eye movements and control of robotic systems with the aid of an embedded CNN model. The combination of a webcam and a commercial eye tracker based on the PCCR method in conjunction with deeplearning CNN allows the highly accurate classification of four directions of eye movement classes (up, blink, left, and right).
The significance of TCES is that it can track gaze and eye direction, which overcomes the limitations of the conventional EOG monitoring system using skin-mounting sensors. Furthermore, with only a low-cost commercial eye tracker and a webcam, the designed all-in-one interface can control the robotic arm with hands-free and high DOFs that do not require other input action, such as the user's hands. The TCES presented in this study shows broad applicability to further developing eye-tracking systems for remote control of surgery systems, construction devices, and warehouse systems. Future work will develop a video camera system to detect eye motions and overcome the limitations of classifying eye images. The motion detector would offer more commands with high accuracies, such as turning eyes clockwise or lowering eyes from up to down.

Experimental Section
Experimental Setup: The central computer in this work ran CNN with both windows (Intel 7th Gen CPU þ Nvidia GTX 1080ti) and macOS system (M1 chip, Apple). We prepared two kinds of cameras for the experimental setup. One was a webcam combined with a CNN model that could detect the four eye directions (up, blink, left, and right), and the other was a commercial eye tracker that could track the eye gaze at a low price. The commercial eye tracker was located at the bottom of the monitor to track the user's head and pupils and must be placed within the sensor's field of view. The operating distance of a commercial eye tracker was around 85 cm. The webcam was installed on the monitor to record a frontal shot of the user's face. The TCES we designed was conducted in a sufficiently controllable experimental environment. The level of the light range was around 400 lux. Eye movement was detected on the user's face. Eye movements through an eye tracker and webcam could control the all-in-one interface we developed on the desktop. The signal from the all-in-one interface controlled the robotic arm by connecting wired or Bluetooth wireless.
Face Detection: We used an HD 1080 P Autofocus webcam (Wansview Technology Co. Shenzhen. China) to record a person's face and used the Dlib Library to detect faces and eyes. To detect the face and establish landmarks based on the feature position, any webcam that records a person's face could be used. Dlib library included face position and facial landmarks detection. Dlib face detection used the histogram-oriented gradients (HOGs) method, and facial landmark detection followed Kazemi's model.
The face was shown with 68 other landmarks set. Figure S4 (left), Supporting Information, shows the positions of 68 points identified on the face. Dlib included a pre-built model for face landmark detection called shape_predictor_68_facesmarks.dat. The left eye (36-41) and the right eye (42-47) were numbered, as shown in Figure S4 (right), Supporting Information.
Mechanical Specifications of an Eye Tracker: The Tobii Eye Tracker 5 is a commercial device that tracks eye location on a computer screen. [18] Eye tracking has been widely used to translate and analyze the visual attention of users. Practical eye tracking was based on fixation and gaze points revealing visual attention. The basic concept was using a light source to identify both eyes that highly reflected visible illumination, captured images to identify the light source, and calculated gaze direction with geometric information such as the angle between cornea and pupil reflections. As shown in Figure S5, Supporting Information, the eye tracker (Tobii) had one camera and four light sources that emitted near-infrared light.
Configuration of the Motor Servo on Robot Arm Frame: In the robot arm system for demonstration, hardware components were integrated with Lynxmotion AL5D robot arm (RobotShop Inc., Mirabel, QC, Canada), SSC-32U servo controller, and five different types of Hitec servomotors (HS-422, HS-485HB, HS-645MG, HS-755HB, and HS-805BB). The five HS servomotors were connected to five different channels in the SSC-32U servo controller, as shown in Figure S6, Supporting Information. Each channel had three inputs: a pulse width modulation pin, a voltage common collector pin, and a ground pin. Each input must be connected to each HS servomotor in order. [19] This robot arm has five 5-DOFs. Detailed specifications for the robot could be found in Figure S7, Supporting Information. SSC-32U servo controller was a dedicated robot controller board that controlled up to 32 servomotors, USB, serial input, and ATmega328p chip, as shown in Figure S8, Supporting Information.
Human Subject Study: The study involved healthy volunteers aged between 18 and 40 and was conducted following the approved Institutional Review Board (IRB) protocol (#H22479) at Georgia Institute of Technology. In addition, written informed consent was provided by all volunteers.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.