Three‐dimensional posture estimation of robot forceps using endoscope with convolutional neural network

Abstract Background In recent years, there has been significant developments in surgical robots. Image‐based sensing of surgical instruments, without the use of electric sensors, are preferred for easily washable robots. Methods We propose a method to estimate the three‐dimensional posture of the tip of the forceps tip by using an endoscopic image. A convolutional neural network (CNN) receives the image of the tracked markers attached to the forceps as an input and outputs the posture of the forceps. Results The posture estimation results showed that the posture estimated from the image followed the electrical sensor. The estimated results of the external force calculated based on the posture also followed the measured values. Conclusion The method which estimates the forceps posture from the image using CNN is effective. The mean absolute error of the estimated external force is smaller than the human detection limit.


| INTRODUCTION
Surgical robots have been developed to support minimally invasive surgeries. A minimally invasive surgery is an operation in which instruments such as endoscopes or forceps are inserted into the abdominal cavity through small ports. It offers patients the benefits of smaller scars, faster recovery, and fewer complications, as compared to conventional open surgeries. On the contrary, such operations are complicated due to the narrow field of view of the endoscope, pivot motion of the forceps centered on the insertion point, and the lack of tactile feedback. 1 The surgical robot da Vinci solved these problems through a master-slave type teleoperation. The slave manipulator in the patient body follows the movement of the master device, which is operated by the doctor. Da Vinci is also capable of reducing hand tremors and adjusting motion scaling, which enables it to perform complex operations. Currently, da Vinci is being used for operations on the abdomen, pelvis, and chest as a surgery support robot to alleviate the burden on the operator. 2 Da Vinci and a majority of other robots are equipped with sensors that facilitate precise positioning. However, the presence of many electrical elements around the surgical robot can damage the sensor system. An electric knife in contact with robotic instruments may induce a large current to flow near them, thereby damaging the sensors or causing excess sensor noise. 3 An effective approach to address this issue is to provide alternatives to the sensors, such as estimating the posture of forceps via endoscopic images. Posture estimation via images also facilitates easier washing.
In the field of computer-aided intervention, there are many studies that segment the forceps region from images. Attaching a marker to the forceps and extracting the forceps region is the easiest method for such estimations. 4,5 However, when using a single marker, it is difficult to estimate the posture of a forceps with joints. Segmentation methods employing deep learning can extract the entire region of forceps without markers in the endoscopic image. 6,7 However, when the forceps tip is hidden behind an organ, a different tracking image is obtained from the same posture.
Allan et al estimated the posture of a surgical instrument from a camera image, using a 3D model. 8 Random forest can be used to stochastically classify the pixels of the endoscope image into surgical instruments and organs. The 3D posture was restored through the segmentation image and low-level optical flow. However, this method required 1-20 seconds for classification and posture estimation. It was also confirmed that the error from the previous frame gradually increased.
Tanaka et al estimated the posture of a surgical instrument in real time. 9 To estimate the posture of a surgical instrument, they used a database of projected contour images of a 3D model created in advance. Real-time estimation was realized by using a highperformance computer. However, when estimating the posture of the instrument with a joint, many images for needed for the database, which can impair real-time estimations.
Du et al constructed a convolutional neural network (CNN) to estimate the 2D posture of a surgical instrument from images subjected to semantic segments. 10 However, this CNN only estimated 2D postures.
The 3D posture estimation via CNNs has not been verified.
In this study, we propose a system to estimate the 3D posture of a surgical instrument by using a CNN, without the use of a position sensor. The system performs instrument tracking by using markers prior to inputting the image in the CNN such that any background data unrelated to the instrument posture is removed. This combination of the CNN and marker tracking enables posture estimation in unknown environments; the data set acquired in a dry environment can be used in vivo. The CNN outputs the estimated instrument posture, which is obtained from the position sensor in the learning phase. Moreover, it is possible to estimate the external force acting on the tip of the instrument by using the estimated instrument posture and a backdrivable pneumatic actuation system. The proposed method does not track the image temporally, and the error from the previous frame does not increase. Furthermore, it can estimate forceps posture from a large number of image databases at a constant rate because the computational speed of a CNN depends only on the structure of the CNN. This article is organized as follows. The robot system used in this paper is described in Section 2, and the method of posture estimation via CNN is described in Section 3. The experimental results are presented in Section 4, and the results are discussed in Section 5. Finally, the conclusions of this study are presented in Section 6.

| SURGICAL ROBOT SYSTEM
This section describes the system components used in this study. The proposed system consists of a master device, a slave manipulator, and an endoscope.

| Slave manipulator
The slave manipulator used in this study is shown in Figure 1. The slave manipulator consists of a holder robot with four degrees of freedom (DOFs) and a forceps with three DOFs. 11,12 The holder robot has three rotational joints q 1 (yaw), q 2 (pitch), and q 4 (roll), and a linear motion q 3 . The direction of the motion of each axis is defined to be positive when it moves along the arrow shown in Figure 1A. The holder robot is designed to pivot around the point O shown in Figure 1A. The forceps has a 2-DOF tip bending (ϕ 1 and ϕ 2 ) and a grasping mechanism. The driving mechanism of the forceps is illustrated in Figure 2. The push-pull operation of the nickel titanium wires attached to the pneumatic cylinders causes the the flexible joint of the forceps tip to bend. The four pneumatic cylinders are arranged at equal intervals.
The control block diagram of the forceps is presented in Figure 3.
wheref ext is an external force. J p is a 3-DOF Jacobian matrix, and J T p + is a generalized inverse of J T p . J q is a Jacobian matrix from angular velocity _ q to cylinder velocity _ X. J T p + and J q are both functions of the posture q. Therefore, Equation (1) shows that the external force is estimated using the forceps posture, its differential value, and the driving force. Haraguchi et al verified the accuracy of the external force estimation. 12 In this study, we use the same force estimation algorithm as described in. 12

| Master device
Sensable's PHANTOM Desktop, shown in Figure

| Endoscope
The endoscope used in this study (ENDOEYE FLEX 3D, Olympus) is a stereo camera. However, we only used one because similar images can be obtained from both cameras. In this study, we obtained 680 × 540 pixels images before applying the proposed image processing algorithm. The operator observes the 3D image on the display when teleoperating the slave robot from the master.

| THE PROPOSED METHOD
The proposed method combines the traditional marker tracking programing and a CNN.

| Forceps tracking
During posture estimation, the background area, excluding the forceps, is redundant. Therefore, removing this background information may increase the robustness of the machine learning in an unknown environment.
In this study, marker-based forceps tracking is performed to remove the background information, followed by an CNN-based posture estimation. The tracking image helps to accelerate training convergence 13 and enables posture estimation, without the influence of the background areas. Another advantage of the markerbased method is that it does not require that the endoscope is able observe the entire instrument. Posture estimation is possible when  The image output of each process is shown in Figure 4. On performing the aforementioned processing, the influence of the variation in the distance between the endoscope and the forceps can be reduced, and a similar image can be obtained from the forceps posture. In other words, when creating a training data set, it is unnecessary to create a separate data set based on the distance between the endoscope and the forceps. Therefore, the total number of training data sets is reduced.

| CNN construction
We used CNNs to estimate the forceps posture from the tracking image of the marker. The CNN used in this study is shown in Figure 5.
Each CNN is composed of N convolutional layers and one fully connected layer. The tracking image acts as the input to the CNN, and the CNN outputs forceps posture variables ϕ 1 or ϕ 2 . We use two independent CNNs to estimate ϕ 1 and ϕ 2 . The kernel and output shape of each layer are listed in Table 1.
The kernel size of the convolutional layer was empirically determined to improve estimation accuracy. To obtain the posture as an output, the convolutional layer is activated using the Relu function, and the fully connected layer is activated using the linear function. 14

| Training data
Endoscopic surgery employing a surgical robot is controlled using a master-slave method. However, creating the training data set from a real operation employing a master-slave method results in a nonuniform distribution of the forceps posture in the data set. Therefore, we assigned the slave robot discrete target angles and created a uniformly distributed training data set. The layout of the devices when creating the training data set is shown in Figure 6. The coordinate system of the forceps is represented by solid lines, and the initial coordinate system of the forceps is represented by a dashed line. In this study, we defined that z, ϕ 1 , and ϕ 2 axes of the initial coordinate system are parallel to the x c , y c , and z c axes of the endoscope coordinate, respectively.
Additionally, we used one DOF of the bending motion ϕ 1 and roll q 4 of the holder robot to create training data equivalent to the 2-DOF bending motion of ϕ 1 and ϕ 2 . The motion range of the slave manipulator is defined in Table 2. The joint angles in Table 2 are expressed as the displacement from the initial coordinate system. A step input of 5 is given for each degree of freedom, and the reference angles and the endoscope image at the steady state is recorded. Thus, we obtained 112 554 images. However, when the direction of the forceps tip bending is parallel to z c of the endoscope coordinate system, the marker at the forceps tip could not be detected as it is hidden behind the shaft or the flexible joint. Therefore, we excluded the postures wherein the area of the tip marker was less than 10% of that of the root marker. Finally, 107 166 data remained as the training data set.

| Selecting the CNN structure
The structure of a CNN suitable for learning from tracking images is unclear. Therefore, we select the number of the convolutional layers suitable for an experimental posture estimation.
The created training data set consists of postures within the movement range defined in Table 2. In this study, we evaluate the mean absolute error (MAE) of the entire training data as the evaluation function for selecting the CNN. We prepared eight CNNs with 0-7 con-  layers, and the minimum MAE of ϕ 2 is achieved by the CNN with two layers. These CNNs are used in the subsequent experiments.

| Validation data
The training data were created without moving the linear joint q 3 of the holder robot or the posture ϕ 2 of the forceps. However, when operating in the master-slave mode, these two values change continuously.
Therefore, to verify the accuracy of the estimated forceps posture when realistic trajectory data with continuous values are provided, validation data are created using the master-slave method. In this study, two types of validation data were created, as shown below: Case A. Operates freely without a load.
Case B. A palpation motion which an external force acts on the tip of the forceps.
In case A, the operator randomly created trajectory data, and in case B, the operator created trajectory data so as to press the forceps to the object simulating the organ, as shown in Figure 7. Each validation data set has a total of 1000 images acquired at 25 Hz, using the endoscope. In the case B, the driving force of the pneumatic cylinder is simultaneously recorded to estimate the external force.

| Posture estimation
To verify the accuracy of the forceps posture estimation, we compared the estimated posture with the posture measured using the sensor. The For example, the forceps used in this study are able to perform the task of block transfer without problems. 17 The results of posture estimation and estimation error in case A are shown in Figure 8. Figure 8A,B shows the results of posture estimation of ϕ 1 and ϕ 2 . In Figure 8A,B, the red lines are the forceps postures measured from the position sensor, and the blue lines are the forceps postures estimated from the image. Figure 8C,D shows the absolute error of Figure 8A,B respectively. The MAE of ϕ 1 and ϕ 2 in case B are 6.7 and 6.5 , and the maximum error of ϕ 1 and ϕ 2 are 25.6 and 25.6 . The results of posture estimation and estimation error in case B are shown in Figure 9. Figure 9A,B shows the results of posture estimation of ϕ 1 and ϕ 2 . Figure 9C,D shows the absolute error of Figure 9A,B, respectively. The MAE of ϕ 1 and ϕ 2 in case B are 9.9 and 14.1 , and the maximum error of ϕ 1 and ϕ 2 are 23.9 and 14.2 .

| External force estimation
To verify the accuracy of the external force estimation, we compared the estimated force with the force calculated by the position sensor and the driving force of the pneumatic cylinder with Equation (1). For the comparison, we used the calculated force instead of the ground truth (eg, three axis force sensor) because the force estimation accuracy had been verified by previous research. 12 We verified the accuracy of the external force estimation using the case B. The results of the force estimation and the estimation error for the case B are shown in Figure 10. Figure 10A shows the results of force estimation. In Figure 10A, the red line is the external force measured based on the position sensor, and the blue line is the external force estimated based on the image. Figure 10B shows the absolute error of Figure 10A. The MAE of the x, y, and z external

| DISCUSSION
This section considers the accuracy of posture estimation and external force estimation.

| Posture estimation
In posture estimation in case A, Figure 8A,  Table 4.
whereφ 1 andφ 2 are the postures measured on the forceps coordinate system when the roll of the holder robot q 4 = 0. We consider the relationship between the distribution ofφ 1 ,φ 2 , and the estimation error ofφ 1 ,φ 2 . The posture estimation error map of the training data and case A are shown in Figure 11. Figure 11 maps the root mean square error (RMSE) ofφ 1 andφ 2 . Figure 11A shows the estimation accuracy of training data to be nonuniform, and Figure 11B shows the estimation accuracy of case A is nonuniform. In both figures, the RMSE tends to be large in the region where the angle ofφ 1 orφ 1 is large.
Therefore, uniform training is also important for improving posture estimation accuracy. Figure 9A,B show that the posture estimation of case B contained a large error compared to case A. Table 5

| External force estimation
During the external force estimation of case B, as shown in Figure 10, the proposed method could was capable of estimating the external force. It has been previously reported that a person could detect an external force exceeding 0.3 N, in a master-slave operation. 18 The MAE of the external force in our experiment was 0.3 N; therefore, the image-based force estimation is sufficient for a master-slave surgical robot. However, the maximum error in the external force is 0.82 N.
To reduce the uncomfortable feeling given to the operator due to the error in the estimated external force estimation, it is necessary to improve the estimation accuracy. The posture estimation error is a component of the error in the estimated force because the driving force data used to estimate external force is the same. Therefore, an improvement in posture estimation leads to acceptable force estimations.

| CONCLUSIONS
In this study, we