Deep Imitation Learning for Optimal Trajectory Planning and Initial Condition Optimization for an Unstable Dynamic System

In this article, an innovative offline deep imitation learning algorithm for optimal trajectory planning is proposed. While many state‐of‐the‐art works achieved optimal trajectory planning, their systems were stable or quasistable, and their approaches rarely optimized the system's initial conditions (ICs). Here, a new unstable dynamic system task called “internal sliding object stabilization control” is proposed, modeled, and solved by deep imitation learning. Given the system's ICs, the neural networks (NNs) can imitate the iterative linear quadratic regulator (iLQR), generate optimal trajectories, and compute faster. A proportional–integral–derivative (PID) controller is used to track the unstable trajectories. Leveraging on the gradients of NNs, it can optimize the system's ICs, avoid obstacles stepwise, and ensure the worst bounds of NNs for safety. Subsequently, thorough simulations are conducted, including comparing the iLQR and PID controllers in the task, optimizing the system's different ICs by gradient descent, and finding the worst bound of the performance by gradient ascent. Results show that the proposed algorithm achieves considerably improved performance. Finally, experiments are conducted with a real manipulator to compare the proposed structure with the original iLQR. Results indicate that the proposed algorithm resembles the iLQR well. Program code and experiment results are in https://github.com/DanielYamChen/ISOSC.git.

continuous trajectory optimization problem, which can be solved by classic optimization tools. [13]Campana et al. presented a gradient-based optimization technique to shorten the initial collision-free polygonal path and applied it to operate a manipulator in a box-filling task and a PR2 robot working in the kitchen. [14]agyar et al. proposed an optimization algorithm for robots in picking-and-placing, drawer-opening, and cupboard-opening tasks.The algorithm used the dynamic movement primitives (DMPs) to generate the initial trajectories and the stochastic trajectory optimization for motion planning (STOMP) to optimize them. [15]But, these papers usually did not handle unstable tasks, which needs considering further control issues, but only quasistable or stable tasks.
[18][19] Mordatch and Todorov utilized the alternating direction method of multipliers to couple the NN controller and the trajectory optimizer, which modifies the cost and allows the optimization progress to adapt to the NN behavior. [20]To address the nonlinear kinodynamic constraints, Li et al. presented the near-optimal rapidly-exploring random tree (RRT), used the NN to approximate the cost function, and tested it in underwater vehicle dynamic motion planning and car collision-avoidance navigation. [21]Qureshi et al. proposed the motion planning network (MPNet), which comprises an encoder network to encode the point clouds of objects into the latent space and a planning network for supervised training based on trajectories from the optimal rapidly-exploring random tree (RRT*). [22]n the other hand, some works researched on optimal trajectory planning based on reinforcement learning (RL), which is a goal-oriented method to skip addressing the complicated system dynamics.Schulman et al. proposed the proximal policy optimization (PPO) algorithm for deep RL, which utilizes the ratio of the new to old actor outputs as the learning rate for the actor NN, and applied it to a simulated humanoid locomotion task. [23]Bellegarda and Byl combined optimal control as trajectory optimization and PPO as deep RL.The structure deploys the action of larger reward from either optimal control or PPO and optimizes the actor NN from the better action. [24]Kulkarni et al. weightedly combined a nominal strategy and the RL based on recurrent neural network (RNN) to generate the trajectory tracked by impedance control.It aimed to train the model within a few minutes for assembly tasks in the real world. [25]Yang et al. deployed a hierarchical planning to navigate microrobot in blood vessels, where the high-level controller set up short-term targets along the desired path, and the deep RL output rotation sequence to approach the targets. [26]onetheless, these machine-learning-based works only generated near-optimal trajectories for the targets but did not consider improving the IC of the tasks.
As mentioned, these reported works mainly addressed the optimization problem on stable or quasi-stable mechanical systems, where the dynamic states would not easily diverge to boundaries.Such a stable or quasi-stable system is easier to control with near-optimal trajectory planning, since it usually has larger endurance for offsets between the optimal trajectory and near-optimal trajectory.By contrast, the control problem of unstable dynamic systems, such as balance control for the uncertain center of gravity (COG) in bicycle riding, [27,28] is challenging.Even when the action only has a subtle offset from the desired dynamically balanced state or only a small perturbation is injected into the system, the target state can easily be diverged and make the system collapse.In this work, a new unstable dynamic system that resembles some practical applications is set up and tackled, which, to the best of our knowledge, is rarely discussed.The scenario is depicted as follows.A manipulator holds a box and moves along a predefined trajectory.The COG of the box is unconstrained and movable, so it would easily drift around because of box movement.The control goal is to maintain the location of the COG at the center of the box as much as possible.The possible applications of this system include a robotic server holding a tray of dishes and drinks and moving around in the restaurant, a manipulator holding an opaque box for package delivery, and so on.
Although several works have established strategies for finding optimal trajectories, few papers have tackled the problem of finding optimal ICs.Such optimization is important, since it can contribute significant improvements to the system and achieve specific requirements, such as obstacle avoidance.Therefore, this study proposes a near-optimal trajectory planning based on deep imitation learning, which can solve the trajectory optimization problem in a shorter period and further optimize the IC by utilizing the easily available gradients of NNs.The algorithm is then combined with trajectory tracking to solve the unstable COG problem of the moving box held by the manipulator.The contributions of this work are as follows: 1) A new unstable dynamic system control problem named internal sliding object stabilization control (ISOSC), which aims to maintain the drifting COG of the beheld box to the box center when the box held by a manipulator moves, is proposed, mathematically modeled, and solved.A creative method based on the extended Kalman filter (EKF) to estimate the displacement of the drifting COG by a torque sensor is also proposed.Thorough simulated comparisons using different control strategies and realistic experiments were conducted.The results are demonstrated in the article.2) A creative offline deep imitation learning, which mimics the iterative linear quadratic regulator (iLQR) based on NN models, x mB -Net and θ B -Net, is proposed to solve the ISOSC task.This process can generate near-optimal trajectories and solve the task combined with a proportional-integral-derivative (PID) trajectory-tracking controller.The proposed algorithm can generate target trajectories faster than the original iLQR computation while achieving almost the same performance in simulations and experiments.
3) Leveraging the easily available gradients of the NNs, the proposed deep imitation learning can further easily optimize the IC of the task based on the gradient descent (GD) method and the chain rule, which can achieve less cumulative COG displacement errors and obstacle avoidance and ensure the worst bound of the NN performance of the task.
Our methods are similar to those proposed by Levine et al. [29] but have certain essential differences.First, their method was applied to a quasi-stable system, while we dealt with an unstable dynamic system.Second, their algorithm generated actions in real-time.In contrast, owing to the unstable nature of the system, the proposed algorithm needs to generate the whole trajectories before the task, so it can incorporate the trajectory-tracking PID controller to stabilize the unstable system during the task.
In addition, this work utilized different concepts in comparison with our previous work, [30] which set the PID controller as the main control role and deep RL as the compensation controller.This work set the PID controller as the assistant for trajectory tracking and the deep imitation learning as the main trajectory generator that produced the complete trajectories offline before the task.
The remainder of this article is organized as follows.Section 2 constructs the proposed unstable system and derives the mathematical model under certain assumptions.Section 3 elaborates on the proposed algorithms, including the mimicked algorithm iLQR, feature point extraction from the optimal trajectories, the near-optimal trajectory generation NNs, and IC optimization based on GD.Section 4 demonstrates the simulation results of the proposed system and algorithms.Section 5 shows the experiment results, and Section 6 concludes the work.

The Dynamic Model
In this section, some conditions are assumed and simplified to address the stabilization control problem for the unstable COG of the manipulated box.The dynamic model is then derived following these assumptions.

Problem Description
The target scenario of the problem can be set as the handling of the box by the manipulator.The COG of the box is movable within the box, whose motion is determined by the state of the box, such as orientation or velocity.The manipulator can control the box's COG by adjusting the box's orientation.To simplify the problem, only two-dimensional planar motion is involved.The system can be modeled as shown in Figure 1 and described as follows: 1) The box is modeled as a massless bar with a limited length; 2) The COG of the box is modeled as a point mass (m) called the internal sliding object which can slide on the bar with no friction.Thus, any tilt of the bar or any force with a horizontal component initiates the sliding of the mass; 3) The motion of the bar can be directly controlled by the manipulator, including its planar translational motion ðx B , y B Þ and rotational motion θ B .
Given the initial position of the point mass relative to the bar, the initial angle of the box, and the translational movement path of the box, a controller can be designed to generate an angular trajectory of the box/bar such that the internal sliding object can be maintained at the center of the box/bar when the box moves during the task.This problem is hereafter referred to as the ISOSC problem, which is an optimal control problem.

Derivation of the Dynamic Model
The settings for model derivation are described as follows.The body coordinate is fixed on the bar, as shown by the blue arrows in Figure 1.The displacement and velocity of the point mass are set as x mB and v mB , respectively, with respect to (w.r.t.) the body coordinate.Their initial values are defined as x mB,0 and v mB,0 .Note that y B ðtÞ ¼ 0 unless the point mass flies away from the box.Before a task, the translational moving trajectories of the box, ðx B ðtÞ, y B ðtÞÞ, are provided and parameterized as cubic polynomials.
According to the rigid-body dynamics, the acceleration ẍ@m at the site of the internal object in the box can be formulated as where the Coriolis term is ignored.According to the fictitious force theory and by observing the free body diagram of the sliding object on the site of the box where the object is located, as shown in Figure 2, the relative acceleration ẍmB of the object to the box and the normal force N between them can be calculated by Newton's second law ẍmB N Substituting ẍ@m and ÿ@m in Equation (2) with Equation (1), along with the dominant dynamic nonlinear differential equation of the relative displacement x mB and the normal force N can be obtained as In this equation, the angle of the box θ B can be seen as the system input, and x mB is the output.In the control task, a series of θ B should be sent into the dynamic system to make the magnitude of the output x mB as small as possible.

Position Estimation
To obtain the position of the internal object as feedback and generate a control input signal, the position can only be estimated by the applied torque measured from a torque sensor mounted between the box and the end effector of the manipulator.An EKF is utilized to estimate the position. [31]The measured applied torque τ msr is After taking the partial derivative of Equation ( 6) w.r.t. the variables ðx mB , ẋmB , ẍmB Þ, the linear expansion of the measurement matrix H can be expressed as In addition, the dominant dynamic equation, Equation (4), is discretized for digital simulation and calculation in a computer.In summary, the EKF structure can be expressed as where k is the iteration serial number and Δt is the loop time of the digital control system.

Near-Optimal Trajectory Generation Neural Network Based on Imitation Learning Algorithm
To solve the ISOSC task, a near-optimal trajectory generation NN is proposed, which is based on imitation learning and takes the ICs as inputs.The IC includes the translation path of the box among the task, as well as the initial angle of the box and the initial displacement of the sliding object.The translation path is generated from one start point, two via points, and one endpoint, along with three average speed values between these four points.
The trajectory comprises three segments of third-order polynomials, setting the velocity value at each via point as zero.In short, the IC can be represented and defined as an IC vector as follows where x B,0-3 , y B,0-3 , and v avg,0-3 are the column collections of x-coordinates and y-coordinates of the via points and the intermediate average speeds, respectively.Given the ICs, the total time t tot to complete the task can be calculated by The procedure for constructing the NNs is depicted as follows.First, the classic optimal control algorithm solves the task for each IC, generating the corresponding optimal angle trajectory of the box and the optimal displacement trajectory of the object.Two sequences of couples of timestamps and feature points are extracted from these two optimal trajectories, respectively.Second, these different sequences of couples of timestamps and feature points, which are deemed as label data, along with their ICs, which are set as input data, are packed to train two deep NNs, respectively.One NN is trained by feature points from the optimal angle trajectory, while the other is trained by those from the optimal displacement trajectory.Finally, after such imitation learning, the two NNs can generate a near-optimal angle trajectory and a near-optimal displacement trajectory, respectively, given an IC as the NNs' input.
This article shows two aspects that are different from the previous works.First, given that the controlled plant in the task is extremely unstable, instead of directly using the generated nearoptimal angle trajectories as system inputs, this optimal control problem needs to be converted into a trajectory-tracking control problem, or the position of the internal sliding object would easily diverge to two ends of the manipulated box.Second, leveraging the unique property that an NN model possesses easily accessible gradients given the NN inputs, the IC of the trajectory generation task can be further optimized based on the GD algorithm, a topic that has been rarely discussed previously.Each part mentioned previously will be explained in the following parts.

The Imitated Target: Iterative Linear Quadratic Regulator
The iLQR was chosen as the imitated target to create a couple of near-optimal trajectory NNs based on imitation learning. [32]The linear quadratic regulator, which is based on dynamic programming and is mathematically derived, is a typical algorithm with good performance in solving optimal control problems with known linear dynamic models and ICs.And, the iLQR is an advanced version that handles nonlinear problems by iteratively using the linear quadratic regulator to solve the linear expansions of nonlinear dynamic models with regard to the solution obtained from the previous iteration, gradually converging to the optimal solution of the nonlinear problem.
iLQR has several advantages for solving the dynamic system in this article compared to other nonlinear model predictive control (MPC) or optimization techniques.Some off-the-shelf optimization solvers, like ACADO, [33] OSQP, [34] IPOPT, [35] CasADi, [36] or Gurobi, [37] take several minutes to find the optimal solution, while iLQR only takes few seconds, since the dynamic system is low-dimensioned, relatively simple, and easily realized in the matrix form.These are also the reasons why other nonlinear MPC methods were not used.Since the dynamic system is differentiable, the total differential matrix in iLQR can be easily obtained.Besides, the dynamic system is a non-convex process depending on sequential decision making, which is intuitively to use dynamic programming instead of the interior point or Newton methods.The nonlinearity of the system has no analytic solution, so it requires a numerical iterative way to solve by iLQR.
First, the state space and action space vectors are set as where ω B is the angular velocity of the manipulated box.The optimal control problem can be defined as follows with where V 0 is the expected summation of cost values from the initial state and action to the final state and action that need to be minimized, cðx k , u k Þ is the cost function for the state and the action at each iteration, and C xx and C uu are the weighted cost matrix to determine the cost value of the state and action, respectively.Although the goal in this task only emphasizes minimizing the total object displacement value during the whole task, when the summation of squared x mB is used as one of the previously mentioned cost functions, the summation of the squared angular velocity ω B is also assigned weight to be punished.This is due to the fact that not only is this positive term necessary in the iLQR algorithm to make the Hessian matrix of the cost function positive, but it can also act as a form of regularization, avoiding the input angle changing over-abruptly and causing damage to the real robot arm.Second, according to Equation (8), this optimal control problem is constrained to the following dynamic system Taking the first-order Taylor expansion of Equation (15) w.r.t. the state and action vector ðx k , u k Þ at ð xk , ûk Þ, which is the solution obtained from the previous iteration, the total derivative can be attained by In summary, the iLQR algorithm can be represented as Algorithm 1, [32] where N is the discretized task time and Using the iLQR algorithm, an optimal angle trajectory of the box, θ B,iLQR ðtÞ, and a corresponding optimal displacement trajectory of the internal sliding object, x mB,iLQR ðtÞ, can be obtained for every IC of the dynamic system.

Feature Point Extraction from Optimal Trajectories
To implement the near-optimal trajectory generation NNs, ICs of the dynamic control task are sent into NNs as input data.Two NNs are expected to generate near-optimal angle and displacement trajectories, respectively, with optimal trajectories attained from the iLQR algorithm as the label data.However, the numbers of points in optimal trajectories are too large for NN output, and different ICs have different lengths of optimal trajectories.Therefore, feature point extraction from optimal trajectories is used before NN training.Instead of directly using the optimal trajectories, the feature points extracted from the optimal trajectories are used as label data for training.The near-optimal trajectories are then reconstructed by the feature points produced from the couple of NN models.Instead of using the NN auto-encoder as trajectory embedment, which has been utilized in previous research, [38,39] extreme point extraction, which is simpler and more intuitive, is employed in this work.An extreme point of the optimal trajectory means a trajectory point where its derivative with regard to time equal to zero.They are chosen as the feature points, θB,i or xmB,i .In addition, the incremental time between two consecutive feature points is chosen as the featured timestamp, ΔT θ,i or ΔT mB,i , ensuring that the time mentioned later is positively increasing.Thus, an optimal angle or displacement trajectory generated from iLQR can be represented as a sequence of couples of feature timestamps and feature points fðΔT θ,i , θB,i Þg M θ i¼1 or fðΔT mB,i , xmB,i Þg M mB i¼1 .Besides, to enhance the extent of recovery, the initial velocity of the angular trajectory recorded as ωB,0 and the last trajectory points in the two sequences are also picked up.
For trajectory reconstruction, cubic polynomials are used and connected based on the extracted feature points and timestamps as the extreme points to generate the polynomials by the following equations where By using this method, various ICs are set as input data.Each IC then generates two sequences of couples of a timestamp and a feature point that are respectively extracted from the optimal angle and displacement trajectories using iLQR.These two sequences can be set as output label data to train the two NNs which respectively generate two sequences of near-optimal angle and displacement timestamps and feature points.

Near-Optimal Trajectory Generation Neural Network
Two NN models named θ B -Net and x mB -Net are created.By inputting a normalized IC vector r 0 IC , the two NN models can produce two sequences of couples of a normalized timestamp, ΔT 0 θ,i or ΔT 0 mB,i , and a normalized feature point, θ0 B,i or x0 mB,i , which are further used to reconstruct the near-optimal angle trajectory θB,NN ðtÞ and the displacement trajectory xmB,NN ðtÞ, respectively.The normalized IC vector r 0 IC referred to in Equation ( 10) can be calculated by Where y U , y D , x L , and x R are the up, down, left, and right boundaries of the translation range of the box, respectively; v min and v max are the minimum and maximum translational speed values of the box, θ B,range is a positive range value of the initial angle of the box, and L box is the half-length of the side of the box where the internal object can slide.In addition, the relationships of timestamps, ΔT θ,i or ΔT mB,i , and feature points, θB,i or xmB,i , between their normalized counterparts can be represented as follows, with t tot obtained from Equation (11) The training mode is supervised learning, and both of the NNs have the same structure, except for the different number of neurons in their output layers.Each NN has two fully connected hidden layers, and each layer is composed of 1024 rectified linear unit (ReLU) functions as neurons with L2 regularization weighted at 10 À4 and a dropout rate of 0.1, as well as batch normalization during training.The output layer is composed of sigmoid functions, such that the output values can be limited to 0-1, ensuring positive incremental timestamps.All the input and label data are normalized between 0 and 1 before being sent to train the NN models.The loss function of NN is set as the summation of squared errors (SSE) between the output and label data, and the optimization method is Adam.
The number of the original label data is different from each IC for sequences of either angle or displacement optimal trajectories.To solve this problem, the dimensions of the output layers are determined as the average length of sequences of the label data.For those whose length is larger than the average, superfluous couples counted from the end of the sequences are deleted except for the last couple, whereas for those with smaller lengths, the last couple is repeatedly duplicated until the length reaches the average number.
The necessity of using the NN method is explained here.The model-based iLQR provides a real optimal and reliable solution to the optimal trajectory planning problem due to its good numerical convergence.However, considering the purpose of optimizing the ICs of the dynamic system, the iLQR solver cannot offer a simpler, more precise, or more analytical solution for the IC optimization problem, because the iLQR is an iterative, non-closed-formed, and non-analytical solver for the IC inputs.In contrast, the NN solver, which also inputs the ICs of the dynamic system and outputs the near-optimal trajectory, is more likely to solve the IC optimization problem based on its GDs, which will be explained in Section 3.5.This is why NNs are chosen to solve the proposed task in this article.More specifically, the NN solver is a closed-form analytical approximator of the iLQR solver, and the NN can solve the IC optimization problem more easily.Thus, the NN method is chosen not due to its data-driven property but due to its accessible GD for optimizing the IC of the dynamic system.Admittedly, an uncertain difference exists between the data-driven and modelbased methods.To verify the bound of the difference, the ratio of root mean squared errors (RMSEs) of the output trajectories between generated from the iLQR and our proposed NN methods was calculated and compared as shown in Figure 10.

Transferring to Trajectory-Tracking Control Problem
The controlled dynamic system is extremely unstable, and the optimal trajectories generated from the iLQR are coherently calculated based on the exact result of every state.Even if the near-optimal angle trajectory has only little bias from the optimal one at any time, if without any feedback control, the resulting displacement trajectory would easily diverge to the two ends of the box.In addition, the near-optimal angle trajectory must have a difference to some extent from the optimal one.Conversely, in a realistic robotic system, even if the optimal angle trajectory is directly used as the system input, the sliding internal object will almost not follow the desired displacement trajectory due to disturbance, friction, and robot arm vibration.
Therefore, a trajectory-tracking controller is necessary to resist the disturbance.In this study, a PID controller is used.The tracking target is the near-optimal displacement trajectory xmB,NN ðtÞ, and the output of the PID controller is added to the near-optimal angle trajectory θB,NN ðtÞ.The controller structure is shown in IC .The output displacement trajectory x mB,NN ðtÞ can be very similar to the target xmB,NN if the tracking control is conducted well.In other words, the performance index, the RMSE of x mB , has a significant relationship with xmB,NN .Therefore, if the target trajectory can be further optimized with regard to some form of cumulative errors, the RMSE of x mB will also be optimized.Furthermore, xmB,NN is essentially a complicated function of the normalized IC vector r 0 IC with a chained composition of the NN models, linear normalization, and cubic polynomials.These functions are all differentiable.Therefore, in this article, another differentiable cost function is defined from the displacement target trajectory xmB,NN .This cost function can then be optimized for the normalized IC vector r 0 IC based on the GD method since all the functions are differentiable.Such optimization for the IC with regard to the target trajectory can finally optimize the RMSE of x mB .
Seeing that the optimization goal also considers the total time duration t tot of the task, another cost function in relation to the displacement target trajectory xmB,NN is defined as In addition, the gradient of the newly defined cost function J 2 w.r.t. the normalized IC vector r 0 IC can be calculated by where Except that can be easily obtained from the Jacobian of the x mB -Net by calling the function in Keras, the other terms for gradient calculation in Equation ( 26) can be determined through the aforementioned list of equations.Given the gradient of the cost function J 2 w.r.t. the normalized IC vector r 0 IC , normalized IC optimization can be achieved by iteratively conducting the GD method to find the optimal normalized IC as follows IC (29)   where η Adam is the descending rate calculated from the Adam algorithm. [40]Meanwhile, the uniform grid search from 0 to 1 for each optimizable dimension is used for the initial positions of the ICs in the GD optimizations.The IC with the least cost value among all the optimized ones is selected as the final optimal IC solution.

Stepwise Obstacle Avoidance Algorithm
A special case is considered in this process.A step-based gradual obstacle avoidance algorithm can be used along with the GD optimization if only the positions of the two via points are to be optimized and circular 2D obstacles exist to block the path of the robotic arm.When the cost function iteratively descends opposed to the gradient direction in the IC vector subspace, step vectors can be added to the two via points in the IC vector subspace to move away from the obstacles.This can not only optimize the IC using the GD method, but also make the via points gradually move away from the obstacles.The pseudo-code is depicted in Algorithm 2, where step 1 and step 2 are the step vectors added to the two via points in each iteration to move away from the obstacles, and η mov is a fixed positive length of the steps.

Simulation Results
After demonstrating the control structure and algorithm, the simulation for verification will be presented in this section.The general simulated environment setting is listed in Table 1.The RMSE of the sliding object displacement x mB of the task is set as the performance index for comparison.The computer hardware for the simulations includes Intel CoreTM i9-9900 K processor, 32 GB RAM, and NVIDIA GeForce RTX-2070 8 GB.The simulations were conducted on Python 3.7.3,Keras 2.2.4,and Tensorflow-GPU 1.14.0.

Comparison of iLQR with PID Control
To verify the effectiveness of the iLQR optimal control in the unstable dynamic system, the typical and high-performed feedback controller, PID control, was utilized for comparison.Four different ICs were tested, as shown in Figure 4 and Table 2.For the parameter setting in the iLQR optimal control, the iteration number was set to 15 and the weights of the displacement cost w x and angular velocity cost w ω in Equation ( 14) were respectively set to 5 Â 10 À8 and 10 À4 by roughly tuning only one digit in the exponent and the mantissa to render the RMSE of x mB as small as possible.Conversely, the setting for the PID controller is as follows where parameters ðK p , K I , K D Þ were tuned to ð0.7½deg mm À1 , 0.14½deg mm À1 -s À1 , 0.0238½deg-s mm À1 Þ to Algorithm 2. Simple Obstacle Avoidance Algorithm.
Given the current start p B,0 , end p B,3 , and two via points p B,1 , p B,2 ; Given the circular obstacles of center at p C,i and radius r i ; ðunit vector normal to p B,0 p B,1 and outward the circleÞ; ðunit vector normal to p B,3 p B,2 and outward the circleÞ; make the RMSE of x mB as small as possible, with the output limited within AE60deg.
The simulated results' performance indexes, which are usually analyzed in control theories, are listed in Table 3.The RMSEs of using the PID feedback control are considerably smaller than those using the iLQR optimal control.However, the overshoots of the PID control are two to four times those of the iLQR, and the settling times of the PID control are slightly larger than those of the iLQR.Before entering the steady state, the x mB using the PID control underwent 1.5-2 more vibration periods than using the iLQR, manifesting more violent behaviors.This finding can be verified by inspecting the plots of the processes for the task.Taking IC 1 for example, the strategy used by the PID control was more drastic than that used by iLQR, as shown in Figure 5.
By compositely considering the overshoot, settling time, vibration periods, and RMSE of x mB , although the PID control promises smaller RMSEs, the iLQR that can move the internal sliding object to the box center in a more precise and gentle way is more favorable.

Training and Performance of x mB -Net and θ B -Net
To conduct supervised learning and build x mB -Net and θ B -Net, five random seeds were used to respectively generate 10 000 random ICs of the dynamic system.For each IC, the iLQR using the same parameters mentioned in the previous subsection solved and built the two optimal trajectories.The feature points of the optimal trajectories were extracted as the label data using  To validate the effectiveness of the trained x mB -Nets and θ B -Nets and the recovery ability, the two trajectories generated from x mB -Net and θ B -Net are respectively compared with those generated from the iLQR in the simulations of ICs 2 and 3 in Table 2.These tests were conducted for all the five model couples of five different random seeds as mentioned, and the means and STDs of trajectories were compared.The results are shown in Figure 7.
For both ICs 2 and 3, the overall recoveries of trajectories xmB,NN are much better than θB,NN , since x mB was the controlled target.The overall variations of trajectories x mB,iLQR were essentially smaller than those of trajectories θ B,iLQR , so they were easier to imitate for NN.In addition, the box angle trajectories of ICs 2 and 3 both have a common defect, wherein the ups and downs of the trajectory θB,NN were insufficient compared with the imitated target θ B,iLQR .That is, the extreme maximum values were not large enough and the extreme minimum values were not small enough, manifesting a weaker characteristic of extremity.Such defect and the difference between θB,NN and θ B,iLQR lead to trajectory x mB,iLQR of the internal sliding object not being tracked perfectly.So, x mB would diverge to the box ends if without any feedback control, since it is an unstable dynamic system.
Furthermore, 100 random ICs were generated using a new random seed to statistically test the effectiveness.The 100 curves of ð xmB,NN,mean À x mB,iLQR Þ and 100 curves of ð θB,NN,mean À θ B,iLQR Þ for these 100 random ICs using the five couples of trained NN models are illustrated in Figure 8.The results show that the difference of x mB was roughly restricted within 40 mm, approximately one-third of the range of displacement.Moreover, the differences were comparatively large during the first one-fifth of the task, which is the transient response time, and small and roughly limited to 5 mm during the last four-fifth, which is after the transient response time.The differences of θ B during the first one-tenth of the task were also larger than those during the last nine-tenth, but the reduction of differences along time was poorer than that of x mB .
In short, the two trajectories xmB,NN and θB,NN generated by the x mB -Net and θ B -Net can considerably resemble the imitated    trajectories x mB,iLQR and θ B,iLQR generated by iLQR, except for some differences in the transient responses.

Simulation Results of x mB -Net and θ B -Net in the Dynamic System
In this section, the proposed control structure and algorithm are verified in a complete simulation by considering the dynamic system.To validate the tracking of the trajectories generated from x mB -Net and θ B -Net in the ISOSC task, ICs 1-4 in Table 2 were used for the proposed algorithm and compared to those using iLQR in the simulations.Since iLQR was directly calculated based on the dynamic system, additional feedback controllers were not needed to stabilize the system.Comparatively, the proposed algorithm somewhat shows a difference with the imitated iLQR, thereby needing a PID controller to track the target trajectory xmB,NN , as mentioned in Section 3.4.The parameters   ðK p , K I , K D Þ in Equation ( 30) for the PID controller here were set to ð0.4½deg mm À1 , 0.2½deg mm À1 -s À1 , 0.08½deg-mm À1 Þ, with the output box angle limited within AE60deg.In all following sections, the ensemble of bagging was leveraged, and the feasibility that all the five couples of the NN models can run separately in parallel computers was considered.That is, only the trajectory couple ðx mB,NN , θ B,NN Þ with the minimum RMSE of x mB,NN generated from the five NN model couples was chosen as the result comparison and demonstration.And, only the longest calculation time among the five models was compared with the other algorithms.
The results are shown in Figure 9 and Table 3.The RMSEs using the proposed algorithm are 0.90-1.10times those using iLQR.It indicates that the RMSEs using the two methods are very close.Besides, the overshoots of the proposed algorithm are larger than those of iLQR.This is because the dynamic system is unstable, and the trajectory-tracking control task uses the target trajectory that is not perfectly calculated according to the dynamic system.So, it would easily make the output trajectory x mB,NN have additional large variances at extreme points compared with the trajectory x mB,iLQR .This can also be observed in Figure 9.Nevertheless, these overshoots are smaller than those from using the pure PID controller, except for IC 3. The settling times of the proposed algorithm are also larger than those of the iLQR, and close to those using the PID control.However, this is not apparent in Figure 9.By considering the vibration periods before entering the steady state, the x mB using the PID control underwent 0 to 1 more period than those using iLQR, which can also be manifested in Figure 9, for the same reason of larger overshoots.However, these results are still 1-2 fewer periods than those observed using the PID.
To statistically verify the proposed algorithm used in the dynamic system, 1000 random ICs were generated using the random seed mentioned in paragraph 4 of Section 4.2 to compare the performance between the iLQR and the proposed algorithm.The ratio of RMSE of x mB using the proposed algorithm to that using iLQR and the running time were compared.For each IC, five NN model sets were separately run, and the least RMSE of x mB was used.For a fairer comparison, the iLQR algorithm was added with early stopping, such that it would stop if the difference between the RMSEs of two consecutive iterations was less than 0.01 mm.
The time used by iLQR to calculate these 1000 ICs was 365.30 s, while the longest time of the proposed algorithm among using the five NN model couples was 120.92 s, which To conclude, the proposed deep imitation learning algorithm with trajectory-tracking control was compared to the iLQR strategy in the dynamic system of ISOSC task through simulations.In terms of the trajectory profiles and control performance indexes, the output trajectories, x mB , of the proposed algorithm are considerably similar to those of the iLQR, except for additional small variances along the trajectories due to the instability of the dynamic system.And, the control performances of the proposed algorithm are still better than those of using the pure PID control when inclusively considering overshoot, settling time, and vibration periods before entering the steady states.In addition, considering the statistical comparison for RMSE of x mB , the results of the two algorithms are very close, but the proposed algorithm only took one-third of the time of the iLQR to calculate.Therefore, the proposed algorithm and control structure resemble the iLQR well but use much less time for computation.

Simulation Results of Optimization for the IC Based on GD
In this section, the optimization for the IC based on GD, as mentioned in Section 3.5, which is one of the main contributions of this article, is conducted for several parts by selecting some components of the IC to optimize while fixing the other components.Instead of using the RMSE of the dynamic system output x mB as the performance index, the SSE of x mB calculated using Equation ( 24) was used since the total time of the task can be shortened to make the cost function J 2 as small as possible when optimizing the via points in the IC.All the GD methods were conducted for 500 iterations.Furthermore, for each component of the IC used to optimize, several points uniformly distributed among the range mentioned after Equation ( 21) were set as the beginning points for the GD methods and run separately.Then, the respective optimized ICs of the minimum J 2 and task time were further used to compare their SSE of x mB .Thus, the bestoptimized IC of the NN model set can be obtained by utilizing such grid search.Similar to the previous section, the algorithm was run for all the five NN model couples, and only the one with the minimum SSE of x mB was demonstrated based on the concept of ensemble.

Optimization for the Initial Box Angle θ B,0
In this part, the initial angle of the box θ B,0 was optimized to make J 2 as small as possible.ICs 1 and 2 in Table 2 and Figure 4 were used, except for the θ B,0 .For both of the ICs, the values of θ B,0 at the beginning of the GD methods were set to ðÀ15°, 0°, 15°Þ.The optimization progresses of the GD methods for the two ICs are shown in Figure 11.For IC 1, the bestoptimized θ B,0 is the point from 15°, descending to the achievable minimum J 2 at 30°.For IC 2, the best-optimized θ B,0 is the point from À15°, descending to the achievable minimum J 2 at À30°.Considering that J 2 is a complicated function of the IC vector, as depicted in Equation ( 24), the profiles of Figure 11 can be seen as the cross-section of J 2 along the θ B,0 -axis.Such inspection can preliminarily realize the profile of the J 2 function in the highdimensional vector space.
Afterwards, the optimized ICs were utilized in the dynamic system and the proposed algorithm structure to verify the optimization effectiveness.The results are shown in Figure 12.The profiles demonstrate that both optimized initial box angles  became extreme angles, so the boxes could tilt to the extreme opposite orientation and make the sliding objects rush back to the box centers as soon as possible and achieve a shorter rising time.For the performance index, the SSEs of IC 1 before and after optimization are 6418½mm 2 s and 3198 ½mm 2 s, indicating a reduction of 50%.The SSEs of IC 2 before and after optimization are 4324 ½mm 2 s and 3405 ½mm 2 s, indicating a reduction of 21%.
To statistically test the optimizing strategy in this part, 50 random ICs were generated using the random seed mentioned in Paragraph 4 of Section 4.2.The improvement rate before and after optimization for these ICs is shown in Figure 13a.The result shows that only 14 cases located between ½0, 0.05 almost did not improve; the other cases are averagely distributed among ½0.05, 0.75, indicating that about 72% of the cases exhibited obvious improvement.The optimized initial box angles versus the initial object displacements x mB,0 of the respective ICs is shown in Figure 13b.The relationship is approximately linear with saturation at x mB,0 ¼ AE50 mm.This result shows that the best initial box angle had a strong positive relationship with the initial object displacement.Moreover, the algorithm tended to set the best initial box angle to the limit values when the initial box displacement was larger than 50 mm.

Optimization for the Initial Object Displacement x mB,0
Similar to the previous part, the initial displacement of the internal sliding object x mB,0 was optimized to achieve the smallest J 2 here.ICs 1 and 2 in Table 2 and Figure 4 were also used, except for their x mB,0 .For both ICs, the initial values of x mB,0 at the beginning of the GD methods were set to ðÀ62.5 , 0, 62.5Þmm.The optimization progresses of the GD methods for the two ICs are shown in Figure 14.For IC 1, the best-optimized x mB,0 converges to the points from À62.5, 0, 62.5 mm and descending to the achievable minimum J 2 at À30.93 mm.For IC 2, the best-optimized x mB,0 comprises the points from À62.5, 0, and 62.5 mm and descending to the achievable minimum J 2 at À0.22 mm.The profiles of Figure 14 are seen as the cross-section of the J 2 function along the x mB,0 axis.
The optimized ICs were also deployed in the dynamic system for verification.The results are shown in Figure 15.For the performance index, the SSE of IC 1 before and after optimization are 6418 ½mm 2 s and 177 ½mm 2 s, respectively, reducing by 97%.Meanwhile, the SSE of IC 2 before and after optimization are 4324 ½mm 2 s and 35 ½mm 2 s, respectively, reducing by 99%.
The same 50 ICs mentioned before were used to verify the optimization strategy in a statistical manner.The improvement rates before and after optimization for these ICs are shown in

Figure 16a.
The result shows that only four cases are located between ½0, 0.05 with almost no improvement, and 41 cases are over 0.55, distributed increasingly as the improvement rate grows.That is, 82% of the cases were sufficiently improved.The optimization performance in this section is obviously better than that in Section 4.4.1, since the optimized target is x mB,0 , which has a more direct relationship with x mB than θ B,0 .The optimized initial object displacements in relation to the initial box angles of the respective ICs are illustrated in Figure 16b.The relationship is highly linear, showing that the best initial object displacement has a strong positive relationship with the initial box angle.The slope in this section is much smaller than that in Section 4.4.1, and the extreme values only occupy about one-third of the range values, revealing that as θ B,0 increases or decreases, the algorithm tends to modify x mB,0 slightly to keep x mB,0 as small as possible compared with Section 4.4.1.

Optimization for the Two via Points Without Obstacles
In this part, the two via points of the box movement path were optimized.The modified IC 1 in Table 2 with changing v avg,2 to 150 ½mm s À1 was used.First, only the x-coordinate of the two via points, i.e., x B,1 and x B,2 , in the IC vector were optimized with fixing the other components.For both x B,1 and x B,2 , the beginning points of the GD methods were set to ð162.5, 325, 487.5Þ mm, respectively.There were nine sets of beginning points for the GD methods.The stepwise changing progress of the box path for each beginning point set from an NN model couple is shown in Figure 17a, and the optimization progresses for J 2 w.r.t.x B,1 and x B,2 are shown in Figure 17b.The results show that the best-optimized ðx B,1 , x B,2 Þ is the point from ð487.5, 162.5 mmÞ, descending to the achievable minimum J 2 at ð271, 0mmÞ.The J 2 of the other beginning points also descended to lower levels of the function by utilizing the GD methods.The results of the dynamic system and the box path based on the best-optimized IC are shown in Figure 18a.The SSE of x mB is 6361 ½mm 2 s.
Second, both the Â and y coordinates of the two via points in the modified IC 1 mentioned before were optimized.The points ð0, 217, 433, 650Þmm were set as the beginning points of the GD methods for each of the four components.Therefore, a total of 256 sets of beginning points were used to conduct optimizations.The optimizing progresses for J 2 w.r.t.two via points are shown in Figure 17c, which shows that the GD methods are effective in making J 2 's start from different beginning point sets and decrease as the epochs increase.The two best-optimized via points are ð650, 650 mmÞ and ð0, 0 mmÞ, respectively, conforming to intuition: if the y-coordinate of the two via points can also be optimized, the best way is to make the two via points coincide with the start or end points.So, the box path would become only one segment of a straight line between the start and end points, in which the box can avoid unnecessary acceleration and deceleration along the path and impact on the internal sliding object.The results of the dynamic system and the box path based on the best-optimized IC are shown in Figure 18b.

Optimization for the Two via Points with Obstacles
The GD methods with the obstacle avoidance algorithm mentioned in Section 3.6, which was used to optimize the position of the two via points when obstacles placed in the box moving space, were verified in this part.Given the numerous sets of beginning points for each NN model couple, only the progress of the minimum final optimized J 2 from each of the NN model couples is illustrated.The one with the minimum SSE of x mB among the five NN model couples was selected to illustrate the x mB and θ B in the ISOSC task.First, only one circular obstacle was considered.The circular obstacle was located at ð325, 325 mmÞ with a diameter of 390 mm.The Â and y coordinates of the two via points in the modified IC 1 were to optimize with changing v avg,2 to 150 ½mm s À1 in Table 2 and Figure 4.The points ð0, 217, 433, 650Þ mm were set as the beginning points of the GD methods for each of the four components, and a total of 256 beginning point sets were used to conduct optimizations.
The stepwise changing progresses of the box path for each NN model couple are shown in Figure 19a, and the optimizing progresses for J 2 w.r.t. the two via points are shown in Figure 19b.The two best-optimized via points are ð650, 648 mmÞ and ð435, 18 mmÞ, respectively.Results show that the progress of optimizing did not descend as monotonically as the previous parts, since the combined algorithm had to avoid the obstacle blocking the center of the path space during the progress.Since there was only one obstacle, and the box path only needed one turning via point to circumvent the obstacle, the path with the minimum optimized J 2 of each NN model couple only had two segments of straight lines.Instead of stepwise evolving to the shortest path, the algorithm tended to flatten the last segment of the path.The results of the dynamic system and the box path based on the best-optimized box movement path are shown in Figure 20a.The SSE of x mB is 5499 ½mm 2 s.This result is  compared with the two shortest paths, which comprise two segments of lines tangent to the circle, and using iLQR, as shown in Table 4.The proposed algorithm even has less SSE of x mB than iLQR, whether along the best-optimized path or the shortest paths.The SSEs of the best-optimized paths are less than those of the shortest paths for iLQR and the proposed algorithm, since the steady states where the displacement of the internal object maintained around zero after the transient states took a small portion of the SSE values.Thus, the time duration of the steady states had little influence on the SSE values.
Second, two circular obstacles were considered.The circular obstacles with a diameter of 364 mm were located at ð117, 221 mmÞ and ð533, 429 mmÞ, respectively.All other conditions were set the same as the last paragraph.The stepwise changing progresses of the box paths are shown in Figure 19c, and the optimizing progresses for J 2 w.r.t. the two via points are shown in Figure 19d.The two best-optimized via points are ð390, 642 mmÞ and ð275, 70 mmÞ, respectively.This result shows that the optimizing progresses did not descend monotonically because of the two obstacles.Noticeably, saw-shaped trembles appeared in the black progress curve in Figure 19d, forming a dynamic balance and seesaw struggles between the directions to optimize for J 2 -value and avoid obstacles.Such phenomenon is also slightly shown in the pink progress curve in Figure 19b.The results applied to the dynamic system and the box path based on the best-optimized via points are shown in Figure 20b.The SSE of x mB is 5771 ½mm 2 s.This result was compared with the shortest path and the use of iLQR, as listed in Table 4.The SSE of x mB from the best-optimized paths is less than that of the shortest path for the proposed algorithm, which manifests that the shortest path is not the best solution.
To conclude, by inspecting the proposed optimization algorithm being used in several cases for the ICs based on the GD methods, it verified that the proposed algorithm could effectively stepwise find the best-optimized IC and make the SSE of x mB as small as possible.

The Worst IC for the J 2 Function
The NN models are usually seen as black-box models that cannot be safely employed in the manufacturing industry because the models are very complicated and hard to mathematically analyze.However, finding and checking the boundaries of the models can somewhat ensure the behavioral range of the NN models (i.e., the safety of the NN models).In this subsection, we tried to find the worst case for the NN models to assess their worst behavior.Unit: ½mm 2 s.The bold-marked numbers are the best results (least values) compared to other approaches of the conditions.
The performances of the NN models are judged by the J 2 function.Since the J 2 function is differentiable w.r.t. the IC vector, the gradient ascent (GA) method, which is the opposite of the GD method implemented in the previous subsection, was utilized to find the worst worsened IC and the largest J 2 value.If the proposed algorithm and the control structure still perform reasonably or acceptably using the worst worsened IC, the safety of the NN models, which has been thought not to be thoroughly analyzed, can be assured as safe somewhat.
ð0.33, 0.67Þ were set as the beginning points of the GD methods for every component of the normalized IC vector, and a total of 2 13  The worst cases may be caused by the bad behaviors of the x mB -Nets, indicating that the NN models resemble the iLQR very badly at certain ICs.However, such bad target trajectories and displacement trajectories obtained from the dynamic system were still bounded and converged back to zero in the last half of the tasks.Therefore, the NN models can be said to be roughly safe after the inspection of the worst cases for the J 2 functions.

Experimental Section
In this section, the first part of the proposed algorithm structure, which utilized x mB -Net and θ B -Net from deep imitation learning and trajectory-tracking control in the ISOSC task, was verified and compared with using iLQR in a realistic experiment platform.The EKF mentioned in Section 2.3 to estimate the displacement of the internal sliding object by the measured applied torque was also verified by comparing it with the measurements from a camera in the experiments.
The experiment platform is shown in Figure 23.The mechatronics and controller of the platform comprise an industrial 6-DOF articulated manipulator (RT605-710-GB, HIWIN, Taiwan), which is position-controlled and driven by a Syntec motor driver system, and a six-axis F/T sensor (WEF-6A200-4-RCD, WACOH, Japan) installed on the wrist of the arm to measure the applied torque perpendicular to the box moving plane.This mechatronic system, along with the real-time trajectory-tracking PID control, was controlled by an industrial computer (PXIe-8840, National Instruments, USA) using LabVIEW with a loop time of 12 ms.The target trajectories, either from iLQR or from the NN model couples, were generated by a personal computer in advance and then sent to the industrial computer for tracking.
Regarding the end-effector, a self-designed plate made of pieces of medium-density fiberboard and brackets made from PLA were set as the "box" and screwed onto the F/T sensor, so the outside camera could observe the position of the internal object.In addition, a set of railways made of acrylic bars were attached on the plate to confine the movement of the internal object and ensure the object only moved along a straight line on the plate.Meanwhile, a stainless-steel ball weighing 365gw with a diameter of 1.75 in was set as the "internal sliding object."Although the steel ball did not actually slide but rolled on the acrylic railways, this was the most realistic way to achieve the friction-free assumption in the dynamic model.The steel ball was so intensive and small enough that the rolling could be approximated as sliding.The box movement was constrained to the plane of Figure 23.
In terms of parameter settings, some parameters in Table 1 were changed due to physical limitations in reality.The v min and v max were respectively reduced to 50 and 100½mm s À1 , and the weight of the angular velocity cost w ω in Equation ( 14) was increased to 10 À2 considering the maximum rotation speed of the manipulator.The training data were re-generated in the computer based on these settings, and the NN models were re-trained for implementation in the experiments.ICs 1 and 2 in Table 2 were used to conduct the experiments, except that the start, via, and end points were proportionally reduced to 550 mm due to manipulator workspace limitations.In addition, v avg also decreased from 100 and 150½mm s À1 to 50 and 100½mm s À1 , respectively.Parameters ðK p , K I , K D Þ in Equation ( 30) of the PID controller for trajectory tracking were tuned to (0.02½deg mm À1 , 0.002½deg mm À1 -s À1 , 0.018½deg-s mm À1 ).
The iLQR was used to generate trajectories to compare with the proposed algorithm in the experiments.Since disturbances and other unrecognized factors exist in the real world, the same real-time trajectory-tracking PID control was also applied for the trajectory from iLQR; otherwise, the internal object could have easily left the desired trajectory.Similarly, the trajectories generated from the NN model couple with the least RMSE of xmB,NN were chosen as the target trajectories.For each experiment condition, experiments were conducted five times, the mean and standard deviation (STD) of resulting trajectories from experiments were demonstrated, and the RMSE of x mB was set as the performance index for comparison.
The displacement of the internal object estimated by EKF was compared with measurement from a digital video recorder.Three small blue paper stickers as markers were stuck at the two ends of the box and the center of the steel ball, so the displacement of the ball can be obtained from the position of the three stickers by image processing.The experiments were recorded at 60 frames per second and 1920 Â 1080 resolution.
The experiment results are shown in Figure 24, and the snapshots of the experiments are shown in Figure 25 and 26.The modified IC 1 is discussed first.The target trajectories xmB generated from the iLQR and the NN were very similar, except the rising time of NN shorter than that of iLQR.For this reason, considering the realistic displacements, the overshoot of x mB,NN was 12.5% larger than that of x mB,iLQR , indicating a more drastic movement during the transient response of the NN method.And, both of x mB,NN and x mB,iLQR exhibited about 3.5 periods of oscillations before entering the steady state because of the instability of the dynamic system, friction between the internal object and the box, and about 0.15s time delay from the controller command to the manipulator drivers.Both of x mB,NN and x mB,iLQR showed a steady-state error of about 17mm in the steady  state due to the static friction between the object and the box.Overall, the x mB,NN of the proposed control structure can effectively approach 0 and resemble the x mB,iLQR of the imitated iLQR algorithm in a realistic case.
For the modified IC 2, the target trajectory from the iLQR, xmB,iLQR , was slightly underdamped, while the xmB,NN from NN was critically damped.However, the trajectory during the transient response of xmB,NN was also more drastic than xmB,iLQR , resulting in the overshoot of displacement x mB,NN being 55% larger than that of x mB,iLQR .For the oscillation behavior during the transient response, x mB,iLQR showed decayed oscillations in the steady state, whereas x mB,NN showed a constant bias.This result may be due to the mean value of x mB,iLQR being very close to 0 and mainly affected by the sensitive controller, while x mB,NN achieved a static balance between the static friction and controller angle command at the biased displacement.Moreover, the magnitude of x mB,NN , similar to x mB,iLQR , was gradually minimized during the process in this case.Comparing the displacement estimated by the EKF and measured by the camera, the EKF can estimate the displacement well, considering the trends and magnitudes, especially when the x mB were static.However, EKF usually underrates the magnitude of x mB during vibration periods.This might be because the inertia and weight of the box, which was ignored in the dynamic model of EKF, still took comparable portion of the applied torque and was not removed from estimation. it had some delay when x mB 's were dynamic.Statistically saying, for the modified IC 1, the RMSE of x mB,iLQR estimated by the EKF is 37.18 AE 0.60½mm and the RMSE of x mB,NN is 35.60 AE 0.36½mm, which even reduces 4.2%.For the modified IC 2, the RMSE of x mB,iLQR by the EKF is 32.75 AE 0.31½mm and the RMSE of x mB,NN is 37.03 AE 0.88½mm, which is 13.1% worse.
To conclude that, the proposed control structure, which deploys x mB -Net and θ B -Net based on deep imitation learning with trajectory-tracking control, can effectively decrease the displacement of the internal object as small as that of using iLQR in the ISOSC task in the real world.

Conclusion and Future Work
As Industry 4.0 has prevailed among the world, the demand for more automatic optimal trajectory planning of manipulators is also ascending.Although many previous works have researched on optimal trajectory planning, they rarely investigated on application of unstable dynamic systems, which are hard to control for near-optimal planning algorithm.And previous research also scarcely addressed optimization for the IC of dynamic systems.
In this article, an unstable dynamic system named ISOSC task is proposed, mathematically modeled with some assumptions, simulated, and experimented on a real robotic platform.A corresponding EKF based on torque measurement to estimate the internal object displacement in the box is also constructed.To address the ISOSC task, a near-optimal trajectory planning structure based on deep imitation learning aided with a PID trajectorytracking controller is proposed.The ISOSC tasks are firstly solved by the iLQR as the imitated algorithm given many ICs.Then, the extreme points of trajectories x mB,iLQR and θ B,iLQR generated from the iLQR are extracted as the label data.Taking the IC as input data, two NN models named x mB -Net and θ B -Net are trained.In this way, given an IC, the corresponding near-optimal trajectory couple xmB,NN and θB,NN can be fast re-constructed based on these feature points output from the couple of NNs.Besides, to further optimize the IC of the dynamic system, the GD method, which leverages the easily accessible gradient of NN models, is utilized.Such GD method can improve the IC to achieve minimum displacement of the internal object and avoid obstacles blocking among the box movement, and it can investigate the worst bound of NNs to assure the safety of NNs.Some simulation results are obtained.Comparing using the iLQR optimal control with the PID controller in the ISOSC task, although PID control can achieve smaller RMSE of x mB than iLQR, iLQR promises 1.5-2 less vibration periods than PID control and 25-50% overshoot percentages of PID control during transient responses.Moreover, in addition that the two trajectories xmB,NN and θmB,NN generated from the proposed algorithm are similar to their imitated counterparts x mB,iLQR and θ mB,iLQR , the behavior of x mB,NN and θ mB,NN resulted from these target trajectories xmB,NN and θmB,NN used in the dynamic system with a trajectory-tracking PID controller are also comparably like those of using iLQR, except for larger vibration magnitudes and overshoots during transient responses.But, it takes only 1/3 of calculation time of the iLQR.Lastly, About the IC optimization based on the GD method, it can optimize the initial box angle θ B,0 for 50 random ICs to achieve the cost J 2 of 72% cases improved from 5% to 75%, and it can optimize the initial object displacement x mB,0 for 50 random ICs to achieve the cost J 2 of 82% cases improved more than 55%.Also, It can optimize the two via points of the box movement path to achieve minimum J 2 , it can modify the two via points to avoid obstacles among the box movement path, and it can be conducted reversely as GA to find the largest bound of J 2 and the worst performance of x mB -Net to ensure the safety of NNs.The proposed near-optimal trajectory planning algorithm was also experimented and compared with the iLQR by using a commercial 6-DOF manipulator to manipulate the box and a force/torque sensor to measure the applied torque.Experiment results show that the RMSE of x mB,NN from the proposed algorithm is 95.8% to 113.1% of that of x mB,iLQR from the iLQR, indicating that the proposed algorithm can imitate the iLQR well.Some works will be implemented to make this research more complete.For the dynamic model of the ISOSC task, some assumptions can be removed to make it more like the realistic situation.Such as, the friction between the box and the internal sliding object should be considered in the model, the weight and inertia of the box should also be considered in EKF, and it can be extended to a 3D model.In addition, the size of the NN models will be further researched.What the minimum number of layers and neurons of NNs are required to mimic the iLQR successfully, as well as the structure, will be researched.Thus, the complexity of the NN model will reduce, and the worst bound of its performance found by the GA method will be conducted in a more efficient and more confirmed way, which will also be researched in the future.
Although the experiment in this article is restricted to two dimensions for simplicity, this case not only has clear modelbased optimal solutions analyzed in a mathematical way but also sufficiently delivers the advantage and concept of using our proposed NN-based IC optimization structure in a clear way.The proposed method can be applied in scenarios with more degrees of freedom (DOFs) and complexity.For example, the ISOSC task can be extended to three dimensions such that a robot server holds a plate with a glass sliding on it.In such a case, one more translational dimension and two more rotational dimensions should be added.The dynamic system would become much more complicated and even need tensor calculation for attitude control.The increase in system complexity mainly affects the complexity of iLQR construction and calculation.The calculation complexity of iLQR increases in squared as the system dimension increases for matrix computation.On the other hand, as the dimension of the system increases, the number of input IC entries of the NN would increase linearly.So, the number of initial starting points for initializing IC optimization in the grid search way would increase in the exponential speed, which has large room to improve as the future work.

Figure 1 .
Figure 1.Demonstration of the task.The bottom rectangle represents the box, while the upper rectangle represents the sliding mass point.Figure 2. Free body diagram of the object observed from the box.

Figure 2 .
Figure 1.Demonstration of the task.The bottom rectangle represents the box, while the upper rectangle represents the sliding mass point.Figure 2. Free body diagram of the object observed from the box.

Figure 3 .
Given that the last half of the trajectory xmB,NN is nearing zero, the tracking goal of the last half of the task is directly set to zero for stability by multiplying a smooth inverse step function to the xmB,NN ðtÞ e 50À 100t t tot 1 þ e 50À 100t t tot (23) 3.5.Using Gradient-Descent to Find the Optimal Initial Condition Following Section 3.2 and 3.3, a displacement target trajectory xmB,NN can be obtained and set as the tracking goal by inputting a normalized IC vector r 0

Figure 3 .
Figure 3.The near-optimal trajectory-tracking control structure, where gðtÞ is the dynamic controlled plant.

Figure 4 .
Figure 4. a-d) Four different initial conditions (ICs) of the ISOSC task.The red line represents the box and its length, the blue lines are the predefined box translational trajectories, and the green square is the internal sliding object in the box.

Figure 5 .
Figure 5. Displacement of the object x mB (top) and the angle of the box θ B (bottom) versus time of the task of IC 1.

Figure 6 .
Figure 6.Training progress of a) x mB -Net and b) θ B -Net.The light color patterns show the plus and minus one standard deviation (STD) among five different random seeds.

Figure 7 .
Figure 7.Comparison of the object displacement x mB and the box angle θ B to time between trajectories from iLQR and NN models in the tasks using a) IC 2 and b) IC 3. The light blue patterns are the plus and minus one STD of five random seeds.

Figure 9 .
Figure 9. Object displacement x mB and the box angle θ B versus time of the task with a) IC 1, b) IC 2, c) IC 3, and d) IC 4.

Figure 10 .
Figure 10.Statistical histogram for the ratio of RMSE of the proposed algorithm to RMSE of iLQR for the 1000 ICs, with the bar width set to 0.25.The closer the ratio to 1.0 (as the red dashed line marks), the better performance the result has.

Figure 11 .
Figure 11.Optimization progresses for J 2 w.r.t.θ B,0 of a) IC 1 and b) IC 2, where the filled dots are the beginning points of the algorithm, the colors show three curves from different start points, and the line colors are for clear demonstration.

Figure 13 .
Figure 13.a) Statistical histogram of the improvement rate for the 50 ICs.b) Optimized initial box angles versus the initial object displacements of respective ICs.

Figure 12 .
Figure 12.Object displacement x mB and box angle θ B versus time of the task before and after optimizing θ B,0 for a) IC 1 and b) IC 2.

Figure 14 .
Figure 14.Optimization progresses for x mB,0 w.r.t.J 2 of a) IC 1 and b) IC 2, where the filled dots are the beginning points of the algorithm, the colors show three curves from different starting points, and the line styles are only for clear demonstration.

Figure 15 .
Figure 15.Object displacement x mB and box angle θ B versus time of the task before and after optimizing x mB,0 for a) IC 1 and b) IC 2.

Figure 16 .
Figure 16.a) Statistical histogram of the improvement rate for the 50 ICs.b) Optimized initial object displacements versus the initial box angles of respective ICs.

Figure 17 .
Figure 17.a) The stepwise changing progress of the nine box paths by setting the lightest-colored lines as the beginning paths and stepwise progressing to the deepest-colored lines as the final paths, b) the optimizing progressions for x B,1 and x B,2 with respect to J 2 for only x coordinates to optimize, and c) the optimizing progressions with respect to J 2 versus epoch number for both x and y coordinates to optimize.

Figure 18 .
Figure 18.a) The object displacement x mB and the box angle θ B versus time of the task and the box moving path for a) only x coordinates to optimize and b) both x and y coordinates to optimize.

Figure 19 .
Figure 19.a) The stepwise changing progress for every 50 epochs of the box path.The light-colored lines are set as the beginning paths and stepwise progressing to the deep-colored lines as the final paths.b) The optimizing progressions with respect to J 2 for each NN model set for one obstacle illustrated as a dashed circle; c,d) are for two obstacles.Each NN model set was matched to one corresponding color, including black, red, blue, pink, and green.

Figure 20 .
Figure 20.a) The object displacement x mB and the box angle θ B versus time of the task and the box moving path for a) one obstacle and b) two obstacles, which shown as the black dashed circles.The dotted gray lines demonstrate the shortest paths for comparison.
¼ 8192 sets of beginning points were used.The worsening progressions of J 2 are shown in Figure 21.Only the progress of the maximum final worsened J 2 from each NN model couple is illustrated due to large number of curves.The Figure shows that most of the slope of the worsening progress curves approached zero at the end of the epoch number, indicating convergence to the local maximum values.Therefore, the behavior of the ICs with the maximum J 2 values can somehow represent the worst performance of the proposed algorithm and control structure.The results of running the dynamic system and the worst worsened box paths for the five NN model couples, respectively, are shown in Figure 22.The SSEs of x mB are 17 790, 23 691, 21 043, 36 323, and 21 344 [mm 2 s], while their counterparts using the iLQR with the same ICs are 8012, 8188, 8233, 8093, and 8272 [mm 2 s], respectively.The results show that the worst cases of the proposed algorithm are not caused by the worsened ICs, since using iLQR can still have a reasonable optimal profile of trajectories and SSE with the same ICs.The worst cases are also not caused by the bad resemblance of θ B -Nets, since the profiles of θ B,iLQR and θB,NN are similar for each NN model couple.

Figure 21 .
Figure 21.The worsening progress of J 2 for each NN model couple.

Figure 22 .
Figure 22.Object displacement x mB and box angle θ B versus time of the task compared with the target trajectories from the NN models and those using iLQR for NN a) Model 1, b) Model 2, c) Model 3, d) Model 4, and e) Model 5 and f ) their box moving paths.

Figure 24 .
Figure 24.Experiment comparison of the object displacement x mB and the box angle θ B to time between the tracked trajectories from the iLQR (left) and the NN models (right) for a) modified IC 1 and b) modified IC 2. Each diagram of x mB compares the measured data from the camera (red solid line and shaded pattern) and the EKF (blue solid line and shaded pattern), along with marking the tracked-target trajectory, x mB,NN or x mB,iLQR .

Figure 25 .
Figure 25.The snapshots of the experiment in chronological order of using a) iLQR and b) the proposed NN method, respectively, for modified IC 1.

Figure 26 .
Figure 26.The snapshots of the experiment in chronological order of using a) iLQR and b) the proposed NN method, respectively, for modified IC 2.
2 is not totally outside of Circle ðp C,i , r i ÞÞ step 1 ← step 1 þ η mov Â ðunit vector normal to p B,1 p B,2 and outward the circleÞ; step 2 ← step 2 þ η mov Â ðunit vector normal to p B,1 p B,2 and outward the circleÞ;

Table 1 .
General environment setting.

Table 2 .
Four different ICs of the ISOSC task.

Table 3 .
Comparison of performance using iLQR, PID, and the proposed algorithm.

Table 4 .
SSE comparison with paths and iLQR.