Hierarchical Planning with Deep Reinforcement Learning for 3D Navigation of Microrobots in Blood Vessels

Designing intelligent microrobots that can autonomously navigate and perform instructed routines in blood vessels, a crowded environment with complexities including Brownian disturbance, concentrated cells, confinement, different flow patterns, and diverse vascular geometries, can offer enormous opportunities and challenges in biomedical applications. Herein, a biological‐agent mimicking a hierarchical control scheme that enables a microrobot to efficiently navigate and execute customizable routines in simplified blood vessel environments is reported. The control scheme consists of two decoupled components: a high‐level controller decomposing complex navigation tasks into short‐ranged, simpler subtasks and a low‐level deep reinforcement learning (DRL) controller responsible for maneuvering microrobots to accomplish subtasks. The proposed DRL controller utilizes 3D convolutional neural networks and is capable of learning control policies directly from raw 3D sensory data. It is shown that such a control scheme achieves effective and robust decision‐making within unseen, diverse complicated environments and offers flexibility for customizable task routines. This study provides a proof of principle for designing intelligent control systems for autonomous navigation in vascular networks for microrobots.

capture several key characteristics such as varying vessel geometry and rich, unknown red blood cell configurations. The hierarchical control scheme is inspired by the hierarchical problem-solving strategy in biological agents. [21] Specifically, we design a high-level controller to automatically decompose a complex navigation task into simpler subtasks, which are represented by a series of navigation subgoals leading toward the ultimate goal. The high-level controller is accompanied by a low-level deep reinforcement learning (DRL) controller to maneuver robots to accomplish these subtasks. By training a DRL controller via reinforcement learning method on extensive raw 3D sensor data, [22] our data-driven approach to navigation control not only simplifies sensor and algorithm development but also enables navigation in unknown, diverse blood vessel environments. The hierarchical control design offers great flexibility to customize navigation routines in large-scale, complex environments; it ultimately provides an algorithmic route to address the navigation challenges arising from a broad range of biomedical applications, such as targeted drug delivery, blood clot clearance, precision surgery, and numerous circulatory system-based disease diagnostics and therapeutics.

Hierarchical Control Algorithm
A hierarchical control scheme is established to address the 3D navigation of self-propelled microrobots in blood vessels ( Figure 1A). The proposed hierarchical control scheme consists of a high-level controller dynamically setting short-ranged navigation targets along a desired path (length scale >100 μm) ( Figure 1B) and a low-level DRL controller responsible for navigating robots to circumvent RBC obstacles (length scale <10 μm) and moving toward the specified dynamic targets using local observation ( Figure 1C,D). The choice of the DRL controller is motivated by its exceptional capability in sequential decisionmaking arising from various challenging situations such as games [23,24] and robotics, [25] as well as the recent success in applying DRL to learn generalizable navigation strategies in 2D microstructured environments. [15,20] A navigation task can be represented by a 3D preset path in the space connecting the starting point to the ultimate target point. The high-level controller selects a point in the path near the microrobot as a temporary target position. As the microrobot is steered by the low-level controller ( Figure 1B) and gets closer to the temporary target, a new farther target along the path is selected. By following these guiding targets, the robot will approximately follow the designed path. Mathematically, let the path be represented by a parametric function T(q) ∈ R 3 , then the sequentially generated targets are given by T(q 1 ), T(q 2 ), …, T(q N ), where q 1 < q 2 < … < q N , T(q N ) denotes the final target point or the desired path endpoint. The generation of new temporary targets is well paced with the progress that the microrobot makes toward these temporary targets, as summarized by Algorithm 1.
The hierarchical control algorithm performs iterations on two levels. The high-level controller iteratively updates temporary short-ranged targets along the desired path as navigation subtasks. The low-level DRL controller iteratively updates the rotational decisions at an interval of t c based on the microrobot  Figure 1. Hierarchical control scheme for autonomous microrobot navigation. A) Schematic representation (not to scale) of the low-level controller steering a microrobot to navigate in a blood vessel. Our deep reinforcement learning (DRL) algorithm employs deep neural networks to take 3D sensing of the microrobot's neighborhood, microrobot's state (position and orientation), and target (octahedron) location as inputs and to output rotational decisions. The details of the architecture are provided in the Experimental Section. B) Scheme of 3D local sensing around the microrobot. The sensation is represented by a 3D binary image with width W and resolution (pixel size) U. The 3D binary image takes a value of 1 if the central point of the pixel is in a red blood cell (RBC) or is outside the vessel, and 0 otherwise. C) A target generator as a high-level controller to sequentially generate short-ranged targets (octahedrons) that guide the microrobot along a prescribed path. D) Local 3D sensory input, microrobot state (position and orientation), and target position are fed into a neural network, which outputs the rotational decisions to steer the microrobot towards the target. Here, the RBCs have diameters uniformly sampled between 6 and 8 μm, and the microrobot has a diameter of 2a ¼ 1 μm.
www.advancedsciencenews.com www.advintellsyst.com state and the local observation, with the objective to accomplish the navigation subtask. The target update is triggered only when the microrobot is making progress or getting closer to the target. The complete algorithm is illustrated in Algorithm 1. A scalar parameter d s is used as a threshold to update a new temporary target as the new subtask. Throughout this work, d s is set at 20a, which is slightly larger than the size of RBC (%12a to 20a), where a is the radius of the microrobot. Such a choice of d s leads to a good balance of task decomposition and low-level controller learning. Setting d s to be too small can cause the temporary target set inside a red blood cell obstacle and further cause the robot to be trapped when the robot aims to circumvent the cell. In contrast, setting d s to be too large makes the subtask harder due to its increasing state-action space. This creates a hurdle for the lowlevel controller to learn effective strategies and defeats the purpose of hierarchical decision-making.
Algorithm 1: Hierarchical control algorithm for microrobot navigation Given a desired path represented by a parametric function T(q) ∈ R 3 . Denote microrobot position by r.
While True: Select a temporary target on the path, denoted by r t ¼ T(q*), where q*¼argmin q [||T(q)Àr|| > d s ] and solved q* is required to be monotonically increasing.
While the robot is not getting closer to the target: Steering the microrobot towards the target r t based on DRL policy.

End While End While
In the following, the dynamics model of microrobots and the DRL algorithm used to derive the control policy used to accomplish navigation subtasks are discussed.

Microrobot Dynamics
In this work, a type of microrobot that is engaged in constant self-propulsion but allows continuous control of orientation via external stimuli or intrinsic features (e.g., electric [26] and magnetic fields, [5,27] light, [28][29][30] agent chirality, [7,31,32] flexible structure mechanics [33] ) was considered. The dynamics of such a direction-controllable microrobot is given by where r and p denote the position vector and the orientation vector (which is also the self-propulsion direction), respectively; t is time, and v SP is propulsion speed taking a constant value; w ¼ ( w 1 , w 2 ), Àw max < w 1 , w 2 < w max are the two bounded control inputs that change the self-propulsion direction in two orthogonal basis directions of q 1 and q 2 , where q 1 ¼ e z Â p (e z is the unit vector in the z-direction) and q 2 ¼ p Â q 1 . Brownian translation and rotation are characterized by zero-mean independent multivariate Gaussian noise process ξ r and ξ p with covariance E[ξ r (t) ξ r where D t is the translational diffusivity, D r is the rotational diffusivity, and I denotes the unit tensor. All lengths are normalized by microrobot radius a and time is normalized by τ ¼ 1/D r . The control update time is t c ¼ 0.02τ, the integration time step Δt ¼ 0.001τ, and D t ¼ 1.33a 2 D r .

DRL Controller Design
Given a short-ranged target specified by the high-level controller ( Figure 1B), the low-level DRL controller aims to steer the microrobot to the specified target within the minimum time. Based on the dynamics model of the microrobot, w ¼ (w 1 , w 2 ) with Àw max < w 1 , w 2 < w max , are the two control inputs that change the self-propulsion direction p on two orthogonal bases. Here robot state s refers to its position r and orientation p and the system state ϕ(s) is defined to include the microrobot's state s, the target position r t , and its local 3D observation (the 3D binary image of the microrobot's neighborhood with a range of %15 μm, double the size of a typical RBC).
To seek an optimal control policy π that maps the system state ϕ(s) to rotational decisions w, the expected reward collected during a navigation process is maximized E P ∞ n¼0 γ n ½Rðs nþ1 Þ in the policy space, [34,35] where R is the instant reward function that encourages or penalizes the system states, γ is the discount factor, and n denotes the time step. In the DRL framework, the optimal Q* function associated with the reward collecting process is defined as which is the expected sum of rewards along the navigation process by following the optimal policy π*, after observing ϕ(s) and making a rotational decision of w. Given Q* function, the optimal policy π* is connected to The navigation policy π is optimized through the deep deterministic gradient descent algorithm, [36] which simultaneously trains a deep neural network, called Critic network, to approximate the optimal Q* function, and another deep neural network, called Actor-network, to approximate the policy π* (Supplementary Materials, Figure S1, Supporting Information). The discount factor γ is set to 0.99 to encourage the microrobot to seek rewards in the long run and R is set equal to 1 for all states that are within a threshold distance 1 to the target, and 0 otherwise. Both neural networks employ 3D convolution neural layers to process 3D local sensory input and a fully connected layer to process the system state. The neural network is trained extensively to estimate Q* through multiple episodes of navigation in different blood environments (see Supplementary Materials for details; Figure S2 and S3, Supporting Information) to learn robust and generalizable navigation strategies in various scenarios (different RBC configurations, vessel sizes, and target locations). The code with training instructions [https:// github.com/yangyutu/DeepReinforcementLearning-PyTorch] is released.

Free Space Navigation
We first examine the free space navigation strategies learned by the DRL controller. Figure 2 shows the rotational speeds (normalized by the maximum allowed rotation w max ) parameterized by target locations. For clarity, we place the microrobot at the origin and align its self-propulsion direction with the lab coordinate x-axis. The in-plane rotation changes the self-propulsion direction in the xy plane while out-of-plane rotation changes the self-propulsion direction in the xz plane. Analogous to typical steering, to reach the target, the microrobot constantly adjusts its propulsion direction according to the relative position of the target throughout the navigation process. Considering targets are in the xy plane, the key aspects of the navigation strategy are summarized as follows (Figure 2A,B): i) When the target is in the front, propulsion direction adjustment is achieved mainly through in-plane rotation in proportion to the angle deviation; ii) If the target locates behind the microrobot, both in-plane and out-of-plane rotation are engaged at nearly the maximum value to quickly reorient the propulsion direction.
The resulting controlled trajectories of the microrobot navigating to targets at different locations are shown in Figure 2C and Movie S1, Supporting Information, where we arrange targets on a lattice surrounding the microrobot for comprehensive testing (see the 3D scheme). For targets lying in front (i.e., r t ¼ (30, 0, 0)), the microrobot directly navigates toward the target. For other Figure 2. Free space navigation. A) Learned rotational decision for in-plane rotation speed w 1 . B) Learned rotational decision for out-of-plane rotation speed w 2 . w 1 and w 2 are normalized by the maximum rotation speed w max . In presenting the control policies, we fix the microrobot at the origin with orientation pointing along the þx direction. The target locations vary in the xy plane. C) Representative controlled trajectories (200 control steps or 20 τ) of the microrobot (initially located at the origin) navigating towards different target locations. The targets are arranged like lattice, with locations of (À30, À30, À30), (À30, À30, 0), (À30, À30, 30), …, (30,30,30). D) A representative localization trajectory of the microrobot around the target located at (À30, À30, À30) (upper panel); the distance versus time between the microrobot and the target is showed at the lower panel. E,F) Representative microrobot trajectories navigating towards the different target locations with external flow The setup is the same as (D) except for a steady external flow in the x direction.
www.advancedsciencenews.com www.advintellsyst.com target locations that the microrobot does not initially point to, rotational actions are first engaged to quickly reorient the microrobot towards the target and thereafter used to maintain the direction against Brownian motion. In either situation, nearly straight-line trajectories are produced, suggesting the optimality of the navigation strategy. [14] Rotational actions are engaged to correct stochastic Brownian disturbances to maintain trajectories moving toward the target. The learned control policy enables not only rapid navigation toward the target but also stable localization around the targets upon arrival ( Figure 2D). Because the propulsion is constantly engaged, after arriving at the target, the microrobot still needs to constantly adjust its orientation to maintain positions in the vicinity of the target. As the microrobot carefully hovers around the target, the microrobot orbits periodically to trace out circular trajectory patterns (radius ¼ v SP /w max ).
So far we have demonstrated the learned control policy under one hyperparameter setting (i.e., v SP , w max ). Control policies under other hyperparameter settings can be obtained via a simple arithmetic transformation (Supplementary Materials, Equation (S3) and Figure S4, Supporting Information) without retraining the model. Moreover, the control policy under external flow fields can be derived accordingly by treating the system as if a microrobot navigating in absence of flow but with a change in its hyperparameter (Supplementary Materials, Equation (S4), Supporting Information). We apply a flow field in the x-direction and verified the derived control policies ( Figure 2E,F, Supplementary Materials, Figure S5, Supporting Information). Despite the adversarial impact of external fluid flow, the microrobot still eventually reaches prescribed targets located at different locations. The external fluid flow asymmetrically affects the microrobot motion: it speeds up the microrobot when the microrobot travels along the flow direction but slows down the microrobot if the microrobot travels against it. Therefore, the controlled navigation trajectories no longer resemble straight lines but are bent toward the flow direction, which is also predicted by the theoretical optimal trajectory of micro-swimmers in simple flows. [14] Particularly when the magnitude of the flow increases to v f ¼ 0.8v SP , the trajectories are strongly bent as the microrobot is struggling toward the target. The presence of flow fields also causes delayed arrivals when microrobots travel against the flow as well as additional disturbance to the localization process (Supplementary Materials, Figure S5, Supporting Information). The radii of the hovering trajectories are significantly larger than the ones when fluid flow is absent. It is important to note that when v f is greater than the propulsion speed v sp , the microrobot is no longer controllable.

Navigation in Blood Vessels
Navigation in blood vessels meets additional challenges as biconcave RBCs and vessel walls can act as traps and barriers. As a first step to evaluate the learned navigation strategy, we consider steering microrobots in a simple blood vessel environment with a few RBCs ( Figure 3A,B). We arrange targets at different locations as in the free navigation test in Figure 2C and examine if the steered microrobot can circumvent RBC obstacles blocking its way. As shown in the representative trajectories in Figure 3A, when there is no RBC blockage in the way, the microrobot follows nearly the ideal straight-line path to the target as in the free-space navigation; In contrast, when an RBC is blocking the direct path, the microrobot will adjust its propulsion direction to get around the RBC. After its arrival, the microrobot employs similar localization strategies around the target as in free space navigation. To investigate the impact of the vessel wall confinement on navigation, we perform a similar evaluation near the vessel. As shown in Figure 3B, the microrobot successfully arrive at all the targets near a curved vessel wall. Particularly, when a near-wall RBC is blocking the path to the target, the microrobot will adjust its propulsion direction to circumvent the RBC and simultaneously avoid colliding with the vessel wall. Now we evaluate the robustness, generalization, and efficiency of navigation strategies in more realistic blood environments in Figure 3C-K, which have typical sizes of arteries or veins and different RBC volume fractions. The major assumption for these blood model environments is that the blood flow is spatially uniform rather than turbulent, such that all objects in the blood flow have similar drifting speeds and appear to be still relative to each other. We also assume that self-propelled robots move much faster than RBCs and that RBCs appear effectively static.
We randomly place RBCs with different configurations (position and orientation) and sizes (uniformly sampled between diameters 6 to 8 μm) to create an unseen blood environment to test the generalization of the learned strategies. The high-level controller sequentially generates temporary targets to guide the microrobot to follow a straight path at the axis of vessels extending from the bottom to the top (Algorithm 1). Robots can navigate through the vessels by circumventing all RBCs in the way (Movie S2 and S3, Supporting Information). Since RBC configurations are randomly generated and are unseen in the training stage of the neural network, this test suggests that the neural network have learned a generalizable navigation strategy.
We further quantify the navigation performance in blood vessels by calculating the mean travel distance 〈L〉 versus the mean time t when we set the target at the end of the vessel in Figure 3C. As a benchmark, in a deterministic limit, the theoretical optimal deterministic performance is given by 〈L〉 ¼ v SP t. A rough linear relationship indicates that the microrobot can navigate through a different portion of the vessel with a similar speed while the configuration of RBC varies. Particularly, in the free space navigation case, the navigation speed achieves %90% of the optimal deterministic speed. In general, microrobots transport faster in vessels with a larger radius and fewer RBCs. When the sizes of vessels are the same, more RBCs lead to a frequent adjustment of orientation and therefore slow down navigation. At similar RBCs concentrations, more confinement in small vessels produces additional difficulties for microrobots to get around RBC obstacles, and therefore leads to slower navigation.
As a further test of robustness and generalization of learned navigation strategies, we also examine the navigation in curved vessels with varying diameters (Figure 3J-L, Movie S3, Supporting Information) from bottom to top. Surprisingly, while microrobots are only trained in cylindrical blood vessels, the generalization of DRL controllers enables successful navigation in www.advancedsciencenews.com www.advintellsyst.com curved vessels. Here we note that while we have achieved impressive performance across different blood environments using a single neural network, additional performance gains can be expected if the neural network is further fine-tuned to a specific blood environment, which is a topic of future studies. The aforementioned results assume that the RBCs and the microrobots are experiencing the same ambient flows, and therefore the RBCs appear stationary with respect to the microrobot. An extra robustness test is to allow the microrobots to experience an additional external flow field with speed v f . We find that microrobots are capable of arriving at targets when the external flow speed v f is small (v f ≤ 0.5 v sp ) and RBCs are dilute (e.g., 5%) via a simple control policy remapping (Supplementary Materials, Figure S6, Equations (S3) and (S4), Supporting Information).

Exhaustive Spatial Survey in Blood Vessels
We have demonstrated that the present hierarchical DRL controller can steer the microrobot towards specified targets in both RBC-absent and RBC-present environments. To further illustrate that our hierarchical control scheme allows controlled navigation according to a preset routine, we consider the problem of steering a microrobot to exhaustively survey a blood vessel, analogous to a vacuum robot cleaning a room. The capability to quickly and completely survey a blood vessel is crucial for applications such as deploying robots to search hard-to-reach regions and clean sparse hidden biological threats (e.g., cancer cells, toxins, etc.), or to rapidly release and mix drugs in complex environments.
Here we consider steering the microrobot to closely follow a predefined path T: (x(q), y(q), z(q)) given by a parametric function www.advancedsciencenews.com www.advintellsyst.com 8 < : x ¼ R 0 cosðk 2 qÞ cosðk 3 qÞ y ¼ R 0 cosðk 2 qÞ sinðk 3 qÞ where q ≥ 0, k 2 and k 3 determine the projection pattern of path onto the xy plane, R 0 denotes the coverage range, k 1 determines how fast the path elevates in the z-direction. We choose k 1 ¼ 5, k 2 ¼ 5, k 3 ¼ 7, R 0 ¼ 45, and the 3D trajectory and its projection on the xy plane is shown in Figure 4A. By gradually increasing parameter q, the curve (x(q), y(q), z(q)) traces out a multi-helix pattern elevating from one end to the other, which can be used to guide a microrobot to sufficiently sample the space in a vessel ( Figure 4A). In the baseline case where there are no RBCs in a vessel, the controlled robots can follow the predefined path with high fidelity, with random deviation quickly corrected by the control policy ( Figure 4B and Movie S4, Supporting Information).
In vessels with RBCs, the microrobot can manage to closely adhere to the prescribed path by circumventing RBCs in the way ( Figure 4C,D and Movie S4, Supporting Information). As RBCs get denser (e.g., 10%), the microrobot needs to deviate from the ideal prescribed path more frequently and the trajectories (top view, Figure 4A-D) appear to be chaotic. Since microrobots are performing preset routines from the bottom to the top, the efficiency of routine execution can be measured by elevation speed on the z-axis ( Figure 4E). The theoretical optimal elevation speed is obtained by assuming a deterministic microrobot with speed v sp which exactly follows the preset path (Equation (3)). In all blood environments, we observe a rough linear relationship in elevation versus time, indicating microrobots are making constant progress in this task. With 0%, 5%, and 10% RBCs in the vessels, the elevation speeds are 85.9%, 73.9%, and 64.2%, respectively, of the optimal speed, as a result of slowdowns caused by more RBC blockages.
Another quantification of the routine execution quality is measured by the distance between the 3D preset path T (Equation (3)) and the actually executed path r after appropriate alignment. Particularly, we can define point-wise deviation between the two paths at an arbitrary point q by where q 0 is the corresponding optimal alignment in T computed using the dynamic time warping algorithm in MATLAB. [37] The mean deviation between the two paths is given by averaging enough sample points within the paths (Supplementary Materials). As shown in Figure 4F, the pointwise deviation Δ at different elevations for all cases are fluctuating around increasing mean deviation of 4.8a, 6.0a, and 7.2a in blood environments with 0%, 5%, and 10% RBCs. With increasing RBCs, occasional spikes in Δ are more frequent since microrobots are going the extra way to get around RBCs. Overall, our control scheme can enable microrobots to execute preset surveying routines within different microstructured environments with high fidelity. Moreover, by modifying the preset routine path defined by Equation (3), different surveying strategies such as adaptive exploration in vessels of varying sizes can be implemented.

Model Analysis
We now analyze what has been learned in the decision-making module enabled by DRL to understand the navigation performance in the aforementioned tasks. In a toy blood environment ( Figure 5A), we apply the t-distributed stochastic neighbor embedding (t-SNE) algorithm [38] to embed the learned representations of randomly sampled states into a 2D plane and color each point by the state value given by The state value provides information on if one state s is favorable to another state; a higher V indicates the controlled microrobot can arrive sooner than states with a lower V.
We consider five configurations (I)-(V) in Figure 5 to examine how the network perceives different situations. As shown in Figure 5A, high dimensional system states are embedded in the 2D plane apparently based on the shortest path distance to the target location, with closer states on the right. For example, configuration (I) with the closest distance to the target has its embedding on the right. Similarly, in configurations (II) and (IV), where the microrobot in (II) gets blocked by the RBC and has to reorient to get to the target and the microrobot in (IV) without RBC blockage in the way, the two configurations get assigned similar value. Additionally, in configurations (III) and (V), a microrobot in (V) gets blocked by two RBCs in the way but the configuration is evaluated to have a similar state value to (V), where the microrobot is not blocked by any RBCs. We hypothesize that the neural network can implicitly estimate the shortest paths based on local sensor information and the target position, and uses this estimate to guide the rotation decision to follow these shortest paths. To validate this hypothesis, we use the Dijkstra algorithm to estimate the shortest path distance from each state to the target. Under this hypothesis, the shortest path distance can provide state value estimation in Equation (5) via where γ is the discount factor used in Equation (5), l S is the shortest path length from the microrobot's position to the specified target, and l S /v SP t c is the number of control steps needed to move along the shortest path. The similarity in the learned state value function and the estimated one ( Figure 5B) suggests the microrobot has acquired nearly optimal navigation strategies; this is, Figure 5. Analysis of learned representations in neural networks. A) The 2D t-distributed stochastic neighbor embedding (t-SNE) embedding of the last hidden layer representation of the neural network in an example navigation task. Every point corresponds to a 2D representation of the internal state associated with the observations at the microrobot states (r, p). Points are colored by the state value. B) Estimated state value based on the shortest-path estimation (Equation (6)).
www.advancedsciencenews.com www.advintellsyst.com making rotation decisions to follow approximate shortest paths. Although we never explicitly consider this information in the development of our model, the orientation rotating to follow the shortest path emerges after deep reinforced learning of extensive navigation data.

Conclusions
We have presented a proof-of-principle study on designing a hierarchical control scheme to solve the complex microrobot navigation problem in blood vessels. The integration of a low-level DRL controller and a high-level controller enables the customization of navigation routines beyond simple navigation scenarios considered previously. [15,20] While we do not attempt a fully modeling of the realistic complexity of the blood environment (e.g., nonsteady blood flow, RBC deformation, [39,40] and hemodynamics [41] ), we aim to emphasize the key idea of local sensing, hierarchical decision-making, and data-driven learning. We show that a 3D sensing of the local environment together with DRL-based control can learn robust and efficient navigation strategies in blood vessels with diverse RBCs and varying vessel geometries. We further demonstrate that the hierarchical control scheme can steer robots to efficiently and reliably accomplish preset spatial survey routines in blood vessels. Finally, we illustrate that the neural network can learn effective representations of observations that underpin successful navigation performance. Our results not only demonstrate a general data-driven control scheme to enable navigation in human blood vessels, but also lay the foundation for achieving more sophisticated nano/ microrobot autonomy in an ample spectrum of complex environments, either in vivo or in vitro.
Our control framework can be applied in experimental settings [2] as well as extended in other computational studies. The data-driven nature of our approach lowers the hurdle of sensing and control algorithm development, offering an end-to-end approach that maps raw sensor data to decisions. The highly decoupled nature of our control scheme also allows the modification of different low-level control modules for different purposes, including adapting the controller for specific robots and motors and accommodating experimental measurement errors using additional state estimation components (e.g., Kalman filter). The proposed algorithm can be combined with the highfidelity blood physics simulator of the true blood to learn control strategies in realistic blood environments. Similarly, our control scheme also applies to a broad class of micro-robots in other navigation scenarios like the urinary tracts or the eyes in the human body or 3D porous media for environmental applications. The high-level controller can also be extended from the rule-based one considered here to more generic learning-based ones, [42] which enable data-driven task decomposition and opportunities for joint optimization with low-level controllers. A further extension could include controlling a swarm of microrobots via single-agent control paradigm [43] or multiagent stochastic control paradigm [44][45][46] to achieve swarm intelligence for more complicated tasks such as capturing circulating tumor cells in the blood. [47,48]

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.