Efficient Navigation of Colloidal Robots in an Unknown Environment via Deep Reinforcement Learning

Equipping micro‐/nanoscale colloidal robots with artificial intelligence (AI) such that they can efficiently navigate in unknown complex environments can dramatically impact their use in emerging applications such as precision surgery and targeted nanodrug delivery. Herein, a model‐free deep reinforcement learning algorithm is developed that trains colloidal robots to efficiently navigate in unknown environments with random obstacles. A deep neural network architecture is used that enables the colloidal robots to mimic animal navigation decision‐making by directly processing raw sensor input and decomposing long‐range navigations to short‐range ones. The trained robot agents learn to make navigation decisions regarding both obstacle avoidance and travel time minimization, based solely on local sensory inputs without prior knowledge of the global environment. Such agents with biologically inspired mechanisms can acquire competitive navigation capabilities in large‐scale, complex environments containing obstacles of diverse shapes, sizes, and configurations. Herein, the potential of AI to enable colloidal robots in extensive applications is illustrated.


Introduction
Self-propelled colloidal particles have recently demonstrated great promise as micro-/nanorobots capable of functioning in complex confined and crowded microenvironments. [1] In potential real-world applications (e.g., nanodrug delivery, precision surgery, environmental remediation, and machines [2][3][4][5][6][7][8][9][10][11][12] ), micro-/nanorobots are confronted with navigation challenges, including long-distance travel (e.g., travel in tissue, soil, and vasculature), unknown or spatiotemporally changing environment abundant with obstacles and dead ends, and additional time and fuel constraints. Beyond developing sophisticated micro-/nanorobot systems that have more efficient transport mechanisms and sensing capabilities, [4,[13][14][15] efforts have also been directed toward developing better navigation strategies. [16][17][18][19] Examples include the application of a Markov decision process framework [20] and a variational Fermat's principle [21] to compute optimal navigation paths in mazes and flow fields. However, these methods generally require pre-existing knowledge of the entire environment, which is generally unavailable in the vast majority of expected applications, and even when available, the computational cost becomes prohibitive for large-scale navigation.
Animals and humans are capable of effortlessly navigating and exploring in unknown, complex, visually rich environments (e.g., cities, fields, mountains, sea, space, etc.). [21] An adult in an unfamiliar city has little difficulty traversing city blocks and exploiting shortcuts when possible. In the context of colloidal robot navigation in an unknown obstacle field, here we aim to overcome navigation challenges by equipping colloidal robots with human-like decision capabilities, or artificial intelligence (AI), using deep reinforcement learning (DRL), which is a subset of AI algorithms inspired by the human learning and decision-making process. DRL has recently achieved remarkable performance, even superhuman performance, in various domains that require sequential decision-making, ranging from games [19,22] and robotics [23] in traditional computer science to novel applications such as drug design, [24] chemical reaction optimization, [25] and medical treatment. [26] The usage of deep neutral networks in DRL, particularly the convolutional neutral network, [27] has significantly expanded the success of traditional reinforcement learning to challenging visual domains like robotics and selfdriving cars. [28,29] In this study, we use a deep neural network architecture [30] (Figure 1a) that aims to minimally mimic the animal navigation decision-making system in two key aspects. First, we use convolutional neural layers, [29,31] which mimic the animal visual system [32] by directly processing high-dimensional raw sensory information (i.e., obstacle images). Raw sensory information has been shown to enable learning of efficient representations from high-dimensional observations and facilitate the generalization of experiences to new situations (i.e., unknown environments). [29,33] Second, we dynamically transform distant targets to local short-range proxy targets, which mimics animal navigation behaviors that decompose long-range goals into a series of shorter-range subgoals (on the scale of visual perception limit). [29,34] We train the neural network ( Figure 1a) via a model-free (i.e., data-driven) learning framework, which will allow our approach to be applicable to a broad range of colloidal robots exhibiting different dynamics. [35,36] Our DRL algorithm also contains critical enhancements to stabilize the learning process, as the strong stochasticity in colloidal robots arising from Brownian motion can make the training process unstable and divergent. We deploy the robot agent to navigate across diverse environment structures (e.g., different obstacle shapes, sizes, spacing, and target distances) ( Figure 1b) and train the neural network directly from extensive navigation trajectory data, involving sequences of observations, self-propulsion decisions, subsequent observations, and reward signals. Ultimately, the trained agent learns to make self-propulsion decisions to circumvent obstacles and reach targets, accommodating the unique stochastic dynamics of the robot. Because our DRL algorithm only relies on local information, the agent is inherently trained to acquire short-range navigation capability; however, the architecture ( Figure 1a) allows it to navigate long distances by implicitly decomposing them into a series of shortscale navigations.
In a series of navigation tests, we show in the following section that the trained agent can efficiently navigate in large, complex environments with obstacles of unknown shapes and arrangements. We demonstrate that the neural network can learn effective representations of observations and thus shed light on the successful navigation performance. Our results demonstrate a general framework to train an intelligent colloidal robot to master the rule of efficient navigation based on local visual information, in contrast to developing algorithms to compute navigation strategies on a case-by-case basis. Such an approach is promising to overcome the navigation challenges in emerging future applications of colloidal robots.

Colloidal Robot Model
We model colloidal robots as Brownian-type self-propelled particles that can control their self-propulsion speed yet without direct control over their orientation. Although direct orientation control for some types of self-propelled particles has been realized using magnetic field, [4] we herein consider the more general and more challenging situation that particles cannot directly control their orientation and have to smartly exploit Brownian motion and their self-propulsion to realize efficient navigation. The equation of the model of such a colloidal robot confined on a plane is given by where x, y, and θ denote the position and orientation, t is time, and v is propulsion speed taking binary values of 0 and v max as the control inputs. Brownian translational and rotational displacement processes ξ x , ξ y , and ξ θ are zero-mean Gaussian noise processes with variances hξ x ðtÞξ x ðt 0 Þi ¼ 2D t δðt À t 0 Þ, hξ y ðtÞξ y ðt 0 Þi ¼ 2D t δðt À t 0 Þ, and hξ θ ðtÞξ θ ðt 0 Þi ¼ 2D r δðt À t 0 Þ, respectively, where D t is the translational diffusivity and D r is the rotational diffusivity. All lengths are normalized by particle radius a, and time is normalized by τ ¼ 1/D r . The control update time is t c ¼ 0.1τ, the integration time step Δt ¼ 0.001τ, v max ¼ 2a/t c , and D t ¼ 1.33a 2 D r . Notably, our DRL algorithm is model free and can be readily applied to other dynamic models controlled by specific physics. [15,[37][38][39]

Deep Reinforcement Learning
We denote the particle state at time step n by s n ¼ (x n , y n , θ n ). The observation ϕ(s n ) at s n consists of the pixelated image of Figure 1. a) The main body of neural network architecture used in our DRL algorithm. The details of the architecture and other units, such as nonlinearity, normalization, and pooling, are in Experimental Section. Two streams of sensory inputs are fed to the neural network, including a pixel image (30a Â 30a) of the particle's neighborhood fed into a convolutional layer and the target's position fed into a fully connected layer. A distant target will be transformed to a local proxy target. The network will output two Q the particle's neighborhood and the target position (x t , y t ), as shown in Figure 1a. We seek an optimal control policy, π Ã , which maps an observation ϕ(s n ) to a self-propulsion action (i.e., OFF or ON) to maximize the expected reward accumulated during a navigation process, E P ∞ n¼1 γ n ½Rðs n Þ, where R is the one-step reward function and γ is set to 0.99 to encourage the agent to value rewards in the distant future. To seek an optimal policy minimizing arrival time, [40,41] R is set equal to 1 for all states that are within a threshold distance 2 to the target and 0 otherwise. The optimal control policy is obtained by training the neural network ( Figure 1a) to approximate the optimal state-action value function (known as the Q Ã function) given by which is the expected sum of rewards along the process by following the optimal policy π Ã , after observing ϕ(s) and self-propelling with speed v. The neural network contains convolution neural layers to process sensory information of the particle neighborhood, represented by a W Â W binary image (W ¼ 30) and a fully connected layer to process the target's position in the particle's local coordinate system. Distant targets (distance > W ) are transformed into local proxy targets by projecting onto a circle of radius W centering on the particle (Figure 1a). The neural network finally outputs the two Q Ã values associated with the ON and OFF actions. The local neighborhood sensory input directly impacts the learned navigation strategies and the computational cost during the training stage. The sensory input is characterized by two parameters, vision field size W (i.e., the size of the binary image) and resolution. The vision field size determines the range the robot can perceive, and the resolution determines the minimum obstacle feature can be detected. W has to be greater than twice the largest obstacle size such that the agent can learn robust navigation strategies. This is because the agent makes self-propulsion decisions based solely on current observations without access to historical information (i.e., the agent is memoryless). A smaller W can cause ambiguity in the decision-making process when the agent is trying to circumvent large obstacles in a bounded environment (see Figure S1, Supporting Information, for illustrating examples). High resolution will increase the computation cost, whereas low resolution can cause delays as robots are unable to detect small trapping obstacle features. A reasonable criterion of resolution, characterized by pixel length scale U, is that U is comparable to the diffusion distance within unit characteristic time ffiffiffiffiffiffiffiffiffi ffi 2D t τ p % 1.6a and U is smaller enough to detect trapping concave geometry features. We choose U ¼ a in the present study.
We train the neural network through multiple episodes of navigation in free space and two obstacle-present environments ( Figure 1b) (see Experimental Section for details). Each episode is initiated at a random starting position of the agent and random target; the episode ends when the agent arrives at the target or when the maximum control step is reached, whichever comes first. Extensive navigation data are necessary to enable the agent to learn robust navigation strategies in various scenarios (different obstacle shapes, sizes, spacing, and target locations). We use the canonical deep Q-learning algorithm [21] to iteratively improve the estimate of Q Ã , with several critical enhancements [34,42] to improve sample efficiency and the rate of convergence. With the estimated Q Ã function, the optimal propulsion decision at a given observation ϕ(s) is given by

Free Space and Simple Obstacles
The trained DRL particle agent can efficiently navigate in free space and unknown bounded obstacle environments not observed during its training process (Figure 2a-d). In free space navigation (Figure 2a), the navigation strategy derived from the learned optimal Q Ã function is similar to previous studies [18,43,44] and can be summarized approximately as where d n is the projection of the target-particle vector onto the orientation vector n ¼ (cosθ, sinθ), α n is the angle between target-particle distance vector and n, and parameters d c and α c are fitted to be %0.4 v max τ and %30 . Despite the inability to directly control orientations, the agent exploits Brownian rotation to realize desired reorientation. Intuitively, the particle agent has learned an "orientation timing" strategy to where it will propel itself when the target locates in front of the particle or otherwise waits for the favorable orientation to be sampled by Brownian rotation.
We further consider navigating in environments with obstacles of different shapes and arrangements unseen in training, including regular rectangular obstacles (Figure 2b), re-entrant rectangular obstacles (Figure 2c), and moon-shaped obstacles ( Figure 2d). These obstacles, with different shapes, size, and spacing, are designed to block paths to targets or trap particles via concave geometries (obstacle-wall arrangements in Figure 2b,c or obstacle geometry itself in Figure 2d). Successful navigation trajectories indicate that the trained particle agent has learned navigation "knowledge" generalizing beyond the training environments (e.g., Figure 1b). Specifically, the agent learns to wait for desired favorable orientations such that it can propel itself forward without hitting convex blocking obstacles or getting trapped in dead ends (Figure 2b,c). The agent will also temporally propel itself away from the target to get out of concave traps if it accidentally gets trapped (Figure 2d). In general, the agent learns to avoid the obstacles and approximately follow the shortest geometric path toward the target, even when only local information is accessed.
We examine the distribution of displacements of the trained agent during the navigation processes in Figure 2a,b. In collecting the statistics, trajectories start at the same position but with randomly sampled initial orientations. The displacements are projected along and perpendicular to the shortest geometric path (represented by the direction vectors connecting from the initial position to the target), which are used to capture the navigation progress and the deviation from the ideal direct path, respectively. As shown in Figure 2e,f, for navigation in both free space and rectangle-obstacle environments, the displacement distributions exhibit two modes at t ¼ τ: 1) a near mode located at 0 due to trajectories with unfavorable initial orientations waiting at the initial position for favorable orientations from stochastic rotation and 2) a far mode located at %0.75 v max τ due to trajectories with favorable initial orientation rapidly self-propelling. At longer times, the near mode continues to spread out, and the far mode propagates toward the target, indicating that the trained agent can readily propel itself to get closer to the target when favorable orientations and positions appear via Brownian motion. In addition, the perpendicular displacement (i.e., the vertical deviation to the ideal path) distributions exhibit a narrow peak around 0 but with tails extending to %v max τ in free space and %0.5 v max τ in obstacle-present environments, indicating that the DRL trained agent can usually maintain a close distance to the ideal path, but occasional large deviation can also occur (Figure 2g,h).

Complex Obstacles
After understanding the navigation behavior of trained agents in free space and simple obstacle environments, we now analyze its navigation performance in larger and more complex unknown obstacle environments (Figure 3a,b). Both environments contain obstacles of the same shape, but the environment in Figure 3b is more crowded due to reduced space between obstacles. Representative trajectories show that the trained agent, starting at different locations, is capable of circumventing obstacles in the way and reaches the target. In a larger and more complex environment, the agent is repeatedly using the same navigation strategies as it does in the smaller and simpler environments (Figure 2a-d) to "smartly" self-propel only at favorable orientation to avoid the obstacles and get out of traps.
We now quantify navigation performance across different length scales (Figure 3a,b) by benchmarking the navigation performance with a model-based Markov decision process (MDP) algorithm (see Supporting Information). [20,21] As a basic tool widely used in optimal sequential decision-making problems, MDP algorithm can compute the optimal navigation strategy using a discretized particle dynamic model. But the MDP algorithm cannot be applied to unknown environment navigation (as it requires prior knowledge of the environment) or large-scale navigation tasks due to prohibitive computational cost. The performance of DRL and MDP algorithms is first compared based on mean first arrival time 〈T〉 (see Experimental Section for comparison setup). For trained agents navigating different distances in both free space and sparse-obstacle environments (Figure 3a,b), the DRL and MDP agents have similar mean arrival time (Figure 3c). Surprisingly, the DRL agent slightly outperforms MDP agent by 10% in the dense-obstacle environment (Figure 3d). This is probably because the DRL algorithm is optimizing on a continuous state space, whereas the MDP algorithm is optimizing on a discretized one; thus, in crowded environments that require frequent and careful steering, DRL agent is able to make more accurate decisions based on continuous state observation.
We further compare the traveled path length between DRL and MDP agents where the traveled path length of a trajectory is defined by where (x 1 , y 1 ), (x 2 , y 2 ), …, (x N , y N ) are particle positions observed at consecutive control update times. The traveled path length can provide information on how agents plan their route to get to the target. Since the MDP agent seeks a global optimal performance www.advancedsciencenews.com www.advintellsyst.com by planning the global shortest paths according to the preexisting map, [21] a much larger path length, compared with the MDP agent, indicates that the DRL agent plans inferior paths to reach the target. Figure 3d shows that the MDP and DRL agents have comparable mean traveled path length 〈L〉, thereby indicating that the DRL agent can also plan efficient routes toward the target even based solely on local information. The DRL agent's average travel speeds v ¼ 〈L〉/〈T〉 are fitted to %0.36 v max and %0.22 v max for navigation in the sparse-and dense-obstacle environments, respectively, smaller than the fitting of %0.5 v max in free space. Because circumventing obstacles requires frequent reorientation and thus more waiting for the desired orientation, the travel speeds will reduce with increasing crowdedness of the environment.

Infinitely Large-Obstacle Environments
To ultimately evaluate the navigation capability in infinitely large environments, we deploy trained agents to perform directed transport in free space and periodic environments containing both sparse obstacles and dense obstacles (Figure 3a,b, also see Experimental Section). In all three cases, the trained agents can successfully travel along specified directions (i.e., horizontal direction in free space and 45 direction in obstacle environments). Particularly, even in the crowded dense-obstacle environment, the trained agents are able to circumvent all obstacles in the way and approximately maintain the ideal directed path (i.e., the radial path in the 45 direction) (Figure 4a,b). We further perform the mean squared displacement (MSD) analysis on the navigation trajectory (Figure 4c), which provides insights into the combined effect of control strategy and environment on the motion modes of the particle agent. At both shorter and longer time scales t compared with τ, the MSD of the trained particle agent navigating in free space follows MSD(t) % t 2 , as the navigation strategy produces intermittent directed transport. As a comparison, a constantly self-propelled particle will become an effective random walker at a longer time scale [45] and have MSD(t) % t. Surprisingly, for navigations in the obstacle environments, the DRL agent still displays approximately MSD(t) % t 2 at a longer time scale, although at a smaller time scale it displays www.advancedsciencenews.com www.advintellsyst.com MSD(t) % t 1.7 , where the smaller exponent arises from detours when circumventing obstacles. For the same reason, the transition from t 1.7 to t 2 occurs at a longer time for the environment with dense obstacles (%100 τ vs %20 τ). In short, the trained agent can perform directed transport in obstacle-present environments as in obstacle-absent ones (i.e., without getting trapped), although at a reduced speed.

Model Analysis
Finally, we examine the representations learned from highdimensional sensory inputs by the neural network to understand the successful performance in the previous navigation tasks. We consider an unknown environment with only one obstacle and apply the nonlinear dimension reduction t-distributed stochastic neighbor embedding (t-SNE) algorithm [21] to embed the learned representations in the last hidden layer into a 2D plane (Figure 5a). Each 2D point is colored by the state value defined by where V(s) can be interpreted as the expected maximum total reward a robot can achieve when it initializes with state s and follows the optimal navigation policy π Ã . A particle state with higher V indicates that a particle starting at this state can arrive at the target faster than other states with lower V. As shown in Figure 5a, high-dimensional sensory observations at different particle states are embedded in the 2D plane apparently based on the shortest path distance to the target location (in the horizontal direction) and the particle's orientation (in the vertical direction). In the left end, particle states with the shortest distance and favorable orientations are assigned with the highest state value (Figure 5a (4)). b) Implied effective navigation direction from the gradient of orientation averaging optimal state value (Equation (5)) with the target denoted by red star.
www.advancedsciencenews.com www.advintellsyst.com lower part if the colloidal particle orients opposite to it (Figure 5a 4 ). For particles located the furthest from the target due to blockage of the obstacle, their observations are assigned to the lowest state value and placed on the rightmost end at either upper or lower wings (Figure 5a 5 ,a 6 ). While Q Ã function assigns a lower value to states with unfavorable orientations for particles within an intermediate range, it assigns similar state values to distant particle states regardless of their orientations; this is because initial orientation does not add appreciable value to long-distance navigation. In short, as demonstrated in the toy example (Figure 5a), the reward signals have shaped the neural network to learn representations from observations that can distinguish whether one particle state is more favorable than the other via state values. Another way to understand the learned navigation strategy is by visualizing the normalized gradient vector (see Supporting Information) of orientation averaging optimal state values V XY (x, y) defined by Because the trained agent will take actions to move from lowvalue states (e.g., particle states with unfavorable orientation and far away from the target) to high-value states (e.g., particle states with favorable orientation and smaller distance to the target), the gradient of V XY (x, y) represents the effective navigation direction that the trained agent wishes to move. In other words, if the agent orients along the effective navigation direction, it will self-propel; otherwise, it will wait for the orientation from Brownian rotation sampling. In Figure 5b, the effective navigation direction in the complex test environment clearly shows the agent's intent to circumvent all obstacles (even if they are not observed by the agent before) and follow paths directed toward the target, which is consistent with navigation trajectories shown in Figure 3a

Conclusion and Outlook
In summary, we have developed a bioinspired DRL methodology to tackle the challenge of efficient navigation of colloidal robots in an unfamiliar complex environment. By training the agent through extensive navigation trajectories in diverse navigation scenarios with varying obstacles shapes, size, spacing, and target distance, the agent is able to derive effective navigation strategies generalizing beyond the training environments. Using t-SNE, we show that the neural network can learn useful representations of visually rich, high-dimensional sensory input and use them to obtain generalizable navigation strategies.
Our DRL algorithm provides a general model-free framework that applies to navigation of colloidal robots with different dynamical models due to different actuation mechanisms, [4,46,47] translation-rotation hydrodynamic coupling, [48] etc. Our DRL architecture can also be used as a basic building block to construct more complex learning architectures for navigation tasks in more challenging environments, including incorporating additional visual channels for navigation in flow fields, [41] adding memory module [49] for navigation in nonstationary environments with limited visibility, using 3D convolutional layers [50] for 3D navigation, extending to continuous control DRL [51] for high-precision localization tasks, and building hierarchical neural networks [52] for navigation in environments with multiple-scale obstacle features. Our algorithm can also be extended to a multiagent system [48,53,54] to control multiple robots to cooperate on tasks and assemble to nonequilibrium machines and devices [55] or applied as a general end-to-end controller for controlling stochastic colloidal assembly. [56] Ultimately, our DRL algorithm allows to be integrated with experimental systems as it can directly process raw sensor inputs of microrobot systems (e.g., microscope). [57] Possible data scarcity and noisy-state estimation issues in real experiments can be addressed by techniques such as imitation learning, [58] transfer learning, [17] and the usage of memory module. [59] 5. Experimental Section

Deep Convolution Neural Network Architecture
We approximated the Q Ã (ϕ(s), v) function using a neural network described as follows. The observation ϕ(s) consisted of two streams of inputs. The first input was the pixel-level binary sensory input of a 30a Â 30a neighborhood centering on the particle and aligned with its orientation (pixel width is a, 1 denotes the presence of obstacles or out of the boundary, and 0 vice versa). The second input was the target position in the local coordinate frame of the particle (note that distant targets will be transformed to local proxy targets, as we will discuss in Section 5.2.1). The neighborhood sensory input first entered a convolutional layer [29,60] consisting of 32 filters with kernel size 2 Â 2, stride 1, and padding of 1, following a batch normalization layer, [61] a rectifier nonlinearity [62] (i.e., max(0, x)), and a 2 Â 2 of maximum pooling layer. [32] The output then entered the second convolutional layer consisting of 64 filters and same kernel, stride, and padding as the previous layer, followed similarly by a batch normalization layer, a rectifier nonlinearly, and a maximum pooling layer. The local target coordinate first entered a fully connected layer consisting of 32 units followed by rectifier nonlinearity. Then the output from the target coordinate input and the sensory input merged and entered a fully connected layer of 64 units followed by rectifier nonlinearity. The output layer was a fully connected linear layer with two outputs associated with the binary actions ON and OFF.

Sensory Input Construction
For navigations in the obstacle environments, the environment maps were represented by a binary image (1 at obstacle region) with the pixel size of a. For each particle state s ¼ (x, y, θ), we constructed the observation ϕ(s), which consisted of local neighborhood sensory input and target coordinate input in the following way: 1) The local neighborhood sensory input was obtained by first constructing a squared window of width W ¼ 30a centering on the particle and aligned with its orientation and then extracted a www.advancedsciencenews.com www.advintellsyst.com 30 Â 30 binary matrix from the environment maps. Our image processing procedures included resizing the local neighborhood image to the desired resolution (i.e., the actual length per pixel) and thresholding (i.e., determining if a part of an obstacle is on the pixel). We used 50% gray level as the threshold. 2) For a given target position, we first transformed the target position to a local proxy target position in the following way: if the target's distance to the particle is greater than W ¼ 30a, its local proxy target position is the projection of the original target on a circle of radius W; otherwise, local proxy target is the original target itself. Local proxy target positions were further transformed to the local coordinate system of the particle before feeding into the neural network.

Obstacle Representation and Collision Dynamics
We directly converted environment maps to pixel images using an image processing software. Obstacle regions had a value 1, whereas free space regions had a value 0. Particle-obstacle collision dynamics was modeled in the following simplified way: 1) In every control time step, we evolved the particle state (x(t), y(t), θ(t)) using equation of motion (Equation (5)) for an interval of t c . 2) If the new position of particle (x(t þ t c ), y(t þ t c )) from step (i), was in the obstacle region, we set its position to previous one (x(t), y(t)). But we needed to still update its orientation.
This approximation is reasonable because the particleobstacle collision is not the dominant factor that affected the navigation process. Note that for nonspherical particles (e.g., rod shape), a better approximation is necessary since collision will affect particle orientation (see our previous study for more details [21] ).

Training Algorithm
The algorithm we used to train the agent was the canonical deep Q-learning algorithm [63] plus enhancements from double Q-learning, [43] the hindsight experience replay, [44] and scheduled multistage learning. Hindsight experience (ϕ(s h ), v h , R(s h ), ϕ(s hþ1 )) was created from ordinary experience (ϕ(s n ), v n , R(s n ), ϕ(s nþ1 )) using the following procedures: ϕ(s h ) and ϕ(s hþ1 ) were the observations at s h and s hþ1 but with the target set at particle position (x nþ1 , y nþ1 ) at step nþ1; R(s h ) was set to 1; v h ¼ v n . Note that we only constructed hindsight experience from experience that did not collide with obstacles. The complete algorithm is presented later, with training parameters set according to Table 1. Our implementation of the algorithm is available at https:// github.com/yangyutu/DeepReinforcementLearning-PyTorch The enhancement of scheduled multistage learning consisted of the following steps: 1) At the beginning of each episode, initial particle states and target positions were randomly generated. To speed up the convergence, we used ideas from curriculum learning [64] and required that they were generated in such a way that their distance gradually increased from a small value. Then during the training process, the agent could learn the short-distance navigation before learning long-distance navigation. Mathematically, let D denote the distance between the generated initial particle position and target position, and we require D to satisfy where episode is the episode number, S m is the maximum of width and height of the training environment, T s is the initial target threshold, T e is the final target threshold, and T d is the target threshold decay.
2) The whole training process consisted of two stages. The first stage lasted for 74 000 episodes, where the free space environment and the sparse-obstacle environment were randomly selected (0.2 vs 0.8 probability) to train the active particle. The second stage lasted for 60 000 episodes, where the free space environment, the sparse-obstacle environment, and the denseobstacle environment were randomly selected (0.1 vs 0.4 vs 0.5 probability) to train the active particle. We empirically determined the training episode numbers in the first stage and the second stage by monitoring the running average reward plateaus. The typical training time on a desktop with a 1080Ti GPU was %4 days.
Algorithm: Deep Q-learning with hindsight experience replay and double Q evaluation Initialize replay memory M to capacity N M Initialize action-value function Q with random weights ω Initialize target action-value function Q 0 with weights ω 0 For episode 1, N E do Randomly generate a particle state s 0 and a target position. Obtain initial observation ϕ(s 1 ).
For n ¼ 1, maxStep do With probability ε select a random action; otherwise select v n ¼ argmax v Q(ϕ(v n ), v; ω) Execute action v n using simulation and get new particle state s nþ1 and reward R(s nþ1 ) Generate observation state ϕ(s nþ1 ) at state s nþ1 Store transition (ϕ(s n ), v n , R(s nþ1 ), ϕ(s nþ1 )) in M. Perform a gradient descent step on (y j -Q(ϕ(s j ), v j )) 2 with respect to network parameters w If arrive at the target, break the For loop. Every C steps reset ω 0 ¼ ω End For End For Note that as we discussed in Section 5.2.1, distant targets were transformed to local proxy targets. The local proxy target position will change whenever the particle translates or rotates. This target shift problem will cause artifacts in the learned control policy. To remedy this undesired effect, when we executed the line "Store transition (ϕ(s n ), v n , R(s nþ1 ), ϕ(s nþ1 )) in M" in the algorithm, we needed to construct the observation ϕ(s nþ1 ) based on target position at the previous step n instead of the target position at step nþ1.

Displacement Distribution Analysis
In the navigation example in free space ( Figure 2a) and in the rectangle-obstacle environment (Figure 2b), we collected displacement statistics from 60 000 trajectories initialized at positions (5,40) and (35,30) and with random orientation. The targets were located at (40,5) and (9, 5), respectively. Displacements at different observation times (t ¼ 1τ, 5τ, 10τ, 20τ, 30τ) were projected along the unit vector connecting the initial position to the target position to get the parallel component r || and perpendicular component r ⊥ . Particularly, displacements in the rectangleobstacle environment were projected along the piecewise linear segment (which was the shortest geometric path) connecting the initial position to the target position to get the parallel component r || and perpendicular component r ⊥ . Projections onto these piecewise linear segments were calculated using the MATLAB routine (https://www.mathworks.com/matlabcentral/fileexchange/ 34869-distance2curve).

MSD Analysis
The MSDs for trajectories of the DRL agent and constant selfpropelled agent were calculated from 300 trajectories. For navigation in free space, trajectories were initialized at a fixed location (0, 0) with random orientations and navigated toward a remote target set at (100 000, 0). For navigation in periodic obstacle environments (Figure 4), trajectories were initialized at a fixed location (5, 5) with random orientations and navigated toward a remote target set at (5760, 5760). For navigation in the obstacle environment in Figure 4a,b, we applied periodic boundary conditions on the simulation box without the hard wall. Note that we applied the hard wall conditions in the simulation box when we performed other navigation tests.

Mean First Passage Time and Mean Travel Path Length
The mean first passage time and mean travel path length for each navigation task were averaged from 300 trajectories. Trajectories were all initialized at different positions with random orientations. The shortest geometric paths between starting positions and the target position in obstacle environments were calculated using the fast-marching algorithm (see Figure S2, Supporting Information) (https://www.mathworks. com/matlabcentral/fileexchange/6110-toolbox-fast-marching).

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.