Independent Control and Path Planning of Microswimmers with a Uniform Magnetic Field

Artificial bacteria flagella (ABFs) are magnetic helical micro-swimmers that can be remotely controlled via a uniform, rotating magnetic field. Previous studies have used the heterogeneous response of microswimmers to external magnetic fields for achieving independent control. Here we introduce analytical and reinforcement learning control strategies for path planning to a target by multiple swimmers using a uniform magnetic field. The comparison of the two algorithms shows the superiority of reinforcement learning in achieving minimal travel time to a target. The results demonstrate, for the first time, the effective independent navigation of realistic micro-swimmers with a uniform magnetic field in a viscous flow field.


Introduction
The magnetic control of micro-swimming devices [1,2,3,4,5] through micro-manipulation [6,7], targeted drug delivery [8,9] or convection-enhanced transport [10], has created new frontiers for bio-medicine. A particularly promising technology involves corkscrew-shaped magnetic micro-swimmers (artificial bacterial flagella (ABFs)) that propel themselves when subjected to a rotating magnetic field [11]. Rotating magnetic fields can form propulsive gradients and they are arguably preferable to alternatives, such as electric fields, for in-vivo conditions [12,13]. However, the independent, yet coordinated, control of individual ABFs is challenging as it requires balancing between the magnetic forces and the hydrodynamic interactions between the swimmers while the employed magnetic fields are practically uniform over lengths of few micrometers. We note that independent navigation of mm-sized micro-swimmers has been shown in [14] through experiments and simulations, while in [15] a reinforcement learning (RL) algorithm was applied to adjust the velocity of an idealized swimmer in simulations with one way coupling with a complex flow field. Control of swimmers using two-way coupling and RL have been demonstrated with linked-spheres at low Reynolds numbers [16] and for artificial fish in macro-scales [17]. Similarly, genetic algorithms have been used to navigate micro-swimmers towards high concentrations of chemicals [18]. The problem of heterogeneous micro-robots navigation via a uniform input has been studied in two dimensions on surfaces [19] and in a fluid at rest [20]. The steering of two micro-propellers along two distinct paths in 3 dimensions has been accomplished with the help of magnetic fields gradients [21]. These advances exploited the heterogeneous response of micro-swimmers to a uniform input to achieve independent trajectories along a prescribed path. These control methods are based on short horizon objectives (stay on the prescribed path) and do not provide the trajectory that minimizes the travel time to a target position, particularly in the presence of a background flow. In addition, strong background flows restrict the set of feasible paths for given micro-swimmers. To the best of our knowledge the steering of multiple micron-sized swimmers towards a target in a minimal time under a background flow and a uniform magnetic field has not been reported before. In this work, we present two methods to independently guide two micro-ABFs towards a single target in the presence of a uniform magnetic field. The two methods rely on simulations of swimming ABFs using an ordinary differential equation (ODE) model. The model is calibrated with the method of dissipative particle dynamics (DPD) [22,23], taking into account the particular geometry of the swimmers and their interactions with the viscous fluid. We first present a semi-analytical solution for the simple yet instructive setup of multiple, geometrically distinct ABFs in free space, with zero background flow. This result enables understanding of the design constraints for the ABFs necessary for independent control and how their geometric characteristics relate to their travel time. We then employ RL to control multiple ABFs trajectories in a broad range of flow conditions including a non-zero background flow.

Artificial bacterial flagella
The ABFs are modeled as microscopic rigid bodies of length l with position x and orientation q (represented by a quaternion), immersed in a viscous fluid and subjected to a rotating, uniform, magnetic field. We estimate that the magnetic and hydrodynamic interactions between ABFs are orders of magnitude smaller than those due to the magnetic field for dilute systems (see supplementary material) and we ignore inertial effects due to their low Reynolds number (Re ≈ 10 −3 ). Following this approximation, the system is fully described by the position and orientation of the ABFs. Additionally, the linear and angular velocities of the ABF, V b and Ω b , are directly linked to the external force and torque, F b and T b , via the mobility matrix [24], where the superscript b indicates that the quantity is expressed in the ABF frame of reference, for which ∆, Z and Γ are diagonal. The matrices Γ and Z represent the application of torque to changing the angular and linear velocity, respectively. The ABFs are propelled by torque applied through a magnetic field and we assume that it can swim only in the direction of its main axis so that Z has only one nonzero entry (Z 11 ). The coefficients in the mobility matrix are often estimated by empirical formulas for low Reynolds number flows [25]. Here we estimate the components of the mobility matrix for the specific ABF by conducting flow simulations using DPD [22,23], which we validate against experimental data of [8] (see supplementary material). We remark that the shape (pitch, diameter, length, thickness) of the ABF influence the elements of these matrices and the present approach allows to account for these geometries.
The ABF with a magnetic moment m is subjected to a uniform magnetic field B and hence experiences a torque T = m × B.
No other external force is applied to the ABF, hence F = 0. Combining eq. (1) with the kinematic equations for a rigid body gives the following system of ODEs: where ⊗ denotes the quaternion product, andΩ the pure quaternion formed by the vector Ω. The transformations between the laboratory frame of reference and that of the ABF are given by: where q is the conjugate of q and R(q) is the rotation matrix that corresponds to the rotation by a quaternion q [26]. The system of differential equations (2) and (3) Figure 1: Left: Dimensionless time averaged forward velocity of two ABFs, differing in shape and magnetic moment, against the field rotation frequency (in units of the step-out frequency of the first swimmer, ω c,1 ). Right: The ABFs geometries. The arrows represent the magnetic moment of the ABFs.
We note that when simulating multiple non-interacting ABFs in free space, we use the above ODE system for each swimmer with the common magnetic field but different mobility coefficients and magnetic moments.

Forward velocity
ABFs were designed to swim under a rotating, uniform magnetic field [27,11]. We first study this scenario by applying the field B(t) = B (0, cos ωt, sin ωt) to ABFs initially aligned with the x axis of the laboratory frame. Note that in the later sections, the magnetic field is able to rotate in any direction so that the swimmers can navigate in three dimensions. We consider two ABFs with the same length but different pitch and magnetic moments, as shown in fig. 1. In both cases, the magnetic moment is perpendicular to the helical axis of the ABF. Under these conditions, by symmetry of the problem, the swimmers swim along the x axis. The difference in pitch results in different coefficients of the mobility matrix and along with the different magnetic moments results in distinct propulsion velocities for the two ABFs. For each ABF velocity we distinguish a linear and a non-linear variation with respect to the frequency of the magnetic field. First, the ABF rotates at the same frequency as the magnetic field and its forward velocity increases linearly with the frequency of the magnetic field, consistent with the low Reynolds approximation [28,8,29,3]. In the non-linear regime, the magnetic torque is no longer able to sustain the same frequency of rotation as the magnetic field. The onset of non-linearity depends on the geometry and magnetic moment of the ABF as well as the imposed magnetic field. Indeed, the magnitude of the magnetic torque is bounded while that of the hydrodynamic torque increases linearly with the ABF angular velocity Ω. The torque imbalance at high rotation frequencies causes the ABF to slip, resulting in an alternating forward and backward motion (see supplementary material). Increasing the frequency further increases the effective slip and accordingly decreases the forward velocity. The two regimes are distinguished by the step-out frequency ω c corresponding to the maximum forward velocity of the ABF. The differences in propulsion velocities for the ABFs can be exploited to control independently their trajectories. The slope V /ω in the linear regime depends only on the shape of the ABF. The step-out frequency depends on both the shape and the magnetic moment (it can also be changed by varying the surface wetability of the ABF [30]). These two properties can be chosen such that the forward velocities of two ABFs react differently to the magnetic rotation frequency ( fig. 1). By changing ω, it is then possible to control the relative velocities of the two ABFs: one is faster that the other in one regime while the opposite occurs in an other regime. This simple observation constitutes the key idea for independent control of several ABFs even with a uniform magnetic field. We remark that, while this potential has been previously identified [28,30,31,32,33], the control of similar systems have been performed in the simple case of free space, non interacting propellers and no background flow [21,20]. To the best of our knowledge, this is the first time that such independent controlled navigation of multiple micro-swimmers is materialised in three dimensions with a complex background flow. In the following sections we propose two methods to tackle the problem of steering ABFs towards a target in a minimal amount of time.

Independent control I: semi-analytical solution
In the absence of an external flow field, we derive a semi-analytical strategy for the navigation of N ABFs towards a particular target. Each ABF has a distinct magnetic moment and without any loss of generality, we set the target position of all swimmers to the origin and define the initial position of the i th ABF as x (i) . We assume that the time required by one ABF to align with the rotation direction of the field is much smaller than |x (i) |/v, where v is the typical forward velocity of the ABF. The proposed strategy consists in gathering all ABFs along one direction n k at a time, such that x (i) · n k = 0, i = 1, 2, . . . , N after phase k. We choose a sequence of orthogonal directions, n k · n k = δ kk . The choice of the orientations of n k is not restricted to the basis vectors of the laboratory frame and is described at the end of this section. In three dimensions, the strategy consists of three phases, k = 1, 2, 3, until all ABFs have reached their target: they first gather on a plane, then on a line and finally to the target. All ABFs are gathered along a given direction n k by exploiting the different forward responses of the ABFs when we alternate the frequency of rotation of the magnetic field. More specifically, for N ABFs, the field rotates in the direction n k for t j time units at frequency ω c,j , j = 1, 2, . . . , N , where ω c,j is the step-out frequency of the j th swimmer. We define the velocity matrix with elements U ij = V i (ω c,j ), denoting the velocity of swimmer i when the field rotates with the step out frequency of swimmer j. We can relate the above quantities to the (signed) distances d j covered by the ABFs as where s j ∈ {−1, 1} determines if the field rotates clockwise/counterclockwise. Equivalently, the vector form of the above is d = U β, where β j = t j s j . Setting d i = x (i) · n k , we can invert this linear system of equations for each phase k and obtain the times spent at each step-out frequency We emphasize that this result holds only if the velocity matrix is invertible, restricting the design of the ABFs to achieve independent control. The total time spent at phase k is then given by The yet unknown directions n k , k = 1, 2, 3, are chosen to minimize the total travel time. The directions are parameterized as n k = R(φ, θ, ψ)e k , k = 1, 2, 3, where R(φ, θ, ψ) is the rotation matrix given by the three Euler angles φ, θ and ψ. Note that this choice of handedness of the three directions does not influence the final result. The optimal angles satisfy φ , θ , ψ = arg min We solve the above minimization problem numerically with derandomised evolution strategy with covariance matrix adaptation (CMA-ES) [34] (see supplementary material for the configuration of the optimizer).

Independent control II: Reinforcement Learning
We now employ a RL approach to solve the problem introduced in section 4. Each of the N ABFs is initially placed at a random position netic field frequency of rotation and direction, and has the goal of bringing all ABFs within a small distance (here two body lengths, d = 2l) from the target origin. This small distance is justified by the assumption of non-interacting ABFs. The agent sets the direction and magnitude of the magnetic field frequency every fixed time interval. An episode is terminated if either of the two conditions occur: (a) all ABFs reached the target within a small distance d, or (b) the simulation time exceeds a maximum time T max . The positions x i and orientations q i of the ABFs describe the state s of the environment in the RL framework. The action performed by the agent every ∆t time encodes the magnetic field rotation frequency and orientation for the next time interval. The reward of the system is designed so that all ABFs reach the target and the travel time is minimized. Additionally, a shaping reward term [35] is added to improve the learning process. The training is performed using VRACER, the off-policy actor critic RL method described in [36]. More details on the method can be found in the supplementary material.

Reaching the targets
In this section, we demonstrate the effectiveness of the two methods introduced in sections 4 and 5. We first consider 2 ABFs in free space with zero background flow. Figure 2 shows the distance of the ABFs to their target over time, and the corresponding magnetic field rotation frequency for both methods. In both cases, the ABFs successfully reach their target. Interestingly, the rotation frequencies chosen by the RL agent correspond to the step-out frequencies of the ABFs. Indeed, these frequencies allow the fastest absolute velocity difference between the ABFs, so it is consistent that they are part of the fastest solution found by the RL method. Furthermore, the RL trained swimmer was about 25% faster than the semi-analytical swimmer. We remark that the RL solution amounts to first blocking the forward motion of one swimmer while the other continues swimming (see fig. 3). The blocked swimmer is first reoriented such that its magnetic moment is orthogonal to the plane of the magnetic field rotation, thus the resulting magnetic torque applied to this swimmer is zero. On the other hand, the method presented in section 4 makes both ABFs swim at all time, even if one of them must go further from its target. In such situations, the "blocking" method found by RL is advantageous over the other method. We now employ the RL method in the case of 2 ABFs swimming in a background flow with non zero velocity. The assumptions required for deriving the semi-analytical approach are violated and therefore we do not use this approach in this case. In the presence of a background flow u ∞ , eqs. (4c) and (4d) become with A = B = C/2 = V 1 (ω c,1 ) and a = b = −c = 2π/50l. With these parameters, the maximum velocity of the background flow is larger than the maximum swimming speed of the ABFs. The distances between the swimmers and the target over time are shown for the RL method on fig. 4. Despite the background flow perturbation, the RL method successfully navigates the ABFs to their target. The magnetic action space exhibits a similar behavior as in the free space case: the rotation frequency of the magnetic field oscillates between the step out frequencies of both swimmers and never exceeds the highest of these frequencies, where the swimming performance would degrade considerably. The trajectories of the ABFs seem to make use of the velocity field to achieve a lower travel time: fig. 4 shows that the trajectories tend to be parallel to the velocity field. The RL method not only found a solution, but also made use of its environment to reduce the travel time.

Robustness of the RL policy
The robustness of the RL method is tested against two external perturbations, unseen during the training phase. In both cases, the robustness of the method is measured in terms of success rate (expected ratio between the number of successful trajectories and the number of attempts). First, a flow perturbation δu(r) = εu ∞ (r/p) is added to the background flow described in the previous section, where ε controls the strength of the perturbation and p controls the wave length of the perturbation with respect to the original one. Figure 5 shows the success rate of the RL approach against the  perturbation strength ε for different p. For large wave lengths (p = 2), the RL agent is able to successfully steer the ABFs to their target in more than 90% of the cases when the perturbation strengths of less than 20% of the original flow. In contrast, the success rate degrades more sharply for smaller wave lengths (p = 1/2), suggesting that the method is less robust for short wave length perturbations. The RL policy seems more robust to perturbations with the same wavelengths as the original flow (p = 1) for large perturbation strengths: the success rate is above 30% even for large perturbations. At small length scales, micro-swimmers are subjected to thermal fluctuations. We investigate the robustness of the RL policy (trained with the background flow, eq. (5)) on swimmers subjected to thermal noise and background flow (eq. (5)). The thermal fluctuations are modeled as an additive stochastic term to the linear and angular velocities of each swimmer, following the Einstein relation with the mobility tensor given by eq. (1). Defining the generalized undisturbed velocityV = (V, Ω), the resulting stochastic generalized velocities satisfy V =V + δV, δV i = 0, i = 1, 2, . . . , 6, δV i , δV j = k B T M ij , i, j = 1, 2, . . . , 6, where M is the mobility tensor and k B T is the temperature of the system, in energy units. The above property is achieved by adding a scaled Gaussian random noise with zero mean to the velocities at every time step of the simulation. The success rate of the policy is shown in fig. 5 for various temperatures k B T , in units of the room temperature k B T 0 . As expected, a large thermal noise causes the policy to fail at its task. Nevertheless, this failure only occurs at relatively high temperatures: the success rate falls below 50% for k B T > 25k B T 0 , which is well above the normal operating conditions of ABFs. With temperatures below 2k B T 0 , the RL method sustains a success rate above 99%. We remark that this robustness is achieved successfully even with a policy trained with k B T = 0.

Conclusion
We have presented two methods to guide multiple ABFs individually towards targets with a uniform magnetic field. The semi-analytical method allows to understand the basic mechanisms that allow independent control and we derive the necessary condition for the independent control of multiple ABFs: their velocity matrix must be invertible, a condition that can be accommodated by suitably choosing the geometries of the swimmers. This result may help to optimize the shapes of ABFs to further reduce the travel time. The RL approach can control multiple ABFs in quiescent flow as well as in the presence of a complex background flow. Additionally, this approach is resilient to small flow perturbations and to thermal noise. When the background flow vanishes, the RL method recovers a very similar behavior as the first method: the rotation frequency is alternating between the step-out frequencies of the swimmers. Furthermore, the RL method could reach lower travel time than the first method by blocking one swimmer while the other is swimming. Steering an increased number of swimmers requires longer travel times according to the first method. We thus expect that applying the RL approach to more than two swimmers requires longer training times and might become prohibitively expensive as the number of swimmers increases. Possible solutions to this problem might include pre-training the RL agent with the policy found by the semi-analytical method. The current work focused on the simplified case of non-interacting swimmers. In practice, the ABFs may interact hydrodynamically and magnetically with each other, encounter obstacles, evolve in confined geometries or experience time varying flows. Nevertheless, we expect the RL method to be a good candidate to overcome these variants, in the same way as it naturally handled the addition of a background flow.

Neglecting the interactions between swimmers
The artificial bacterial flagella (ABFs) experience hydrodynamic drag forces and torques. Additionally, they are subjected to a magnetic torque from the external, imposed, magnetic field. Furthermore, when multiple swimmers compose the system, hydrodynamic and magnetic interactions occur. Here we estimate the magnitude of these interaction forces by considering a simplified representation of the ABFs immersed in a fluid of viscosity η = 10 −3 Pa s, subjected to a rotating magnetic field of magnitude B = 1 mT. To have a rough estimate of the hydrodynamic torque, we represent the ABFs as ellipsoidal bodies of size 2a × 2b × 2c = 10 µm × 2 µm × 2 µm. The magnetic fields rotates with angular velocity ω = 10 · 2πs −1 which is also the step-out frequency ω c of the ABFs. These choices correspond to similar conditions as those found in experimental work for ABFs [1]. Considering the equivalent sphere of the ABF, with a radius R = (abc) 1/3 , the hydrodynamic torque magnitude can be expressed as [2] T H = 8πηR 3 ω.
At the step out frequency, the magnetic torque reaches its maximum magnitude, where m is the magnetic moment of the sphere. Equating eqs. (1) and (2) gives an estimate for the magnetic moment hence m ≈ 8 × 10 −15 Am 2 . The magnetic field generated by a dipole of moment m is given by wherer = r/r and r = |r|. Therefore, the magnetic torque exerted by one sphere of moment m 2 to another of moment m 1 (separated by a distance r) is given by Considering a small distance between the spheres r = 10a (five body lengths between the centers) and replacing with the numerical values, we obtain T DD /T M ≈ 10 −5 , hence we neglect the magnetic interactions between swimmers. Similarly, the magnetic force caused by the gradient of the magnetic field generated by the dipole is negligible compared to that of the hydrodynamic drag. We now estimate the hydrodynamic interactions between 2 rotating ABFs by computing the fluid velocity caused by the rotation of one at the other's location. The magnitude of the velocity caused by the equivalent sphere of radius R rotating with angular velocity ω is given by v int (r) = ωR 3 /r 2 . This rough estimate, at a distance r = 10a (5 body lengths), gives a flow velocity of about v int ≈ 0.12 µm s −1 .
ABFs are able to swim at velocities of around one body length per second, that is v ≈ 10 µm s −1 . Hence the velocity due to the interactions amounts to about 1.2% of the swimming velocity when the swimmers are close to each other. This velocity decreases further when the swimmers are separated by larger distances (0.03% when they are separated by 10 body lengths), which is the most common configuration in the current problem. We therefore neglect the hydrodynamic interactions between ABFs in our simulations.

Forward slip
As explained in the main text, the velocity of an ABF subjected to a uniform magnetic field rotating at constant frequency first increases linearly (its rotation follows that of the magnetic field, hence its velocity is constant). After the frequency of rotation of the field reaches the critical value ω c , the hydrodynamic forces that would be needed to follow the magnetic field frequency are too high, causing the ABF to "slip". This can be seen on fig. 1. Below ω c , the velocity is constant, hence the displacement of the ABF increases linearly with time. For higher magnetic field frequency, the ABF slips and hence perform a back and forth motion. The time averaged velocity hence decreases.

Quaternions
Quaternions are useful mathematical tools to represent rotations in 3 dimensions [3]. Unlike the rotation matrices, which have 9 components in 3 dimensions, quaternions store only 4 components, which make them also desirable on a computational point of view. A quaternion q can be represented as where q w , q x , q y , q z ∈ R are real numbers and i 2 = j 2 = k 2 = ijk = −1. A common notation is to decompose the quaternion into a scalar part and a vector part, where q = (q x , q y , q z ). A quaternion with zero scalar part is called a pure vector quaternion, and a quaternion with the null vector is called a pure scalar quaternion. The conjugate of q is defined as q = (q w , −q).
Furthermore, the norm of a quaternion is The product between two quaternions, noted ⊗, can be derived from the properties of i, j, k. In vector notations, it is given by where × denotes the vector product. Rotations can be represented by unit quaternions |q| = 1. The rotation of angle φ around the unit vector u is represented by the quaternion The rotation of a vector x by q is given in terms of quaternion product, where x = (0, x) is the pure vector quaternion of x and y is the result of the rotation on x. The same operation can be written in matrix form, where 4 Calibration of the propulsion matrix from dissipative particle dynamics The mobility matrix presented in the main text links the external forces and torques applied to a rigid body with its instantaneous linear and angular velocities, relatively to the flow: The elements of this mobility matrix depend on the shape of the body. In this work, we estimated the mobility matrix of ABFs of different pitch but with the same length and diameter. The body was immersed in a fluid at rest and subjected to rotations and translations with constant speed, and the hydrodynamic torques and forces were collected. The flow solver, Mirheo [4], employs the dissipative particle dynamics (DPD) method [5], that we briefly summarize here for completeness. The flow is discretized into n particles of mass, positions and velocities m i , r i and v i , respectively, with i = 1, 2 . . . , n. We consider particles with the same mass, m i = m. They evolve in time according to Newton's law of motion, where f i is the total force acting on the i th particle, where r ij = r i − r j , r ij = |r ij |, e ij = r ij /r ij and v ij = v i − v j . The coefficients a, γ and σ are the conservative, dissipative and random force coefficients, respectively. ξ ij is a random variable with zero mean and unit variance satisfying The kernel w(r) = max (1 − r/r c , 0) vanishes after the cutoff radius r c . The dissipative and random kernels are linked through the fluctuation-dissipation theorem where k B T is the temperature of the fluid in energy units. We set w D = w s with s = 1/4. The ABF is modeled as a rigid object consisting of frozen DPD particles that move as a rigid body and interact with the surrounding solvent particles. The frozen particles are particles chosen from a separate simulation of a DPD solvent at rest. Once equilibrated, the particles that are inside the volume of the ABF are selected. Furthermore, to enforce no-slip and no-penetrability boundary conditions, the solvent particles are bounced back from the surface of the ABF. The ABFs were first oriented along their principal axes and we assumed that the matrices ∆, Z and Γ had non zero elements only on their diagonal. For each ABF, we perform 6 simulations: in each simulation, only one of the 6 components of the linear and angular velocities, V and Ω, is non zero. The time averaged forces and torques were collected for each of these simulations, F i and T i , where i = 1, 2, . . . , 6 is the simulation index. The coefficients of the mobility matrix are then estimated as for i = 1, 2, 3. The coefficient Z 11 , on the other hand, is computed from the simulation of a freely translating ABF with an imposed rotation with constant angular velocity Ω along its swimming direction in a DPD fluid. After reporting the averaged swimming velocity V , we obtain All simulations are performed with a particle number density of n d = 10r −3 c . The length of the ABF is set to 27r c , resulting in more than 3000 ABF frozen particles in each case. The simulation domain is periodic and has dimensions L = 270r c (ten times the length of the swimmer), which was high enough to avoid disturbances from the periodic images.  The geometries of the two ABF for which the propulsion matrix has been calibrated. In arbitrary units, the left one has a pich of 2 and the right one has a pitch of 5. The arrow represents the magnetic moment m of the ABFs. We choose the magnetic moment such that the maximum velocity of the swimmer equals one body length per second, which is a typical value found in experiments for a magnetic field of the order of 1 mT. This constraint is expressed as V max = Z 11 mB, hence we set

Validation of the DPD method
The simulation of a single swimmer in 3 dimensions, in free space, is modeled similarly as in the previous section with the DPD method. The swimmer has a magnetic moment perpendicular to its principal component and is immersed in a uniform, rotating magnetic field B(t) = B(0, cos ωt, sin ωt). The simulation employs the same resolution as described in the previous section. The swimmer's geometry is reproduced from [6] and the magnetic moment (not reported in the experiments) has been tuned to match the experimental data (m = 1.0 × 10 −14 N m T −1 ), with a magnetic field magnitude of B = 3 mT. The mean swimming velocity along the x axis is reported for different angular velocities ω on fig. 3 and agree well with the experimental data.

Settings of the CMA-ES minimization
The minimization of the travel time with respect to the travel directions n k is performed with the gradient free method derandomised evolution strategy with covariance matrix adaptation (CMA-ES). The population size is set to µ = 16, the initial standard deviation is σ = π/2 and the minimization is stopped when the standard deviation reaches δ = 10 −5 . The first generation is centered at (φ, θ, ψ) = (0, 0, 0).

Reinforcement Learning
The reinforcement learning (RL) method consists in an agent controlling the magnetic field through an action A = (a ω , a x , a y , a z ). This action sets the frequency and direction of the rotating magnetic field for ∆t simulation time. The angular velocity is set to ω = a ω and the direction is set to the unit vector u = (a x , a y , a z )/ a 2 x + a 2 y + a 2 z . Every ∆t time, the environment returns its state s and reward r to the agent. The state is composed of all positions and quaternions of the ABFs. Due to the absence of inertia, this state describes the full system. The reward has two components, designed to guide the ABFs towards the target in a minimal time: The first term is designed from reward shaping [7]: it increases if the ABFs come closer to the target, so that the cumulative reward (we use a discount factor γ = 1), until the final simulation time T , depends only on the initial conditions if the ABFs reached their goal. The constant K φ in eq. (28) is set such that the sum of the shaping part of the reward over one successful episodes is of the order of 1. Therefore, this part of the cumulative reward does not affect the resulting policy [7]. The second term in eq. (27) is designed to penalize long travel times, c t = ∆t/T max , Furthermore, an additional positive term is added to the reward if all ABFs reached the target within the given distance d success . We set this constant to K success = 1. The parameters used for the two-swimmers setup are listed in table 2.
Parameter Value ∆t 128.5/ω c,1 d success 2l T max 139l/V K success 1 K φ 1/72.3l x 0 1 (50l, 20l, −5l) x 0 2 (−10l, 30l, 20l) σ 3l We used the off-policy actor-critic algorithm V-RACER, implemented in the software smarties [8,9], with the default parameters. A single deep neural network (with 2 hidden layers of 128 units each) maps the state space to a continuous approximation of the policy, the state value function and the action value function. More precisely, the policy is parameterized as a gaussian whose means and variances are output of the neural network. The neural network weights are updated through policy-gradient. More details on the algorithm and implementation can be found in [8].
The number of episodes needed to converge for the RL method is of the order of 2.5 × 10 4 , as shown on fig. 4. The travel time is first equal to the maximum allowed one (when the RL agent is not able to guide the ABFs towards the target). After about 5000 episodes, the ABFs reach the target and the travel time decreases with the number of training episodes. The travel time converges to a smaller value than that of the semi analytical solution. ) in free space, zero background flow, and corresponding magnetic field rotation frequency ( ), where ω c,1 is the step-out frequency of the first swimmer.