Massively parallel computation of globally optimal shortest paths with curvature penalization

We address the computation of paths globally minimizing an energy involving their curvature, with given endpoints and tangents at these endpoints, according to models known as the Reeds‐Shepp car (reversible and forward variants), the Euler‐Mumford elasticae, and the Dubins car. For that purpose, we numerically solve degenerate variants of the eikonal equation, on a three‐dimensional domain, in a massively parallel manner on a graphical processing unit. Due to the high anisotropy and nonlinearity of the addressed Partial Differential Equation, the discretization stencil is rather wide, has numerous elements, and is costly to generate, which leads to subtle compromises between computational cost, memory usage, and cache coherency. Accelerations by a factor 30 to 120 are obtained w.r.t a sequential implementation. The efficiency and the robustness of the method is illustrated in various contexts, ranging from motion planning to vessel segmentation and radar configuration.

discretization grid, but solve the resulting coupled system of equations using an iterative method implemented on a massively parallel computational architecture, namely a Graphics Processing Unit (GPU), following. [9][10][11][12] Our numerical schemes involve finite difference offsets which are often numerous (30 for Euler-Mumford), rather wide (up to 7 pixels), and whose construction requires nontrivial techniques from lattice geometry. 2 This is in sharp contrast with the standard isotropic eikonal equation addressed by existing GPU solvers, which only requires few and small finite difference offsets when it is discretized on Cartesian grids, 9,10 and depends on unrelated geometric data when the domain is an unstructured mesh. 11,12 Due to these differences, the compromises needed to achieve optimal efficiency-a delicate balance between the cost of computations and of memory accesses-strongly differ between previous works and ours, and even between the different models considered in this paper.
Our study provides the opportunity to inspect these compromises as the stencil of the finite difference scheme grows in width, number of elements and complexity, from two offsets of width 1 pixel (isotropic model in two dimensional, 2D), to 30 offsets of width up to 7 pixels (elastica model). Alternative finite difference discretizations may benefit from shorter stencils, but fail structural properties such as causality, leading to different compromises whose investigation is an opportunity for future research. Specifically, our observations regarding the models with wider finite difference stencils are the following: (i) They work best, somewhat counter intuitively, with a more finely grained parallelization, in our case obtained with smaller tiles and a smaller number of fixed point iterations within them, see Section 3.1 and Table B1. (ii) Precomputing and storing the stencil weights and offsets offers a significant speedup, up to 40% in our case, but the memory cost is prohibitive unless one can take advantage of symmetries in the equation to share this data between grid points, see Section 3.2. (iii) The scheme update operation involves a sort of the solution values fetched at the neighbors defined by the stencil, whose cost becomes dominant in the wide stencil case unless implemented in a GPU friendly manner, see Section 3.3. We expect our findings to transfer to other wide stencil finite difference methods, a class of numerical schemes commonly used to address Hamilton-Jacobi-Bellman PDEs arising in various applications, including deterministic (as here) and stochastic optimal control, optimal transport and optics design via the Monge-Ampere equation, 13 etc. Eventually, our GPU accelerated eikonal solver is 30× to 120× faster than the CPU fast marching method from Reference 2, see Table B2. In the numerical experiments Section 4, which include applications to medical image segmentation, boat routing and radar configuration, computations times on typical problem instances are often reduced from 30 s to less than one, enabling convenient user interaction.

OUTLINE
We describe Section 2.2 the curvature penalized path optimization problems addressed, and Section 2.

Notations
We use square brackets to denote real intervals, such as ]a, b[, ]a, b], [a, b[ and [a, b] ⊂ R, whose bounds a and b are either contained or excluded following the usual convention. We denote ∶= (cos , sin ) for all ∈ R.

Curvature penalized path models
Throughout this paper we fix a bounded and closed domain Ω ⊂ R 2 , and a continuous and positive cost function [0, 2 [ with periodic boundary conditions. The general objective of this paper is to compute paths (x, ) ∶ [0, L] → Ω × S 1 in the position-orientation state space, which globally minimize the energy where we denoted e ∶= (cos , sin ) anḋ∶= d dl andẋ ∶= dx dl . An additional constraint to (1) is that the initial and final configurations x(0), (0) and x(L), (L) are imposed, in other words the endpoints of the physical path and the tangents at these endpoints. The path is parametrized by Euclidean length in the physical space Ω, and the total length L is a free optimization parameter. The constraint (1, right) requires that the path physical velocitẏ x(l) matches the direction defined by the angular coordinate e (l) ∶= (cos (l), sin (l)), for all l ∈ [0, L]. This constraint is said nonholonomic because it binds together the first order derivatives of the path (ẋ(l),̇(l)) for each l ∈ [0, L].
The choice of curvature penalty function ( ), where ∶=̇is the derivative of the path direction in (1), is limited to three possibilities in our approach, in contrast with the state dependent penalty which is essentially arbitrary. The considered curvature penalties are defined by the following expressions, which correspond to the Reeds-Shepp, Euler-Mumford, and Dubins models respectively: we define ( ), for all ∈ R, as where ∞ cond stands for +∞ where cond holds, and 0 elsewhere. The Reeds-Shepp model penalizes curvature in a roughly linear manner; because this is a rather weak regularization, smooth minimizers of (2) may not exist, 14 and a relaxed formulation allowing for singularities is required, introducing either cusps or in-place rotations, and leading to two variants referred to as the Reeds-Shepp reversible and the Reeds-Shepp forward models, see Remark 3. The quadratic curvature penalty of the Euler-Mumford model corresponds to the energy of an elastic bar, hence minimal paths follow the rest position of those objects. Finally the Dubins model forbids any path section whose curvature exceeds that of the unit disk, by assigning to it the cost +∞. Minimal paths for these models are qualitatively distinct, as illustrated on Figure 1. The curvature penalty may also be scaled and shifted, so as to control its strength and symmetry, see Remark 2 and Section 4.4.
In the following, we fix a seed point (x * , * ) ∈ Ω × S 1 in the state space, and denote by u(x, ) the minimal cost of a path from this seed to an arbitrary target (x, ) ∈ Ω × S 1 : For the Reeds-Shepp models, the path cost is modified as in Remark 3. Once the map u ∶ Ω × S 1 → R is numerically computed, as described in Section 2.3, a standard backtracking technique 2 allows to extract the path (x, ) ∶ [0, L] → Ω × S 1 globally minimizing (1), from the seed state (x * , * ) to any given target (x * , * ) ∈ Ω × S 1 .
Remark 3 (Well-posed formulation of the Reeds-Shepp model). The minimization of the energy (1) associated with the Reeds-Shepp cost ( ) ∶= √ 1 + 2 is not a well-posed optimization problem in general, 14,15 since minimizing sequences may develop nondifferentiable singularities in the limit, see Figure 1. To obtain a well posed problem, following References 6, we reject the convenient planar Euclidean arclength parametrization l ∈ [0, L] used in (1), in favor of an arbitrary time parametrization t ∈ [0, 1], thus allowing for times where the spatial velocity vanisheṡx(t) = 0 but the angular velocity does noṫ(t) ≠ 0.
The formulation (4) allows the proportionality constant (t) to be positive or negative, and for this reason we refer to (4) as the Reeds-Shepp reversible model. A cusp is observed when changes sign.
One may also introduce in (4) the additional constraint ⟨ẋ, e ⟩ ≥ 0, thus ensuring that the proportionality constant (t) ≥ 0 is nonnegative, and leading to the Reeds-Shepp forward model. 6 A rotation in-place is observed when vanishes over some time interval, which often happens at the start and at the end of the path.
F I G U R E 1 Planar projections of minimal geodesics for the Reeds-Shepp, Reeds-Shepp forward, Elastica and Dubins models (left to right). Seed point (0, 0) with horizontal tangent, regularly spaced tip point with random tangent (but identical for all models).

Nonholonomic eikonal equations and their discretization
The minimal travel cost (3), from a given source point to an arbitrary target, is the value function of a deterministic optimal control problem. As such, it obeys a first-order static nonlinear PDE, a variant of the eikonal equation, of the generic form where ∇ x u(x, ) ∈ R 2 and u(x, ) ∈ R denote the partial derivatives of the unknown u ∶ Ω × S 1 → R w.r.t. the physical position x and the angular coordinate . This PDE holds in Ω × S 1 ⧵ {(x * , * )}, while the constraint u(x * , * ) = 0 is imposed at the seed point (x * , * ), and outflow boundary conditions are applied on Ω. The detailed arguments and adequate concepts of optimal control, Hamilton-Jacobi-Bellman equations, and discontinuous viscosity solutions, are nontrivial and unrelated to the object of this paper (which is GPU acceleration), hence we simply refer the interested reader to References 2,16. For comparison, the standard isotropic eikonal equation 1,17 on R d , which corresponds to an omni-directional vehicle not subject to maneuverability constraints or curvature penalization, is defined by the operator  u = ||∇u||.
The considered variants of the eikonal PDE involve the following nonlinear and anisotropic operators  u(x, ), see Reference 2. We rely on a finite differences discretization Fu of the operator  u, on the Cartesian grid where the physical domain is usually rectangular Ω = [a, b] × [c, d] (or padded as such), and where the grid scale h > 0 is such that 2 ∕h ∈ N so that the sampling of S 1 ∶= [0, 2 [ is compatible with the periodic boundary conditions. By convention, the value function u is extended by +∞ outside Ω, thus implementing the desired outflow boundary conditions on Ω. For any discretization point p = (x, ) ∈ X h , the finite differences operator Fu(p) is defined as the square root of the following expression 3 where I, J, K are fixed integers, ik , jk ≥ 0 are nonnegative weights, and e ik , f jk ∈ Z 3 are finite difference offsets, for all The weights and offsets may depend on the current point p. Before turning to the variants (6) and (5), let us mention that the standard discretization 17 of the isotropic eikonal equation ( u = ||∇u||) fits within this framework, with meta-parameters J = d (and I = 0, K = 1), choosing unit weights w j1 = 1, 1 ≤ j ≤ d, and letting (f j1 ) d i=1 be the canonical basis of R d . Riemannian eikonal PDEs can also be addressed in this framework, with J = d(d + 1)∕2 (and I = 0, K = 1) and using weights and offsets defined by an appropriate decomposition of the inverse metric tensor, see References 3,18. The anisotropy of the Riemannian metric is not bounded a-priori, but strong anisotropy leads to large stencils: specifically where M denotes the Riemannian metric tensor, see Reference 18 (Proposition 1.1). Excessively large stencils in turn lead to longer execution time due to cache misses, slower convergence of the iterative method, and less precise boundary conditions.
In the curvature penalized case, the weights and offsets in (8)  Convergence is established in the limit where → 0 and h∕ → 0, see Reference 2, yet in practice the fixed choice = 0.1 appears suitable for the considered applications.
A fundamental property of discretization schemes of the form (8) is that they can be solved in a single pass over the domain, using a generalization of the fast-marching algorithm, 2,3,18 thanks to a property known as causality. This is highly desirable when implementing CPU solver, but anecdotical for a GPU eikonal solver whose massive parallelism forbids taking advantage of this property. See Section A for more discussion of the properties of these schemes and of their relevance to GPU implementations. Nevertheless, those schemes are robust and well tested. Alternative approaches offering different compromises and possibly more suited to GPUs will be considered in future works.

IMPLEMENTATION
We describe the implementation of our massively parallel solver of generalized eikonal PDEs, assumed to be discretized in the form (8). The bulk of the method is split in three procedures, Algorithm 1, 2, and 3, discussed in detail in the corresponding sections.
For simplicity, Algorithms 2 and 3 are written in the special case where the meta parameters of the discretization (8) are J = 0 and K = 1, whereas I is arbitrary. The case of arbitrary J and K is discussed in Section 3.3. The assignment of a value val to a scalar (resp. array) variable var is denoted var ← val (resp. var ⇐ val).

Algorithm 1. Parallel iterative solver (PYTHON)
Variables: (Blocks marked for current and next update) (Set seed point value, and mark its block for update) While an active block remains: For all active blocks b in parallel:

Load or compute the stencil weights
(Load the neighbor values) For r from 1 to R: Sort the indices, so that u i 1 ≤ · · · ≤ u i I .
For r from 1 to I:

Parallel iterative solver
Massively parallel architectures divide computational tasks into threads which, in the case of graphics processing units, are grouped into blocks (indexed by b) following a common sequence of instructions, and able to take advantage of shared data, see Remark 4. Following References 9,10,12, the main loop of our iterative eikonal equation solver is designed to take advantage of this computational architecture, see Algorithm 1. It is written in the Python programming language, which is also used for the pre-and post-processing tasks, and launches Algorithm 2 as a CUDA kernel via the cupy † library.
The discretization domain X h , which is a three dimensional Cartesian grid (7), is split into rectangular tiles X b h , indexed by b ∈ B h , see Figure 3, left. The update of a tile X b h is handled by a block of threads, and the tile should therefore contain no less than 32 points in view of Remark 4. The best shape of the tiles X b h was found to be 4 × 4 × 4 for the Reeds-Shepp models (forward and reversible), and 4 × 4 × 2 for the Euler-Mumford and Dubins models, see Section 3.2, Table B1. One of the findings of our work is indeed that the schemes featuring wider stencils work best with smaller tile sizes, see also the discussion in the second paragraph of Section 3.2. Some padding is introduced if the dimensions of the tiles X b h do not divide those of the grid X h .
A boolean table active ∶ B h → {0, 1} records all tiles tagged for update. Denote by N h ∶= #(B h ) the total number of tiles, and by } be the number of active tiles in a typical iteration of Algorithm 1. Since we are implementing a front propagation in a three-dimensional domain, one generally In each iteration of Algorithm 1, the active table is checked for emptiness, in which case the program terminates. More importantly, the indices of all non-zero entries of the active table are extracted, so as to update only the relevant blocks. The complexity (N h ln N h ) of this operation is in practice negligible w.r.t. the cost of the block updates themselves  (N act N b RK(I + J)) where R is the number of inner loops in Algorithm 2 and I, J, K are the scheme parameters (8). A second boolean table next ∶ B h → {0, 1}, is used to mark the blocks which are to be updated in the subsequent iteration.
A single array u ∶ X h → [0, ∞[ holds the solution values. Indeed, the block update operator benefits from a monotony property, see Section A, which guarantees that the values of (u n ) n≥0 of the numerical solution decrease along the iterations of Algorithm 1, toward a limit u ∞ . As a result, load/store data races in u between the threads are innocuous.

Remark 4 (SIMT architecture). A block of threads is under the hood handled by a GPU device in a Single Instruction Multiple Threads (SIMT) manner:
the same instructions are applied on 32 threads of a same block (also called a warp) simultaneously. For this reason, the number of threads within a block should preferably be a multiple of the width of a warp. For the same reason, thread divergence (threads within a warp going along different execution paths, due to conditional branching statements, which is implemented in a sequential manner by "muting" the threads of the inactive branch) should be avoided for best efficiency.
Our numerical solver reflects these properties through the choice of the tile size X b h , and in the choice of an Eulerian discretization scheme on a Cartesian grid (8) whose solution by Algorithm 3 involves little branching and yields an even load between threads, as opposed to an unstructured mesh where these properties are by design less ensured. 11

Block update
The BlockUpdate procedure, presented in Algorithm 2, is the most complex part of our numerical method. It is executed in parallel by a block of threads, each handling a given point p ∈ X b h of a tile of the computational grid, where the tile index b ∈ B h is fixed.
shared between the threads of the block is initialized with the values of the unknown u ∶ X → [0, ∞] at the same positions. Throughout the execution of the BlockUpdate procedure, the values of u b are updated several times, and then finally they are exported back into the main array u. If the number R of updates of u b is sufficiently large, then this procedure amounts to solving a local eikonal equation on treated as boundary conditions. A similar approach is used in. 9,10, 12 We empirically observe that stencil schemes using a wide stencil work best with a small number R of iterations, and a small tile size, see Table B1. Our interpretation is the following: using several iterations is meant to propagate the front through the tile X b h and to stabilize the local solution u b within the tile, 10 but this objective looses relevance when the stencil is so wide that the scheme update at a point x ∈ X b h involves fewer values of the local array u b in X b h than of the global array u in X h ⧵ X b h , which is cached and frozen throughout the iterations in Algorithm 2. In addition, each of the R iterations has a higher cost when the stencil is wide and has numerous elements, see the description of Algorithm 3 in Section 3.3.
The finite difference scheme (8) used for curvature penalized fast marching is built using nontrivial tools from lattice geometry, 2 whose numerical cost cannot be ignored. Empirical tests show that precomputing the weights and offsets usually reduces overall computation time by 30%-50%. If Last but not least, the block b and its immediate neighbors b ′ need to be tagged for update in the next iteration of the eikonal solver Algorithm 1, if appropriate, via the boolean array next ∶ B h → {0, 1}. This step is not fully described in Algorithm 2, and in particular the neighbors of a tile and the appropriate condition for marking them are not specified. Indeed, a variety of strategies can be plugged in here, and our numerical solver is not tied to any of them. Good results were obtained using Adaptive Gauss Siedel Iteration 12,19 and with the Fast Iterative Method, 10 while other variants were not tested. 9 Remark 5 (Walls and thin obstacles). Our finite differences scheme involves rather wide stencils, see Figure 2, raising the following issue: the update of a point p may involve neighbor values u(p + he i ) across a thin obstacle. In order to avoid propagating the front through the obstacles, if any are present, an additional walls array is introduced in Algorithm 2, as well as and intersection test between the segment [

Local update
This section is devoted to the local update operator presented in Algorithm 3. From the mathematical standpoint, one defines Λu(p) as the solution to the equation Fu(p) = (p) w.r.t. the variable u(p), regarding all neighbor values as constants, see Reference 18 (Appendix A). This process of solving a discretized PDE at a single point p, and w.r.t. the corresponding unknown u(p), is often referred to as a Gauss-Siedel update. We prove in this subsection that Algorithm 3 does compute the correct update value Λu(p), and comment on its numerical complexity and efficient implementation.
These results generalize and abstract the update step of the standard fast marching method for isotropic eikonal equations, 20 whose discretization is a special case of (8) as mentioned in Section 2.3.
For simplicity, and consistently with the presentation of Algorithm 3, we first assume a numerical scheme of the following form: denoting a + ∶= max{0, a}, in other words J = 0 and K = 1 in (8 Proof. Assume w.l.o.g. that u 1 ≤ · · · ≤ u I . Observing that f( ) = −h 2 2 < 0 for all ∈] − ∞, u 1 ], and that f( we see that f admits a root Then f admits the piecewise quadratic expression with the convention J I ∶= [u i I , ∞[. Denoting r * ∶= max{r; 1 ≤ r ≤ I, f(u i r ) < 0}, one has • b 2 r − a r c r > 0 for all 1 ≤ r ≤ r * , so that r ∶= (b r + √ b 2 r − a r c r )∕a r is well defined.
• r * ∈ J r * is the unique root of f, whereas r > u i r+1 for all r < r * .
Proof. The proof is illustrated on Figure 4, left. Assume w.l.o.g. that u 1 ≤ · · · ≤ u I , so that i r = r for all 1 ≤ r ≤ I. The piecewise quadratic expression (11) on J r ∶= [u r , u r+1 ], is obtained by inserting ( − u i ) + = − u i if i ≤ r, and ( − u i ) + = 0 if i > r, in the defining expression (10).
Recalling that f is increasing on [u 1 , ∞[, we obtain that f(u 1 ) ≤ · · · ≤ f(u r * ) < 0. The polynomial P r thus takes a negative value P r (u r ) = f(u r ) < 0, for any r ≤ r * , and has a positive dominant coefficient a r ≥ 1 > 0. It follows that P r admits two distinct real roots, for any r ≤ r * , the largest being r .
When r < r * one has P r (u r+1 ) < 0, and thus r > u r+1 as announced. On the other hand P r (u r * ) = f(u r * ) < 0 and P r (u r * +1 ) = f(u r * +1 ) ≥ 0 (or P r ( ) → ∞ as → ∞ in the case r * = I), so that P r admits a root in the interval J r * by the intermediate value theorem. Since P r * = f is increasing on the interval J r * , this is the largest root of P r * . We have found a root r * ∈ J r * of f, and it is unique by Lemma 1, which concludes the proof. Proposition 2 below shows how to characterize and compute the Gauss-Siedel update * = Λu(p) when the numerical scheme has the general form (8), with arbitrary parameters I, J, K. It is obtained as the unique root of f ∶ R → R defined by λ (1) λ (2) λ (3) f Left: illustration of Proposition 1, with I = 3 and u 1 < u 2 < u 3 . The function f is piecewise quadratic (11). In this case, the unique root of f is also the largest root of P 2 , which belongs to the interval [u 2 , u 3 ]. Right: illustration of Proposition 2, with K = 3. The function f is the maximum (a.k.a. the upper envelope) of the nondecreasing functions f 1 , f 2 , f 3 , whose unique root is (1) , (2) , (3) , respectively. The unique root of f is thus * = min{ (k) ; 1 ≤ k ≤ 3} (here * = (2) ).
and where u ik ∶= u(p + he ik ) and u ′ jk ∶= min{u(p − hf jk ), u(p + hf jk )}. Note that the weights ik and jk are nonnegative by design for all i, j, k, yet for simplicity we assume that they are positive up to dropping the corresponding terms in the sums. (12), where 1 ≤ k ≤ K, admits a unique root denoted (k) . It can be characterized and computed by applying

The function f defined by (12) admits a unique root, which is
Proof. The function f k has a structure similar to (10), up to gathering the two sums featuring I and J terms, respectively, into a single sum with I + J terms, which establishes the first point.
The proof of the second point is illustrated on Figure 4, right. Since f k is nondecreasing and admits the unique root

NUMERICAL EXPERIMENTS
We illustrate our numerical solver of curvature penalized shortest paths in variety of contexts ranging from motion planning with obstacles or drift, to image segmentation, and to the configuration of radar systems. Some of the test cases are new, whereas others are closely related to previous works 2, [4][5][6][7][8] and are meant to illustrate the benefits of the GPU solver over an earlier CPU implementation in common use cases. Test data is synthetic except for the medical image segmentation problem Section 4.3.
We report in Table B2 the running times of the GPU eikonal solver presented in this paper, and of the CPU solver introduced in Reference 2, as well as the GPU/CPU speedup which varies significantly depending on the experiment. Indeed, the running time of the GPU eikonal solver, which is an iterative method, depends on the presence and layout of obstacles or slow regions in the test case as noted in Reference 9. This in contrast with the fast-marching-like method 2 implemented on the CPU, which is guaranteed to update each discretization point at most N neigh = K(I + 2J) times where I, J, K are the scheme parameters (8) (for this reason, slightly abusively, fast-marching is referred to as a single pass method), and whose complexity  (N neigh N ln N) is independent of the specific test case, where N is the total number of discretization points. The logarithmic factor in the fast-marching complexity comes from the overhead of managing a priority queue of the trial discretization points, sorted by the solution values, which is implemented using a binary heap. This complexity can be slightly improved to (N neigh N + N ln N) using the more advanced Fibonacci heap data structure, 22 but with little practical benefit in our experience. 23 Indeed, the sequential nature of the fast marching method is the main obstacle to improving its computation time.
The numerical experiments presented in the following sections are designed to illustrate the following features of the eikonal solver introduced in this paper: 1. Geodesics in an empty domain. Illustrates the qualitative properties of the different path models, and the GPU/CPU speedup in the ideal case.

Fastest exit from a building.
Illustrates the implementation of walls and thin obstacles, which is nontrivial with wide stencils as described in Remark 5.

Retinal vessel segmentation.
Illustrates a realistic application to image processing, based on the choice of a carefully designed cost function.
4. Boat routing. Illustrates a curvature penalty whose strength and asymmetry properties vary over the PDE domain, as described in Remark 2.

Radar configuration.
Illustrates the automatic differentiation of the eikonal PDE solution u w.r.t. the cost function , for the optimization of a complex objective.
Remark 6 (Computation time and hardware characteristics). Program runtime is dependent on the hardware characteristics of each machine.
The reported CPU and GPU times were obtained on the Blade ® Shadow cloud computing service, using the provided Nvidia ® GTX 1080 graphics card for the GPU eikonal solver, and an Intel ® Xeon E5-2678 v3 for the CPU eikonal solver (a single thread was used, with turbo frequency 3.1 GHz).

Geodesics in an empty domain
We compute minimal geodesics for the Reeds-Shepp, Reeds-Shepp forward, Euler-Mumford elastica and Dubins model, in the domain [−1, 1] 2 × S 1 without obstacles. The front is propagated from the seed point (x * , * ) placed at the origin x * = (0, 0) and imposing a horizontal initial tangent * = 0.
Geodesics are backtracked from several tips (x * , * ) where x * is placed at 16 regularly spaced points in the domain, whereas * is chosen randomly (but consistently across all models).
This experiment is meant to illustrate the qualitative differences between minimal geodesic paths associated with the four curvature penalized path models, see Figure 1. The Reeds-Shepp car can move both forward and backward, and reverse gear along its path, as evidenced by the cusps along several trajectories. The Reeds-Shepp forward variant cannot move backward, but has the ability to rotate in place (with a cost), and such behavior can often be observed at the endpoints of the trajectories. 6 The Elastica model produces pleasing smooth curves, which have a physical interpretation as the rest positions of elastic bars. Trajectories of the Dubins model have a bounded radius of curvature, and can be shown to be concatenations of straight segments and of arcs of circles, provided the cost function is constant as here.
The generalized eikonal PDE (5) or (6) is discretized on a 300 × 300 × 96 Cartesian grid, following (8), thus producing a coupled system of equations featuring 8.6 million unknowns ‡ . Computation time for the GPU eikonal solver ranges from 0.28 s (Reeds-Shepp forward) to 1.54 s (Euler-Mumford elastica), reflecting the complexity of the discretization stencil, see Figure 2. A substantial speedup ranging from 60× to 120× is obtained over the CPU implementation; let us nevertheless acknowledge that, as noticed in Reference 9, the absence of obstacles and of a position dependent speed function is usually the best case scenario for an iterative eikonal solver such as our GPU implementation.

Fastest exit from a building
We compute minimal paths within a museum map, for the four curvature penalized models under consideration in this paper, as illustrated on Figure 5. Due to the use of rather wide stencils, often 7 pixels long see Figure 2, some intersection tests are needed to avoid propagating the front through the walls, which are one pixel thick only. A careful implementation, as described in Remark 5, allows to bypass most of these intersection tests and limits their impact on computation time. In contrast with Reference 10, we do not consider "slightly permeable walls," since they would not be correctly handled with our wide stencils, and since as far as we know they have little relevance in applications. A closely related experiment is presented in Reference 6 for the Reeds-Shepp models, using a CPU eikonal solver.
The front propagation starts from two seed points located at the exit doors, and a tip is placed in each room for geodesic backtracking, with an arbitrary orientation. The extracted paths are smooth (Euler-Mumford case) or have a bounded curvature radius (Dubins case), but minimize a functional (1) which is unrelated with safety and thus may not be directly suitable for motion planning. Indeed, in many places they are tangent to the obstacles, walls, and doorposts, without any visibility behind, which is a hazardous way to move.
The PDE is discretized on a Cartesian grid of size 705 × 447 × 60, where the first two factors are the museum map dimensions, and the third factor is the number of angular orientations, for a total of 19 million unknowns. Computation time on the GPU ranges from 0.59 s (Reeds-Shepp forward) to 3.2 s (Euler-Mumford elastica), a reduction by approximately 50× over the CPU eikonal solver.

Tubular structure segmentation
A popular approach for segmenting tubular structures in medical images, such as blood vessels on the retinal background as illustrated on Figure 6, is to devise a geometric model whose minimal paths (between suitable endpoints) are the centerlines of the desired structures.
For that purpose a key ingredient, not discussed here, is the careful design of a cost function ∶ R 2 × S 1 →]0, ∞] which is small along the vessels of interest in their tangent direction, and large elsewhere. 24 Curvature penalization, and in particular the Reeds-Shepp forward and Euler-Mumford elastica models, 4-6 helps avoid a classical artifact where the minimal paths do not follow a single vessel but jump form one to another at crossings.
The test cases have size 512 × 512 × 60, 387 × 449 × 60 and 398 × 598 × 60, respectively, and the computation time of the GPU eikonal solver ranges from 1 s (Reeds-Shepp forward) to 3 s (Euler-Mumford elastica) on the GPU. This is compatible with user interaction, in contrast with CPU the run time which is 30× to 80× longer, see Table B2. Note that by construction, the front propagation is fast along the blood vessels, and slower in the

Boat routing with a trailer
The Dubins-Zermelo-Markov model 26  which is usually a scarce resource, hence we regard it as default despite the longer runtime, see the discussion Section 3.2; it is nevertheless 59× faster than the CPU implementation.

Optimization of a radar configuration
We consider the optimization of a radar system, so as to maximize the probability of detection of an intruder vehicle. The intruder has full knowledge of the radar configuration, and does its best to avoid detection, but is subject to maneuverability constraints as does a fast plane. Following References 7,8 the intruder is modeled as a Dubins vehicle, traveling at unit speed with a turning radius of 0.2, whose trajectory starts and ends at a given point x * ∈ Ω and which must visit a target keypoint x * ∈ Ω in between. § The problem takes the generic form where Ξ is the set of radar configurations, and Γ is the set of admissible trajectories. A trajectory escapes detection from a radar configured as with probability exp(−( ; )). Following (1), a trajectory is represented as a pair = (x, ) ∶ [0, L] → Ω × S 1 , and its cost is defined as where  denotes the Dubins cost (2, right), and (x, ; ) is an instantaneous probability of detection depending on the radar configuration , and on the intruder position x and orientation . We refer to Reference 8 for a discussion of the detection probability model, and settle for a synthetic and simplified yet already nontrivial construction. The detection probability is the sum of three terms (x, ; ) = ∑ 3 i=1̃( x, ; y i , r i , v i ), corresponding to as many radars, each of the following form illustrated on Figure 8, top left. .
F I G U R E 8 (Top left) Instantaneous detection probability of a vehicle by a radar, depending on the radial distance r and radial velocity v of the vehicle, in addition to the radar parameters, see (14). Note the blind distance and blind velocity periods. (Top right) Trajectory minimizing the probability of detection between two points in the presence of a single radar. It approximates a concatenation of circles, at a multiple of the blind radial distance, and spirals, corresponding to a multiple of the blind radial velocity. We also optimize the blind distances r 1 , r 2 , r 3 ∈]0, ∞[, and we define the bling velocities as v i = 0.2∕r i for 1 ≤ i ≤ 3, reflecting the fact that the product r i v i is constrained by the physics of the radar system. Summarizing, the collection of radar configuration parameters , and their domain Ξ, are defined for this experiment as: The trajectory of the intruder has a curvature radius bounded below by 0.2 (Dubins model), stays within the rectangular domain [−1, 1] × [0, 1], starts and ends at (−0.9, 0.5), and visits the target point (0.8, 0.5), which completes the description of the test case.
Minimization over the parameter ∈ Γ in (13) is solved numerically using the eikonal solver presented in this paper, thus defining a function ( ) ∶= inf{( ; ); ∈ Γ} depending on the radar configuration alone ∈ Ξ. We differentiate ( ) in an automatic manner as described in Reference 7, and optimize this quantity by gradient ascent, with a projection of the updated parameters onto the admissible domain Ξ at each gradient step. Using these tools, a local maximum of ( ) is reached in a dozen iterations approximately, see Figure 8, bottom right. Computation time is dominated by the cost of solving a generalized eikonal equation in each iteration, which takes 0.26 s on the GPU and 9.6 s on the CPU (Dubins model on a 200 × 100 × 96 grid). Since the optimization landscape is highly nonconvex, obtaining the global maximum w.r.t. would require a nonlocal optimization method in complement or replacement of local gradient ascent, thus requiring many more iterations and benefitting even more from GPU acceleration.

CONCLUSION AND PERSPECTIVES
Geodesics and minimal paths are ubiquitous in mathematics, and their efficient numerical computation has countless applications. In this paper, we present a numerical method for computing paths which globally minimize a variety of energies featuring their curvature, by solving a generalized anisotropic eikonal PDE, and which takes advantage of the massive parallelism offered by GPU hardware for computational efficiency. In comparison with previous CPU implementations, a computation time speed up by 30× to 120× is achieved, which enables convenient user interaction in the context of image processing and segmentation, and reasonable run-times for applications such as radar configuration which solve these problems within an inner optimization loop.
Future work will be devoted to additional applications, to efficient implementations of wide stencil schemes associated with other classes of Hamilton-Jacobi-Bellman PDEs, and to the study of numerical schemes based on different compromises in favor of, for example, allowing grid refinement or using shorter finite different offsets.

DATA AVAILABILITY STATEMENT
Open source implementation of the library available at: https://github.com/Mirebeau/AdaptiveGridDiscretizations. Other data available on request from the authors.

APPENDIX A. MONOTONY AND CAUSALITY
We briefly present in this appendix the concepts of monotony and causality, which are two key properties for the analysis of discretization schemes of eikonal equations, and we discuss their relevance to a GPU implementation. We refer the interested reader to Reference 3 (section 2.1 and section A) for additional discussion, and for the proofs of the results presented below.
A (finite differences) scheme on a finite set X is a mapping F ∶ R X → R X defined as Given such a scheme, one is often able to prove that the univariate nonlinear equation admits a unique root ∈ R, thus defining the Gauss-Siedel update operator Λu(p) ∶= , see Lemma 1 for the class of schemes considered in this paper. This operator is monotonous, in the sense that Λu ≤ Λv if u ≤ v on X, provided the original scheme F is DDE. It is causal, in the sense that u(p) may depend on u(q) only if u(p) > u(q), provided the original scheme F is in addition causal in the previously defined sense. See Reference 3, proposition A.4, for a complete statement and proof.
A solution Fu = 0 to a finite difference scheme defines a fixed point Λu = u of the corresponding Gauss-Siedel update operator, and conversely.
Alternatively, semi-Lagrangian discretizations of the eikonal equation directly define an operator Λ whose fixed point must be computed, and which is obtained via a variational principle 19 rather than from a finite difference scheme as in (A1). In that case, monotony holds by construction, whereas the causality property can be derived from a geometrical acuteness property of the discretization stencils. 23,[27][28][29][30] The fixed point of a monotonous operator Λ can be obtained using a variety of iterative methods such as References 17,19, or 9-12 on the GPU.
For concreteness, global iteration is the simplest approach (but usually not the most efficient): let formally u 0 ∶= +∞, and define u n+1 ∶= Λu n for all n ≥ 0. Then using monotony and an immediate induction argument one obtains that u n+1 ≤ u n for all n ≥ 0; under mild additional assumptions, one finds that u n converges decreasingly to a fixed point of Λ, as desired, see for example, References 30, proposition D.3. Monotony is of key significance for the GPU implementation, since it ensures stability along the iterations, and also allows to use a single array for reading and writing the solution values (data races are indeed innocuous, since the latest computed value is always the smallest and the closest to the fixed point).
The fixed point of an operator Λ which is both monotonous and causal can be computed in a single pass over the domain using the extremely efficient fast marching method (FMM), see Reference 20 or 3, proposition A.2. The FMM, however, has little parallelization potential (notably, the parallel implementation 27 does not scale well due to the excessive overhead of message passing), hence it is less relevant to GPU eikonal solvers which instead rely on iterative methods as discussed above.