Optimal control under nonconvexity: A generalized Hamiltonian approach

This article extends the analysis of optimal control based on a generalized Hamiltonian which covers situations of nonconvexity. The approach offers several key advantages. First, by identifying a global solution to a constrained optimization problem, the generalized Hamiltonian approach solves the problem of distinguishing between a global optimum and the (possibly many) nonoptimal points satisfying the Pontryagyn principle under nonconvexity. Second, in our generalized approach, interpreting the slopes of the separating hypersurface as shadow prices of the states continues to hold. Third, we discuss how the generalized Hamiltonian approach can be used in solving dynamic optimization problems under nonconvexity.

to be satisfied at (possibly many) nonoptimal points makes it problematic in applied optimization. Attempts have been made to reformulate nonconvex optimal control problems as convex optimization problems, [8][9][10] but these approaches do not apply in general. This lack of generality comes from the fact that the standard Lagrange multipliers are the slopes of a separating hyperplane. From the separating hyperplane theorem, 4 a separating hypersurface is guaranteed to exist under convexity. But introducing nonconvexity in optimal control can undermine the existence of a separating hyperplane, indicating a need to move away from the standard Lagrangian approach and its associated Hamiltonian. This motivates the generalized Hamiltonian approach explored in this article.
This article develops the analysis of dynamic optimization under nonconvexity. It considers replacing a separating hyperplane by a separating nonlinear hypersurface. As showed by Gould,11 this generalization applies under general conditions. In this context, we derive a generalized Hamiltonian which covers situations of nonconvexity when a separating hyperplane does not exist. This formulation offers several key advantages. First, by identifying a global solution to a constrained optimization problem, the generalized Hamiltonian approach solves the problem of distinguishing between a global optimum and the (possibly many) points that may satisfy the Pontryagyn principle under nonconvexity. Second, in our generalized approach, interpreting the slopes of the separating hypersurface as shadow prices of the states continues to hold. Third, we explore how the generalized Hamiltonian approach can be used in solving dynamic optimization problems under nonconvexity.

OPTIMAL CONTROL
This section presents a dynamic optimization problem and reviews well-known results from previous literature about its optimal control solution. Consider an optimization problem about the path of m state variables x t = (x 1t , … , x mt ) ∈ R m and n control variables u t = (u 1t , … , u nt ) ∈ R n over a T period planning horizon with t ∈ {0, 1, … , T}. Throughout, we assume that x 0 and T are fixed. Letting x = (x 1 , … , x T+1 ) and u = (u 0 , … , u T ), consider the dynamic optimization problem u t ∈ U, t ∈ {0, 1, ..., T} where f t ∶R m × R n → R and g t ∶R m × R n → R m are continuously differentiable functions for each t ∈ {0, … , T}, k∶R m → R is a continuously differentiable function, and U ⊂ R n is a normed vector space.* For each t ∈ {0, 1, … , T}, f t (x t , u t ) is the objective function at time t and g t (x t , u t ) describes the change in the state variables through the state equation x t+1 − x t = g t (x t , u t ). † And k (x T+1 ) measures the cost of the state variables at time T + 1. Denote the solution to (1) by (x * , u * ) , u t ∈ U, t ∈ {0, … , T}} is the feasible set. Throughout the article, we assume that S ≠ ∅ and that V is finite.
A dual formulation to (1) involves the standard Lagrangian L 0 where t = ( 1t , … , mt ) ∈ R m are Lagrange multipliers (or costate variables) corresponding to the state equations x t+1 − x t = g t (x t , u t ) , t ∈ {0, 1, … , T} and = ( 0 , … , T ). 3,4,12 Consider the case where L 0 (x, y, ) has a saddle point (x * , u * , * ) ∈ R m × U × R m satisfying for all (x t+1 , u t , t ) ∈ R m × U × R m and all t ∈ {0, 1, … , T}. The linkages between a saddle point of the Lagrangian L 0 (x * , u * , * ) and the optimization problem (1) have been investigated in previous literature. 4,13 They are summarized next. ‡ Lemma 1. Assume that the Lagrangian in (2) has a saddle-point (x * , u * , * ) in (3). Then, (1) and V = L 0 (x * , u * , * ) 2. (x * , u * , * ) satisfies the following conditions § where H 0t (x t , u t , t ) is the standard Hamiltonian defined as Lemma 1 presents a dual formulation to the optimization problem (1) based on the Lagrangian L 0 in (2) and (3). The condition L 0 (x * , u * , * ) = V states that there is a "zero duality gap" between the primal approach (1) and the dual approach given (2) and (3). 13 Importantly, note that Lemma 1 holds under general conditions. In particular, it applies without requiring that the set U be convex, or that the functions f t , g t or k be convex.
Lemma 1 assumes that the Lagrangian L 0 in (2) has a saddle-point. In this context, it states that Equation (4a) and (4d) provide necessary conditions for a solution to problem (1). But these conditions are not sufficient for two reasons. First, the Lagrange multipliers in (4) are not always well defined; avoiding these occurrences requires introducing some regularity conditions on the feasible set S. 13,15 This motivates imposing the following constraint qualification** As1∶rank The constraint qualification As1 guarantees the existence of Lagrange multipliers that satisfy Equation (4). 13 Second, Equation (4) can identify points that fail to be a solution to problem (1). As is well known, this can arise under nonconvexity. This issue is further discussed below.
While Lemma 1 applies under discrete time, similar results apply under continuous time when the time interval between successive periods Δt becomes small with lim Δt→0 Under continuous time, the optimization problem (1) satisfies the following well-known conditions. 1 where H 0 (x t , u t , t , t) is the standard Hamiltonian defined as Equation (4 ′ ) in Corollary 1 are the well-known Pontryagyn principle. 1,4 Note that (4c) allows for u * t to have multiple solutions. Examples of multiple solutions include: (1) a "bang-bang" solution when the Hamiltonian H 0 is linear in u t and U is closed and bounded, implying that u * t can switch between lower bounds and upper bounds of U and (2) when the Hamiltonian H 0 is nonconvex in u t and u * t occurs in different regions of U. In these cases, Equation (4a ′ ) would apply almost everywhere (a.e.), the derivatives in (4a ′ ) being evaluated at all points away from the switching points.
Corollary 1 also covers the calculus of variations as a special case when In such situations and under As1, Equation (4) give conditions that are both necessary and sufficient for (x * , u * ) to be a solution to Equation (1) [2][3][4] . † † In other words, under As1 and convexity, Lemma 1 would provide a complete characterization of the optimal solution of minimization problems. In this case, numerical methods can be used to solve Equation (4a)-(4d) efficiently, yielding the optimal solution (x * , u * ) to the optimization problem (1). 14,18,19 This includes the sweep method which treats Equation (4a)-(4c) as a two-point boundary problem: solving (4d) forward for x t starting at x 0 , and solving (4a) backward for t starting with T+1 given in (4b). This is illustrated next in an example.
With f t (x t , u t ) and g t (x t , u t ) being convex functions mapping R 2 → R, Equation (1) is a convex optimization problem. As noted above, under convexity, finding the solution (x * , u * ) to the minimization problem (1) reduces to solving Equation (4). Efficient methods (such as the sweep method) are conveniently available for this task. For example, in the case where x 0 = 1.8 and T = 2, the sweep method applied to (4) provides a convenient way to find the optimal solution x * = ( = (−1.077, −0.545, −0.500), along with V = 2.759 and associated Lagrange multipliers = ( 2.154, 1.090, 1, 000). But the next example shows how introducing nonconvexity in the analysis becomes problematic.

Example 2.
Consider the optimization problem (1) considered in Example 1, with the exception that g t now takes the form where In the case where x 0 = 1.8 and T = 2, the solution to (1) is the same as in Example 1: is only a local minimum (and not a global minimum) since the value of its objective function is 51.097 > 2.759 = V. This is problematic since this local solution also satisfies Equation (4), with = ( 11.893, 3.350, 1.000). This is a scenario where, in the presence of nonconvexity, there are multiple solutions to Equation (4). This makes it clear that Equation (4) alone cannot be used to find the solution to problem (1).
These arguments are illustrated in Figure 1. As in Example 1, Figure 1A represents the convex case. Figure 1A shows that the optimal point (point O) is at the tangency of the feasible set and the indifference curve going through point O. Under convexity, the separating hyperplane theorem holds, implying the existence of the hyperplane AOA' separating the feasible and the indifference curve going through point O. This validates the dual approach to the optimization problem (1) based on the Lagrangian (2) and (3), where the slopes of the separating hyperplane define the Lagrange multipliers. In a way consistent with Example 1, under convexity, Equation (4) simply reflect that, the optimal point O, the separating hyperplane AOA' is tangent to both the indifference curve and the boundary of the feasible set S.
The failure of Equation (4) is illustrated in Figure 1B. Figure 1B represents a nonconvex case where the functions f 's and g's are both nonconvex. Figure 1B  hyperplane, the saddle-point characterization of the standard Lagrangian approach and its associated Hamiltonian no longer apply. In next section, we will explore a generalized Lagrangian approach that resolves this issue. Under nonconvexity, one way to deal empirically with local solutions to (1) is to search for all local optima that satisfy Equation (4) (e.g., by conducting an extensive numerical search across starting values using the sweep method) and then use the values of the objective function to identify which local optima is the global optimum. This is a convenient approach in problems where the number of local optima is small (e.g., in Example 2). But this approach is less attractive when the number of optima is large. To see that this can be a serious problem, consider situations where the nonconvexities are associated with each of the n state variables x t = (x 1t , … , x nt ) at each time t ∈ {0, … , T}. Under the simple scenario where each nonconvexity gives rise to 2 local solutions, the potential total number of local solutions is 2 n (T+1) . For example, having n = 10 and T = 20 would generate 2 210 ≈ 1.6455 × 10 63 possible local solutions, a very large number that would make a numerical search for all local solutions empirically impractical. In such situations, there is a need to rely on a different approach that can deal more effectively with local solutions. This is the topic of next section where we present a generalized Hamiltonian approach to the characterization of optimal control problems under nonconvexity.

GENERALIZED HAMILTONIAN
As just discussed, a separating hyperplane may no longer exist under nonconvexity, thus invalidating the standard Lagrangian approach to constrained optimization. But these arguments can be generalized. As first argued by Gould,11 a separating hyperplane can be replaced by a separating hypersurface. 11,20 This is illustrated in Figure 1B showing that, under nonconvexity, there is no separating hyperplane that goes through the optimal point O. Yet there is a separating hypersurface AODD' that separates the boundary of the feasible set from the indifference curve going through point O. This is the key insight obtained by Gould 11 : the saddle-point characterization of a generalized Lagrangian still applies under nonconvexity provided that one allows for a nonlinear separating hypersurface.
To evaluate a generalized dual formulation to (1), consider functions p t ∈ P, where P is the set of continuous functions where p = (p 0 , … , p T ) ∈ P T+1 . Note that (5) includes as a special case the augmented Lagrangian approach proposed by Hestenes 12 and Rockafellar. 21 The standard Lagrangian approach in (2) is also obtained as special case of (5) when p t ∈ P 0 ⊂ P, where P 0 is the set of linear functions p t ∶ R m → R satisfying p t (0) = 0, that is, where p t (a) = ∑ m j=1 jt a j , t = ( 1t , … , mt ) ∈ R m being Lagrange multipliers at time t.
To evaluate the implications of Lemma 2, let In this context, B − B 0 is the "indifference curve" going through (x * , u * ). Note that (x * , u * ) being a solution to (1) can be written as (x * , u * ) ∈ S ∩ (B − B 0 ) or Equation (8) implies that (x * , u * ) is on the boundary of both S and B. Such separation properties have been the subject of much interest (e.g., Giannessi 22 ; Chinaie and Zafarani 23 ). Having (x * , u * ) separating S and B provides useful insights into the role of the penalty function p * ∈ P T+1 .
To see the linkages between p * and a separating hypersurface, define In the dual approach portrayed in Lemma 2, p * in the saddle-point problem (6) is associ- and is chosen to satisfy p * (S ∩ B) = . Under convexity, the separating hyperplane theorem holds and p * (x) can be taken to be linear. This reduces to the case discussed in Section 2 and illustrated in Figure 1A where p define a separating hyperplane given by the line AOA'. Under nonconvexity, the separating hypersurface may need to be nonlinear. This is illustrated in Figure 1B where the hyperplane AOA' is no longer separating S and B. Figure 1 also indicates that a nonlinear hypersurface always exists which separates S and B. In Figure 1B, such a separating hypersurface is given by the nonlinear line AODD'. For additional numerical examples under nonconvexity, see Gould. 11 In general, p * in (6) is chosen such that the associated separating hypersurface F is in the set C X while going through S ∩ B. Assuming that problem (1) has a solution, C X ≠ ∅ and there are many options for choosing p * 22,23 (e.g., Gould 11 ). For example, one could choose p * such that, besides the points in S ∩ B, the associated hypersurface F in C X is close to the upper bound of C X , close to the lower bound of C X , or somewhere in between. In Figure 1B, this amounts to choosing p * such that the separating hypersurface F in C X is close to the indifference curve, close to the boundary of the feasible set, or in between. These arguments have two implications: (1) the choice of p * in (6) (and the associated separating hypersurface F) is not unique and (2) the nonlinearities in p * depend on the nature of nonconvexity in the optimization problem (1). While Lemma 2 establishes the general validity of the dual approach given in (6), the presence of nonconvexity makes the evaluation of p * in (6) more challenging. Thess challenges have been noted in previous literature. 11,22,23 They are further discussed in Section 5 below.
Next, we explore the implications of Lemma 2 where the functions k and {(f t , g t ) ∶ t ∈ {0, … , T}} are continuously differentiable functions. (All proofs are in the Appendix).
where H t (x t , u t , p t ) is the generalized Hamiltonian defined as Proposition 1 introduces the generalized Hamiltonian H t (x t , u t , p t ) defined in Equation (9e). It also establishes useful linkages between the penalty function p t ∈ P and the solution to the optimal control problem (1). Comparing Proposition 1 with Lemma 1 is instructive. Importantly, under As1, Proposition 1 states that a saddle-point of the generalized Lagrangian L is a necessary and sufficient condition for a solution to the optimization problem (1). This result applies under nonconvexity. This contrasts with Lemma 1 where a saddle-point of the standard Lagrangian L 0 is only a sufficient condition. As illustrated in Figure 1B, the failure of the standard Lagrangian approach arises under nonconvexity (when a separating hyperplane does not exist).
Comparing Equations (9) and (4) is also instructive. Equation (9) reduce to (4) when p * is linear. As noted above, the linearity of p * is convenient and always valid under convexity (from the separating hyperplane theorem). But nonconvexity can require p * to be nonlinear. The main difference is between Equations (4a) and (4b)

Corollary. (Generalized Pontryagyn principle). Under continuous time, assume that the functions
Then, under As1, Corollary 2 is a generalized Pontryagyn principle. This can be seen comparing Equations (9 ′ ) to (4 ′ ). There are two main differences. First, the standard Hamiltonian in (4e ′ ) is replaced by the generalized Hamiltonian in (9e ′ ). While H 0 (x t , u t , t , t) in (4e ′ ) is linear in the Lagrange multipliers t , H (x t , u t , p t , t) can involve a nonlinear penalty function p t . As illustrated in Figure 1B, a nonlinear penalty function is needed to represent a nonlinear separating hypersurface under nonconvexity. Second, Equation (4a ′ ) differs from (9a ′ ). As discussed above, Equation (4a ′ ) holds at two points: points O and O ′ , reflecting the fact that (4) can fail to identify the global optimum O (as it fails to distinguish between point O and the local optimum O ′ ). This inability to distinguish between global and local optimal does not arise in Figure 1B under the separating hypersurface AODD'. In contrast with (4a ′ ), note that (9a ′ ) allows for p t to be nonlinear and thus for p(̇x t ,t) x t to vary with botḣx t and t. As discussed next, Equation (9a ′ ) represents the dynamics of the shadow prices of the states.

PROPERTIES OF THE PENALTY FUNCTION
As noted above, a saddle-point of the generalized Lagrangian in (6) supports a dual approach to the primal problem (1). Besides identifying the solution (x * , u * ), the generalized Lagrangian also involves the penalty fucntion p * . The properties of p * are of significant interest. To see this, rewrite the optimization problem (1) as  (10) is where p t ∈ P, t ∈ {0, … , T}. Let (x b , u b , p b ) be the saddle point of L(x, u, p, b) in (10) conditional on b. The effects of changes in b on V(b) are presented next.

Proposition 2. Under As1 and for any two points b and b ′ , the following inequalities hold
Equation (12) shows how the change in the function V(b) is closely related to the function p b (⋅) in the generalized Lagrangian approach. When the functions p b and V are differentiable at b and letting b ′ → b, this gives the following result:

Equation (13) is a version of the envelope theorem, stating that the derivative of the indirect objective function V(b) with respect to b t is equal to the derivative of the function p bt . When the objective function has a monetary interpretation, this implies that the derivative of the penalty function can be interpreted as measuring the marginal values (or shadow prices) of the states. When problem (1) corresponds to a cost minimization problem,
can be interpreted as measuring the marginal cost of the states at time t. This is an important generalization of the standard Lagrangian approach. Indeed, when p is taken to be linear then p bt( g t +b t ) b t reduces to the standard Lagrange multipliers. Proposition 2 establishes that such arguments generalize when p is nonlinear, with the slopes of p bt (g t + b t ) providing a general measure of the shadow prices of the states. This stresses that nonlinear shadow prices become an integral part of optimization problems under nonconvexity. This seems important to the extent that nonlinear pricing is commonly observed in many markets. 24 In this context, the additional information provided from the generalized Hamiltonian approach about nonlinear pricing can be very useful.

ITERATIVE ALGORITHM
From Proposition 1, solving the optimization problem (1) is equivalent to finding a saddle-point of the generalized Lagrangian L(x, u, p) in (5) under nonconvexity. In this context, the challenge is to identify constructive ways to find a saddle-point to L. This section presents an algorithm that can be used to solve the optimization problem (1). † † † The algorithm is based on the generalized Lagrangian L(x, u, p) in (5). Since the state equation is an equality constraint in (1), we focus our attention below on the case where the penalty function satisfies p t ∈ P ′ , t ∈ {0, … , T}, where for all a ≠ a ′ } ⊂ P. Note that p t ∈ P ′ implies that is equivalent to (x t+1 − x t ) = g t (x t , u t ). From Proposition 2, p t ∈ P ′ also implies that the corresponding shadow prices of the states are nonzero.

Proposition 3. Upon convergence, the iterative Algorithm A gives
as a solution to the optimization problem (1).
Proposition 3 states that, if it converges, the iterative Algorithm A will find a solution to problem (1). This is an important result that applies in the presence of nonconvexity. Equation (14) involves an iteration on the scheme from ( u q−1 , x q−1 ) to (x q , u q ). It corresponds to the second inequality in (6). Equation (15) involves an iteration on the scheme from p q−1 to p q , corresponding to the first inequality in (6): it provides a move toward satisfying the complementary slackness condition in (7). ‡ ‡ ‡ While Proposition 3 states that the iterative Algorithm A can identify a solution to problem (1), it raises three challenges. The first challenge is that the convergence mentioned in step 3 may not be achieved. For example, this would occur if the iterative changes stated in (14) and (15) are infinitely small. One way to avoid this issue is to replace (14) and (15) in Algorithm A by the following conditions: where ( 1 , 2 ) > 0 are small positive numbers. § § § Under (14 ′ ) and (15 ′ ), each iteration would avoid making infinitely small changes, implying that it would be more likely to make progress toward convergence and a solution to the optimization problem (1). The second challenge relates to the speed of convergence of Algorithm A. Achieving convergence may take a large number of iterations, in which case the iterative scheme A becomes less practical. In general, the number of iterations depend on how steps 1 and 2 are chosen in Algorithm A. 25,26 Iterative schemes used to solve optimization problems include gradient descent (GD) methods and evolutionary algorithms (EA). GD methods rely on using local gradients to guide the search for a global minimum. 26 For convex problems, GD methods perform well: they are guaranteed to converge; and they converge faster when using refined choice of the step size at each iteration. 26 But the reliance on local gradients makes GD methods less attractive for nonconvex problems for two reasons: (1) they may converge only to a local minimum and (2) they may fail to converge in the presence of variable-specific saddle-points****. 25 Evolutionary algorithms (EA) rely on stochastic simulations. 27 EA are inspired by biological evolution, where randomly generated mutations followed by selection that improves fitness. EA methods perform reasonably well under nonconvexity: they are more likely to identify a global solution in the presence of local solutions or variable-specific saddle-points; but they can be slow to converge. 27 GD and EA can be combined in evolutionary gradient descent (EGD) methods. 28 EGD methods benefit from the merits of both GA and EA: they work well under nonconvexity (by escaping local solutions and variable-specific saddle points) while converging faster than EA methods. 28 In general, the speed to convergence depends on the nature of the optimization problem and the specifics of the iteration scheme in (14) and (15). In this context, knowing the structure of the optimization problem can help speed up convergence. Using Proposition 1, the time-additive structure of the generalized Lagrangian L(x, u, p) in (5) suggests the following recursive formulation of "step 1" in (14): • Start with = 0.
• Step 1a: Using the generalized Hamiltonian H t (x t , u t , p t ) defined in (9e), choose u q ∈ U that satisfies • Step 1c: If < T, let = + 1 and go to step 1a. Otherwise, stop and record (x q , u q ).
Note that "step 1a" is closely related to Equation (9c) in Proposition 1: it is a search for the control variables u ∈ U that would minimize the generalized Hamiltonian H (x , u , p ).
Similar arguments can be made for "step 2." Consider the case where p t is written as p t ≡ p (⋅, t ) where t ∈ t are parameters providing a flexible representation of the penalty functional p t ∈ P, t ∈ {0, 1, … , T}. In this context, choosing p = (p 0 , … , p T ) in (15)  ) . • Step 2c: If > 0, let = − 1 and go to step 2b. Otherwise, stop and record the penalty functions p Note that these steps rely on the results stated in Proposition 1: "step 2a" involves Equation (9b) while "step 2b" involves Equation (9a). These formulations constitute a version of the "sweep algorithm": the recursive formation of "step 1" proceeds forward in time (from 0 to T), while the recursive formulation of "step 2" is backward in time (from T to 0). "Step 1" solves for the forward path of the state and control variables while "step 2" solves for the backward path of the penalty functions. Under convexity, the penalty functions can be taken to be linear, their slopes being Lagrange multipliers; then the above steps would reduce to the standard sweep method commonly used in solving optimal control problems. 14,18,19 As such, the above recursive formulation is a generalization of the sweep method under nonconvexity.
The third challenge comes from identifying the penalty function p under nonconvexity. Nonconvexity can require the penalty function to be nonlinear (as illustrated in Figure 1B). The difficulty is that the nonlinearity of p in general depends on the nature of nonconvexity in the optimization problem. One step in the direction of introducing nonlinearity in the augmented Lagrangian approach proposed by Hestenes 12 and Rockafellar. 21 In this context, allowing p to be a quadratic function is one possibility. But Figure 1B illustrates that choosing a quadratic function for p would fail to satisfy the global separation property. This reflects the fact that, while polynomial functions provide good local approximation properties, they may not be good choices as global penalty functions.
A more attractive possibility is to consider spline functions for p. Spline functions offer good flexibility as they can provide global approximations to any smooth function. 29 In this case, t in p (⋅, t ) would be the spline parameters. The choice of a penalty function in "step 2" then involves a search for values of the spline parameters. For example, p t (x t+1 − x t , t ) could be a spline function, which is continuous but only piecewise differentiable in (x t+1 − x t ), with slopes changing across spline knots defined by t . Figure 1B provides an example of a separating hypersurface AODD' which is a piecewise linear function, with changing slopes reflecting nonlinear shadow pricing of the states. Upon convergence, "step 2 ′′ would generate a penalty function with two desirable properties: (1) it would identify a global solution to the optimization problem (1) (from Proposition 3) and (2) the slopes of the penalty function would provide useful information about the shadow prices of the states (from Proposition 2).

AUTHOR CONTRIBUTIONS
This paper was developed and written by Jean-Paul Chavas.