Factorization of completely positive matrices using iterative projected gradient steps

We aim to factorize a completely positive matrix by using an optimization approach which consists in the minimization of a nonconvex smooth function over a convex and compact set. To solve this problem we propose a projected gradient algorithm with parameters that take into account the effects of relaxation and inertia. Both projection and gradient steps are simple in the sense that they have explicit formulas and do not require inner loops. Furthermore, no expensive procedure to find an appropriate starting point is needed. The convergence analysis shows that the whole sequence of generated iterates converges to a critical point of the objective function and it makes use of the Łojasiewicz inequality. Its rate of convergence expressed in terms of the Łojasiewicz exponent of a regularization of the objective function is also provided. Numerical experiments demonstrate the efficiency of the proposed method, in particular in comparison to other factorization algorithms, and emphasize the role of the relaxation and inertial parameters.

where conv stands for the convex hull operator. The factorization of a nonzero completely positive matrix A is never unique. Moreover, the number of columns of the factor X can vary (see References 2,3), which gives rise to the following notion.
Let A ∈ R n×n . The cp-rank of A is defined as The majority of the known numerical methods aiming to factorize a completely positive matrix are sensitive to its cp-rank.
The cp + -rank of A is defined as where R n×r ++ denotes the set of matrices in R n×r + which have at least one positive column. The cp + -rank is useful when characterizing the interior of the cone of completely positive matrices. Indeed, as showed by Dickinson in Reference 2, Theorem 3.8, this can be characterized as The problem of computing the cp-rank of a matrix is in general open (see 4 ). However, upper bounds for the cp-rank have been derived by Bomze Notice that there exists matrices A ∈ int ( n ) such that cpr (A) ≠ cpr + (A). Closely related to the completely positive matrices is the class of copositive matrices where S n×n denotes the set of n × n symmetric matrices. In fact,  n is the dual cone of  n (see, for instance, Reference 1), namely, Here, ⟨⋅, ⋅⟩ denotes the Frobenius inner product (see Section 2 for the precise definition). Many relaxations of combinatorial optimization problems and of nonconvex quadratic optimization problems can be formulated as linear problems over  n or  n . Since the objective function and the constraint functions are linear, the challenge when addressing these is entirely transferred in the proper handling of the cone constraints. Consequently, copositive and completely positive matrices have received considerable attention in recent years (see, for instance, . The application fields, where copositive and completely positive matrices appear, include block design, complementarity problems, projections in energy demand, the Markovian modelling of DNA evolutions, and maximin efficiency robust tests, see Reference 1 and the references therein.
We illustrate this approach for a nonconvex quadratic programming problem min x∈R n x T Mx.
where M ∈ S n×n and j n denotes the all-ones vector in R n . If M is not a positive semidefinite matrix, then (3) is a nonconvex optimization problem which is usually NP-hard and exhibits numerous local minima. Observe that the objective function of (3) can be rewritten in terms of the Frobenius inner product as for X = xx T , it holds x T Mx = ⟨ M, xx T ⟩ = ⟨M, X⟩. In the same fashion, the constraint j T n x = 1 implies ⟨ j n j T n , X ⟩ = 1. Therefore, the optimization problem min X∈R n×n ⟨M, X⟩ .
is a convex relaxation of the nonconvex quadratic problem (3). In Reference 9 it has been shown how optimal solutions of (4) can be related to optimal solutions of (3). Let X * be an optimal solution of (4). If X * is of rank one, then it can be expressed as X * = x * x T * and therefore x * is an optimal solution of (3). If rank (X * ) > 1, then X * can be factorized as X * = ∑ r i=1 x i x T i and it can be shown that an appropriately scaled version of each x i is an optimal solution of (3). Another class of matrices related to the completely positive matrices is the one of doubly nonnegative matrices, which are real positive semi-definite square matrices with nonnegative entries, namely,   n ∶= S n×n ∩ R n×n + . One has in general  n ⊆   n , whereas, for n ≤ 4, the inclusion becomes an equality (see References 1,8). However, for n ≥ 5, the inclusion is strict; for example (see Reference 4) A ∶= Checking membership of a matrix to  n is known to be a NP-hard prpblem, as it has been proved by Dickinson and Gijben in Reference 10. Jarre and Schmallowsky 11 proposed a method which provides a certificate for a given matrix to be completely positive. Their approach is based on an augmented primal-dual method and aims to solve a certain second order cone problem. Eventually, it is necessary to solve Lyapunov equations to obtain a completely positive factorization.
One of the main challenge when dealing with completely positive matrices is their efficient factorization. 1,3,12 This is a question of high relevance in many applications, as, for example, in the statistics of multivariate extremes. Cooley and Thibaud 13 have shown that the tail dependence of a multivariate regularly varying random vector can be summarized in a so-called tail pairwise dependence matrix Σ of pairwise dependence metrics. This matrix Σ can be shown to be completely positive, and a nonnegative factorization of it can be used to estimate probabilities of extreme events or to simulate realizations with pairwise dependence summarized by Σ. This approach is used in Reference 13 to study data describing daily precipitation measurements. Further applications of the nonnegative factorization of completely positive matrices can be found in data mining and clustering, 14 and in automatic control. 15, 16 Sponsel and Dür developed in Reference 17 an algorithm for determining the projection of a matrix onto the copositive cone  n . This method can be also used to compute completely positive factorizations, however, for reasonably big input matrices, the algorithm runs into memory problems. In Reference 18, Nie treats the completely positive factorization problem as a special case of an -truncated K-moment problem, for which an algorithm is developed based on the solving of a sequence of semi-definite optimization problems. From the numerical point of view this method is expensive, the reported numerical experiments demonstrate the factorization of completely positive matrices only up to order 8 × 8.
Recently, Groetzner and Dür proposed in Reference 3 a novel approach to the nonnegative factorization problem which consists of formulating it as a nonconvex split feasibility problem and, consequently, in solving it via the method of alternating projections. It is known that when the initial point is sufficiently close to the feasible set, then the sequence generated by the nonconvex method of alternating projections convergences to an feasible element. The drawback of this algorithm is that it requires in every iteration two projections, which both have in general to be approximately calculated via inner loops, since they amount to solve a second-order cone problem (SOCP) and to find a singular valued decomposition of a matrix, respectively. In the same article, a modification of this method has been suggested, which replaces the solving of the SOCP by a simple projection on the nonnegative orthant, but keeps the singular value decomposition, however, without a theoretical evidence of its convergence. Also very recently, Chen, Pong, Tan and Zeng proposed in Reference 19 another approach which consists of reformulating the split feasibility problem as a difference-of-convex optimization problem and, consequently, in solving it via a specific algorithm, which also requires the singular valued decomposition of a matrix in every iteration. We will present these approaches in more detail later.
In this paper we develop a different approach for the nonnegative factorization of a completely positive matrix, which amounts to the minimization of a nonconvex smooth function over a convex and compact set. To solve this problem we propose a projected gradient algorithm with parameters that take into account the effects of relaxation and inertia. The gradient and the projection steps are expressed by simple explicit formulas and thus do not require any inner loops. We prove the global convergence of the generated sequence for any starting point, which is another advantage over the methods discussed above, which make use of expensive computing procedures to find the points where the algorithms start. We provide rates of convergence for both the sequences of objective function values and of iterates in terms of the Łojasiewicz exponent of a regularization of the objective function. Numerical experiments show that our algorithm outperforms the other iterative factorization methods and emphasize the influence of the relaxation and inertial parameters on its performances.
Relaxation techniques have been introduced to provide more flexibility to iterative schemes, 20 while inertial effects in order to accelerate the convergence of numerical methods [21][22][23] and to allow the detection of different critical points. 24 Inertial proximal gradient algorithms for nonconvex optimization problems have been proposed and studied in References 25-29; their global convergence has been shown in the framework of the Kurdyka-Łojasiewicz property. [30][31][32][33][34][35] For convex optimization problems, relaxed inertial algorithms have been proved to combine the advantages of both relaxation techniques and inertial effects (see . One of the aims of this work is to investigate, also in the nonconvex setting, to which extent the interplay between relaxation and inertial parameters influence the numerical performances of projected/proximal gradient algorithms.
Solution methods for nonsmooth nonconvex optimization problems have been already used in the literature for nonnegative matrix factorization. We recall here PALM, introduced by Bolte, Sabach, and Teboulle in Reference 33, which is a block coordinate projection gradient method, and its inertial variant studied by Pock and Sabach in Reference 29, for which also numerical experiments were reported. Recently, the symmetric nonnegative matrix factorization has been addressed by Dragomir, d'Aspremont and Bolte in Reference 39 from the perspective of a non-Euclidean first-order method. We also want to mention 40 where nonnegative matrix factorizations are computed in an alternating manner.
In what concerns sparse matrix factorizations, 41 this can be formulated as the feasibility problem of finding an element in the intersection of two nonconvex sets. In Reference 19 a difference-of-convex optimization approach was used to develop a solution method for it, as alternatives one could use approaches based on the method of alternating projections 42 or on the Douglas-Rachford algorithm for feasibility problems. 43,44

Notations
We will write for a n × r matrix X ∶= ( 1≤i≤n,1≤j≤r if we want to specify its elements, and neglect the subscripts if there is no risk of confusion. The Frobenius inner product of X, Y ∈ R n×r is defined by ⟨X, ∑ r j=1 x i,j y i,j . Due to the definition of trace operator it holds For X ∈ R n×r we will denote its Frobenius norm by and its 2-norm by where ‖⋅‖ denotes the usual Euclidean norm of a vector. If X : = [X 1 | … |X r ] is the column representation of the matrix X, then we have For every X, Y ∈ R n×r we have In addition, for every ∈ R, it holds For a symmetric positive semi-definite matrix A ∈ R n×n we denote by its eigenvalues. Therefore, The following two estimates, which we also prove for the sake of completeness, will be useful later on.
2. If A ∈ R n×n is a symmetric positive semi-definite matrix, then Proof.
1. Using the column representation of Y : Notice that, in view of (7b), inequality (10) is sharper than (7d). 2. For two positive semi-definite matrices A, B ∈ R n×n we have the following consequence of the Von Neumann's trace inequality (see Reference 45, pp. 340-341) The inequality (11) follows by applying (12) for the positive semidefinite matrices A and XX T , and by noticing further that The open ball around Z ∈ R n×r with radius > 0 is denoted byB F (Z, ) ∶= {X ∈ R n×r ∶ ‖X − Z‖ F < } and the closed ball by B F (Z, ) ∶=B F (Z, ), where the closure is taken with respect to the topology induced by the Frobenius norm.
The indicator function of a set  ⊆ R n×r is defined as  (X) = 0, if X ∈ , and  (X) = +∞, otherwise. We say that an element Z ∈  is a projection of an element X onto a nonempty closed subset If the set  is also convex, then the projection of an element X is uniquely defined and we will denote it by Pr  (X). It is characterized by where the max operator is understood entrywise; 2. if  ∶= B F (0; ) for > 0, we have In general, it is challenging to compute the projection onto the intersection of two sets, even if these are both convex and explicit forms for the projections onto the sets are available. To this end one can either use the method of alternating projections 42,46 or the Douglas-Rachford algorithm for feasibility problems. 43,44 Cyclic formulations of these iterative methods can be used to determine the projection on the intersection of more than two sets.
In the following example we provide one particular pair of two convex sets for which the projection onto their intersection can expressed by a closed formula. We will make use of this formula in our algorithm, which means that it will require a reduced computational effort when calculating this projection.

Example 2.
Let > 0 and K be a nonempty closed convex cone in R n×r . Then the projection onto the intersection K ∩ B F (0, ) is given by (see Reference 47, Theorem 7.1) Notice that in general Pr B F (0, ) •Pr K (X) ≠ Pr K (X) •Pr B F (0, ) (see Reference 47, Example 7.5).
For later comparison we discuss two more examples of projections on some particular sets which were used in the nonnegative factorization of completely positive matrices.

Example 3.
Let B ∈ R n×r and consider the following set associated to B The set  (B) is a polyhedral cone and thus a closed convex subset of R r×r . The projection of X ∈ R r×r onto the set  (B) is the unique solution of the optimization problem It was shown in Reference 3 that (16) is equivalent to the SOCP min t∈R,Z∈R r×r t. (SOCP) Second-order cone problems have been intensively studied in the literature from both theoretical and numerical perspectives.
where I r denotes r × r identity matrix. The set  r is compact but nonconvex, so projections on this set always exist, but may not be unique. A projection of an element X ∈ R r×r on  r can be found by polar decomposition of X (see, for instance, Reference 3, Lemma 4.1). In particular, for every X ∈ R r×r , there exist a positive semi-definite matrix T ∈ R r×r and an orthogonal matrix Y ∈ R r×r such that Therefore, the matrix Y is a projection of X onto  r and it can be computed by means of the singular valued decomposition of X = UΣV T . Indeed, for T ∶= UΣU T and Y : = UV T it holds X = UΣV T = UΣU T UV T = TY .

Variational analysis tools
In the following we will introduce some tools from variational analysis which will play an important role in the convergence analysis. Let Ψ ∶ R n×r → R ∪ {+∞} be a proper and lower semi-continuous function and X an element of its effective domain } and the limiting (Mordukhovich) subdifferential 48,49 of Ψ at X is such that Ψ (X k ) → Ψ (X) as k → +∞ and V k ∈̂Ψ (X k ) for any k ≥ 0}.
The normal cone to a nonempty convex subset  of R n×r is defined as for X ∈ , and as N  (X) = ∅ for X ∉ . It holds N  (X) =  (X) for every X ∈ R n×r . If  ⊆ R n×r is a nonempty convex closed set and X ∈ R n×r , then

Nonnegative factorization of completely positive matrices via projection onto the orthogonal set  r
In the following we will revisit some recent iterative approaches for finding a nonnegative factorization of completely positive matrices.
In Reference 3 this problem was reformulated as a feasibility problem. For a given matrix A ∈ R n×n , in a first step, a not necessarily entrywise nonnegative matrix B ∈ R n×r such that A = BB T was considered. The aim was to find a r × r square matrix Q such that Q ∈  (B) ∩  r , where  (B) and  r are the polyhedral cone associated to B and the set of r × r orthogonal matrices given in (15) and in (17), respectively. This approach was motivated by the observation that, for every B 1 , B 2 ∈ R n×r , it holds B 1 B T 1 = B 2 B T 2 if and only if there exists Q ∈  r such that B 1 Q = B 2 (see Reference 3, lemma 2.6).
To solve (19), naturally, the method of alternating projections was used.
(The Method of Alternating Projections 3 ). Let A ∈  n and r be a positive integer value.

Input:
• a given B ∈ R n×r such that A = BB T ; • a given starting point Q 0 ∈  r .
Step 1: (MAP) Step 2: If a stopping criterion is not met, then set k : = k + 1 and go to Step 1.
The stopping criterion used in Reference 3 for this scheme, as well as for the other two methods that will be described later in this section, reads min 1≤i≤n,1≤j≤r where Tol fea is a positive very small tolerance number. The nonconvex method of alternating projections is known to converge locally, which means that convergence can be guaranteed if the initial point is sufficiently close to  (B) ∩  r .
As noticed in Example 3, the first step in (MAP) amounts to solve a second-order cone problem, which usually can be done only in an approximate way and requires an inner loop. To avoid this drawback, another algorithm was proposed in Reference 3, which, in every iteration, calculates an approximation of Pr (B) (Q k ). This is done by using the projection on R n×r + , for which an exact formula exists, and an update step which uses the Moore-Penrose- This second algorithm has the following formulation.
(The second Method of Alternating Projections 3 ). Let A ∈  n and r be a positive integer value.

Input:
• a given B ∈ R n×r such that A = BB T ; • a given starting point Q 0 ∈  r .
Step 1: Compute (ModMAP) Step 2: If a stopping criterion is not met, then set k : = k + 1 and go to Step 1.
In Reference 19, an alternative approach to (19) was considered, by reformulating the nonnegative factorization problem as a difference-of-convex optimization problem and by solving the latter via a nonmonotone linesearch algorithm. This can be found in Reference 19, Section 6.1, here we present for easy reference the iterative scheme with a fixed stepsize.
(The Difference-of-Convex Approach with fixed stepsize 19 ). Let A ∈  n and r be a positive integer value.

Input:
• a given B ∈ R n×r such that A = BB T ; • a given starting point Q 0 ∈  r .
Step 1: Compute ) . (SpFeasDC) Step 2: If a stopping criterion is not met, then set k : = k + 1 and go to Step 1.
One can notice that all three iterative schemes require in every iteration the calculation of a projection onto the orthogonal set  r . To do this one basically needs to carry out a singular value decomposition of a matrix, as discusses in Example 4, which can be done in a subroutine that needs  ( r 3 ) steps. Furthermore, all three algorithms ask for finding a matrix B ∈ R n×r such that A = BB T . This can be done, for instance, by the Cholesky decomposition of A, in which case B is a lower triangular matrix, or by the spectral decomposition A = VΣV T and then by setting B ∶= VΣ 1 2 . In either case, one needs an additional procedure to find an appropriate initial matrix B.

AN OPTIMIZATION MODEL WITH CONVERGENCE GUARANTEES
In this section we will propose a new approach for the nonnegative factorization of completely positive matrices, which consists of solving a nonconvex optimization problem by means of a projected gradient algorithm. We will also carry out for the iterative method a comprehensive convergence analysis, and even derive convergence rates.

The optimization model
For a given nonzero completely positive matrix A ∈ R n×n , finding a factorization A = XX T , where X ∈ R n×r + , can be cast as an optimization problem Denoting by  * ∶= inf X∈  (X) the optimal objective value of (P mod ), it holds Notice that  is a nonconvex objective function with continuous gradient ∇ (X) = −2 ( A − XX T ) X, which is however not Lipschitz continuous, but locally Lipschitz continuous. In order to be able to handle this situation in a proper way in the convergence analysis, we minimize the objective function  (X) over a meaningfully chosen bounded set, which, however, does not pose any restriction on the model. Indeed, if X satisfies A = XX T , then since, according to the definition of the Frobenius norm and (5) and (6), we have This explains the choice of  as the intersection of R n×r ) .
1. The set  is nonempty convex and closed, and for any X ∈ R n×r it holds where [X] + ∶= max {X, 0} and the max operator is understood entrywise. 2. For X, Y ∈ R n×r , the following inequalities are true where Proof.
, it follows from (14) that Moreover, if Z is symmetric, then so is ∇ (Z). Let X, Y ∈ R n×r be fixed. One can easily verify that Applying (24) with W : = XX T and Z : = YY T and by taking into consideration (25), we get Since 2∇ , it remains to estimate the two last terms in (26). Observe that where the last equation comes from the fact that trace operator is invariant under cyclic permutations, as we see below Notice that, thanks to (11), Plugging this estimate into (27), also neglecting the last two nonnegative terms, we obtain the left-hand side inequality in (22).
By applying (10) we can derive an upper bound for the last term in (27) By plugging (28) into (27) and recalling the inequalities (11) and (10), we get the right-hand side inequality in (22) with L (X, Y ) defined as in (23). ▪

A projected gradient algorithm with relaxation and inertial parameters
We are now in the position to formulate the projected gradient algorithm we propose in this paper to solve (P mod ).

Algorithm 1. Let A ∈  n and r be a positive integer value.
Input: • given starting points X 1 ∶= X 0 ∈  and; • a relaxation parameter ∈ (0, 1] chosen such that Main iterate: Set k : = 1.
Step 1: Compute Step 2: If a stopping criterion is not met, then set k : = k + 1 and go to Step 1.
1. In the analysis we will use, to ease the reading, L F instead of L F ( + ), however, we will return to this notation in Section 4.1, where we will consider some particular choices of the sequence of inertial parameters, which will give different values for + . To help the readers to understand the choice of the parameters, and the motivation behind the construction of L F in (29), we formulate and prove Lemmas 2 and 3 first and postpone the discussion on the feasibility of in (30) to Remark 3. 2. In the following theoretical investigations, we are interested on the convergence behaviour of the generated sequences as k → +∞, however, in the numerical experiments we will use as stopping criterion where Tol val is a positive very small tolerance number. 3. Recall that an explicit formula for the projection operator on  has been given in (21). 4. For any k ≥ 1, the following equivalent formulation of (31c) will be useful in the analysis The following result proves that the sequence {X k } k≥0 generated by Algorithm 1 belongs to  and that L F ( + ) is an upper bound for the sequence where Proof.
1. Notice that {Z k } k≥2 ⊆  due to (31b). If we assume that X 1 ∈ , then, by induction arguments, X k+1 ∈ , since it is a convex combination of X k and Z k + 1 . Consequently, for any ▪ Remark 2. In the nonconvex setting, the boundedness of the sequence of iterates plays an important role in the convergence analysis. As seen in Lemma 2 (i), the nature of Algorithm 1 ensures that X k ∈  for every k ≥ 0, and thus the sequence {X k } k≥0 is bounded.
For readers' convenience we denote the objective function of (P mod ) by Ψ ∶=  +  .
where ∶= max Proof. Let k ≥ 2 be fixed. We first show that The characterization of the projection (13) ensures that In view of (31b), it is clear that Z k ∈ , thus, setting X : The left-hand side inequality in (22) implies that while the right-hand side inequality in (22) and (33) imply Summing up (38), (40) and (39), and noticing that  (Z k+1 ) =  (Z k ) = 0, yield (36). Next we will study the term Then, by using identity (8), it holds Combining (31a) and (32b) gives us further By plugging (42) and (43) into (36), we get which is nothing else than (34) with the constants and as defined in (35). Notice that (44) is true since is an upper bound for Otherwise, we have which leads to the desired statement. ▪ The estimate above remains true if we replace Ψ by . In fact, the indicator function was artificially inserted in the decreasing property (34), as it will help us to prove the convergence of the iterates later on. Now, with ≥ 0 introduced in (35b), we define the following function The objective function Ψ of (P mod ) is closely related to Ψ in terms of their critical point. Indeed, if = 0, which is the case when = 1 and + = 0, then Ψ (Z, X) = Ψ(Z) for any (Z, X) ∈ R n×r × R n×r , thus X * ∈ critΨ if and only if (X * , X * ) ∈ critΨ for X * ∈ R n×r . On the other hand, one can easily verify that for every > 0 we have Remark 3. In the view of (31c), it holds X k+1 − X k = (Z k+1 − X k ) for every k ≥ 1 . Therefore, using the definition (45), the inequality (34) can be rewritten for any k ≥ 2 as We will show that C 0 > 0. It holds On the one hand, since 0 < ≤ 1, we have This is further equivalent to On the other hand, by setting ∶= 1 > 0, the second inequality in (48) can be equivalently expressed as Its reduced discriminant reads Thus, the inequality (50) is equivalent to Combining (49) and (51), we observe further that Thus, in view of (30), C 0 > 0.
A direct consequence of Lemma 3 follows. Proof. Let k ≥ 2 be fixed. In view of (47) we have

Proposition 2. Let
It is clear that the sequence {Ψ (Z k , X k−1 )} k≥2 is monotonically decreasing and, since it is nonnegative, is convergent. The fact that C 0 > 0 and telescoping arguments (see, for instance, Reference 20, lemma 5.31) give ∑ k≥1 ‖X k+1 − X k ‖ 2 < +∞, thus X k + 1 − X k → 0 as k → +∞. By taking into consideration (41), we deduce that Z k + 1 − Z k → 0 as k → +∞. Using further (32a) and (31a), we have Z k + 1 − Y k → 0 as k → +∞. According to (32), (31a), and (31c), the conclusion follows. Proof. LetX be a cluster point of {X k } k≥0 , which means that there exists a subsequence { X k i } i≥1 such that X k i →X as i → +∞. We deduce further that Z k i →X as i → +∞, due to (32b). By the characterization of the projection (18) and (31b), we get that for every i ≥ 1 From here, By passing to limit as i → +∞, and by taking into consideration the continuity of ∇ and the fact that Z k + 1 − Y k → 0 as k → +∞ (see Proposition 2 2), we get The closedness of the graph of the normal cone gives −∇ . In other words,X ∈ critΨ. ▪

Global convergence thanks to the Łojasiewicz property
In this subsection we will prove that actually the whole sequence of iterates {X k } k≥0 generated by Algorithm 1 converges to a critical point of the objective function Ψ and even establish its rate of convergence. To this end we will use that the regularized objective function Ψ fulfills the Łojasiewicz property (see Reference 35), since it is a semi-algebraic function (see Reference 30, Example 1 and Reference 32). Recall that a function is called semi-algebraic if its graph can be expressed as a semi-algebraic set Let ≥ 0. The fact that Ψ satisfies the Łojasiewicz property means that for any critical point (Ẑ,X) of Ψ there exist C L , L > 0 and ∈ [0, 1) such that .
If Ω is a connected and compact subset of critΨ , then, according to Reference 30, Lemma 1, Ψ fulfills the uniform Łojasiewicz property, which means that there exist (global constants) C, > 0 and ∈ [0, 1) such that for all Next we will see that, for Ω ∶= Ω ( the set of cluster points of the sequence {(Z k , X k−1 )} k≥2 , we actually are in the setting of the uniform Łojasiewicz property. Notice that Ω ≠ ∅ thanks to the boundedness of the sequences {X k } k≥0 and {Z k } k≥2 .
Proof. The item 1 follows from Theorem 1 and (46). The proof of 2-3 follows in the lines of Reference 33, Theorem 5 (ii)-(iii), by taking into consideration Reference 33, Remark 5, according to which the properties in 2-3 are generic for sequences satisfying Z k − Z k − 1 → 0 and X k − X k − 1 → 0 as k → +∞, which is indeed our case due to Proposition 2 2.
Finally, to prove 4, we consider an arbitrary element As a consequence, since {Ψ (Z k , X k−1 )} k≥2 converges due to Proposition 2 1, it follows that Ψ is a constant on Ω, namely, As a last preparatory step we derive an upper bound for a subgradient of Ψ .

Lemma 5.
Let {X k } k≥0 be a sequence generated by Algorithm 1. For any k ≥ 2 we have where In addition, where trace (A) ) , Proof. Let k ≥ 2 be fixed. The calculus rules of the limiting sub-differential guarantee that for every (Z, X) ∈ R n×r × R n×r it holds By the characterization of the projection (18) and (31b), we have From this we deduce which proves (52). Further, we observe that where the last inequality follows from (7b)-(7c) and the fact that ) (see Lemma 2). From here we derive the following estimate which holds for all k ≥ 2 which yields the inequality (53). ▪ To simplify the notation, let us define for every k ≥ 2 where Ψ * = lim k→+∞ Ψ (Z k , X k−1 ). According to Proposition 2 1, the sequence { k } k≥0 converges monotonically decreasing to 0. We are now in the position to prove the global convergence of the sequence generated by Algorithm 1. = Ψ * and, for every k ≥ 2, we have Ψ (Z k , X k−1 ) − Ψ (X ,X ) = k . We will show that {X k } k≥0 has finite length, namely, Form here it will follow that {X k } k≥0 is a Cauchy sequence, thus it converges to some X * , which, according to Theorem 1, will be a critical point of (P mod ).
In order to prove (55) we will consider two cases: The monotonicity of { k } k≥0 implies that k = 0 for all k ≥ k 1 and, further, in view of (47) and (30), that X k + 1 − X k = 0 for all k ≥ k 1 . Hence Case 2. It holds k > 0 for every k ≥ 2. As Ψ fulfills the uniform Łojasiewicz property, there exist C, > 0 and ∈ [0, 1) such that Combining (56) and (57), we deduce that for every k ≥ k 2 it holds where the last two inequalities follow from Lemma 5. For the given exponent ∈ [0, 1), we define which is a nondecreasing function as ′ (s) = s − 1− > 0. The concavity of gives, by taking into consideration (47), for all k ≥ 2 From here we get that for every k ≥ k 2 By setting for every k ≥ k 2 the inequality (60) becomes ∈ (0, 1) and 1 ∶= C 2 2 (C 1 + C 2 ) ∈ [0, 1) .

for instance, Reference 25, Lemma 2.3 or Reference 26, Lemma 3). This leads to (55) and the proof is completed. ▪
We will close this section by discussing the rates of convergence of the projected gradient algorithm with relaxation and inertial parameters. The nature of the rates is determined by the Łojasiewicz exponent of the function Ψ , which cannot be calculated exactly. This is why we will cover in our statements all possible situations. Further discussions about the values the Łojasiewicz exponent take will be made in the last section of the paper.
An essential tool for deriving the rates of convergence is the following lemma, the proof of which can be found in Reference 50, Lemma 15.

Lemma 6.
Let { k } k≥0 be a monotonically decreasing sequence of nonnegative numbers converging 0. Assume further that there exists a natural numberk ≥ 2 such that for any k ≥k it holds where C > 0 and ∈ [0, 1). The following statements are true: , then there exist C ,0 > 0 and Q ∈ [0, 1) such that for any k ≥k 3. if ∈ (1∕2, 1), then there exists C ,1 > 0 such that for any k ≥k + 2 We will show that the sequence { k } k≥0 defined in (54) satisfies the recursion inequality (61) in Lemma 6.

Lemma 7.
Let {X k } k≥0 be the sequence generated by Algorithm 1 and { k } k≥2 the sequence defined in (54). Then there exists k 3 ≥ 2 such that for any k ≥ k 3 Proof. From (47) we get for any k ≥ 4 where V k ∈ Ψ (Z k , X k−1 ) is the element defined in Lemma 5 and (62) holds true by taking into account further that 0 ≤ + ≤ 1, hence By the same argument as in the proof of Theorem 2, if we take k 3 : = k 2 ≥ 2 for which (57) holds, then according to (56) the following inequality holds for every k ≥ k 3

The desired statement is a combination of this estimate and (63). ▪
In order to transfer the convergence rates from { k } k≥0 to the sequence {X k } k≥0 , we will need the following lemma.

Lemma 8.
Let {X k } k≥0 be the sequence generated by Algorithm 1 and { k } k≥2 the sequence defined in (54). Let X * be the critical point of (P mod ) to which the sequence {X k } k≥0 converges as k → +∞ and ∶ R + → R, (s) = s 1− . Then there exists k 3 ≥ 2 such that for any k ≥ k 3 Proof. By using the same arguments as in the proof of Theorem 2, there exists k 3 ≥ 2 such that for any k ≥ k 3 the following inequality is true Let k ≥ k 3 be fixed. By an induction argument one can prove that For any K ≥ k + 2 ≥ k 3 , by summing up (65) for i = k + 2, … , K, we get Notice that Plugging these relations into (67), neglecting the last two negative terms in (68b) and (68c), we get Thanks to (47) we can deduce that The fact that { k } k≥0 is monotonically decreasing implies √ k+1 ≤ √ k and ( k+1 ) ≤ ( k ). By passing K → +∞ in (69) and by using (66), we get the desired statement. ▪ We can now formulate the rates of convergence for the sequences of objective function values and iterates.

Theorem 3.
Let {X k } k≥0 be the sequence generated by Algorithm 1 and { k } k≥2 the sequence defined in (54). Let X * be the critical point of (P mod ) to which the sequence {X k } k≥0 converges as k → +∞. Then there exists k 4 ≥ 2 such that the following statements are true: 3. if ∈ (1∕2, 1), then there exist C ′ 3 , C ′ 4 > 0 such that for any k ≥ k 4 + 2 Proof. Let k 3 ≥ 2 be the index provided by previous lemma with the property that (64) holds for any k ≥ k 3 . Since { k } k≥0 converges to 0, there exists k 4 ≥ k 3 such that for any k ≥ k 4 Moreover, as 1 − 2 ≥ 0, due to (71) it holds Consequently, Lemma 8 implies that which is nothing else than the second inequality of 2 with C ′ 2 ∶= C 5 If ∈ (1∕2, 1), then we can use Lemma 6 3 to ensure that there exist C ′ 3 > 0 such that for any k ≥ k 4 Since 2 − 1 > 0 and k ≤ 1 due to (71), we have Then the second statement follows from (70) with C ′ 4 ∶= C 5 C 1− 3 > 0. ▪

Some particular instances of Algorithm 1
In the following we will discuss some particular instances of Algorithm 1. To this aim we will use again the notation L F ( + ), which will allow us to better underline the dependence of the step size from the inertial parameters.

Example 5.
Choosing k = 0 for all k ≥ 1, Algorithm 1 reduces to the relaxed projected gradient algorithm ) , In this case, + = 0 and condition (30) becomes Notice that, according to (72), the the choice = 1 is allowed, which leads to the classical projected gradient algorithm.
In the nonconvex setting, algorithms with inertial effects proved to be helpful to detect critical points of minimization problems which cannot be found by their non-inertial variants (see, for instance, References 26,51). For constant inertial parameters k = + ∈ [0, 1] for any k ≥ 1, condition (30) is equivalent to and further to Condition (73) is in implicit form, however, one can show that it is satisfied for every 0 < + ≤ 0.967. In order to find a larger + , which fulfills (73), one can use a bisection routine starting from 0.967, as we did in our numerical experiments and will explain in the next subsection.
In order to see that for every 0 < + ≤ 0.967 the inequality (73) always holds true, one can rewrite (73) equivalently as Relation (74) is definitively fulfilled if We have and this is why we will solve a more restricted yet easier inequality ( ) ≤ 0 instead of (74). The derivative of reads ′ ( ) = 24 3 + 24 2 − 4 − 8, and has exactly one root Since ′ (0) = −8 < 0 and ′ (1) = 36 > 0, we have that is decreasing on (0, ) and increasing on ( , 1). Moreover, as (0.967) = −0.00458574 < 0, (0) = −3 < 0 and (1) = 1 > 0, we can conclude that ( ) < 0 for every ∈ [0, 0.967], which implies that (74) is fulfilled as a strict inequality for every + ∈ [0, 0.967] as well. Since in the above approach we weakened (74) in order to simplify the computations, one cannot expect 0.967 to be the largest value for which this inequality is fulfilled. However, we will use in our numerical experiments 0.0967 as the starting point for a bisection procedure aimed to find larger values of + which fulfill (74).

Example 7.
An interesting choice of the variable inertial parameters { k } k≥1 in the context of the inertial projected gradient algorithm discussed in Example 6 is Notice that, for ∶= 1, this is exactly the update rule of the celebrated Nesterov/FISTA algorithm. 21,22 This iterative scheme have attracted the interest of the optimization community and of many practitioners due to the fact that, in the convex setting, it improves for the sequence of objective function values the convergence rate over the one of the standard non-inertial method. In the nonconvex setting, no theoretical results, which emphasize an improvement in the convergence behaviour through this update rule, have been obtained so far, however, some empirical studies suggest that this might be the case (see, for instance, Reference 29).
Since + = sup k≥0 k = , one can find such that (30) holds by solving (see (73)) If one wants to choose larger values for , for instance to take close to 1, a restart mechanism can be adapted into the scheme (75), like, for example, in Reference 52.
Example 8. If we set, again in the context of the inertial projected gradient algorithm, then it holds + = . This is a setting considered by László 51 for the inertial gradient algorithm, which is the scheme in Example 6 without the projection step. Our algorithm can be considered as an extension of the one in Reference 51. To guarantee convergence, in Reference 51 is required that the step size fulfills where L F denotes the Lipschitz constant of the gradient of the objective function. This condition excludes the case = 1 and allows = 1∕L F as stepsize when = 1∕2. In our setting, we can have larger values of in combination with the stepsize 1∕L F , namely, those for which (76) is fulfilled (see also the discussion at the end of Example 6).

Example 9.
Other than for the classical inertial algorithms for convex optimization problems and monotone inclusions, for which the inertial parameters were not allowed to take values greater than 1/3, the interplay between relaxation and inertia gives us much more freedom when it comes to the choice of the latter. We have seen that as far as + satisfies (73) we can choose = 1. For + close to 1 such that (73) is not satisfied, in other words This applies also for the case when k = 1 for any k ≥ 1, and thus + = 1, for which Algorithm 1 becomes ) , As we will see in the numerical results, the strategy of choosing + close to 1 and according to (77) yields to the best performances of the algorithm.

Numerical experiments
The aim of the numerical experiments we will present in this subsection is twofold: to compare the performances of our algorithm with those of other numerical methods for the nonnegative factorization of completely positive matrices, as are ModMAP and SpFeasDC from References 3 and 19, respectively, and to show in which way and to which extent the algorithm parameters influence these performances.
A particular attention will be paid to the nonnegative factorization of matrices not belonging to the interior of  n , for which the algorithms in References 3,19 perform rather poor.

Number of runs and starting points.
In every numerical experiment, for A ∈ R n×n with n < 100, we run Algorithm 1 100 times for 100 randomly chosen initial matrices in  (for instance, by choosing a random matrix in R n×r and then by using the projection formula (21)), and run the algorithms ModMAP and SpFeasDC also 100 times for 100 randomly chosen initial matrices in  r (for instance, by choosing a random matrix in R r×r and by computing a SVD decomposition); if n ≥ 100, then we do this for each of the algorithms 10 times. As noticed in Section 2.3, the algorithm ModMAP and SpFeasDC require, in addition, a matrix B, which we compute by the Cholesky decomposition. If the Cholesky decomposition fails, then we use the eigenvalue decomposition. Here we follow the approach described in Reference 3, Section 3, see also Reference 19, Section 6.

Parameter choice.
We will choose the constant + , which will then determine the sequence of inertial parameters { k } k≥1 , with two different aims: • by running a simple bisection routine which starts at 0.967 in order to find greater values for + that satisfy (73), namely, Instead of using the midpoint rule, we will use as update rule + ∶= (3 + + 1) ∕4 for the bisection routine, which seemingly gives better results. We will then choose + ∶=̂+, which is the last value at which (73) holds. As seen in the previous subsection, as long as (73) is fulfilled, we can and do choose = 1.

Stopping criteria.
For A ∈ R n×n , we will run each of the algorithms at most 10, 000 iterations if n < 100 and 50, 000 otherwise. We count the algorithms ModMAP and SpFeasDC as "success" if the stopping criterion is reached before the maximal number of iterations is attained. This is nothing else than the stopping criterion used in References 3,19. For SpFeasDC, we will set Tol fea : = 10 −16 if the matrix A belongs to int ( n ), and Tol fea : = 10 −7 otherwise. For ModMAP we will take as threshold 10 × Tol fea . For all instances of Algorithm 1 we will use as stopping criterion the relative error condition Also here, we will set Tol val : = 10 −16 if A belongs to int ( n ), and Tol val : = 10 −7 otherwise.

Tables.
In the tables with numerical results, we report the (rounded) successful rate over the total number of trials (Rate), the average CPU time in seconds for both successful (Time (s)) and failed (Time (f)) trials, and the average number of iterations (Iter.) needed to reach the stopping criteria for the successful trials. We also use boldfaces to highlight the best results among all methods that have successful rate 1.

Plots.
In the Figures 1,2 and 9-12 we plot for some particular instances of the matrix A the sequences of function values { (Z k ) −  min } k≥2 and of distances { 1 2 ‖X k − X sol ‖ 2 F } k≥0 in logarithmic scale, where  min denotes the smallest objective function value over all methods and X sol is the last iterate X k for each method. With the plots we want to emphasize that the sequences of both function values and iterates have linear rates of convergence.
In the Figures 3-8 we emphasize to what extent the performances of the algorithms are sensitive to the number of columns r of the factor.
We summarize here the different variants of Algorithm 1 with corresponding parameter choices we will use in the numerical experiments:

F I G U R E 8 Comparison of the number of iterates required by
IPG-sFISTA and RIPG-sFISTA for the factorization of a randomly generated matrix A ∈  100 for different values of r and random initial points 4. IPG-sFISTA: the inertial projected gradient algorithm formulated in Example 7 (for = 1) with inertial parameters fulfilling (75) for ∶=̂+; 5. IPG-mod: the modification of Nesterov's scheme from Reference 51 discussed in Example 8 with ∶=̂+ and step size ∶= 1∕L F . The setting goes beyond the one in which convergence was proved in Reference 51, but it is within the one for which our convergence result holds. 6. RIPG-const, RIPG-sFISTA and RIPG-mod: the relaxed versions of IPG-const, IPG-sFISTA and IPG-mod, respectively, for different values of + that violate (73), as in Example 9, and with corresponding relaxation parameters satisfying (77).
Numerical experiment 1. In our first experiment, we use randomly generated completely positive matrices as in Reference 3, section 7.8. Precisely, in each test we generate a random n × 2n matrix B 0 and then we set A ∶= |B 0 | |B 0 | T ; here the absolute value operator |⋅| is understood entrywise. We test the algorithms on 50 randomly generated 40 × 40 matrices, 10 randomly generated 100 × 100 matrices, and 10 randomly generated 500 × 500 matrices, all via the approach described above. For the nonnegative factorization we choose in each case r : = 1.5n + 1 and r : = 3n + 1, which have been used also in Reference 19. Notice that the choice r : = 1.5n + 1 may not be sufficient since the test instances are generated using matrices of order n × 2n. Nevertheless, we obtain satisfactory results for these choices of r. The performances of the different numerical methods on the for the different instances are captured in the Tables 1-6. One can notice that SpFeasDC outperforms the other methods with respect to the number of iterations, which is due the fact that SpFeasDC uses a linesearch routine to improve the step size, while the other methods have quite conservative   step size rules. However, some of the instances of Algorithm 1 can compete with SpFeasDC in terms of computational time. This is due to the fact that the latter runs in every iteration a SVD routine, which is much more time expensive than the simple projection step made by Algorithm 1. In particular with growing dimensions our algorithm becomes faster than SpFeasDC.
Numerical experiment 2. In the second numerical experiment we examine the efficiency of the factorization algorithms when the number of columns r of the factor varies. We first consider the factorization of a matrix of the form and recall that I n and j n denote the n × n identity matrix and the all-ones-vector in R n , respectively. It has been observed in Reference 3 that for such matrices the rate of success of ModMAP, when applied with random initial points, increases with higher values for r. We notice that A n ∈ int ( n ) for any n ≥ 2 ( 53 ).

TA B L E 3
The nonnegative factorization of random completely positive matrices for n = 100 and r = 151  A similar behavior can be noticed for various instances of Algorithm 1, as it can be seen in Figure 3. Here we plot the rate of success with which the algorithms ModMAP, PG and IPG-sFISTA with =̂+ and RIPG-sFISTA with ( , ) = (̂3 , (̂3)) provided a factorization of the matrix A 40 for r ∈ {40, 51, 61, 71, 81,101, 121} and 100 random initial points. The number of required iterations is plotted in Figure 4. For the variants of Algorithm 1 similar results can be reported.
We repeat the experiment for a completely positive matrix A ∶= |B 0 | |B 0 | T constructed as in the previous numerical experiment from a randomly generated 100 × 200 matrix B 0 for random initial points and various r ∈ {151,176, 201,251, 301}. As it can be seen in Figure 5, the rate of success for different variants of Algorithm 1 increases with higher values for r, however, this is not anymore the case for ModMAP. Moreover, according to Figure 4, the numbers of iterations required by the first provide a factorization that is more stable than the one required by ModMAP. With the Figures 6 and 8 we want to underline that, while IPG-sFISTA requires for all instances of r less iterations than RIPG-sFISTA when factorizing A 40 given by (80), the situation changes to the opposite when it comes to the factorization of the randomly generated matrix A ∈  100 .
Numerical experiment 3. In this numerical experiment we address the factorization of the matrix from Reference which is known to be completely positive with cpr(A) = rank(A) = 3 (see Reference 54, Theorem 2.5). However, at the time this matrix was introduced no completely positive factorization of it was known. A completely positive factorization of this matrix with two zero columns was obtained for the first time in Reference 3 and it is given by and notice that the matrixX 1 is obtained with IPG-const,X 2 with IPG-sFISTA, andX 3 with RIPG-mod. Numerical experiment 4. In this numerical experiment, we consider the perturbed matrix A defined by Both A and A belong to  5 for all ∈ [0, 1]. Furthermore, A ∈ int ( 5 ) whenever 0 ≤ < 1, since P = [ j 5 |I 5 ] [ j 5 |I 5 ] T ∈ int ( 5 ), while A ∈  5 ⧵ int ( 5 ). It has been observed in References 3,19 that it is much more difficult to perform a nonnegative factorization of A than of A when < 1. In particular, the rate of success for ModMAP and SpFeasDC decreases to zero when to 1, that is, when A becomes nearly identical to A. For this experiment, we set, according to the lower bounds for the cp-rank derived in Reference 5, r : = 11 for ∶= 1 and r : = 12 otherwise. We present in Tables 7 and 8 the numerical performances of the algorithms applied to the nonnegative factorization of the matrices A 0.99 and A 1.00 = A, respectively. One can see that both ModMAP and SpFeasDC practically fail to factorize the two matrices, a fact which was noticed in References 3, 19. In what concerns the inertial methods IPG-const, IPG-sFISTA, and IPG-mod, they also seem to face some difficulties in solving these matrices, as the rate of success is not for every initial matrix equal to 1. On the other hand, the methods RIPG-sFISTA and RIPG-mod combining inertial and relaxation parameters always return nonnegative factorizations for + taken equal tô3 and equal to 1. This emphasizes the importance of the interplay between the inertial and relaxation parameters, as mentioned in Example 9, and provides a strong motivation for the approach proposed in this paper.
Numerical experiment 5. Let I n and J n denote the identity matrix and the all-ones-matrix in R n×n , respectively, and define This family of matrices, that lie on the boundary of  2n , has been also considered in Reference 3. The authors report that the algorithms they propose fail to factorize matrices in this family, which is also the case with SpFeasDC, as we have seen in our experiments. We exemplify this in Table 9 for n = 15 and r = 30. On the other hand, the methods RIPG-sFISTA and RIPG-mod combining inertial and relaxation parameters provide a factorization in reasonable time, as it is also the case for n = 50 and r = 100 on which we report in Table 10. It is also interesting to notice that, for this family of matrices, FISTA outperforms all the other methods, despite of the fact that the parameter choice for this method does not fail into the setting for which convergence was proved.

FURTHER PERSPECTIVES
Numerical evidence (see the Figures 1,2, and 9-12) suggests that the convergence rates of our model are linear, which at its turn suggests that the Łojasiewicz exponent of the function Ψ is at most 1/2. Even though the Łojasiewicz exponent has played an important role in the derivation of many convergence rates results, to little is known about the calculation TA B L E 10 The nonnegative factorization of A 100 given by (84) for r = 100

F I G U R E 9
The sequence  (Z k ) −  min for the factorization of A 0.99 given by (82) and (83) F I G U R E 10 The sequence 1 2 ‖X k − X sol ‖ 2 F for the factorization of A 0.99 given by (82) and (83) of its exact values for functions with complex structure. Some calculus rules for the Łojasiewicz exponent have been provided in References 55 and 56 for some simple models, however, it is not yet clear how to calculate it for Ψ . This is an interesting topic of future research.
The empirical evidence on the benefit of the use of line search techniques gives rise to the interesting question of studying the theoretical convergence guarantees of the iterates generated by a Algorithm 1 enhanced with such a procedure. Another topic of further research is related to the extension of the convergence analysis beyond the current setting, in order to cover the parameter choice of the FISTA method, which, for the numerical experiments 4 and 5, proves to have excellent numerical performances.

F I G U R E 11
The sequence  (Z k ) −  min for the factorization of A 1 = A given by (82) and (83)

F I G U R E 12
The sequence 1 2 ‖X k − X sol ‖ 2 F for the factorization of A 1 = A given by (82) and (83) Further, one can replace in the formulation of the optimization problem (P mod ) the closed ball with radius √ trace (A) by the sphere of the same radius, formulate a projected gradient algorithm with relaxation and inertial parameters (by using the formula of the projection on the intersection of a cone and a sphere from Reference 47), determine a parameter setting for which convergence can be guaranteed and convergence rates can be derived (in the spirit of the analysis for inertial proximal gradient algorithms in the fully nonconvex setting from Reference 26), and, of course, investigate its numerical performances.