The proximal map of the weighted mean absolute error

We investigate the proximal map for the weighted mean absolute error function. An algorithm for its efficient and vectorized evaluation is presented. As a demonstration, this algorithm is applied to a nonsmooth energy minimization problem.


I
The proximity operator, or proximal map, plays a fundamental role in non-smooth optimization; see for instance Chambolle, Pock, ; Combettes, Pesquet, ; Parikh, Boyd, .Given a function  : R  → R ∪ {∞}, the proximal map prox  : R  → R  is de ned as the solution of the problem Minimize  () +  −  , where  ∈ R  .
( . ) Under the mild condition that  is proper, lower semicontinuous and convex, prox  is well-de ned.We refer the reader to Bauschke, Combettes, , Ch. .for details and further properties.
Here   > are given, positive weights and   ∈ R are given data,  = , . . .,  for some  ∈ N.
We refer to ( . ) as the weighted mean absolute error.Any of its minimizers is known as a weighted median of the data {  }.Clearly,  is proper, continuous and convex, and so is   for any  > .
By de nition, the proximal map prox   : R → R for  as in ( . ) is given by prox   () arg min with  > and  ∈ R, whose unique solution is explicitly given in terms of the soft-thresholding operator   () max{ , | | −  } sgn().In this case, we have This map, often with  = , arises in many iterative schemes for the solution of problems involving the -norm; see for instance Daubechies, Defrise, De Mol, ; Goldstein, Osher, .We can therefore view ( . ) as a multi-thresholding operation.
We wish to point out that our problem of interest ( . ) is di erent from the LASSO problem where  ∈ R  , ( . ) see Tibshirani, ; Chen, Donoho, Saunders, .In the latter,  is multi-dimensional and the deviation of its image under a linear map  from a data vector  is measured.By contrast, in ( . ) we measure the deviation of a scalar  from multiple data points   .Moreover, the roles of the -norm and the -norm are reversed in ( . ) and ( .).
We point out that .We also demonstrate the utility of our implementation of ( . ) by solving, similarly as in Li, Osher, , an image denoising problem using a block coordinate descent (checkerboard) algorithm.In order to overcome the generic failure of convergence of such a method to the unique minimizer, we combine it with restarts based on the steepest descent direction.The emphasis in this paper, however, is on the e cient solution of ( .).This paper is structured as follows: We establish an algorithm for the evaluation of the proximal map of the weighted mean absolute error ( . ) in Section and prove its correctness in Theorem . .In Section we brie y discuss the structural properties of the proximal map.We conclude by showing two applications of the proposed algorithm.The rst application is to an image denoising problem using the ROF model Rudin, Osher, Fatemi, , see Section .In Section , we address a non-smooth energy minimization problem.

A E P M
In this section, we derive an e cient algorithm for the evaluation of the proximal map prox   () ( . ) and prove its correctness in Theorem . .To this end, we assume that the points   have been sorted and duplicates have been removed and their weights added.As a result, we can assume  <  < . . .<   .Moreover, we assume  > ,  ≥ and   > for all  = , . . .,  .Summands with   = can obviously be dropped from the sum in ( .).
We divide the real line into the intervals ( . ) We extend these de nitions by setting  ,   +   and   +   + .We therefore have   =   −  − for all  = , . . .,  + .Using this notation, we can rewrite the derivative of  as This formula re ects the fact that  is piecewise linear and convex since  − −  is monotone increasing with .Moreover,  () =  −  = −  =   < holds for all  ∈ int  (points to the left of smallest data point  ), and  () =   −   + =  =   > holds for all  ∈ int   + (points to the right of the largest data point   ).At  =   ,  = , . . .,  ,  is non-di erentiable but we can specify its subdi erential, which is ( . ) The objective of ( . ) is piecewise quadratic and strongly convex.Its derivative is thus strongly monotone and it satis es Φ () < for all  < min{ ,  } and Φ () > for all  > max{  ,  }.
( . ) Consequently, the unique minimizer of Φ lies between these bounds.
The idea to nding the unique minimizer  * of ( . ) is to locate the smallest index ≤  ≤  + such that  * ≤   holds, i. e., the nearest data point to the right of  * .In other words, we need to nd and lim Figure .
: Visualization of the subdi erential Φ of the objective Φ.In the rst (upper) case,  * =  holds.In the second (lower) case, we have  * ∈ int  .
holds.Now we can distinguish two cases:  * =   and  * <   .The rst case applies if and only if Otherwise,  * lies in int   and thus it is the unique minimizer  −  ( − −   ) of the locally quadratic objective Φ.In either case, once the index ≤  ≤  + has been identi ed,  * is given by Both cases are also depicted in Figure . .
The considerations above lead to Algorithm .In our implementation, we evaluate ( . ) for all  simultaneously and bene t from the quantities being monotone increasing with  when nding the rst non-negative entry.
Algorithm Evaluation of ( .), the proximal map of the weighted mean absolute error.
--cbna page of Proof.Let ≤  ≤  + be the index found in Line .First suppose ≤  ≤  .Then  − and   are both nite, and ( .), ( . ) imply max Φ( − ) = lim Owing to the properties of the subdi erential of strongly convex functions, there exists a unique point  * ∈ ( − ,   ] such that ∈ Φ( * ), i. e.,  * is the unique minimizer of ( .).This point either belongs to int   , or else  * =   holds.In the rst case, Φ is di erentiable, so that Φ ( * ) = holds, yielding Otherwise, we have  * =   , and ∈ Φ(  ) implies In either case, the unique solution  * of ( . ) is determined by which is the quantity returned in Line .
It remains to verify the marginal cases  = and  =  + .In case  =  + , we have lim    Φ () < due to the minimality of .Hence, A similar reasoning applies in the case  = .
We provide an e cient and vectorized P implementation of Algorithm in Baumgärtner et al., .It allows the simultaneous evaluation of ( . ) for multiple values of , provided that each instance of ( . ) has the same number  of data points.The weights   and data points   as well as the prox parameter  may vary between instances.The discussion so far assumed positive weights for simplicity, but the case   = is a simple extension and it is allowed in our implementation.This is convenient in order to simultaneously solve problem instances which di er with respect to the number of data points  .In this case, we can easily pad all instances to the same number of data points using zero weights.In addition, data points are allowed to be duplicate, i. e., we only require  ≤  ≤ . . .≤   ∈ R,  ≥ .Notice that since the data points are assumed to be sorted, nding the index in Line is of complexity log  .

S prox 𝛾 𝑓
In this section, we brie y discuss the structure of the map  ↦ → prox   ().Since it generalizes the soft-thresholding operation ( .), it is not surprising that we obtain a graph which features a staircase pattern.An illustrative plot for certain choices of weights   , data points   , and prox parameter  > is shown in Two alternating regimes occur for  * = prox   (), as  ranges over R. First, when  * ∈ int   holds, then ( . ) implies that  * is an a ne function of  with slope .This is the case for  whose associated index  is constant, i. e., As  increases beyond the upper bound of the interval above,  * enters a constant regime which applies to Notice that the case  = reduces to the soft-thresholding map ( . ) with only one plateau.

A T I D
In this section, we present the application of Algorithm to a classical (ROF) total-variation image denoising problem going back to Rudin, Osher, Fatemi, .Given noisy image data  , of size  ×  , we seek an image  , of the same dimension which solves Minimize H () Well-known solution approaches to ( . ) include the primal-dual hybrid gradient method Chambolle, Pock, and the split Bregman iteration Goldstein, Osher, .The latter requires the solution of a Laplacian problem for  , in each iteration.A simpler approach, considered in Li, Osher, , is to partition the unknowns in ( . ) into two disjoint subsets, according to a checkerboard pattern.In this case, problem ( . ) with only one subset of unknowns decouples into independent problems, each of which is of type ( . ) with weights   = and  =  and can be solved e ciently and in parallel using Algorithm .We give further implementation details below.
Alternating over both subsets of unknowns, one obtains a block-coordinate descent method as proposed in Li, Osher, , Sec. .Unfortunately, such a method does not necessarily converge towards the global minimizer for non-smooth objectives; see for instance Friedman et al., , Sec. . Therefore, Li, Osher, proposed to restart the block-coordinate descent algorithm using a random perturbation of the nal iterate upon stagnation.
We depart from this restarting strategy in the following way.Upon stagnation of the block-coordinate descent method, we evaluate the steepest descent direction by orthogonally projecting (w.r.t. the Euclidean norm • ) the zero vector onto the subdi erential of the objective H from ( .):

𝑠 .
( . ) Due to the structure of the subdi erential of the absolute value function Notice that Lines and require the evaluation of a function of the ( . ) for many arguments in parallel, where we bene t from our vectorized implementation of Algorithm .All norms in Theorem .are Frobenius norms for matrices.Line is implemented using a backtracking strategy starting with initial step size  = ., which is then halved until the condition H (  + ) < H (  ) is met.
We wish to emphasize that we do not propose Algorithm as a novel solver for image denoising problems.We rather consider it here as a source of problems of type ( .), which can be solved e ciently by our proposed Algorithm .In practice, Algorithm can also be used e ectively as a preliminary solver stage for ( .), before one switches to, e. g., the split Bregman or Chambolle-Pock iteration.
For veri cation purposes, we show the outcome of Algorithm .As a test case, we choose the wellknown cameraman image of size  =  = .We add zero-mean Gaussian noise with standard deviation  = independently pixel by pixel.We then truncate the values to the range [ , ] to obtain the noisy image shown in Figure .a.We apply Algorithm to the image denoising problem ( . ) with parameter  = .The inner tolerance is set to TOL inner = − .We observe that even for a coarse outer tolerance of TOL outer = , a good reconstruction is obtained; see Figure .b.This tolerance is reached after  = iterations, of which are iterations of the loop in Line -Line and are executions of Line -Line .In particular, the subdi erential projection step in Line , which amounts to solving a quadratic optimization problem, is carried out times.For comparison, we include an "exact" solution of ( . ) obtained by a split Bregman iteration with very tight tolerances in Figure . .Each call to Algorithm (Lines and in Algorithm ) evaluates the proximity operator ( . ) in parallel for half the number of pixels, i. e., = instances.Using padding with zero weights for instances of ( . ) pertaining to points on the boundary, all instances have  = data points   (the current values for all neighbors north, south, east and west) and weights   ∈ { , }. Timing results are reported in Table . .They were obtained on a laptop with an -core Intel Core i CPU with .GHz and GiB RAM, running Ubuntu .and P . .

A D E M P
In this section, we consider the minimization of the non-smooth energy J of a de ected membrane.
We apply an ADMM (alternating direction method of multipliers) scheme which requires the parallel  Consider a bounded domain Ω ⊂ R occupied by a thin membrane.In our model, the energy of this membrane is given in terms of its unknown de ection (displacement) function  : Ω → R and it takes the following form: The rst term describes the potential energy of the membrane, where  ∈ R is the sti ness constant.
The second term accounts for an external area force density  , which we assume to be a non-negative function on Ω.Consequently, the de ection will be non-negative as well.The specialty of our model is the third term, which can be interpreted as follows.The non-negative constants  , . . .,   serve as thresholds.Once the de ection  at a point in Ω exceeds any of these thresholds   , an additional downward force of size   > times the excess de ection activates.Finally, the fourth term models an additional potential energy due to springs along the boundary Γ of the domain with sti ness constant .
A related problem has been studied in Bostan, Han, Reddy, with an emphasis on a posteriori error analysis and adaptive solution.
Notice that the minimization of the convex, non-smooth energy J among all displacements  ∈  (Ω) corresponds to the weak form of a partial di erential equation (PDE), or rather a variational inequality.In fact, the necessary and su cient optimality conditions for the minimization of ( . ) amount to Here,   : Ω → R is the characteristic function of the set  with values in { , }. Provided that the set { ∈ Ω |  () =   } is of Lebesgue measure zero, the variational inequality ( . ) becomes an equation, We could also allow  , . . .,   to be non-negative functions on Ω, with minor modi cations in what follows.
--cbna page of whose strong form-using integration by parts-can be seen to be From here we also learn that the boundary term in the energy ( . ) leads to boundary conditions of Robin type.
We employ a standard Galerkin approach to numerically discretize the energy ( .).To this end, we replace the displacement space  (Ω) by some nite dimensional subspace of piecewise linear, globally continuous nite element functions de ned over a triangulation of Ω. Denoting the associated nodal basis by {  }, we de ne the sti ness matrix as well as the lumped mass matrix with    = when  ≠ .The reason for using mass lumping to approximate all area integrals in ( . ) which do not involve derivatives is to translate the pointwise maximum operator into a coe cientwise one.This technique is crucial for an e cient numerical realization and has been used before, e. g., in Wachsmuth, Wachsmuth, ; Casas, Herzog, Wachsmuth, .
We continue to use  to denote the nodal values of the nite element discretization of the de ection and thus obtain the discrete energy Here, denotes the vector of all ones and the max operation is understood coe cientwise.
The minimization of the convex but non-smooth energy ( . ) is not straightforward.Our method of choice here is an ADMM scheme, which moves the non-smooth terms into a separate subproblem, which can then be e ciently solved using Algorithm .We refer the reader to Boyd et al., and the references therein for an overview on ADMM.
In order to obtain a subproblem of type ( .), we use the identity max{, } =  +  + | −  | and arrive at where •  is the squared norm induced by the positive diagonal matrix .
An ADMM computes iterates  () ,  () ,  () according to the following update scheme for each component   of .This problem ts into the pattern of ( .), and thus can be solved e ciently and simultaneously for all components   by Algorithm .
We use meshes with vertices for Ω and vertices for Ω .We choose the sti ness constant , force density  , boundary sti ness constant , additional downward forces   and threshold de ections   to be constant over the domains in each case, as  ≡ ,  ≡ .,  ≡ ,   ≡ .• ,   ≡ .,  = , . . ., .

Figure
Figure .: Example of the proximal map prox   in case  = .
Exact solution of ( . ) obtained by a split Bregman iteration.

Figure . :
Figure .: Numerical results to obtain timings for Algorithm .

Figure . :
Figure .: Resulting de ections  which minimize the discrete energy functional J ℎ de ned in ( .)over two di erent domains, with data given in ( .).De ections were computed with the ADMM scheme ( . ) and stopping criterion ( .).The subdomains are colored according to the number of active terms ( ≥   ) in ( .).

Table . :
TOL outerIn Lines and , we use subvector indexing.That is,   refers to the subvector of  with "white" Timing results.indices copied and "black" indices zeroed.The roles are reversed for   .Consequently,  =   +   holds.Subvector indexing can be conveniently done in P using logical indexing.