The proximal Robbins–Monro method

The need for statistical estimation with large data sets has reinvigorated interest in iterative procedures and stochastic optimization. Stochastic approximations are at the forefront of this recent development as they yield procedures that are simple, general and fast. However, standard stochastic approximations are often numerically unstable. Deterministic optimization, in contrast, increasingly uses proximal updates to achieve numerical stability in a principled manner. A theoretical gap has thus emerged. While standard stochastic approximations are subsumed by the framework Robbins and Monro (The annals of mathematical statistics, 1951, pp. 400–407), there is no such framework for stochastic approximations with proximal updates. In this paper, we conceptualize a proximal version of the classical Robbins–Monro procedure. Our theoretical analysis demonstrates that the proposed procedure has important stability benefits over the classical Robbins–Monro procedure, while it retains the best known convergence rates. Exact implementations of the proximal Robbins–Monro procedure are challenging, but we show that approximate implementations lead to procedures that are easy to implement, and still dominate standard procedures by achieving numerical stability, practically without trade‐offs. Moreover, approximate proximal Robbins–Monro procedures can be applied even when the objective cannot be calculated analytically, and so they generalize stochastic proximal procedures currently in use.

Recently, the Robbins-Monro procedure has attracted considerable interest in machine learning with large data sets (Bottou, 2010;Bottou et al., 2016;Moulines & Bach, 2011;Zhang, 2004) and in scalable statistical inference (Chen et al., 2016;Li et al., 2017;Su & Zhu, 2018;Toulis & Airoldi, 2015;Toulis & Airoldi, 2017). In this context, given a data set D, the Robbins-Monro procedure in Equation (1) can be applied with h(θ) being the gradient of the negative log-likelihood of θ given D and H(θ, ξ) being the gradient of the negative log-likelihood of θ calculated at a single data point sampled with replacement from D. Standard theory then implies that n converges to a point ∞ for which h( ∞ ) = 0. In other words, n converges to the maximum-likelihood estimator (or maximum a posteriori if regularization is used) given data set D. In this context, h is the gradient of a convex scalar potential, and the Robbins-Monro procedure is commonly referred to as stochastic gradient descent (SGD).
A well-known issue with the Robbins-Monro procedure, however, is its sensitivity to specification of hyperparameters, especially the learning rate 1 . For instance, the procedure can be arbitrarily slow if 1 is even slightly misspecified. To illustrate, suppose that n = 1 ∕n, and there exists a scalar potential F, such that ∇F(θ) = h(θ), for all θ ∈ Θ. If F is strongly convex with parameter μ, then E‖ n − * ‖ 2 = O(n − ) if = 2 1 < 1 (Nemirovski et al., 2009, Section 1); (Moulines & Bach, 2011, Section 3.1). In contrast, the procedure can diverge even in the first few iterations if the learning rate is too large, especially with non-Lipschitz likelihoods as in Poisson regression . In summary, small learning rates can make the Robbins-Monro iterates converge very slowly, whereas large learning rates can make the iterates diverge numerically. Importantly, the requirements for numerical stability and fast convergence are very hard to reconcile in practice, especially in large-scale problems, which renders the Robbins-Monro method, and all its derived procedures, inapplicable without extensive heuristic modifications (Bottou, 2012).

| THE PROXIMAL ROBBINS-MONRO PROCEDURE: AN OVERVIEW
In this paper, our idea to improve the stability of the Robbins-Monro procedure is to leverage the proximal point algorithm of Rockafellar (1976). Assuming a known convex potential function F such that ∇F = h and a current iterate n−1 , the proximal point update is defined as follows: The intuition behind the update in Equation (2) is to penalize values of + n which are far away from n−1 : the smaller n is, the stronger the penalization is. In recent years, interest in optimization through proximal operators (i.e. function prox n F above) has exploded because the resulting proximal procedures are stable and converge with minimal assumptions (Bauschke & Combettes, 2011;Parikh & Boyd, 2013). In addition, they can be applied to settings where the objective function is the sum of a smooth and a non-smooth function (as is common when using regularization), and often lead to efficient, parallelizable algorithms.
To illustrate the stability of proximal updates, we can rewrite (2) as: where * satisfies h( * ) = 0. Then, we take 2 norms on both sides to obtain: By convexity of F, we have h( ) ⊤ ( − * ) ≥ 0 for any θ, and so, unless h( + n ) = 0, we obtain ‖ + n − * ‖ 2 < ‖ n−1 − * ‖ 2 . This shows that the proximal update in Equation (2) contracts the iterates towards the optimal solution * , hence the procedure converges. In practice, however, the proximal point algorithm is infeasible since solving the minimization problem (2) is, in general, as hard as minimizing F, or even impossible if F is unknown. Fortunately, classical results show that approximate solutions to Equation (2) are still stable if the approximation errors are small enough (Rockafellar, 1976).
In this paper, we first introduce stochastic errors in the proximal point updates, and study the properties of the resulting procedure from a probabilistic viewpoint. This procedure will follow a stylized model at first, but will later serve as a template for concrete, approximate implementations. To that end, we begin with the following stylized model: an agent has an initial estimate 0 , then an oracle calculates the proximal update + 1 according to Equation (2), which the agent then observes with error 1 , as 1 = + 1 − 1 1 ; then, the oracle computes the proximal update + 2 given 1 , and so on. This procedure is depicted in Table 1, and is also summarized below: An immediate concern with the stochastic proximal point algorithm of Equation (4) is whether it inherits the stability properties of the classical proximal point algorithm. Indeed, in Section 3, we show that when n decreases at a proper rate, and n is random error with uniformly bounded variance, (3) then the stochastic proximal point algorithm converges, and is numerically stable. In particular, we prove results on almost sure convergence (Theorem 1), and derive error bounds for convex (Theorem 2) and strongly convex objectives (Theorem 3). We then discuss the stability of the algorithm by analysing the dependence of expected errors, E(‖ n − * ‖ 2 ), on the initial error, E(‖ 0 − * ‖ 2 ), and the learning rate, 1 ; see Section 3.2 for details. Towards concrete implementations of the stochastic proximal point algorithm (4), recall our key assumption that h(θ) can be estimated by some random variable H(θ, ξ), such that E (H( , )) = h( ). Thus, using the same notations as in Equation (1), we can define n = H( + n , n ) − h( + n ), whose expected value is zero conditional on { i } 1≤i<n . This lets us rewrite Equations (3) and (4) as: This is a form of a proximal Robbins-Monro procedure, since its update in Equation (5) only differs from the classical Robbins-Monro update of Equation (1) in using the proximal update + n instead of n−1 in H(θ, ξ). As a special case of the stochastic proximal point algorithm defined above, the proximal Robbins-Monro is also numerically stable. Its exact implementation, however, remains problematic due to the presence of the proximal term + n in Equation (5). We thus propose and study two different ways to implement the proximal Robbins-Monro procedure, depending on whether we can observe ξ directly, or not. First, we consider settings where ξ can be observed directly. For example, in the context of stochastic gradient descent (see Section 4.1), ξ is a random datapoint from the data set, and H(θ, ξ) is the corresponding stochastic gradient calculated at parameter value θ. In such settings, we can perform an interesting-and perhaps counterintuitiveapplication of the 'plug-in principle'.
In particular, taking expectations in Equation (5) yields E( n | n−1 ) = + n , where  n−1 is the natural filtration, ( 1 , … , n−1 ). Since n is an unbiased estimator of + n we can simply plug in n on the righthand side of Equation (5) to obtain: Equation (7) describes a wide family of stochastic optimization methods known as incremental proximal methods, or implicit stochastic gradient descent, depending on the field of application. We discuss related work in Section 2.1, and provide details and practical examples in Section 4.
Second, we consider the more novel and challenging setting where ξ cannot be observed directly but is only observed through stochastic function H. In particular, we no longer assume that the form of the dependency of H on ξ is known, and thus solving the implicit Equation (7) becomes infeasible. This is the case, for example when H(θ, ξ) can only be sampled successively in the context of a sequential experiment (see Section 6).
n = n−1 − n H( + n , n ). Proximal Robbins-Monro Procedure (7) n = n−1 − n H( n , n ). Incremental Proximal/Implicit Methods T A B L E 1 Stylized model of the Stochastic Proximal Point Algorithm. The update from n to + n is deterministic, and from + n to n+1 it is stochastic Agent To address this challenge, we first take a full expectation in Equation (5) to obtain E( + n − n−1 + n H( + n , n )| n−1 ) = 0. This implies that conditional on  n−1 we can view the proximal iterate + n as a solution to the following characteristic equation: The key idea is then to apply the classical Robbins-Monro procedure directly on this characteristic equation, leading to the following stochastic fixed point procedure: At first, it may seem that the proximal stochastic fixed-point procedure described in Equation (8) may have the same stability issues as classical stochastic approximation, since it is using classical updates in the inner loop. To investigate this, in Section 4.2 we analyse the convergence properties of this procedure, which is particularly challenging due to its nested structure. Our analysis reveals conditions under which the procedure can be more stable than classical Robbins-Monro. In Section 6, we also show significant benefits in numerical stability using the classical quantile regression example of Robbins and Monro (1951). The proximal stochastic fixed-point procedure of Equation (8) and its theoretical analysis, therefore constitute a key contribution of this paper. We are unaware of other stochastic proximal procedures where the random components, ξ, cannot be observed directly, or where the underlying procedure is comprised of nested stochastic fixed points.
The rest of this paper is structured as follows. In Section 2.1, we discuss related work. In Section 3, we study the stochastic proximal point algorithm, and show theoretically its appealing statistical and stability properties. We then focus on the proximal Robbins-Monro procedure as a natural instance of the stochastic proximal point algorithm. As mentioned earlier, proximal Robbins-Monro is also an 'idealized procedure, that is it is well-defined mathematically but, in general, it cannot be directly computed. We thus explore the extent to which its theoretical properties carry through to the two approximate implementations outlined above. Specifically, in Section 4, we discuss the implicit procedures described in Equation (7), with concrete examples, and in Section 4.2, we analyse the stochastic fixed-point procedure of Equation (8) with a full convergence analysis. To illustrate our theory, we present empirical results on both approximate implementations in Sections 5 and 6, all showing the stability benefits of our approach.
Remark 2.1 Technically speaking, the approximate procedure in Equation (7) is not subsumed by the idealized procedure of Equation (4), since H( n , n ) in the former equation is not an unbiased estimator of h( + n ) in the latter as n and n are dependent in Equation (7). The convergence analysis of the approximate procedures in Equations (7) and (8) therefore requires special technical results that extend beyond the idealized procedure-see Sections 4.1 and 4.2, respectively. This analysis can be conceptually understood as quantifying the bias in the errors between the approximate procedures and the idealized procedure, and arguing that this bias converges to zero sufficiently quickly. As such, the idealized procedure in Equation (4) can be thought of as describing the 'limiting behaviour' of the two approximate procedures. An interesting question for future work is to identify general conditions under which convergence of (4) can be obtained when n have non-zero mean.

| Related work and contributions
There is voluminous literature on classical stochastic approximation. The early mathematical work by Robbins and Monro (1951), Sacks (1958), Fabian (1968), Nevel'son et al. (1973, Robbins and Siegmund (1985), and Wei (1987) established the fundamental properties, including convergence and asymptotic laws. This work was subsequently pivotal in engineering, and particularly, in systems identification and tracking (Benveniste et al., 1990;Ljung et al., 1992); see also the excellent review by Lai (2003). More recently, there have been important developments in studying stochastic approximations through the lens of dynamical systems theory, spearheaded by Kushner and Yin (2003) and Borkar (2008). Roughly at the same time, stochastic approximations appeared in machine learning, usually in the form of stochastic gradient descent (SGD) methods, and especially in applications with large data sets and complex models (Bottou, 2010;Zhang, 2004).
There are mainly two lines of literature that are directly related to our work, as depicted in Table  2. In one line of work, the proximal update is deterministic and is performed after a classical stochastic update. For example, the forward-backward procedure of Singer and Duchi (2009) and the proximal stochastic gradient procedure studied by Rosasco et al. (2014Rosasco et al. ( , 2016; Bianchi and Hachem (2016) fall into this category-see Section 3 for a related discussion. Such procedures first update ̃n = n−1 − n H( n−1 , n ), and then define n = prox n f (̃n), where f is some convex regularization function. In our work, we wish to avoid making an explicit update to ensure stability. A notable exception is presented in Section 4.2, where we discuss the stochastic fixed-point procedure in Equation (8). This procedure involves multiple explicit updates within a nested procedure, which, however, do not introduce instability thanks to the problem structure.
Another line of work involves procedures as in Equation (7), where implicit updates are directly used in the update equation. Incremental proximal procedures (Bertsekas, 2011), and implicit SGD (Toulis & Airoldi, 2017;Toulis et al., 2014;Tran et al., 2016) fall into this category. The update in (7) can be solved efficiently in many statistical models , Algorithm 1), including generalized linear models. In numerical optimization and engineering, the stochastic proximal point algorithms studied by Bianchi (2016); Patrascu and Necoara (2017); Patrascu and Irofti (2019) are closely related. Interestingly, all such procedures can be viewed as the plug-in versions of the proposed proximal Robbins-Monro in Equation (5).
T A B L E 2 Depiction of related work. Modern procedures, such as SGD, are instantiations of the classical procedure of Robbins and Monro (1951). The proximal Robbins-Monro procedure proposed in this paper leads to well-known implicit procedures, and also to novel procedures which can work even when the random component ξ cannot be observed directly Solve: E (H( , )) = 0.

Classical Robbins-Monro
can observe ξ directly Stochastic gradient descent (Coraluppi & Young, 1969); (Zhang, 2004); (Bottou, 2010); natural gradients (Amari, 1998); adaptive gradients (Duchi et al., 2011) Implicit stochastic gradients (Bertsekas, 2011); (Bianchi, 2016) (Toulis & Airoldi, 2017); stochastic proximal gradients (Singer & Duchi, 2009); (Rosasco et al., 2014) cannot observe ξ directly Structural breaks/tracking, quantile estimation (Benveniste et al., 1990); (Robbins & Monro, 1951) Prox-Stochastic Fixed Point (Equation (8), Section 4.2) From a theoretical perspective, the central contribution of this paper is the introduction of the proximal Robbins-Monro procedure as the stochastic analogue of the classical proximal point algorithm. This procedure is similar to classical Robbins-Monro procedures, but differs by using the proximal point, + n , in its iterations. We provide a full convergence analysis of the new procedures in Section 3. This fills a gap in the literature that has remained open since classical stochastic approximation was introduced by Robbins and Monro (1951) as the stochastic analogue of gradient descent. Our analysis shows that the proximal Robbins-Monro procedure is more stable numerically than classical Robbins-Monro, and is also less sensitive to hyperparameter tuning.
The key practical novelty of our work is therefore an implementation of the proximal Robbins-Monro procedure based on the fixed point procedure of Equation (8). This procedure can operate even when we cannot observe ξ directly, including settings where h(θ) is not known analytically. In Section 4.2, we present a full convergence analysis of the procedure, which is particularly challenging due to its nested structure. In Section 6, we also illustrate significant benefits in numerical stability through the classical quantile regression example of Robbins and Monro (1951). As mentioned earlier, the fixed-point procedure and its theoretical analysis constitute a key contribution of this paper. We are unaware of other proximal methods, exact or approximate, that can be applied to pure stochastic approximation settings. Moreover, while stochastic fixed point procedures exist in the classical literature (Borkar, 2008, section 10.2), they usually lack non-asymptotic error analysis (similar to Theorem 6), and so their stability properties are unknown.

POINT ALGORITHM
In this section, we analyse theoretically the stochastic proximal point algorithm in Equation (4). Specifically, we study convergence (Section 3.1), asymptotic normality (Section 3.3) and non-asymptotic convergence rates (Section 3.2). We emphasize that the analysis here also applies directly to the proximal Robbins-Monro procedure of Equation (5). Later, we show that the theoretical properties studied here, and especially those that relate to numerical stability, pass onto the approximate implementations of proximal Robbins-Monro. All proofs can be found in Section 1 of the supplementary material.
We need some notation first. Symbol ‖ · ‖ denotes the 2 vector/matrix norm. The parameter space for θ is Θ ⊆ ℝ p , and is convex. For positive scalar sequences (a n ) and (b n ), we write b n = O(a n ) to express that b n ≤ ca n , for some fixed c > 0 and every n = 1, 2, …; we write b n = o(a n ) to express that b n ∕a n → 0 in the limit where n → ∞. Notation b n ↓ 0 means that b n is positive and decreasing towards zero.
Existence and uniqueness of + n as a solution of Equation (3) are guaranteed by the following assumption that we make throughout the paper without further mention: There exists a convex potential F such that ∇F = h.

TOULIS eT aL.
This assumption is not strictly necessary but covers most applications considered in this paper, including settings where stochastic gradient descent is applied. In Section 6, for instance we study a quantile regression problem where h is scalar-valued and non-decreasing, which ensures the existence of F and + n .
Depending on which result we state, the stochastic proximal point algorithm operates under a combination of the following assumptions.
Assumption 1 It holds that n = 1 n − , 1 > 0 and γ ∈ (0, 1]. Assumption 2 Function h is Lipschitz with parameter L, i.e., for all 1 , 2 ∈ Θ, Assumption 3 Function h satisfies either Assumption 4 There exists fixed 2 > 0 such that, for all n = 1, 2, …, Assumption 3(a) is implied by convexity of F but is weaker since monotonicity of its gradient h is only required to hold at * . Assumption 3(b) states that F is strictly convex at * (in particular, it implies that * is unique). Assumption 3(c) is implied by strong convexity of F but is weaker since, similarly to Assumption 3(a), the quadratic lower bound is only required to hold with respect to * . Assumption 4 was introduced by Robbins and Monro (1951), and has since been standard in stochastic approximation analysis. It simply states that the stochastic errors in the observations of h have zero mean and uniformly bounded variances. It could be weakened to include slowly growing errors, 2 n , provided that Assumption 5 is the typical Lindeberg condition that is used to prove asymptotic normality of n , later in this section.
Overall, our assumptions are weaker than the assumptions in classical stochastic approximation because they refer to the idealized procedures of Equations (4) and (5); compare, for example Assumptions 1-5 with the assumptions (A1)-(A4) of Borkar (2008) Section 2.1) or the assumptions by Benveniste et al. (1990) Theorem 15). In comparison to forward-backward procedures (e.g. Bianchi & Hachem, 2016;Rosasco et al., 2016), we share common assumptions on Lipschitzness of the regression function h (Assumption 2) and bounded second moments for the noise term (Assumption 4). The main difference is that forward-backward procedures require certain 'fine-tuning conditions for the learning rate. For example, assumption (A2) of Rosasco et al. (2016) requires that n decays sufficiently fast with respect to the noise level in ɛ. Our procedure does not require such assumptions because the forward-backward steps are transposed (the implicit step happens first), which adds numerical stability, as shown in Theorem 3. In some sense, our procedure is a form of 'backward-forward splitting.

| Convergence of stochastic proximal points
In Theorem 1, we derive a proof of almost sure convergence of the stochastic proximal point algorithm, which uses the supermartingale lemma of Robbins and Siegmund (1985).
Theorem 1 Suppose that Assumptions 1, 3(b) and 4 hold with γ ∈ (1/2, 1]. Then, the iterates n of the stochastic proximal point algorithm of Equation (4) converge almost surely to * ; that is The conditions for almost sure convergence of the stochastic proximal point are weaker than classical stochastic approximation. For example, in classical stochastic approximations, it is typically assumed that the iterates n are almost surely bounded; for example Assumption (A4) of Borkar (2008).

| Non-asymptotic analysis
In this section, we derive upper bounds for deviance of the potential function, E(F( n ) − F( * )), and the mean squared errors, E‖ n − * ‖ 2 . This provides information on the rate of convergence, as well as the stability of the stochastic proximal point algorithm. Theorem 2 on deviance assumes non-strong convexity of F, whereas Theorem 3 on squared error assumes strong convexity.

Theorem 2 Suppose that Assumptions 1, 2, 3(a) and 4 hold. Let
Then, if γ ∈ (2/3, 1], there exists n 0,1 < ∞ such that, for all n > n 0,1 , the iterate n of the stochastic proximal point algorithm of Equation (4) satisfies: If γ ∈ (1/2, 2/3), there exists n 0,2 < ∞ such that, for all n > n 0,2 , Otherwise, γ = 2/3 and there exists n 0,3 < ∞ such that, for all n > n 0,3 , There are two main results in Theorem 2. First, the rates of convergence for the deviance of the stochastic proximal point algorithm are either O(n −1+ ) or O(n − ∕2 ), depending on the learning rate, γ. Second, there is a uniform decay of expected deviance towards zero, whereas in standard stochastic approximation under non-strong convexity, there is a term of the form exp (4L 2 2 1 n 1−2 ) (Moulines & Bach, 2011, Theorem 4), which can amplify the initial conditions arbitrarily. Thus, the stochastic proximal point algorithm, and consequently, the proximal Robbins-Monro procedure, have similar asymptotic properties to classical stochastic approximation, but they are more stable numerically, and less sensitive to initial conditions or hyperparameter tuning.
Remark 3.1 The best rate of convergence for the proximal Robbins-Monro as shown in Theorem 2 is O(n −1∕3 ), which matches the best known rate for classical stochastic approximations with nonstrongly convex objective (Moulines & Bach, 2011, Theorem 4). This rate is suboptimal since it is worse than the minimax rate of O(n −1∕2 ) that is achieved through Polyak-Ruppert averaging (Ruppert, 1988). We conjecture that our proposed procedure can also achieve the minimax rate through averaging, but we leave this for future work. Remark 3.2 The proof of Theorem 2 presents some unique technical challenges, including an implicit inequality of the form b n + g(b n ) ≤ b n−1 , with g being a non-explicit, non-decreasing function. Our strategy is to solve the reverse recursive inequality, b n ( ) + g(b n ( )) ≥b n−1 ( ), in some parametric family, such as b n ( ) = O(n − ), which is more tractable. Then, it is easy to show that b n ( ) is an upper bound for b n , for any β. Thus, a natural upper bound for b n is given by b n ≤ arg minb n ( ). This solution strategy is reminiscent of the majorization-minorization idea (Lange, 2010), and may be more broadly useful. Theorem 3 Suppose that Assumptions 1, 3(c) and 4 hold. Let n = E(‖ n − * ‖ 2 ) and define = 1 + 2 1 , where the n is the n-th iterate of the stochastic proximal point algorithm of Equation (4). If γ < 1, then, for every n > 1, it holds that Otherwise, if γ = 1, it holds that There are two main results presented in Theorem 3. First, the rate of convergence for the expected errors, E(‖ n − * ‖ 2 ), is O(n − ), which matches the rate of convergence for classical stochastic approximation under strong convexity (Benveniste et al., 1990, Theorem 22). The best possible rate here is O(1/n), which is also the minimax rate with strongly convex objectives. Second, there is an exponential discounting of initial conditions, 0 , regardless of the specification of the learning rate parameter 1 and the Lipschitz parameter L. Another way to express this is to consider the function n = log ( n ∕ 0 ) under a noise-free setting ( 2 = 0). By studying this function with respect to 1 (and other problem parameters, such as convexity) we can study stability. In particular, Theorem 3 shows that n = − log (1 + 2 1 )n 1− . In contrast, in classical stochastic approximation, n = L 2 2 1 n 1−2 − O(n 1− ), which can make the approximation diverge numerically if 1 is even slightly misspecified with respect to L (Moulines & Bach, 2011, Theorem 1). Thus, as in the nonstrongly convex case of Theorem 2, the stochastic proximal point algorithm has similar asymptotic rates to classical stochastic approximation and is also more stable. Remark 3.3 When γ = 1, misspecification of the learning rate parameter can indeed lead to arbitrary slowdown to a rate O( max{n −1 , n − log }). This is also true for classical stochastic approximation (Moulines & Bach, 2011, Theorem 1), and is generally a feature of stochastic first-order methods. The key difference between the two procedures, as described above, is numerical stability.

| Asymptotic normality
Asymptotic distributions are well studied in classical stochastic approximation. Starting from Fabian (1968), there has been extensive work in identifying asymptotic distribution laws in stochastic approximation. In this section, we leverage this theory to show when iterates from stochastic proximal point procedures can also be asymptotically normal. The following theorem establishes this result using Theorem 1 of Fabian (1968); see also (Ljung et al., 1992, Chapter II.8).
Theorem 4 Suppose that Assumptions 1, 2, 3(a), 4 and 5 hold, and that (2 1 J h ( * ) − I) is positive-definite, where J h ( ) is the Jacobian of h at θ, and I is the p × p identity matrix. Then, n of the stochastic proximal point algorithm of Equation (4) is asymptotically normal: The covariance matrix Σ is the unique solution of Theorem 4 shows that the asymptotic distribution of n is identical to the asymptotics of the classical Robbins-Monro procedure (Fabian, 1968, for example). Intuitively, in the limit as n grows, we have that + n ≈ n−1 + O( n ) with high probability, and thus the stochastic proximal point behaves like the classical approximation procedure.

| THE PROXIMAL ROBBINS-MONRO PROCEDURE
In the following sections, we focus on the special case of the stochastic proximal point algorithm, where an unbiased estimate H(θ, ξ) of h(θ) is available, such that E (H( , )) = h( ). This leads to the proximal Robbins-Monro procedure introduced in Equation (5), which we repeat here: This procedure is still infeasible, in general, due to the proximal term, + n , and so we consider approximate implementations. Specifically, we consider two different implementations depending on whether we have direct access to samples of ξ or not. The former leads to well-known stochastic procedures, and so our discussion will be relatively short. Later, in Section 4.2, we focus on the more challenging setting where we cannot directly sample (or observe) ξ, and analyse the resulting procedures in more detail.

| Observable ξ: Approximate implementation with the plug-in principle
As mentioned earlier, when we can observe ξ directly, we can apply the plug-in principle to implement the proximal Robbins-Monro update in (10). Specifically, by definition of the proximal update n ∕2 ( n − * ) →  p (0,Σ).
in Equation (4) and Assumption 4 we have: E( n | n−1 ) = n−1 − n h( + n ) = + n . Plugging-in n for + n in Equation (10) yields: We see that the iterate n appears on both sides of Equation (11), and the resulting implicit update can be solved, in principle, since H is known analytically. The substitution of + n with n may naturally cause concerns about whether the stability properties of the proximal Robbins-Monro procedure carry over to the approximate implementation through the implicit procedure of Equation (11). All related work, which we summarize below, generally points to the same fact: the implicit procedure indeed remains stable, and shows superior performance to classical procedures, such as stochastic gradient descent, both in theory and practice.
Specifically, one of the most popular applications of the procedure in Equation (11) is in iterative statistical estimation, where H(θ, ξ) = −∇ log ℓ(Y; X, θ), and ℓ corresponds to the likelihood of a random data point ξ = (Y, X), at parameter value θ. For example, if in Equation (11) we use H( n−1 , n ) instead of H( n , n ), this amounts to classical stochastic gradient descent (SGD), which is widely popular in optimization and signal processing (Coraluppi & Young, 1969), and has been fundamental in modern machine learning with large data sets (Amari, 1998;Bottou, 2010;Bottou et al., 2016;Zhang, 2004). When we use the implicit update, as originally described in Equation (11), then the resulting procedure is known as incremental proximal method in optimization (Bertsekas, 2011), or as implicit stochastic gradient descent (ISGD) in statistics and machine learning . We refer readers to (Bertsekas, 2011) and (Toulis & Airoldi, 2017) for two complementary analyses of implicit SGD, including asymptotic and non-asymptotic errors; see also (Bianchi, 2016;Bianchi et al., 2018;Salim et al., 2019) for related analyses and stronger theoretical results using monotone operator theory. One way to summarize these results on ISGD is through the following 'meta-theorem'.
Theorem 5 (Patrascu & Necoara, 2017;Ryu & Boyd, 2015;Toulis & Airoldi, 2017) Let n = E || n − * || 2 where n is the n-th iterate of ISGD in Equation (11) There are two main results shown in meta-Theorem 5. First, we see the inherent trade-off in first-order stochastic procedures: a larger learning rate 1 helps to discount faster the bias term O(n −A 0 ) at the expense of a larger variance term, O(Bn −1 ), and vice versa. Second, we see that ISGD can navigate this trade-off in a numerically stable way, since the initial conditions, 0 , cannot be amplified even when the learning rate is misspecified. In contrast, classical SGD has the same bias-variance trade-off, but the term in front of 0 involves additional quantities growing with n. This means that the initial conditions can be amplified arbitrarily when the learning rate is even slightly misspecified, leading to numerical divergence (Moulines & Bach, 2011, Theorem 1); for more detailed results and discussion see (Toulis & Airoldi, 2017, Theorem 2.1), (Ryu & Boyd, 2015, Theorem 4), (Patrascu & Necoara, 2017, Theorem 12), and (Pătraşcu, 2020). See also , section 2.5) for comparisons in statistical efficiency between ISGD, standard SGD and maximum likelihood estimators in regular models.
To illustrate these stability advantages of ISGD, we present two examples from the literature: one example is on a linear normal model where the theoretical assumptions in Section 3 of this paper hold, (11) n = n−1 − n H( n , n ).
and another example on a Poisson model where the assumptions do not hold because the objective is non-Lipschitz.

| Example: Linear normal model
Let * ∈ ℝ p be the true parameters of a normal model, y|x ∼ N(x ⊤ * , 2 ), where x ∈ ℝ p and y ∈ ℝ. Let ξ = (y, x) denote one datapoint, and define H( , ) = −(y − x ⊤ )x as above. Then, classical Robbins-Monro reduces to: Procedure (12) is equivalent to classical SGD on the least-squares objective. It is also known as the least mean squares (LMS) filter in signal processing or as the Widrow-Hoff algorithm (Widrow & Hoff, 1960). From Equation (11), the ISGD update for this problem can be solved in closed form: where the second equality follows from the Sherman-Morrison formula. Simplifying, we obtain Observe that Procedure (13) corresponds to Procedure (12) with the step size n replaced by n ∕(1 + n ‖x n ‖ 2 ). This important normalization of the step size is known as the normalized least mean squares (NLMS) filter in signal processing (Nagumo & Noda, 1967). From Equation (12), we see that it is crucial for classical SGD to have a well-specified learning rate parameter 1 . For instance, assume fixed ‖x n ‖ 2 = c 2 , for simplicity, then if 1 c 2 ≫ 1 the iterate n of classical SGD will diverge to a value O(2 1 c 2 ∕ √ 1 c 2 ) (Toulis et al., 2014, for example). In contrast, a very large 1 will not cause divergence in ISGD, but it will simply put more weight on the n-th observation, y n x n , as can be seen in Equation (13). Intuitively, from a statistical perspective, ISGD specifies an averaging of old and new information by weighing the estimate and observation according to the inverse of information, (1 + n ‖x n ‖ 2 ).
The stability advantages of ISGD on classical SGD in the normal model are further illustrated in the numerical simulations of Section 5.

| Example: Poisson regression
Following the setup in Section 4.1.1, now let y|x ∼ Pois(e x ⊤ * ), where 'Pois' denotes the Poisson density. Then, the classical SGD procedure reduces to: The implicit SGD procedure for this problem is equivalent to: (12) n = (I − n x n x ⊤ n ) n−1 + n y n x n .
The implementation of such implicit update may seem intractable, but it is actually simple and efficient. The key observation is that the implicit update n defined in Equation (15) is still of the form n = n−1 + x n , for some scalar λ; the goal is therefore to identify this scalar parameter. Indeed, equating this expression of the parameter update with the expression in Equation (15) implies that λ actually solves the following fixed point equation: Since f n is non-increasing, the search bounds for its fixed point are to be found in [0, f n (0)] or [f n (0), 0] depending on whether f n (0) > 0 or f n (0) < 0, respectively. This can be done efficiently via a generic root finder procedure. Moreover, the idea can be extended to the family of generalized linear models, survival models based on the Cox proportional hazard and M-estimation (Toulis & Airoldi, 2017;Toulis et al., 2014).
Regarding stability, we can see that the updates in Equation (11) are extremely sensitive to specification of n due to the non-Lipschitzness of the objective in the Poisson model, while implicit SGD is not as sensitive. For example, when we start at 0 = 0 and ‖x 1 ‖ = O(1), the next iterate, 1 , will be O(e 1 y 1 ) in classical SGD, which diverges arbitrarily as 1 increases. On the other hand, implicit SGD has a very different behaviour thanks to the implicit update in Equation (15): when 1 is small such that 1 y 1 ≪ 1, then 1 is O( 1 y 1 ); but when 1 is large, then 1 asymptotes to O( log y 1 ).
These stability advantages of ISGD over classical SGD in the Poisson model are further illustrated in the numerical experiments of Section 5.2.

| Non-observable ξ: Approximate implementation with proximal stochastic fixed points
In this section, we consider cases where we cannot observe directly the random component ξ of H(θ, ξ). As mentioned earlier, this includes cases where the analytic form of h(θ) or H(θ, ξ) is unknown, and may only be sampled through, say, a sequential experiment. We thus present an approximate implementation of the proximal Robbins-Monro procedure based on nested stochastic approximations that can be used without any auxiliary knowledge of the estimation problem. The nested procedure is in fact a proximal form of a fixed-point stochastic approximation procedure (Borkar, 2008, section 10.2), which, however, is run only for a finite number of steps. Section 6 illustrates the benefits of the nested procedure in quantile estimation.
To begin, we first take expectations in the proximal Robbins-Monro iteration: The key idea is then to treat + n as the solution to E ( − n−1 + n H( , )) = 0, and solve this characteristic equation through a separate, standard stochastic approximation procedure. At every n-th iteration, we therefore run a Robbins-Monro procedure, w k , for K steps as follows:

TOULIS eT aL.
At first, it may seem that this procedure is affected by the same stability issues as classical stochastic approximation. However, our convergence result that follows will show that this is not true. For intuition, note that for fixed n the sequence (w k ) k≥1 is a standard Robbins-Monro procedure applied to a different minimization problem: What we gain compared to applying the classical Robbins-Monro method to h directly, is that the objective function in Equation (17) is now strongly convex, even when F is not. With this formulation, it is easy to verify that + n is the solution to this optimization problem, so that w k → + n . Therefore, the problem structure that we designed allows the application of explicit updates, without compromising numerical stability. We illustrate this point in Section 6. (16) with parameters n = and a k = 2a∕K, such that e −a < ∕L and K ≥ 2a(1 + L) 2 , satisfies:

Theorem 6 Theorem Suppose that Assumptions 2, 4 and 3(c) hold, then the proximal stochastic fixed point procedure in Equation
where C = (1 + e −a L)∕(1 + ). Theorem 6 shows two key results. First, the initial conditions of the nested procedure are forgotten exponentially fast at a rate which can be made arbitrarily close to (1 + ) −n -this was also true in the idealized procedure. Second, an approximation error smaller than ε can be obtained by choosing n = O( log , 1 ), and K = O( 1 2 ), where K is the number of iterations in the inner procedure. Taken together, these choices imply a total number of gradient observations of order O( 1 2 log 1 ). In comparison, under the same assumptions, the stochastic proximal point algorithm (Section 3) and the standard Robbins-Monro procedure achieve an approximation error smaller than ε using O( 1 2 s) observations. Hence, the approximate implementation of Equation (16) incurs a small (logarithmic) overhead in terms of number of observations required to achieve a given level of accuracy. Experiments in Section 6, however, show that this overhead may be negligible in practice and that the stability benefit of procedure (16) is preserved without sacrificing accuracy, even when restricted to run for the same amount of time as other procedures.
Remark 4.1 The proof of Theorem 6 is technically challenging due to the nested nature of the procedure. This requires careful balancing of the accumulation of approximation errors from the inner iteration jointly with the rate of convergence of the idealized procedure. To the best of our knowledge, there are no such non-asymptotic analyses of stochastic fixed-point procedures in the literature. The proof of Theorem 6 therefore applies novel techniques, which may be of general interest. We also note that convergence of the nested procedure when F is non-strongly convex is an open question, which we leave for future work. Remark 4.2 The nested nature of the procedure described in Equation (16) is reminiscent of the Catalyst scheme of Lin et al. (2015), which is a general acceleration technique for first-order optimization methods. Similar to the Catalyst scheme, our procedure (16) approximately computes a proximal update at each iteration. The key difference is that we analyse how to perform this approximate computation, whereas the Catalyst scheme assumes oracle access to such computation. Furthermore, the main focus of the Catalyst scheme is to achieve acceleration à-la-Nesterov with the use of a momentum term, while our focus is to analyse the stability of proximal updates.
In the following sections, we illustrate the use of the nested procedure of Equation (16) and the use of Theorem 6 through a simulated study on the normal model of Section 4.1.1, and the classical quantile estimation problem of Robbins and Monro (1951).

| SIMULATED STUDIES ON STABILIT Y
Here, we investigate empirically the stability of plug-in implementations of the proximal Robbins-Monro procedure presented in Section 4. Specifically, we present results for the normal linear model of Section 4.1.1, and the Poisson model of Section 4.1.2. Both experiments highlight the stability benefits of proximal Robbins-Monro procedures.

| Normal linear model
Our first simulation setting with the normal linear model is defined as follows. We define parameters * ∈ ℝ p , p = 6, such that * ,j = e −j ( − 1) j , for j = 1, …, 6; and y|x ∼ N(x ⊤ * , 2 ), where x ∼ N p (0, Σ) is a p-variate normal with Σ = 2I + uu ⊤ , with u a column vector with p i.i.d. uniform random variables, U(0, 1); we set 2 = 4. We estimate * recursively using the procedures of SGD and ISGD as introduced in Equations (12) and (13), respectively. We also use the stochastic fixed point method of Equation (16) to estimate * . To satisfy the conditions of Theorem 6, we set: The parameters L, μ are estimated directly from data (in the linear model, these values correspond to the maximum and minimum eigenvalue of Σ, respectively). Finally, the learning rate for all procedures is set as n = 1 ∕n across iterations, and we vary 1 in the experiment to check the sensitivity of the procedures to specification of the learning rate.
As we vary 1 , we replicate data sets of size N = 10,000 based on the above model setup, and run the three procedures above for a total wall clock time of 1 s. For every replication, we calculate the trajectory of log mean squared error (log-MSE) of all procedures, m j,n = ‖ j,n − * ‖ 2 , where j,n is the n-th iterate of procedure j within the data replication. From this series, we are interested in two summary statistics. First, the 'mean-level of the log-MSE, m j,n , which corresponds to the level around which the log-MSE 'settles'. To calculate this number, we fit an AR(1) model on the series m j,n , and then calculate the stationary limit b 1 ∕(1 − b 0 ), where b 1 is the estimated slope coefficient and b 0 is the estimated intercept in the model. Second, we are interested in the maximum value, max i:t j,i ≤1 {m j,i }, of log-MSE across iterations, where t j,i is the wall clock time until iteration i for method j. This acts as a proxy for the sensitivity of the procedure. Figure 1 shows the results of this experiment. For any value of 1 , Figure 1(top) shows the boxplot of the mean-level log-MSE for each procedure. We see that each procedure behaves differently. Across all 1 values, ISGD performs best, and also remains robust. The stochastic fixed-point procedure (SFP) starts (18) a = log (L∕ ),K = 2a(1 + 1 L) 2 , a k = 2a∕K.

TOULIS eT aL.
from worst performance for small 1 . However, as 1 increases SFP keeps improving in MSE, and its variance increases as well. This can be explained by the SFP specification in Equation (18), where larger 1 leads to larger K. Since we keep the computation budget (measured in wall-clock time) fixed, this means that as 1 increases SFP performs more inner iterations (large K) but fewer outer iterations (small n). In contrast, classical SGD is the most unstable procedure. We see that its MSE steadily increases, while the maximum MSE (Figure 1, bottom) varies widely, as 1 increases. This experiment illustrates a key point of our paper: classical Robbins-Monro methods are sensitive to parameter specifications, while proximal Robbins-Monro methods, and even approximate implementations of it, remain stable in a wide range of specifications.

| Poisson model
In this section, we investigate empirically the stability of plug-in implementations of proximal Robbins-Monro presented for the Poisson model of Section 4.1.1. This model has a non-Lipschitz likelihood, and so it is not covered by our theory. This experiment is therefore meant to test the robustness of our theory and methods when our working assumptions do not hold. Our simulation setup is similar to that of the previous section. As before, we consider the true parameter vector * ∈ ℝ p , p = 6, such that * ,j = e −j , for j = 1, …, 6; and y|x ∼ Pois(e x � * ), where x ij takes values {0, 1, 2, 3} with probabilities {0.4, 0.4, 0.15, 0.05}, respectively. We estimate * recursively using the methods of SGD and ISGD as introduced in Equations (12) and (13), respectively. We also employ the stochastic fixed point method of Equation (16) to estimate * . As before we set: a = log (L∕ ), K = 2a(1 + 1 L) 2 , a k = 2a∕K. The parameters L, μ are estimated directly from simulated data. As expected, the estimated L is larger in the Poisson model than in the normal model, and the estimated value increases with more observations. This leads to larger values for K and fewer outer iterations for the stochastic fixed point procedure. The learning rates for all methods is set as n = 1 ∕n across iterations. As we vary 1 , we replicate data sets of size N = 10,000 based on the above model setup, and run each of the three procedures above for a total wall clock time of 1 s. Figure 2 shows the results of this experiment with the Poisson model. For any value of 1 , Figure  2 (top) shows the boxplot of the mean-level log-MSE for each method, as defined in the previous section. Since the standard SGD method is extremely unstable in this model and regularly diverges, we use Figure  procedures, namely, for ISGD and the stochastic fixed point procedure (SFP). We see that both ISGD and SFP are clearly more stable than classical SGD. This experiment suggests that the proximal Robbins-Monro methods, including the approximate implementations discussed in this paper, are generally more stable than their classical counterparts, even when our theoretical assumptions do not hold. Specifically, in this example, the likelihood is non-Lipschitz, and we see that classical SGD diverges even with slight misspecifications of the learning rate. While ISGD is known to be stable in generalized linear models , it is remarkable that the SFP procedure, initialized only with the theoretical values suggested by Theorem 6 and without any fine tuning, is stable in this highly non-linear model as well.

| APPLICATION: ITERATIVE QUANTILE ESTIMATION
In their seminal paper, Robbins and Monro (1951) applied stochastic approximations in iterative quantile estimation. In this problem, H(θ, ξ) corresponds to a sample drawn from a distribution with cumulative distribution function Q(θ). The goal is to estimate * such that Q( * ) = , for given α ∈ (0,1). A relevant application from medicine and toxicology is the estimation of the dose that is lethal to 50% of experimental subjects, known as LD50 (Grieve, 1996).
In more detail, consider a random variable ξ with cumulative distribution function Q. An experimenter wants to find the point * for which Q( * ) = , for some fixed α ∈ (0, 1). Let h(θ) = Q(θ) − α. The experimenter cannot observe ξ directly, but has only access to the outcome of an experiment, denoted by { ≤ }, for any value of θ specified in the experiment. Robbins and Monro (1951) showed that the following iterative procedure, where H( , ) = { ≤ } − , converges to ∞ for which E(H( ∞ , )) = 0. Consequently, it solves the characteristic equation E( { ≤ ∞ }) − = Q( ∞ ) − = 0. By monotonicity of Q, we obtain ∞ = * . Despite theoretical convergence, the numerical stability of the Robbins-Monro procedure can be challenged by the following result.
Proposition 1 Assume that 0 < * and that 0 + 1 > * , then for any ε > 0 such that 0 + 1 > * + , with probability 1 − Q( 0 ), the number of iterations N of procedure (19) required to approximate * within accuracy ε is lower-bounded: Proof With probability 1 − Q( 0 ), the first iterate of Equation (19) is 1 = 0 + 1 > * , where the inequality is by assumption. Conditioned on this event, the progress in each subsequent iteration, namely n − n−1 , is upper bounded by n (1 − ) with probability 1 as long as n > * . This implies that n ≥ 0 Proposition 1 essentially shows that there are values of the learning rate parameter 1 and initial estimate 0 for which the classical Robbins-Monro procedure may be stuck indefinitely. For example, let Q be the standard normal distribution, and let α = 0.999, so that * = 3.09 is the solution.
Suppose also that 1 = Q � ( * ) −1 ≃ 297, which is the learning rate value suggested by standard theory (Nemirovski et al., 2009). Let 0 = −10 and suppose that H( 0 , 1 ) = − . It follows that From there, the Robbins-Monro procedure makes progress by at most i (1 − ) ≃ 297 i ⋅ 10 −3 at each step. Thus, the number of iterations required to return back from 1 to a region near * is at the order of e 956 . In other words, the procedure gets stuck at large values of θ, where the derivative of the objective is negligible.
This numerical example illustrates that a misspecification of 1 can dramatically amplify the initial conditions in classical stochastic approximation, and affect convergence. It is therefore interesting to investigate whether the proximal Robbins-Monro method offers an improvement.

| Stability of the proximal stochastic fixed points
In the context of quantile estimation, the stochastic fixed point procedure of Equation (16) can be written as follows: where H( , ) = { ≤ } − , n = 1 , a k = 2a∕K, and 1 ,a and K are constants to be defined.
Before presenting our numerical experiments, we discuss intuitively why the nested procedure in Equation (21) improves upon the classical Robbins-Monro method in Equation (19), and also discuss how to define the constants according to Theorem 6. We address these two issues successively. First, consider the idealized case where K = ∞. In this case, the iteration in Equation (21) converges to the solution of the following fixed-point equation: The next iterate, n , is simply defined as n = w ∞ . It is easy to verify the stability of this fixed point. For example, if n−1 < * , then n−1 < n < * ; and, conversely, if n−1 > * , then * < n < n−1 . That is, the idealized procedure with K = ∞ always pulls back in the right direction towards * , and thus always makes progress towards the global solution. Convergence is also extremely fast, as shown in the proof of Theorem 6. To illustrate numerically, consider the example of the previous section where the classical Robbins-Monro procedure did not converge. Using the same numbers, at the second iteration, the idealized procedure will calculate: which solves to 1 ≈ 1.74; if we keep iterating, the idealized procedure will be 0.01-close to * by the hundredth iteration. This is a vast improvement compared to the classical Robbins-Monro method, which remains stuck practically forever.
Second, consider the actual nested procedure in Equation (21), where K is finite. Theorem 6 shows that the procedure maintains the nice convergence and stability properties of the original procedure under certain assumptions. The assumptions in this case can be greatly simplified if we consider that for the normal distribution, the probability density function is upper-bounded. Hence, L ≤ 1 and Theorem 6 suggests the following choice of hyperparameters for the nested procedure: Note in particular that this choice of parameters satisfies K ≥ 2a(1 + 1 L) 2 , as required by the theorem. We can define the constants in a similar manner for arbitrary distributions from an upper bound on the probability density function. Next, we evaluate numerically the (approximate) proximal Robbins-Monro procedure resulting from the aforementioned choice of hyperparameters.

| Numerical evaluation
Here, we conduct a numerical evaluation of our proposed nested procedure in Equation (21), using the parameter settings of Equation (22), and compare it with the classical Robbins-Monro procedure in Equation (19). For a fair comparison, both methods run for a total of 1 s of wall-clock time. As the iteration complexity is similar for both methods, the classical Robbins-Monro procedure runs for N iterations, whereas the stochastic fixed point runs roughly for N/K outer iterations, and K inner iterations. This way, the total number of random samples (gradient observations) used by our procedure similar to those in the classical procedure.
As mentioned before, Q(θ) is here the cumulative distribution function of the standard normal, α = 0.999 and 0 = −10. The quantity to be estimated is * ≈ 3.09, for which Q( * ) = . For different values of 1 , we compare the Robbins-Monro procedure to our proposed fixed point procedure in Equation (21), with K = 50. For each value of 1 , the experiment is replicated 100 times, and we report an average of all final estimates from both procedures. The results of this experiment are shown in Figure 3.
In the left plot, we observe that the classical Robbins-Monro procedure indeed suffers from numerical instability. In particular, as predicted by Proposition 1, when 1 increases beyond * − 0 ≃ 13.1, the iterates overshoot and remain virtually stuck for all subsequent iterations. In fact, there is only a small range of values for 1 (visually between values 11 and 15), for which 1 is big enough to allow convergence, yet small enough to prevent the aforementioned numerical instability. Not shown in the figure, the estimates from the Robbins-Monro procedure are negative for very small learning rates; for example, when 1 = 0.1 the average estimate is −8.8. This is close to the starting point, 1 = −10, and indicates that the classical procedure makes little progress when the learning rate is very small. Overall, these results show that classical Robbins-Monro approximations are extremely sensitive to specification of the learning rate values.
The results for proximal Robbins-Monro, as approximately implemented by the fixed point procedure of Equation (21), are drastically different. In the left subplot of Figure 3, we see that proximal Robbins-Monro neither overshoots nor undershoots in contrast to the classical procedure. We see that the proximal procedure maintains a remarkable numerical stability across the entire range of learning rate values. The procedure is also statistically efficient in that the final iterates are centred around the true value (red dashed line) with small variance around it. This is better shown in the right subplot of Figure 3, which only focuses on the estimates of the proximal Robbins-Monro. We note that a slight bias exists for very small or very large values of the learning rate. For example, the average parameter estimate is roughly 3.04 when 1 = 0.1. The bias goes away, however, with increased sample sizes.

TOULIS eT aL.
We emphasize again that, similar to the simulation studies of Section 5, the stochastic fixed point procedure is implemented in a fully data-driven way, by choosing its parameters using Equation (22), as prescribed by Theorem 6.

| CONCLUDING REMARK S
The theoretical and empirical results presented in this paper point to key advantages of the proposed proximal Robbins-Monro procedure, as defined in Equation (5), over the classical procedure of Robbins-Monro. One such advantage is numerical stability. Our theoretical analysis showed that such stability is obtained without sacrificing convergence or efficiency. However, the proposed method is idealized because it can only be approximately implemented.
While in this paper we propose two approximate implementations that work well in general settings, there remain several open questions. First, although the implicit stochastic gradient methods described in Equation (11) are easy to implement in a wide class of models (e.g. generalized linear models, M-estimation), their application to large-scale non-convex settings, such as neural networks, has just started to emerge (Fagan & Iyengar, 2018). In this context, the stability of proximal Robbins-Monro approximations appears to be beneficial as predicted by the theory in this paper. More work needs to be done, however, to analyse these settings theoretically, and to leverage the added flexibility in designing the learning rate sequence.  (21); averages of last iterates for RM and Prox-RM are indicated as triangles and circles, respectively. Each procedure runs for a total of 1 s of wall-clock time. Right: Zoom in to proximal RM (note the different scale on the y-axis). Left plot is in log-scale. The dashed horizontal line depicts true value, * = 3.09. Both procedures start from 1 = −10, and prox-RM is implemented following Equation (22). We see that prox-RM is more stable to specification of 1 than classical RM [Colour figure can be viewed at wileyonlinelibrary.com]  Second, extending the scope of nested, fixed point implementations of proximal Robbins-Monro as in Equation (16), is interesting especially because the procedure can operate even when only samples from the objective are available. This introduces minimal modelling assumptions, which may be desirable in many settings, such as in econometric models, or in sequential experimentation of clinical trials. It is also an open question whether the substantive results of the quantile estimation example of Robbins-Monro presented in Section 6.1 extend to broader applications and domains. We provided positive empirical evidence in the simulations of Section 5, and conjecture that this holds true more generally.