Polynomial‐time universality and limitations of deep learning

The goal of this paper is to characterize function distributions that general neural networks trained by descent algorithms (GD/SGD), can or cannot learn in polytime. The results are: (1) The paradigm of general neural networks trained by SGD is poly‐time universal: any function distribution that can be learned from samples in polytime can also be learned by a poly‐size neural net trained by SGD with polynomial parameters. In particular, this can be achieved despite polynomial noise on the gradients, implying a separation result between SGD‐based deep learning and statistical query algorithms, as the latter are not comparably universal due to cases like parities. This also shows that deep learning does not suffer from the limitations of shallow networks. (2) The paper further gives a lower‐bound on the generalization error of descent algorithms, which relies on two quantities: the cross‐predictability, an average‐case quantity related to the statistical dimension, and the null‐flow, a quantity specific to descent algorithms. The lower‐bound implies in particular that for functions of low enough cross‐predictability, the above robust universality breaks down once the gradients are averaged over too many samples (as in perfect GD) rather than fewer (as in SGD). (3) Finally, it is shown that if larger amounts of noise are added on the initialization and on the gradients, then SGD is no longer comparably universal due again to distributions having low enough cross‐predictability.

then SGD is no longer comparably universal due again to distributions having low enough cross-predictability.

Context and contributions
It is known that the class of neural networks (NNs) with polynomial network size can express any function that can be implemented in a given polynomial time [28,33], and that their sample complexity scales polynomially with the network size [3]. Thus, NNs have favorable approximation and estimation errors. However, there is no known efficient training algorithm for NNs with general and provable guarantees, in particular, it is NP-hard to implement the ERM rule [9,21]. The success behind deep learning is to train deep NNs with stochastic gradient descent or the like, which gives record performances in various applications [15,18,[22][23][24]. It is thus natural to ask whether SGD can also control efficiently the third pillar of statistical learning, that is, the optimization error, turning deep learning into a universal learning paradigm that can learn efficiently any efficiently learnable class; see [30] for further discussions on this question. This paper answers this question in the affirmative when enough freedom is given to the types of neural networks that can be used (still with polynomial size and with a valid polytime intialization). The following contributions and implications are obtained: (1) It is shown that poly-networks, that is, poly-size neural nets initialized in polytime, trained by SGD with poly-many steps can learn any function class that is learnable by an algorithm that runs in polytime and with poly-many samples; see Theorem 2.5. This part is obtained using a net initialization that is implemented in polytime (and not dependent on the function to be learned nor the data) and that emulates with SGD any efficient learning algorithm. This shows in particular that SGD-based deep learning is P-complete: any algorithm in P can be reduced to training with SGD a neural net initialized in polytime with a proper non-linearity and evaluating the net (see Remark 2.10). (2) We further show that this positive result is achieved with robustness: polynomial noise can be added to the gradients and weights can be of polynomial precision and the result still holds; see Theorem 2.8. Therefore, in a learning theoretic sense, deep learning gives a universal learning paradigm: approximation, estimation and also optimization errors are all controllable with polynomial parameters, and this can be implemented with all algorithms parameters being polynomial. This also creates a separation between deep learning and statistical query algorithms, as the latter are not comparably universal due to cases like parities [19]. (3) Parities were known to be challenging since the work of Minsky-Papert for the perceptron [25], and our positive result requires indeed more than a single hidden layer to succeed, that is, (log ) layers 1 (see Example 4.1). In particular, our universality result together with [29] imply that there exists function classes that require large enough nets to be learned with SGD under memory constraints: we know from [29] that a net with ( 2 ∕ log( )) edges of polynomial precisions 2 cannot learn parities with poly-many samples and thus with SGD in polytime (even though one can represent parity functions with such a size and depth 2), but our result shows that a net of size 2 and (log ) layers can learn parities with SGD in polytime. (4) A lower-bound is derived for descent algorithms on neural nets that shows that learning is impossible with polynomial precision if the null-flow does not overcome the cross-predictability; see definitions in Section 2 and Theorem 2.17. The cross-predictability corresponds to an inverse average-case notion of statistical dimension that is classical in SQ algorithms [5,6,10,19], and the null-flow is a quantity that is specific to descent algorithms. This shows in particular that the above robust universality does not hold when replacing the stochastic gradients with perfect gradients on the entire population distribution (or large enough averages), in alignment with the results from SQ algorithms [5,19]. Therefore, some small amount of stochasticity 3 is needed to obtain the robust universality in our setting with polynomial noise. The difference is that SGD lets us get the details of single samples, which is needed for algorithms like Gaussian elimination on parities. If instead one uses GD on the entire population, the average over all possible samples mostly cancels out in cases like parities, so that any polynomial amount of noise drowns out whatever signal is left from the target parity function. Furthermore, the null-flow also gives a measure for tackling lower-bounds that are specific to gradient descent algorithms. (5) Finally, we show that if a large enough amount of noise is added on the initialization and the gradients, then SGD is no longer universal as it also fails at learning parities; see Theorem 2.24.
In a practical setting, there may be no reason to use our SGD replacement to a general learning algorithm, but this universality result emphasizes the breadth of deep learning in a computational learning context with neural networks are broadly defined, and the fact that negative results specific to deep learning cannot be obtained without further constraints on these networks.

Setting
We focus on Boolean functions to simplify the setting. Since it is known that any Boolean function that can be computed in time ( ( )) can also be expressed by a neural network of size ( ( ) 2 ) [28,33], it is not meaningful to ask whether any such function 0 can be learned with 1 One can reduce the number of layers by using threshold gates in the computation component of arbitrary fan-in; see Section 4. 2 It remains interesting to investigate the regime with sub-polynomial noise and super-polynomial quantization, which can be implemented by rounding gradients with polynomial memory. 3 The stochasticity of SGD has also been advocated in contexts such as stability, implicit regularization or to avoid bad critical points [16,20,27,39]. a poly-size NN and a descent algorithm that can depend on 0 ; one can simply pre-set the net to express 0 . Two more meaningful questions are: (1) Can one learn a given function with an agnostic/random initialization? (2) Can one learn an unknown function from a class or distribution with a proper initialization? For the second question, one is not given a specific function 0 but a class of functions, or more generally, a distribution on functions. Therefore, one can no longer preset the net as desired in an obvious way. We focus here mainly on question 2, which is classical in statistical learning [30], and which gives a more general framework than restricting the initialization to be random. Moreover, in the case of symmetric function distributions, such as the parities discussed below, failure at 2 implies failure at 1. Namely, if we cannot learn a parity function for a random selection of the support (see definitions below), we cannot learn any given parity function on a typical support 0 with a random initialization of the net, because the latter is symmetrical. We thus have the following setting: • Let  = {+1, −1} and  =  be the data domain and let  = {+1, −1} be the label domain. 4 We work with binary vectors and binary labels for convenience (several of the results extend beyond this setting with appropriate reformulation of definitions). • Let  be a probability distribution on the data domain  and  be a probability distribution on   (the set of functions from  to ). We also assume for convenience that these distributions lead to balanced classes, that is, that ( ( ) = 1) = 1∕2 + (1) when ( , ) ∼  ×  (nonbalanced cases require adjustments of the definitions). • Our goal is to learn a function drawn under  by observing labeled examples ( , ) with ∼  , = ( ). • In order to learn we can train our algorithm on labeled examples with a descent algorithm starting with an initialization (0) and running for a number of steps = ( ) (other parameters of the algorithm such as the learning rate are also specified). In the case of perfect GD, each step accesses the full distribution of labeled examples, while for SGD, it only accesses a single labeled example per step (see definitions below). In all cases, after the training with ( (0) , ), the algorithm produces an estimator̂(0) , of . We say that an algorithm achieves an accuracy of in time steps for the considered (  ,  ), if a net with initialization (0) can be constructed such that: where the above probability is over ( , ) ∼ (  ×  ) and any randomness potentially used by the algorithm. We refer to typical-weak learning when = 1∕2 + Ω (1). In other words, when we can predict the label of a new fresh sample from  with accuracy strictly better than random guessing.
Failing at typical-weak learning implies failing at most other learning requirements, such as PAC learning a class for the case of a uniform distribution on a certain class of functions [6,26]. For our positive results with SGD, we will not only show that one can efficiently typically weakly learn any function distribution that is efficiently typically weakly learnable, but that we can in fact reproduce whatever accuracy an algorithm can achieve for the considered distribution. We also shorten 'typical-weak learning' to simply 'learning' and talk about learning a 'function distribution' or a 'distribution' when referring to learning a pair (  ,  ). . So nature picks uniformly at random, and with knowledge of  but not , the problem is to learn which set was picked from samples ( , ( )).

Definitions and models
We use a fairly generic notion of neural nets, simply weighted directed acyclic graphs with a special vertex for the output, a special set of vertices for the inputs, and a non-linearity at the other vertices.

Definition 2.1.
A neural net is defined by a pair of a non-linearity function ∶ ℝ → ℝ and a weighted directed graph with some special vertices and the following properties. does not contain any cycle and there exists > 0 such that has exactly + 1 vertices that have no edges ending at them, 0 , 1 , … , . We refer to as the input size, 0 as the constant vertex and 1 , 2 ,. . . , as the input vertices. Further, there exists a vertex such that for any other vertex ′ , there is a path from ′ to in . We also denote by ( ) the weights on the edges of . We denote by ( , ) ( ) the evaluation of neural net ( , ) at an input (or ( ) ( ) if is implicit). We also sometimes use a shortcut notation: for a neural net with a set of weights , we sometimes use 5 ( ) for eval ( ).
Remark 2.2. Note that we do not specify the non-linear activations used. Our negative results hold for any choice of activations, and our positive results hold for the specific activation given in Section 4. However, we believe that it is possible to derive the positive results with more standard activations like sigmoids, using parts of the sigmoids that are 'sufficiently flat' in replacement to exact flats (taking into account the magnitude of the noise). We believe that the same holds for ReLU's which already possess flats, but complications would arise from the fact that the ReLU has a single flat region. These would require significant additions of technical work that are currently alleviated with our choice of activation.
Note that as we have defined them, neural nets generally give outputs in ℝ rather than {0, 1}. As such, when talking about whether training a neural net by some method learns Boolean functions, we will implicitly be assuming that the output of the net on the final input is thresholded at some predefined value or the like. None of our results depend on exactly how we deal with this part. Definition 2.3. Let > 0, ∈ [0, 1],  be a probability distribution on {0, 1} , and  be a probability distribution on the set of functions from {0, 1} to {0, 1}. Also, let 0 , 1 , … be independently drawn from  and ∼  . An algorithm learns (  ,  ) with accuracy in time steps if the algorithm is given the value of ( , ( )) for each < and, when given the value of ∼  independent of , it returns such that ℙ( ( ) = ) ≥ .
Remark 2.4. Due to the way our neural nets are designed, they will actually have inputs and labels in {±1} instead of {0, 1}. As such, whenever we talk about using a neural net to learn a function, we will be implicitly assuming that the Boolean values are translated appropriately.
Algorithms such as SGD (or Gaussian elimination from samples) fit under this definition. For SGD, the algorithm starts with an initialization (0) of the neural net weights, and updates it sequentially with each sample ( , ( )) as where ( , ( ), ( −1) ) = ( −1) − ∇ (eval ( −1) ( ), ( )), < . It then outputs = eval ( −1) ( ). For SGD with batch-size and fresh samples, one has to update the previous definition with not a single sample at each time step but i.i.d. samples at each time step, computing the empirical average of the query. The extreme case of perfect-GD corresponds to being infinity, that is, an expectation over the entire population.
So GD proceeds successively with the following ( ,  )-dependent updates ( ) = ∼  ( , ( ), ( −1) ) for < for the same function as in SGD. We also consider a noisy version of the above, to ensure that the algorithm is not succeeding due to infinite precision but with robustness. We use adversarial (i.e., more powerful) noise in our 'perturbed' SGD algorithm for the positive results, and statistical noise for our 'noisy' SGD/GD algorithms in our negative results. We now give formal definitions of these algorithms.
(2) If any of the edge weights in 0 are less than − , set all such weights to − . If any of the edge weights in 0 are greater than , set all such weights to . NoisySGDStep( , , , , , , , ): (2) Return the graph that is identical to except that its edge weight are given by the ′ . NoisyGDAlgorithm( , ,  , , , , Δ, ): (1) Set 0 = .
NoisyGDStep( , ,  , , , , ): (2) Return the graph that is identical to except that its edge weights are given by the ′ .

Positive results
Our first result shows that for any distribution that can be learned by some algorithm in polytime, with poly-many samples and with accuracy , there exists an initialization (which means a neural net architecture with an initial assignment of the weights) that is constructed in polytime and that is agnostic to the function to be learned, such that training this neural net with SGD and possibly poly-noise learns this distribution in poly-steps with accuracy − (1). such that using stochastic gradient descent with learning rate to train ( , ) on samples (( , , ′ ), ( )) where 6 ( , , ′ ) ∼  × Ber(1∕2) 2 learns (  ,  ) with accuracy − (1).
Remark 2.6. The extra bits and ′ are used to provide randomness to the net. This is needed for two reasons. First of all, we might want to emulate a randomized algorithm, in which case we need a source of randomness. Secondly, while the net is learning we want it to give a random output in ±1 in each timestep independently of the sample so that every sample has an equal probability of resulting in a nonzero gradient.
The noise-tolerant version of this theorem below (Theorem 2) does not need to have extra random bits appended to its input for the following reasons. First of all, the size of this net increases with the number of timesteps it is intended to be trained for, so if we want to emulate a randomized algorithm with it we can encode a sufficient number of random bits in its initial weights. Secondly, during the training phase this net always returns 0 so we do not need to randomly select an output. Remark 2.7. As a special case, one can construct in polynomial time a net ( , ) that has polynomial size such that for a learning rate and an integer that are at most polynomial, ( , ) trained by SGD with learning rate and time steps learns parities with accuracy 1 − (1). In other words, random bits are not needed for parities, as parities can be learned with a deterministic algorithm using only samples of a single label without producing bias (see Section 4).
We now show that the previous result can be extended when sufficiently low amounts of inverse-polynomial noise are added to the weight of each edge in each time step. Therefore the previous theorem is not a degeneracy due to infinite precision. Theorem 2.8. For each > 0, let  be a probability measure on {0, 1} , and  be a probability measure on the set of functions from {0, 1} to {0, 1}. Let polynomial in . Next, define = such that there is some algorithm that takes samples ( , ( )) where the are independently drawn from  and ∼  , runs in polynomial time, and learns (  ,  ) with accuracy . Then there exists = Θ(1), and a polynomial-sized neural net ( , ) such that using perturbed stochastic gradient descent with precision noise 7 ∈ [−1∕( 2 ), 1∕( 2 )] ×| ( )| , learning rate , and loss function ( ) = 2 to train ( , ) on samples 8 (( , ), ( )) where ( , ) ∼  × Ber(1∕2) learns (  ,  ) with accuracy − (1). Corollary 2.9. For any > 0, there exists a universal polytime initialization of a poly-size neural net, such that if samples are produced from a distribution that is learnable with accuracy by some algorithm working with an upper bound on the number of samples and the time needed per sample, then SGD run in polytime with poly-many samples and possibly inverse-poly noise will succeed in learning the distribution with accuracy − (1). Remark 2.10. More generally, the process of training a neural net with noisy SGD is P-complete in the following sense. Let be a polynomial time algorithm that receives a binary string as input and then returns a value in {0, 1}. For every polynomial in there exists a neural net ( , ), learning rate and inverse polynomial level of noise such that when this net is trained for time steps on ( 0 , 0 ), … , ( −1 , −1 ) using noisy SGD and then run on it returns ( 0 , 0 , 1 , 1 , … , −1 , ) for all possible ( 0 , 0 ), … , ( −1 , −1 ), (with high probability if the gradient noise is statistical). The previous theorem is simply the case of this where is an algorithm that learns a function from random samples. Furthermore, the emulation would also apply to distribution learning under label-noise, that is, if there is a poly-time algorithm that learns a distribution class in the presence of label noise, then SGD on poly-size neural nets as in the theorem can emulate this algorithm.
Remark 2.11. While the learning algorithm used in Theorem 2.8 does not put a bound on how large the edge weights can get during the learning process, we can do this in such a way that there is a constant that the weights will never exceed.
Remark 2.12. Theorem 2.8 shows in particular that parities can be learned efficiently by SGD on neural nets (see Example 4.1 for more details), even with an amount of noise that would prevent SQ algorithms from learning parities. Thus, Theorem 2.8 shows a separation between SGD-based deep learning and SQ algorithms. We further discuss this phenomenon in the next section.

2.3
Negative results

GD and large averages
We saw that training neural nets with SGD and polynomial parameters is universal in that it can learn any efficiently learnable distribution. We now give a lower-bound for learning with a family of 'descent algorithms' which includes GD and SGD. This implies in particular that the universality is lost once perfect gradients are used, or once a large number of fresh samples are used to average each gradient, in agreement with the bounds from SQ algorithms [5, 6, 10-12, 19, 35, 38]. The theorem also gives a new quantity, the 'null-flow', which can be used to lower bound the performance of 'descent algorithms' beyond the number of queries.
Definition 2.13 (Descent algorithms). Consider for each > 0 a neural net of size | ( )| initialized with weights (0) . A descent algorithm running for time steps is defined by a sequence of query functions { } ∈[ ] that rely at each time steps on samples 9 , a query range 10 of , a parameter 2 for the noise variance, and operates by updating at each iterate the weights by to make edits on the memory (i.e., the neural net) by making sequential linear corrections as in (2.1), whereas SQ algorithms can store and adapt the queries as desired. Note also that for a differentiable loss , = [∇ ] gives gradient descent with fresh batches, and we refer to the case where is drawn from the true data distribution, that is, = ∞ as perfect-GD. Definition 2.15 (Null-flow (or junk-flow)). Using the notation in the previous definition, define the null flow of an initialization (0) with data distribution  , steps and queries where ( , ) ∼  ×  are independent of all other random variables, under  and independent of all other random variables, and is the expectation over (< ) , (< ) , , . That is, the null-flow is the sum over all time steps of the root of the expected gradient squared-norm when running GD on random samples with completely random labels.
Definition 2.16 (Cross-predictability). For a positive integer , a probability measure  on the data domain , and a probability measure  on the class of functions  from  to  = {+1, −1}, we define the cross-predictability (of order ) by where = ( 1 , … , ) has i.i.d. components under  , , ′ are independent of and i.i.d. under  , and is drawn independently of ( , ′ ) under the empirical measure of , that is, Note the following equivalent representations: where  ( ) denotes the Fourier-Walsh transform of with respect to the measure  . This measures how predictable a sampled function is from another one on a typical data point, or equivalently, how predictable a sampled data label is from another one on a typical function. Equivalently, this measures the typical correlation among functions, similarly to the average statistical dimension [11,13]. Note that the data point is drawn from the empirical distribution on samples, where will refer to the batch-size of a set of fresh samples in the context of GD (i.e., how many samples are used to compute gradients). For = ∞, that is, perfect statistics, we have for example that if  is a delta function, CP ∞ achieves the largest possible value of 1, and for purely random input and purely random functions, CP ∞ is 2 − , the lowest possible value. For random degree-monomials and uniform inputs, ∞ ≍ ( ) −1 . We now present the lower-bound.
Theorem 2.17. Let  with  =  for some finite set  and  such that the output distribution is balanced, 11 that is, ℙ{ ( ) = 0} = ℙ{ ( ) = 1} + (1) when ( , ) ∼  ×  . Using the previous definitions for CP = (  ,  ) and NF = NF( (0) ,  , { } ∈[ ] , ), the generalization error of a descent algorithm as in Definition 2.13 is lower-bounded as In Section 5, we present a stronger version of Theorem 2.17 for the specific case of parities, with a tighter bound obtained that results in the term CP 1∕2 rather than CP 1∕4 . Note that NF depends also on (in addition to the dependence on ), but it can have a -independent upper-bound as stated below; on the contrary, CP depends crucially on . Remark 2.19. Using a noise distribution that is uniform with variance 2 rather than Gaussian, one obtains a similar statement with a | | replacing √ | |. The above corollary gives a bound similar to those that can be obtained used SQ algorithms. Various results from SQ need to be combined to obtain comparable bounds. One needs to account for the statistical nature of the noise that we consider here, which has less degrees of freedom compared to the adversarial noise of standard SQ, and one needs to account for the weak and typical learning requirements (Equation 1.1). These can be addressed -at least in the case where the distribution  that is uniform on a set -using concentration and coupling arguments and combining for example results from [10,13,36]. We refer to [7] for a detailed discussion on how this is done, which gives the following bound ℙ{ ( ) ( ) ≠ ( )} ≥ by using an 1 -notion of cross-predictabilityCP (replacing the square of the inner-product by the absolute value). Such bounds are slightly weaker than the one of Corollary 2.18. Note that these use a coupling argument to handle the adversarial versus statistical noise, whereas if the latter were considered, the bound would turn to ℙ{ ( ) ( ) ≠ ( )} ≥ − ( 1 ⋅CP 1∕2 | | ) with the 1 -cross-predictability (note that we can also obtain an exponent of 1∕2 on the cross-predictability with our Theorem 5.1). Further improvements may be obtained using [38] for the statistical noise, but all together these bounds are of a similar kind. We next discuss how the null-flow could lead to different kinds of bounds in the context of neural networks. 11 Non-balanced cases can be handled by modifying definitions appropriately. 12 The positive statement uses the fact that it is easy to learn random degree-monomials when is finite; see for Example [4] for a specific implementation.
Remark 2.20. The upper bound on the null flow Corollary 2.18 uses a simple upper bound on the derivative of the loss function. In cases where the derivatives of the output with respect to the edge weights will consistently be much smaller than , one can prove tighter bounds on the null flow, leading to a lower probability of learning the function. One could also obtain comparable improvements in the SQ bounds by adjusting the SQ algorithm to take these bounds on the derivatives into account. However, in cases where there are some inputs for which the derivative of the output with respect to the edge weights are much larger than for typical inputs, one could obtain tighter bounds on the null flow that do not necessarily have analogous tighter SQ bounds. We leave it to future work to investigate such cases.
Remark 2.21. Note that our positive results show that we could learn a random parity function using stochastic gradient descent. The difference is that SGD lets us get the details of individual samples, which is needed for instance to run algorithms like Gaussian elimination for parities. If instead one uses GD on the entire population, the signal about the target function is drown out. This does not take place if GD is used with a sub-polynomial batch-size.
Remark 2.22. Our lower-bounds show that non-constant degree monomials are hard to learn with GD and in [1] we provide further examples of such types of functions in the context of arithmetic learning and connectivity/community detection. In particular, we define the pruned-SBM which corresponds to the classical SBM in the connectivity regime where cycles of constant lengths are pruned down. It is conjectured that GD on a random regular neural net will not be able to distinguish in polytime such pruned-SBMs from their corresponding Erdos-Renyi random graph with matching averaged degree.

SGD with additional randomness
In the case of perfect gradient descent and low cross-predictability, the gradients of the losses with respect to different inputs mostly cancel out, so an exponentially small amount of noise is enough to drown out whatever is left. With stochastic gradient descent, that does not happen, as we saw with the positive result, but we can still get a negative result for SGD if further noise is added to the initialization and the stochastic gradients as discussed next. For each > 0, let ( , ) be a neural net with size polynomial in , and let , , > 0. There exist = ( 2 2 2 ∕ 2 ), ′ = ( 3 3 3 ∕ 2 ), and = ( 2 ∕ ) such that the following holds. Perturb the weight of every edge in the net by a Gaussian distribution of variance and then train it with a noisy stochastic gradient descent algorithm with learning rate , time steps, and Gaussian noise with variance ′ . Also, let be the probability that at some point in the algorithm, there is a neural net ( , ′ ) in ( , ) such that at least one of the first three derivatives of the loss function on the current sample with respect to some edge weight(s) of ( , ′ ) has absolute value greater than . Then this algorithm fails to learn parities with an accuracy greater than 1∕2 + 2 + ( 4 2 2 ∕ ) + ( [ ∕4] ∕4 ).
Remark 2.25. Normally, we would expect that if training a neural net by means of SGD works, then the net will improve at a rate proportional to the learning rate, as long as the learning rate is small enough. As such, we would expect that the number of time steps needed to learn a function would be inversely proportional to the learning rate. This theorem shows that if we set = ∕ for any constant and slowly decrease , then the accuracy will approach 1∕2 + 2 or less. If we also let slowly increase, we would expect that will go to 0, so the accuracy will go to 1∕2. It is also worth noting that as decreases, the typical size of the noise terms will scale as 3∕2 . So, for sufficiently small values of , the noise terms that are added to edge weights will generally be much smaller than the signal terms.
Remark 2.26. Unlike Corollary 2.18 from the negative result from previous subsection, previous result is fairly different from the results that one could obtain using SQ arguments. As mentioned in the previous remark, as we decrease the learning rate the noise terms will scale as 3∕2 . So, for sufficiently small values of it will be possible to determine the most recent input/output pair the net received from its previous edge weights and new edge weights. That in turn means that it is possible to learn the function in question from the series of changes in edge weights the net experiences. So, this negative result is showing that noisy SGD fails to use the information contained in the gradient effectively rather than showing that the gradient is too noisy to provide the information needed to learn the function.
Remark 2.27. The bound on the derivatives of the loss function is essentially a requirement that the behavior of the net be stable under small changes to the weights. It is necessary because otherwise one could effectively multiply the learning rate by an arbitrarily large factor simply by ensuring that the derivative is very large. Alternately, excessively large derivatives could cause the probability distribution of the edge weights to change in ways that disrupt our attempts to approximate this probability distribution using Gaussian distributions. For any given initial value of the neural net, any given smooth activation function, and any given > 0, there must exists some such that as long as none of the edge weights become larger than this will always hold. However, that could be very large, especially if the net has many layers.
Remark 2.28. The positive results show that it is possible to learn a random parity function using a polynomial sized neural net trained by stochastic gradient descent with inverse-polynomial noise for a polynomial number of time steps. Furthermore, this can be done with a constant learning rate, a constant upper bound on all edge weights, a constant , and polynomial in such that none of the first three derivatives of the loss function of any net within of ours are greater than at any point. So, this result would not continue to hold for all choices of exponents.

Positive results
For the positive results, we emulate any learning algorithm using poly-many samples and running in polytime with poly-size neural nets trained by poly-step SGD. This requires emulating any poly-size circuit implementation with free access to reading and writing in memory using a particular computational model that computes, reads and writes memory solely via SGD steps on a fixed neural net. In particular, this requires designing subnets that perform arbitrary efficient computations in such a way that SGD does not alter them and subnet structures that cause SGD to change specific edge weights in a manner that we can control. Note that any algorithm that learns a function from samples must repeatedly get a new sample and then change some of the values in its memory in a way that is determined by the current values in its memory and the value of the sample. Eventually, it must also attempt to compute the function's output based on its input and the values in memory. If the learning algorithm is efficient, then there must be a polynomial-sized circuit that computes the values in the algorithm's memory in the next timestep from the sample it was given and its memory values in the current timestep. Likewise, there must be a polynomial-sized circuit that computes its guesses of the function's output from the function's input and the values in its memory. Since any polynomial-sized circuit can be translated into a neural net of polynomial size, we can encode the desired circuit in a preset format. However, once we run SGD on it, we would a priori completely alter the weights of the edges in the net, which would cause the net to stop performing the intended calculations.
To prevent this, we use an activation function that is constant in some areas (we will use a sigmoid like non-linearity with flats), and ensure that the nodes in the translated circuit are flat nodes, that is, that they always get inputs in that flat range. That way, the derivatives of their activation levels with respect to the weights of any of the edges leading to them are 0, so backpropagation will never change the edge weights in the net. That allows us to construct a portion of the net called the computation component that performs the desired computations in a backpropagationproofed way. This computation component can in particular output values, the computation outputs, that are responsible for editing the memory of the algorithm. However, we still need a mechanism to decide how to store and edit the memory using only a neural net trained with SGD. This is the most challenging part.
The neural net's memory takes the form of its edge weights. We will encode the algorithm's memory in the edges leaving the constant vertex. Normally, we would not be able to precisely control how SGD would alter these weights. However, it is possible to design components of the net (the components in Figures 1-3 defined in Definition 4.3) in such a way that if certain vertices called the control vertices output certain values, then every path to the output through a designated edge will pass through a flat vertex. So, if those vertices are set that way, the derivative of the loss function with respect to the edge weight in question will be 0, and the weight will not change. That allows us to control whether or not the edge weight changes, and by appropriately setting up the values of the initial net and the learning rate, we can ensure that the changes will always translate into the desired bit flips. This gives us a way to construct a net portion that can set the values in memory; we call this the memory component. See Figure 3 for a representation of the overall net.
One difficulty encountered with such an SGD implementation is that no update of the weights will take place when given a sample that is correctly predicted by the net. If one does not mitigate this, the net may end up being trained on a sample distribution that is mismatched to the original one, which can have unexpected consequences. A randomization mechanism is thus used to circumvent this issue, but this mechanism is not necessary for functions like parities, as one can learn parities from only samples having the same label.
In summary, we can create a neural net with a poly-size architecture and a polytime initialization, that carries out both the computation and memory updates of an algorithm when trained by SGD. See Section 4.3 for additional implementation considerations.

F I G U R E 1
The above figure illustrates the main results of this paper. Define Poly_mSGD_NN to be the family of distributions (  ,  ) such that there exist polynomials , , a differentiable loss computable in polytime, and a neural net of polynomial size with a polytime initialization (0) such that the output̂m SGD obtained by running noisy SGD with the above hyper-parameters, fresh samples to compute the average gradient per time-step and polynomial precision (i.e., the ratio of the gradients' amplitude per the noise magnitude) satisfies ℙ( ( +1 ) ≠̂m SGD ( +1 )) = (1). In particular, define Poly_SGD_NN and Poly_GD_NN to be Poly_mSGD_NN when is, respectively, 1 (i.e., SGD with a single sample per time-step) and ∞ (i.e., population or perfect gradient averages). Define also Poly_PAC as the family of distributions such that there exists a poly-time algorithm̂P AC that uses a polynomial number of samples such that ℙ( ( +1 ) ≠̂P AC ( +1 )) = (1). Then the results in this paper imply that Poly_SGD_NN and Poly_PAC are equivalent (the universality) and that Poly_mSGD_NN may or may not match Poly_PAC depending on the values of : it matches for = 1 (from the previous result) but it is a strict subset when 1∕ and CP are small enough (sub-polynomial); this gives an average-case requirement analog to having super-polynomial statistical dimension in the SQL framework.

F I G U R E 2
The emulation net; = 364 √ 2 −243 3 −1641∕2 ∕ ′ , ′ is the maximum between and ⌈2 −243 3 −1641∕2 (18 √ 3) 364 ⌉, is the number of bits required to perform the computation from the computation component and ′ = (18 We highlight in red one copy of . 3 and is the total number of bits required to perform the computation from the computation component. In this illustration, we considered only two copies of the from Definition 4.3; one copy is highlighted in red. The magenta dashed edges are the memory read edges and the blue dashed edges are the memory write edges. The latter allow to change the controller vertices , ′ that act on to edit the memory. Random bit inputs are omitted in this figure, and the information flows from left to right in all edges.

Negative results
Our main approach to showing the failure of an algorithm (e.g., noisy GD) using data from a model (e.g., parities) for a desired task (e.g., typical weak learning), will be to show that under limited resources (e.g., limited number of time steps), the output of the algorithm trained on the true model is statistically indistinguishable from the output of the algorithm trained on a null model, where the null model fails to provide the desired performance for trivial reasons. This gives a computational lower-bound out of a statistical estimate. The indistinguishability to null condition (INC) is obtained by manipulating information measures, bounding the total variation distance of the two posterior measures between the test and null models. More specifically, for Theorem 2.17, we show a subadditivity property of the TV using the data processing inequality, use the fact that we work with a descent algorithm that updates the weights by 'subtractions' of queries and not general statistical queries, bound the one step total variation distance with the KL distance (Pinsker's inequality), which in the Gaussian case gives the In the case of Theorem 2.24 for SGD, we rely on a more sophisticated version of the above program. We use again a step used for GD that consists in showing that the average value of any function on samples generated by a random parity function will be approximately the same as the average value of the function on true random samples. 13 This is essentially a consequence of the low cross-predictability. Most of the work then is using this to show that if we draw a set of weights in ℝ from a sufficiently noisy probability distribution and then perturb it slightly in a manner dependent on a sample generated by a random parity function, the probability distribution of the result is essentially indistinguishable from what it would be if the samples were truly random. Then, we argue that if we do this repeatedly and add in some extra noise after each step, the probability distribution stays noisy enough that the previous result continues to apply.

FURTHER RELATED LITERATURE
In [8] a different emulation argument is shown for gradient descent; the similarity with our result is that both results encode a calculation using a form of GD, but in very different settings and with very different conclusions. [8] shows that one can implement an arbitrary algorithm using GD by repeating the correct series of loss functions, with the purpose to show that it is difficult to predict the long-term results of running online GD on an arbitrary known series of loss functions.
Our emulation shows that one can encode an arbitrary computation on samples drawn from an unknown distribution by training a net with SGD. Our purpose is to prove that a properly initialized net trained by SGD can learn any function learnable from samples. Finally, a key component of our result is that SGD can handle an amount of noise that goes beyond what SQ algorithms can handle, which is unrelated to [8].
The difficulty of learning functions like parities with NNs is not new. Together with the connectivity case, the difficulty was one of the central foci in the perceptron book of Minksy and Papert [25] The sensitivity of parities is also well-studied in the theoretical computer science literature, with the relation to circuit complexity, in particular the computational limitations of small-depth circuits [2,17]. The seminal paper of Kearns on statistical query learning algorithms [19] brings up the difficulties in learning parities with such algorithms. As mentioned earlier there have been numerous works extending the work of Kearns for parities to more general cases of high statistical dimension, such as [5, 6, 10-13, 19, 35, 36, 38] and [34,37] for specific neural networks. While the statistical dimension was initially derived with a worst-case requirement on the class of functions, it was generalized to average-case notions in [10,11,13] and statistical noise [38]. Information measure manipulations as used in our lower-bound were also used in [35] to obtain SQ bounds under memory constraints. We refer to [7] for further comparisons on SQ algorithms. Finally, [31], with an earlier version in [32] from the first author, that also investigates the hardness of learning certain function classes with respect to a continuous input distribution on the Euclidean space, despite regularity assumptions being made.
In particular, [31] proves that the gradient of the loss function of a neural network will be essentially independent of the parity function used, which gives strong indications for the failure of GD. This is achieved in [31] under the requirement that the loss function is 1-Lipschitz, an assumption that is not needed in our Lemma 5.2.

Emulation of arbitrary algorithms
Any algorithm that learns a function from samples must repeatedly get a new sample and then change some of the values in its memory in a way that is determined by the current values in its memory and the value of the sample. Eventually, it must also attempt to compute the function's output based on its input and the values in memory. If the learning algorithm is efficient, then there must be a polynomial-sized circuit that computes the values in the algorithm's memory in the next timestep from the sample it was given and its memory values in the current timestep. Likewise, there must be a polynomial-sized circuit that computes its guesses of the function's output from the function's input and the values in its memory. Any polynomial-sized circuit can be translated into a neural net of polynomial size. Normally, stochastic gradient descent would tend to alter the weights of edges in that net, which might cause it to stop performing the calculations that we want. However, we can prevent its edge weights from changing by using an activation function that is constant in some areas, and ensuring that the nodes in the translated circuit always get inputs in that range. That way, the derivatives of their activation levels with respect to the weights of any of the edges leading to them are 0, so backpropagation will never change the edge weights in the net. That leaves the issue of giving the net some memory that it can read and write. A neural net's memory takes the form of its edge weights. Normally, we would not be able to precisely control how stochastic gradient descent would alter these weights. However, it is possible to design the net in such a way that if certain vertices output certain values, then every path to the output through a designated edge will pass through a vertex that has a total input in one of the flat parts of the activation function. So, if those vertices are set that way the derivative of the loss function with respect to the edge weight in question will be 0, and the weight will not change. That would allow us to control whether or not the edge weight changes, which gives us a way of setting the values in memory. As such, we can create a neural net that carries out this algorithm when it is trained by means of stochastic gradient descent with appropriate samples and learning rate. This net will contain the following components: (1) The output vertex. This is the output vertex of the net, and the net will be designed in such a way that it always has a value of ±1. (2) The input bits. These will include the regular input vertices for the function in question. However, there will also be a couple of extra input bits that are to be set randomly in each timestep. They will provide a source of randomness that is necessary for the net to run randomized algorithms 14 , in addition to some other guesswork that will turn out to be necessary (see more on this below). (3) The memory component. For each bit of memory that the original algorithm uses, the net will have a vertex with an edge from the constant vertex that will be set to either a positive or negative value depending on whether that bit is currently set to 0 or 1. Each such vertex will also have an edge leading to another vertex which is connected to the output vertex by two paths. The middle vertex in each of these paths will also have an edge from a control 14 Two random bits will always be sufficient because the algorithm can spend as many timesteps as it needs copying random bits into memory and ignoring the rest of its input.
vertex. If the control vertex has a value of 2, then that vertex's activation will be 0, which will result in all subsequent vertices on that path outputting 0, and none of the edge weights on that path changing as a result of backpropagation along that path. On the other hand, if the control vertex has a value of 0, then that vertex will have a nonzero activation, and so will all subsequent vertices on that path. The learning rate will be chosen so that in this case, if the net gives the wrong output, the weight of every edge on this path will be multiplied by If the net's output is right, the derivative of the loss function with respect to any edge weight will be 0, so the entire net will not change. This component will be constructed in such a way that the derivative of the loss function with respect to the weights of its edges will always be 0. As a result, none of the edge weights in the computation component will ever change, as explained in lemma 4.2. This component will also decide whether or not the net has learned enough about the function in question based on the values in memory. If it thinks that it still needs to learn (for instance, if the number of timesteps in which the net's output differed from the label is below some preset threshold), then it will have the net output a random value and attempt to set the values in memory to whatever they should be set to if that guess is wrong (i.e., it will set the memory component's control vertices in such a way that if the label differs from the net's output then the values in memory will change to the appropriate values). If it thinks that it has learned enough, then it will try to get the output right (i.e., it will output the value that the emulated algorithm would output on that input if its memory matched the values encoded in the memory component) and leave the values in memory unchanged.
See Figure 3 for a representation of the overall net. One complication that this approach encounters is that if the net outputs the correct value, then the derivative of the loss function with respect to any edge weight is 0, so the net cannot learn from that sample. 15 Our approach to dealing with that is to have a learning phase where we guess the output randomly and then have the net output the opposite of our guess. That way, if the guess is right the net learns from that sample, and if it is wrong it stays unchanged. Each guess is right with probability 1∕2 regardless of the sample, so the probability distribution of the samples it is actually learning from is the same as the probability distribution of the samples overall, and it only needs (2 + (1)) times as many samples as the original algorithm in order to learn the function. Once it thinks it has learned enough, such as 15 This holds for any loss function that has a minimum when the output is correct, not just the 2 loss function that we are using. We could avoid this by having the net output ±1∕2 instead of ±1. However, if we did that then the change in each edge weight if the net got the right output would be −1∕3 of the change in that edge weight if it got the wrong output, which would be likely to result in an edge weight that we did not want in at least one of those cases. There are ways to deal with that, but they do not seem clearly preferable to the current approach.
after learning from a designated number of samples, it can switch to attempting to compute the function it has learned on each new input.
Remark 4.1. We now give an illustration of how previous components would run and interact for learning parities. One can learn an unknown parity function by collecting samples until one has a set that spans the space of possible inputs, at which point one can compute the function by expressing any new input as a linear combination of those inputs and returning the corresponding linear combination of their outputs. As such, if we wanted to design a neural net to learn a parity function this way, the memory component would have ( + 1) bits designated for remembering samples, and log 2 ( + 1) bits to keep a count of the samples it had already memorized. Whenever it received a new input , the computation component would get the value of from the input nodes and the samples it had previously memorized, ( 1 , 1 ), … , ( , ), from the memory component. Then it would check whether or not could be expressed as a linear combination of 1 , … , . If it could be, then the computation component would compute the corresponding linear combination of 1 , … , and have the net return it. Otherwise, the computation component would take a random value that it got from one of the extra input nodes, ′ . Then, it would attempt to have the memory component add ( , ′ ) to its list of memorized samples and have the net return ( ′ ). That way, if the correct output was ′ , then the net would return the wrong value and the edge weights would update in a way that added the sample to the net's memory. If the correct output was ( ′ ), then the net would return the right value, and none of the edge weights would change. As a result, it would need about 2 samples before it succeeded at memorizing a list that spanned the space of all possible inputs, at which point it would return the correct outputs for any subsequent inputs. Note that the depth of the emulation net is at least 7 in general due to the memory component, and here for parities, it can be done with depth (log ) as the computation component can be implemented with this depth. 16 Before we can prove anything about how our net learns, we will need to establish some properties of our activation function. Throughout this section, we will use an activation function ∶ ℝ → ℝ such that ( ) = 2 for all > 3∕2, ( ) = −2 for all < −3∕2, and ( ) = 3 for all −1 < < 1. There is a way to define on [−3∕2, −1] ∪ [1, 3∕2] such that is smooth and nondecreasing. The details of how this is done will not affect any of our arguments, so we pick some assignment of values to on these intervals with these properties. This activation function has the important property that its derivative is 0 everywhere outside of [−3∕2, 3∕2]. As a result, if we use SGD to train a neural net using this activation function, then in any given time step, the weights of the edges leading to any vertex that had a total input that is not in [−3∕2, 3∕2] will not change. This allows us to create sections of the net that perform a desired computation without ever changing. In particular, it will allow us to construct the net's computation component in such a way that it will perform the necessary computations without ever getting altered by SGD. More formally, we have the following.
1 < 16 Unless one uses more than two fan-in in the computation nodes, which can reduce the depth. 17 Note that these will not be the input of the general neural net that is being built, but all the input entering the computation component besides from the constant vertex.
(1) 1 , (1) 2 , … , (0) < (1) . It is possible to add a set of at most new vertices to the net, including output vertices ′′ 1 , … , ′′ ′ , along with edges leading to them such that for any possible addition of edges leading from the new vertices to old vertices, if the net is trained by SGD and the output of ′ is either (0) or (1) for every in every timestep, then the following hold: (1) None of the weights of the edges leading to the new vertices ever change, and no paths through the new vertices contribute to the derivative of the loss function with respect to edges leading to the ′ .
Proof. Such a result follows from Chapter 4 in [28] which shows that AND/OR/NOT gates can be simulated with threshold gates. We provide here a self-contained argument for our specific setting. First, add one new vertex for each gate in a circuit that computes ℎ. When the new vertices are used to compute ℎ, we want each vertex to output 2 if the corresponding gate outputs a 1 and −2 if the corresponding gate outputs a 0. In order to make one new vertex compute the NOT of another new vertex, it suffices to have an edge of weight −1 to the vertex computing the NOT and no other edges to that vertex. We can compute an AND of two new vertices by having a vertex with two edges of weight 1 from these vertices and an edge of weight −2 from the constant vertex. Similarly, we can compute an OR of two new vertices by having a vertex with two edges of weight 1 from these vertices and an edge of weight 2 from the constant vertex. For vertices corresponding to gates that act directly on the inputs, we have the complication that their vertices do not necessarily encode 0 and 1 as ±2, but we can compensate for that by changing the weights of the edges from these vertices, and the edges to these gates from the constant vertices appropriately. This ensures that if the outputs of the ′ encode binary values 1 , … , appropriately, then each of the new vertices will output the value corresponding to the output of the appropriate gate. So, these vertices compute ℎ( 1 , … , ) correctly. Furthermore, since the input to each of these vertices is outside of [−3∕2, 3∕2], the derivatives of their activation functions with respect to their inputs are all 0. As such, none of the weights of the edges leading to them ever change, and paths through them do not contribute to changes in the weights of edges leading to the ′ . □ Note that any efficient learning algorithm will have a polynomial number of bits of memory. In each time step, it might compute an output from its memory and sample input, and it will compute which memory values it should change based on its memory, sample input, and sample output. All of these computations must be performable in polynomial time, so there is a polynomial sized circuit that performs them. Therefore, by the lemma it is possible to add a polynomial sized component to any neural net that performs these calculations, and as long as the inputs to this component always take on values corresponding to 0 or 1, backpropagation will never alter the weights of the edges in this component. That leaves the issue of how the neural net can encode and update memory bits. Our plan for this is to add in a vertex for each memory bit that has an edge with a weight encoding the bit leading to it from a constant bit and no other edges leading 18 It would be convenient if ′ 1 , … , ′ all used the same encoding. However, the computation component will need to get inputs from the net's input vertices and from the memory component. The input vertices encode 0 and 1 as ±1, while the memory component encodes them as ± ′ for some small ′ . Therefore, it is necessary to be able to handle inputs that use different encodings. to it. We will also add in paths from these vertices to the output that are designed to allow us to control how backpropagation alters the weights of the edges leading to the memory vertices. More precisely, we define the following.
We refer to Figure 3 to visualize . The idea is that this structure can be used to remember one bit, which is encoded in the current weight of the edge from 0 to 1 . A weight of 9 √ 3 encodes a 0 and a weight of −9

√
3 encodes a 1. In order to set the value of this bit, we will use and ′ , which will be controlled by the computation component. If we want to keep the bit the same, then we will have them both output 2, in which case 4 and ′ 4 will both output 0, with the result that the derivative of the loss function with respect to any of the edge weights in this structure will be 0. However, if we want to change the value of this bit, we will have one of and ′ output 0. That will result in a nonzero output from 4 or ′ 4 , which will lead to the net's output having a nonzero derivative with respect to some of the edge weights in this structure. Then, if the net gives the wrong output, the weights of some of the edges in the structure will be multiplied by −1, including the weight of the edge from 0 to 1 . Unfortunately, if the net gives the right output then the derivative of the loss function with respect to any edge weight will be 0, which means that any attempt to change a value in memory on that timestep will fail.
More formally, we have the following. Proof. More precisely, we claim that the weight of the edge from to 4 and the weight of the edge from ′ to ′ 4 never change, and that all of the other edges in only ever change by switching signs. Also, we claim that at the end of any time step, either all of the edges on the path from 0 to 2 have their original weights, or all of them have weights equal to the negatives of their original weights. Furthermore, we claim that the same holds for the edges on each path from 2 to 6 .
In order to prove this, we induct on the number of time steps. It obviously holds after 0 time steps. Now, assume that it holds after ′ − 1 time steps, and consider time step ′ . If the net gave the correct output, then the derivative of the loss function with respect to the output is 0, so none of the weights change.
Now, consider the case where the net outputs 1 and the correct output is −1. By assumption, ′ outputs 2 in this time step, so ′ 4 gets an input of 2 27 ⋅ 3 91 40 from ′ 3 and an input of −2 27 ⋅ 3 91 40 from ′ . So, both its output and the derivative of its output with respect to its input are 0. That means that the same holds for 5 , which means that none of the edge weights on this path from 2 to 6 change this time step, and nothing backpropagates through this path. If also outputs 2, then 4 and 5 output 0 for the same reason, and none of the edge weights in this copy of change. On the other hand, if outputs 0, then the output vertex gets an input of 2 243 ⋅ 3 1641∕2 364 from 5 . The derivative of this input with respect to the weight of the edge from −1 to is 2 243 ⋅ 3 1641∕2 364 ⋅ [3 6− ∕(3 3− ∕2 )] if these weights are positive, and the negative of that if they are negative. Furthermore, the derivative of the loss function with respect to the input to the output vertex is 12. So, the algorithm reduces the weights of all the edges on the path from 0 to 6 that goes through 4 exactly enough to change them to the negatives of their former values. Also, since output 0, the weight of the edge from to 4 had no effect on anything this time step, so it stays unchanged.
The case where the net outputs −1 and the correct output is 1 is analogous, with the modification that the output vertex gets an input of −2 243 ⋅ 3 1641∕2 364 from ′ 5 if ′ outputs 0 and the edges on the path from 0 to 6 that goes through ′ 4 are the ones that change signs. So, by induction, the claimed properties hold at the end of every time step. Furthermore, this argument shows that the sign of the edge from 0 to 1 changes in exactly the time steps where the net outputs the wrong value and and ′ do not both output 2. □ So, satisfies some but not all of the properties we would like a memory component to have. We can read the bit it is storing, and we can control which time steps it might change in by controlling the inputs to and ′ . However, for it to work we need the output of the overall net to be ±1 in every time step, and each such memory component will input ±2 243 ⋅ 3 1641∕2 364 to the output vertex every time we try to flip it. More problematically, the values these components are storing can only change when the net gets the output wrong. We can deal with the first issue by choosing parameters such that 2 243 ⋅ 3 1641∕2 364 is the inverse of an integer that is at least as large as the number of bits that we want to remember, and then adding some extraneous memory components that we can flip in order to ensure that exactly 1∕2 243 ⋅ 3 1641∕2 364 memory components get flipped in each time step. We cannot change the fact that the net will not learn from samples where it got the output right, but we can use this to emulate any efficient learning algorithm that only updates when it gets something wrong. More formally, we have the following. We construct as follows. First, we take + ′ copies of , merge all of the copies of 6 to make an output vertex, and merge all of the copies of 0 . Then we add in input vertices and a constant vertex and add an edge of weight 2 from the constant vertex to 0 . Next, define ∶ {0, 1} + → {0, 1} 1+2 +2 ′ such that given ∈ {0, 1} and ∈ {0, 1} , ( , ) lists ℎ ( , ) and one half the values of the and ′ necessary to change the values stored by the first memory units in the net from to ( , ) and then flip the next ′ − |{ ∶ ≠ ( ( , )) }| provided the net outputs 2ℎ ( , ) − 1. Then, add a section to the net that computes on the input bits and the bits stored in the first memory units, and connect each copy of or ′ to the appropriate output by an edge of weight 1∕2 and the constant bit by an edge of weight 1.
In order to show that this works, first observe that since ℎ and can be computed efficiently, so can . So, there exists a polynomial sized subnet that computes it correctly by lemma 4.2. That lemma also shows that this section of the net will never change as long as all of the inputs and all of the memory bits encode 0 or 1 in every time step. Similarly, in every time step 0 will have an input of 2 and all of the copies of and ′ will have inputs of 0 or 2. So, the derivatives of their outputs with respect to their inputs will be 0, which means that the weights of the edges leading to them will never change. That means that the only edges that could change in weight are those in the memory components. In each time step, ′ memory components each contribute (2ℎ ( , −1 ) − 1)∕ ′ to the output vertex, so it takes on a value of (2ℎ ( , −1 ) − 1), assuming that the memory components were storing −1 like they were supposed to. As such, the net outputs ⋆ , the memory bits stay the same if ⋆ = , and the first memory bits get changed to ( , −1 ) otherwise with some irrelevant changes to the rest. Therefore, by induction on the time step, this net performs correctly on all time steps. □ Remark 4.6. With the construction in this proof, ′ will always be at least 10 79 , which ensures that this net will be impractically large. This is a result of the fact that the only edges going to the output vertex are those contained in the memory component, and the paths in the memory component take a small activation and repeatedly cube it. If we had chosen an activation function that raises its input to the 11 9 when its absolute value was less than 1 instead of cubing it, the minimum possible value of ′ would have been on the order of 1000.
In other words, we can train a neural net with SGD in order to duplicate any efficient algorithm that takes bits as input, gives 1 bit as output, and only updates its memory when its output fails to match some designated 'correct' output. The only part of that that is a problem is the restriction that it can not update its memory in steps when it gets the output right. As a result, the probability distribution of the samples that the net actually learns from could be different from the true probability distribution of the samples. We do not know how an algorithm that we are emulating will behave if we draw its samples from a different probability distribution, so this could cause problems. Our solution to that will be to have a training phase where the net gives random outputs so that it will learn from each sample with probability 1∕2, and then switch to attempting to compute the actual correct output rather than learning. That allows us to prove the following (re-statement of Theorem 2.5). Proof. We can assume that the algorithm counts the samples it has received, learns from the designated number, and then stops learning if it receives additional samples. The fact that the algorithm learns in polynomial time also means that it can only update a polynomial number of locations in memory, so it only needs a polynomial number of bits of memory, . Also, its learning process can be divided into steps which each query at most one new sample ( , ( )) and one new random bit. So, there must be an efficiently computable function such that if is the value of the algorithm's memory at the start of a step, and it receives ( , ) as its sample (if any) and as its random bit (if any), then it ends the step with its memory set to ( , , , ).
Next, let 0 be the initial state of the algorithm's memory, and consider setting = ′ ( −1 , , ( ), , ′ ) for each > 0. We know that ′ is equally likely to be 0 or 1 and independent of all other components, so is equal to ( −1 , , ( ), ) with probability 1∕2 and −1 otherwise. Furthermore, the probability distribution of ( −1 , , ( ), ) is independent of whether or not = ′ . Also, if we set ′ = 0 and then repeatedly replace ′ with ( ′ , , ( ), ), then there is some polynomial number of times we need to do that before ′ stops changing because the algorithm has enough samples and is no longer learning. So, with probability 1 − (1), the value of will stabilize by the time it has received times that many samples. Furthermore, the probability distribution of the value stabilizes at is exactly the same as the probability distribution of the value the algorithm's memory stabilizes at because the probability distribution of tuples ( −1 , , ( )) that actually result in changes to is exactly the same as the overall probability distribution of ( −1 , , ( )). So, given the final value of , one can efficiently compute with an expected accuracy of at least . Now, let ( , ) be the value the algorithm outputs when trying to compute ( ) if its memory has a value of after training. Then, define ′′ such that By the previous lemma, there exists a polynomial sized net ( , ) and > 0 such that if we use SGD to train ( , ) on ((2 − 1, 2 − 1, 2 ′ − 1), ( )) with a learning rate of then the net outputs 2 ′′ ( −1 , , , ′ ) − 1 for all . By the previous analysis, that means that after a polynomial number of steps, the net will compute with an expected accuracy of − (1). □ Remark 4.8. This net uses two random bits because it needs one in order to randomly choose outputs during the learning phase and another to supply randomness in order to emulate randomized algorithms. If we let be the minimum number of gates in a circuit that computes the algorithm's output and the contents of its memory after the current timestep from its input, its current memory values, and feedback on what the correct output was, then the neural net in question will have ( ) vertices and = ( 362∕364 ). If the algorithm that we are emulating is deterministic, then will be approximately twice the number of samples the algorithm needs to learn the function; if it is randomized it might need a number of additional samples equal to approximately twice the number of random bits the algorithm needs.
So, for any distribution of functions from {0, 1} to {0, 1} that can be learned in polynomial time, there is a neural net that learns it in polynomial time when it is trained by SGD.
Proof of Corollary 2.9. Previous theorem shows that each efficiently learnable (  ,  ) has some neural net that learns it efficiently. We will next use a Kolmogorov complexity-like argument to emulate a metaalgorithm as follows: Learning-Metaalgorithm(c): (1) List every algorithm that can be written in at most log(log( )) bits.
(2) Get samples from the target distribution, and train each of these algorithms on them in parallel. If any of these algorithms takes more than time steps on any sample, then interrupt it and skip training it on that sample.
(3) Get more samples, have all of the aforementioned algorithms attempt to compute the function on each of them, and record which of them was most accurate. Again, if any of them take more than steps on one of these samples, interrupt it and consider it as having computed the function incorrectly on that sample. (4) Return the function that resulted from training the most accurate algorithm.
Given any distribution that is efficiently learnable, there exist , > 0 such that there is some algorithm that learns (  ,  ) with accuracy 1∕2 + − (1), needs at most samples in order to do so, and takes a maximum of time steps on each sample. For all sufficiently large , this algorithm will be less than log(log( )) bits long, so Learning-Metaalgorithm(c) will consider it. There are only (log( )) algorithms that are at most log(log( )) bits long, so in the testing phase all of them will have observed accuracies within ( − ∕2 log( )) of their actual accuracies with high probability. That means that the function that Learning-Metaalgorithm(c) judges as most accurate will be at most ( − ∕2 log( )) less accurate than the true most accurate function considered. So, Learning-Metaalgorithm(c) learns (  ,  ) with accuracy 1∕2 + − (1). More precisely, this shows that for any efficiently learnable distribution, there exists 0 such that for all > 0 , Learning-Metaalgorithm(c) learns (  ,  ). Now, if we let ( , ) be a neural net emulating Learning-Metaalgorithm(c), then ( , ) has polynomial size and can be constructed in polynomial time for any fixed . Any efficiently learnable distribution can be learned by training ( , ) with stochastic gradient descent with the right and the right learning rate, assuming that random bits are appended to the input. Furthermore, the only thing we need to know about the data distribution in order to choose the net and learning rate is some upper bound on the number of samples and amount of time needed to learn it. □ Remark 4.9. The previous remark shows that for any > 0, there is a polynomial sized neural net that learns any (  ,  ) that can be learned by an algorithm that uses samples and time per sample. However, that is still more restrictive than we really need to be. It is actually possible to build a net that learns any (  ,  ) that can be efficiently learned using memory, and then computed in time once the learning process is done. In order to show this, first observe that any learning algorithm that spends more than time on each sample can be rewritten to simply get a new sample and ignore it after every steps. That converts it to an algorithm that spends time after receiving each sample while multiplying the number of samples it needs by an amount that is at most polynomial in .
The fact that we do not know how many samples the algorithm needs can be dealt with by modifying the metaalgorithm to find the algorithm that performs best when trained on 1 sample, then the algorithm that performs best when trained on 2, then the algorithm that performs best when trained on 4, and so on. That way, after receiving any number of samples, it will have learned to compute the function with an accuracy that is within (1) of the best accuracy attainable after learning from 1∕4 that number of samples. The fact that we do not know how many samples we need also renders us unable to have a learning phase, and then switch to attempting to compute the function accurately after we have seen enough samples. Instead, we need to have it try to learn from each sample with a gradually decreasing probability and try to compute the function otherwise. For instance, consider designing the net so that it keeps a count of exactly how many times it has been wrong. Whenever that number reaches a perfect square, it attempts to learn from the next sample; otherwise, it tries to compute the function on that input. If it takes the metaalgorithm ′ samples to learn the function with accuracy 1 − , then it will take this net roughly 2 ′ samples to learn it with the same accuracy, and by that point the steps where it attempts to learn the function rather than computing it will only add another (1) to the error rate. So, if there is any efficient algorithm that learns (  ,  ) with memory and computes it in time once it has learned it, then this net will learn it efficiently.
Finally, it is necessary to know in order to obtain such a universality result, since given a neural net of size ( ′ ), one could simply pick a function that requires a net of size Ω( ′ +1 ) to compute.

Noisy emulation of arbitrary algorithms
So far, our discussion of emulating arbitrary learning algorithms using SGD has assumed that we are using SGD without noise. It is of particular interest to ask whether there are efficiently learnable functions that noisy SGD can never learn with inverse-polynomial noise, as perfect GD or SQ algorithms break in such cases (e.g., parities). It turns out that the emulation argument can be adapted to sufficiently small amounts of noise. The computation component is already fairly noise tolerant because the inputs to all of its vertices will normally always have absolute values of at least 2. If these are changed by less than 1∕2, these vertices will still have activations of ±2 with the same signs as before, and the derivatives of their activations with respect to their inputs will remain 0.
However, the memory component has more problems handling noise. In the noise-free case, whenever we do not want the value it stores to change, we arrange for some key vertices inside the F I G U R E 4 The noise-tolerant memory component ′ . component to receive input 0 so that their outputs and the derivatives of their outputs with respect to their inputs will both be 0. However, once we start adding noise we will no longer be able to ensure that the inputs to these vertices are exactly 0. This could result in a feedback loop where the edge weights shift faster and faster as they get further away from their desired values. In order to avoid this, we will use an activation function designed to have output 0 whenever its input is sufficiently close to 0. More precisely, in this section we will use an activation function ⋆ ∶ ℝ → ℝ chosen so that ⋆ ( ) = 0 whenever | | ≤ 2 −121 3 −9 , ⋆ ( ) = 3 whenever 2 −120 3 −9 ≤ | | ≤ 1, and ⋆ ( ) = 2 sign( ) whenever | | ≥ 3∕2. There must be a way to define ⋆ on the remaining intervals such that it is smooth and nondecreasing. The details of how this is done will not affect out argument, so we pick some such assignment.
The memory component also has trouble handling bit flips when there is noise. Any time we flip a bit stored in memory, any errors in the edge weights of the copy of storing that bit are likely to get worse. As a result, making the memory component noise tolerant requires a fairly substantial redesign. First of all, in order to prevent perturbations in its edge weights from being amplified until they become major problems, we will only update each value stored in memory once. That still leaves the issue that due to errors in the edge weights, we cannot ensure that the output of the net is exactly ±1. As a result, even if the net gets the output right, the edge weights will still change somewhat. That introduces the possibility that multiple unsuccessful attempts at flipping a bit in memory will eventually cause major distortions to the corresponding edge weights. In order to address that, we will have our net always give an output of 1∕2 during the learning phase so that whenever we try to change a value in memory, it will change significantly regardless of what the correct output is. Of course, that leaves each memory component with 3 possible states, the state it is in originally, the state it changes to if the correct output is 1, and the state it changes to if the correct output is −1. More precisely, each memory value will be stored in a copy of the following. Definition 4.10. Let ′ be the weighted directed graph with 9 vertices, 0 , 1 , 2 , 3 , 4 , 5 , , ′ , and and the following edges: (1) An edge of weight 3 − ∕2 ∕4 from to +1 for each (2) An edge of weight 128 from 1 to (3) An edge of weight −2 −81 ⋅ 3 −9 from to 4 (4) An edge of weight −2 −41 ⋅ 3 −9 from ′ to 4 See Figure 4 for a representation of ′ . The idea is that by controlling the values of and ′ we can either force 4 to have an input of approximately 0 in order to prevent any of the weights from changing or allow it to have a significant value in which case the weights will change. With the correct learning rate, if the correct output is 1 then the weights of the edges on the path from 0 to 5 will double, while if the correct output is −1 then these weights will multiply by −2. That means that 2 will have an output of approximately 2 −24 3 −3∕2 if this has never been changed, and an output of approximately 2 −12 3 −3∕2 if it has. Meanwhile, will have an output of −2 if it was changed when the correct output was −1 and a value of 2 otherwise. More formally, we have the following. Lemma 4.11 (Editing memory using noisy SGD). Let = 2 716∕3 ⋅ 3 24 , and ( ) = 2 for all . Next, let 0 , ∈ ℤ + and 0 < , ′ such that ≤ 2 −134 3 −11 , ′ ≤ 2 −123 3 −11 . Also, let ( ⋆ , ) be a neural net such that contains ′ as a subgraph with 5 as 's output vertex, 0 as the constant vertex, and no edges from vertices outside this subgraph to vertices in the subgraph other than and ′ . Now, assume that this neural net is trained using noisy SGD with learning rate and loss function for − 1 time steps, and then evaluated on an input, and the following hold: (1) The sample label is always ±1. Proof. First of all, we define the target weight of an edge to be what we would like its weight to be. More precisely, the target weights of ( , 4 ), ( ′ , 4 ), and ( 1 , ) are defined to be equal to their initial weights at all time steps. The target weights of the edges on the path from 0 to 5 are defined to be equal to their initial weights until step 0 . After step 0 , these edges have target weights that are equal to double their initial weights if the sample label at step 0 was 1 and −2 times their initial weights if the sample label at step 0 was −1.
Next, we define the primary distortion of a given edge at a given time to be the sum of all noise terms added to its weight by noisy SGD up to that point. Then, we define the secondary distortion of an edge to be the difference between its weight, and the sum of its target weight and its primary distortion. By our assumptions, the primary distortion of any edge always has an absolute value of at most . We plan to prove that the secondary distortion stays reasonably small by inducting on the time step, at which point we will have established that the actual weights of the edges stay reasonably close to their target weights. Now, for all vertices and ′ , and every time step , let ( , ′ )[ ] be the weight of the edge from to ′ at the start of step , [ ] be the output of on step , [ ] be the derivative of the loss function with respect to the output of on step , and ′ [ ] be the derivative of the loss function with respect to the input of on step . Next, consider some < 0 and assume that the secondary distortion of every edge in ′ is 0 at the start of step . In this case, 1 has an activation in [(1∕4 − ) 3 , (1∕4 + ) 3 ], so has an activation of 2 and the derivative of the loss function with respect to ( 1 , ) is 0. Also, the activation of 2 is between 2 −25 3 −3∕2 and 2 −23 3 −3∕2 . On another note, the total input to 4 on step is On the flip side, the total input to 4 on step is at least So, | 4 [ ] | = 0, and the edge from 4 to 5 provides an input of 0 to the output vertex on step . The derivative of this contribution with respect to the weights of any of the edges in ′ is also 0. So, if all of the secondary distortions are 0 at the beginning of step , then all of the secondary distortions will still be 0 at the end of step . The secondary distortions start at 0, so by induction on , the secondary distortions are all 0 at the end of step for every < min( 0 , ). This also implies that the edge from 4 to 5 provides an input of 0 to the output, [ ] = 2, and 2 [ ] ∈ [2 −25 3 −3∕2 , 2 −23 3 −3∕2 ] for every < 0 . Now, consider the case where = 0 ≤ . In this case, has an activation of 2 and the derivative of the loss function with respect to ( 1 , ) is 0 for the same reasons as in the last case. Also, the activation of 2 is still between 2 −25 3 −3∕2 and 2 −23 3 −3∕2 .
On this step, the total input to 4 is On the flip side, the total input to 4 is at least ⋅ ′ [5] which is between 2 −240 3 −29 ⋅ (1 − 7200 ) ⋅ 3 4− ∕2 ⋅ ′ [5] and 2 −240 3 −29 ⋅ (1 + 7200 ) ⋅ 3 4− ∕2 ⋅ ′ [5] So, if the sample label is 1, then on this step gradient descent increases the weight of each edge on the path from 0 to 5 by an amount that is within 3600 + 2 ′ of its original value. If the sample label is −1, then on this step gradient descent decreases the weight of each edge on this path by an amount that is within 10800 + 6 ′ of thrice its original value. Either way, it leaves the weight of the edge from 1 to unchanged. So, all of the secondary distortions will be at most 10800 + 6 ′ at the end of step 0 if 0 < .
Finally, consider the case where > 0 and assume that the secondary distortion of every edge in ′ is at most 10800 + 6 ′ at the start of step . Also, let ′′ = 10801 + 6 ′ , and 0 be the sample label from step 0 . In this case, 1 has an activation between (1∕2 − ′′ ) 3 0 and (1∕2 + ′′ ) 3 0 , so has an activation of 2 0 and the derivative of the loss function with respect to ( 1 , ) is 0. Also, the activation of 2 is between 2 −13 3 −3∕2 and 2 −11 3 −3∕2 . On another note, the total input to 4 on step is On the flip side, the total input to 4 on step is at least So, 4 [ ] = 0, and the edge from 4 to 5 provides an input of 0 to the output vertex on step . The derivatives of this contribution with respect to the weights of any of the edges in ′ are also 0. So, if all of the secondary distortions are at most 10800 + 6 ′ at the beginning of step , then all of the secondary distortions will still be at most 10800 + 6 ′ at the end of step . We have already established that the secondary distortions will be in that range at the end of step 0 , so by induction on , the secondary distortions are all at most 10800 + 6 ′ at the end of step for every 0 < < ′ . This also implies that the edge from 4 to 5 provides an input of 0 to the output, □ Now that we have established that we can use ′ to store information in a noise tolerant manner, our next order of business is to show that we can make the computation component noise-tolerant. This is relatively simple because all of its vertices always have inputs of absolute value at least 2, so changing these inputs by less than 1∕2 has no effect. We have the following. ′ be a function that can be computed by a circuit made of AND, OR, and NOT gates with a total of gates. Also, consider a neural net with input 19 vertices ′ 1 , … , ′ , and choose real numbers (0) < (1) . It is possible to add a set of at most new vertices to the net, including output vertices ′′ 1 , … , ′′ ′ , along with edges leading to them such that for any possible addition of edges leading from the new vertices to old vertices, if the net is trained by noisy SGD, the output of ′ is either less than (0) or more than (1) for every in every timestep, and for every edge leading to one of the new vertices, the sum of the absolute values of the noise terms applied to that edge over the course of the training process is less than 1∕12, then the following hold: with values less than (0) and values greater than (1) representing 0 and 1 respectively for each 20 , then the output of ′′ encodes ℎ ( 1 , … , ) for each with −2 and 2 encoding 0 and 1 respectively. 19 Note that these will not be the data input of the general neural net that is being built; these input vertices take both the data inputs and some inputs from the memory component. 20 This time we can use the same values of (0) and (1) for all ′ because we just need them to be between whatever the vertex encodes 0 as and whatever it encodes 1 as for all vertices.
Proof. In order to do this, we will add one new vertex for each gate and each input in a circuit that computes ℎ. When the new vertices are used to compute ℎ, we want each vertex to output 2 if the corresponding gate or input outputs a 1 and −2 if the corresponding gate or input outputs a 0. In order to do that, we need the vertex to receive an input of at least 3∕2 if the corresponding gate outputs a 1 and an input of at most −3∕2 if the corresponding gate outputs a 0. No vertex can ever give an output with an absolute value greater than 2, and by assumption none of the edges leading to the new vertices will have their weights changed by 1∕12 or more by the noise. As such, any noise terms added to the weights of edges leading to a new vertex will alter its input by at most 1∕6 of its in-degree. So, as long as its input without these noise terms has the desired sign and an absolute value of at least 3∕2 plus 1∕6 of its in-degree, it will give the desired output.
In order to make one new vertex compute the NOT of another new vertex, it suffices to have an edge of weight −1 to the vertex computing the NOT and no other edges to that vertex. We can compute an AND of two new vertices by having a vertex with two edges of weight 1 from these vertices and an edge of weight −2 from the constant vertex. Similarly, we can compute an OR of two new vertices by having a vertex with two edges of weight 1 from these vertices and an edge of weight 2 from the constant vertex. For each , in order to make a new vertex corresponding to the th input, we add a vertex and give it an edge of weight 4∕( (1) − (0) ) from the associated ′ and an edge of weight −(2 (1) + 2 (0) )∕( (1) − (0) ) from the constant vertex. These provide an overall input of at least 2 to the new vertex if ′ has an output greater than (1) and an input of at most −2 if ′ has an output less than (0) . This ensures that if the outputs of the ′ encode binary values 1 , … , appropriately, then each of the new vertices will output the value corresponding to the output of the appropriate gate or input. So, these vertices compute ℎ( 1 , … , ) correctly. Furthermore, since the input to each of these vertices is outside of (−3∕2, 3∕2), the derivatives of their activation functions with respect to their inputs are all 0. As such, the derivative of the loss function with respect to any of the edges leading to them is always 0, and paths through them do not contribute to changes in the weights of edges leading to the ′ . □ Now that we know that we can make the memory component and computation component work, it is time to put the pieces together. We plan to have the net simply memorize each sample it receives until it has enough information to compute the function. More precisely, if there is an algorithm that needs samples to learn functions from a given distribution, our net will have 2 copies of ′ corresponding to every combination of a timestep 1 ≤ ≤ , an input bit, and a value for said bit. Then, in step it will set the copies of ′ corresponding to the inputs it received in that time step. That will allow the computation component to determine what the current time step is, and what the inputs and labels were in all previous times steps by checking the values of the copies of 2 and . That will allow it to either determine which copies of ′ to set next, or attempt to compute the function on the current input and return it. This design works in the following sense. to train ( , ) on (2 ( ) − 1, 2 ( ) − 1) for 0 ≤ < and then run the resulting net on 2 , − 1, we will get an output within 1∕2 of 2ℎ( (0) , (0) , (1) , (1) , … , ( ) ) − 1 with probability 1 − (1).
Proof. We construct as follows. We start with a graph consisting of input vertices. Then, we take 2 copies of ′ , merge all of the copies of 0 to make a constant vertex, and merge all of the copies of 5 to make an output vertex. We assign each of these copies a distinct label of the form ′ ′ , , , where 0 ≤ ′ < , 0 < ≤ , and ∈ {0, 1}. We also add edges of weight 1 from the constant vertex to all of the control vertices. Next, for each 0 ≤ ′ < , we add an output control vertex [ ′ ] . For each such ′ , we add an edge of weight 1 from the constant vertex to [ ] to the output vertex has weight 49∕100. Finally, we use the construction from the previous lemma to build a computation component. This component will get input from all of the input vertices and every copy of and 2 in any of the copies of ′ , interpreting anything less than 2 −21 3 −3∕2 as a 0 and anything more than 2 −15 3 −3∕2 as a 1. This should allow it to read the input bits, and determine which of the copies of ′ have been set and what the sample outputs were when they were set. For each control vertex from a copy of ′ and each of the first output control vertices, the computation component will contain a vertex with an edge of weight 1∕2 leading to that vertex. It will contain two vertices with edges of weight 1∕2 leading to [ ] . This should allow it to set each control vertex or output control vertex to 0 or 2, and to set [ ] to −2, 0, or 2.
The computation component will be designed so that in each time step it will do the following, assuming that its edge weights have not changed too much and the outputs of the copies of and 2 are in the ranges given by lemma 4.11. First, it will determine the smallest 0 ≤ ≤ such that ′ ( ′ , , ) has not been set for any ′ ≥ , 0 < ≤ , and ∈ {0, 1}. That should equal the current timestep. If < , then it will do the following. For each 0 < ≤ , it will use the control vertices to set ′ ( , ,[ ′ +1]∕2) , where ′ is the value it read from the th input vertex. It will keep the rest of the copies of ′ the same. It will also attempt to make [ ] output 2 and the other output control vertices output 0. If = , then for each 0 ≤ ′ < and 1 ≤ ≤ , the computation component will set ⋆( ′ ) to 1 if ′ ( ′ , ,1) has been set, and 0 otherwise. It will set ⋆( ′ ) to 1 if either ′ ( ′ ,1,0) or ′ ( ′ ,1,1) has been set in a timestep when the sample label was 1 and 0 otherwise. It will also let ⋆( ) be the values of ( ) inferred from the input. Then it will attempt to make [ ] output 4ℎ( ⋆(0) , ⋆(0) , … , ⋆( ) ) − 2 and the other output control vertices output 0. It will not set any of the copies of ′ in this case.
In order to prove that this works, set = min(2 −134 3 −11 , 2 77 3 15 ∕ ) and ′ = 2 −123 3 −11 . The absolute value of the noise term applied to every edge in every time step is at most 1∕ 2 , so the sums of the absolute values of the noise terms applied to every edge over the course of the algorithm are at most if > 2 67 3 6 . For the rest of the proof, assume that this holds. Now, we claim that for every 0 ≤ ′ < , all of the following hold: (1) Every copy of or 2 in the memory component outputs a value that is not in (4) The weight of every edge leading to an output control vertex ends step ′ with a weight that is within of its original weight. (5) For every ′′ > ′ , the weight of the edge from [ ′′ ] to the output vertex has a weight within of its original weight at the end of step ′ .
In order to prove this, we use strong induction on ′ . So, let 0 ≤ ′ < , and assume that this holds for all ′′ < ′ . By assumption, the conditions of lemma 4.11 were satisfied for every copy of ′ in the first ′ timesteps. So, the outputs of the copies of and 2 encode information about their copies of ′ in the manner given by this lemma. In particular, that means that their outputs are not in [2 −21 3 −3∕2 , 2 −15 3 −3∕2 ] on timestep ′ . By the previous lemma, the fact that this holds for timesteps 0 through ′ means that the computation component will still be working properly on step ′ , it will be able to interpret the inputs it receives correctly, and its output vertices will take on the desired values. The assumptions also imply that every copy of or ′ took on values of 0 or 2 in step ′′ for every ′′ < ′ . That means that the derivatives of the loss function with respect to the weights of the edges leading to these vertices was always 0, so their weights at the start of step ′ were within of their initial weights. That means that the inputs to these copies will be in [−4 , 4 ] for ones that are supposed to output 1 and in [2 − 4 , 2 + 4 ] for ones that are supposed to output 2. Between this and the fact that the computation component is working correctly, we have that for each ( ′′ , , ), the copies of and ′ in ′ ( ′′ , , ) will have taken on values satisfying the conditions of lemma 4.11 in timesteps 0 through ′ with 0 set to ′′ if ( ′′ ) = and + 1 otherwise.
Similarly, the fact that the weights of the edges leading to the output control vertices stay within of their original values for the first ′ − 1 steps implies that [ ′′ ] outputs 2 and all other output control vertices output 0 on step ′′ for all ′′ ≤ ′ . That in turn implies that the derivatives of the loss function with respect to these weights were 0 for the first ′ + 1 steps, and thus that their weights are still within of their original values at the end of step ′ . Now, observe that there are exactly copies of ′ that get set in step ′ , and each of them provide an input to the output vertex in 4∕2. So, the net gives an output in [1∕2 − ′ , 1∕2 + ′ ] on step ′ , as desired. This also implies that the derivative of the loss function with respect to the weights of the edges from all output vertices except [ ′ ] to the output vertex are 0 on step ′ . So, for every ′′ > ′ , the weight of the edge from [ ′′ ] to the output vertex is still within of its original value at the end of step ′ . This completes the induction argument. This means that on step , all of the copies of and will still have outputs that encode whether or not they have been set and what the sample output was on the steps when they were set in the manner specified in lemma 4.11, and that the computation component will still be working. So, the computation component will set ⋆( ′ ) = ( ′ ) and ⋆( ′ ) = ( ′ ) for each ′ < . It will also set ⋆( ) = ( ) , and then it will compute ℎ( (0) , (0) , (1) , (1) , … , ( ) ) correctly. Call this expression ′ . All edges leading to the output control and control vertices will still have weights within of their original values, so it will be able to make [ ] output 4 ′ − 2, all other output control vertices output 0, and none of the copies of ′ provide a nonzero input to the output vertex. The output of [ ] is 0 in all timesteps prior to , so the weight of the edge leading from it to the output vertex at the start of step is within of its original value. So, the output vertex will receive a total input that is within 2 of 49 50 (2 ′ − 1), and give an output that is within 6 of 49 3 50 3 (2 ′ − 1). That is within 1∕2 of 2 ′ − 1, as desired. □ This allows us to prove that we can emulate an arbitrary algorithm by using the fact that the output of any efficient algorithm can be expressed as an efficiently computable function of its inputs and some random bits. More formally, we have the following (re-statement of Theorem 2.8). Proof. Let be an efficient algorithm that learns (  ,  ) with accuracy , and be a polynomial in such that uses fewer than samples and random bits with probability 1 − (1). Next, define ℎ {0, 1} ( +1) + + →{0,1} such that the algorithm outputs ℎ ( 1 , … , , 1 , … , , ′ ) if it receives samples 1 , … , , random bits 1 , … , and final input ′ . There exists a polynomial ⋆ such that computes ℎ ( 1 , … , , 1 , … , , ′ ) in ⋆ or fewer steps with probability 1 − (1) given samples 1 , … , generated by a function drawn from (  ,  ), random bits 1 , … , , and ′ ∼  . So, let ℎ ′ ( 1 , … , , 1 , … , , ′ ) be ℎ ( 1 , … , , 1 , … , , ′ ) if computes it in ⋆ or fewer steps and 0 otherwise. ℎ ′ can always be computed in polynomial time, so by the previous lemma there exists a polynomial sized neural net ( , ) that gives an output within 1∕2 of 2ℎ ′ (( 1 , 1 ), … , ( , ), 1 , … , , ′ ) − 1 with probability 1 − (1) when it is trained using noisy SGD with noise Δ, learning rate 2 716∕3 3 24 , and loss function on ((2 − 1, 2 − 1), 2 ( ) − 1) and then run on 2 ′ − 1. When the ( , ) are generated by a function drawn from (  ,  ), and ′ ∼  , using to learn the function and then compute it on ′ yields ℎ ′ ( 1 , … , , 1 , … , , ′ ) with probability 1 − (1). Therefore, training this net with noisy SGD in the manner described learns (  ,  ) with accuracy − (1). □ Remark 4.15. Like in the noise free case it would be possible to emulate a metaalgorithm that learns any function that can be learned from samples in time instead of an algorithm for a specific distribution. However, unlike in the noise free case there is no easy way to adapt the metaalgorithm to cases where we do not have an upper bound on the number of samples needed.
Remark 4.16. Throughout the learning process used by the last theorem and lemma, every control vertex, output control vertex, and vertex in the computation component always takes on a value where the activation function has derivative 0. As such, the weights of any edges leading to these vertices stay within of their original values. Also, the conditions of lemma 4.11 are satisfied, so none of the edge weights in the memory component go above ′ more than double their original values. That leaves the edges from the output control vertices to the output vertex. Each output vertex only takes on a nonzero value once, and on that step it has a value of 2. The derivative of the loss function with respect to the input to the output vertex is at most 12, so each such edge weight changes by at most 24 + over the course of the algorithm. So, none of the edge weights go above a constant (i.e., 2 242 3 25 ) during the training process.

Additional comments on the emulation
The previous result uses choices of a neural net and SGD parameters that are in many ways unreasonable. This choice of activation function is not used in practice, many of the vertices do not have edges from the constant vertex, and the learning rate is deliberately chosen to be so high that it keeps overshooting the minima. If one wanted to do something normal with a neural net trained by SGD one is unlikely to do it that way, and using it to emulate an algorithm is much less efficient than just running the algorithm directly, so this is unlikely to come up. In order to emulate a learning algorithm with a more reasonable neural net and choice of parameters, we will need to use the following ideas in addition to the ideas from the previous result. First of all, we can control which edges tend to have their weights change significantly by giving edges that we want to change a very low starting weight and then putting high weight edges after them to increase the derivative of the output with respect to them. Secondly, rather than viewing the algorithm we are trying to emulate as a fixed circuit, we will view it as a series of circuits that each compute a new output and new memory values from the previous memory values and the current inputs. Thirdly, a lower learning rate and tighter restrictions on how quickly the network can change prevent us from setting memory values in one step. Instead, we initialize the memory values to a local maximum so that once we perturb them, even slightly, they will continue to move in that direction until they take on the final value. Fourth, in most steps the network will not try to learn anything, so that with high probability all memory values that were set in one step will have enough time to stabilize before the algorithm tries to adjust anything else. Finally, once we have gotten to the point that the algorithm is ready to approximate the function, its estimates will be connected to the output vertex, and the output will gradually become more influenced by it over time as a basic consequence of SGD.

Proof of Theorem 2.17
Recall that for a sample set Proof of Theorem 2.17. Consider running the descent algorithm on either true data labeled with or random data labeled with random labels, that is, (5.4) Denote by ( ) the probability distribution of ( ) and let (≤ ) ∶= ( (1) , … , ( ) ). We then have the following.
) (5.5) ) . ) and denote by ,ℎ the distribution of ( −1) ,ℎ . Note that the above corresponds to taking one step using the data from ℎ after − 1 steps using the data from .
Using the triangular and Data-Processing inequalities, we have  and by Cauchy-Schwarz, We now investigate a single component ∈ ( ) in the above norm, that is, We have ) (5.33) and thus ) (5.37) We now show the tighter bound with the term CP 1∕2 rather than CP 1∕4 for parities, which implies the following result that is a variant of the lower-bound from [19] with slightly different exponents. We first need the following inequalities. As mentioned earlier, this is similar to Theorem 1 in [31] which requires in addition the function to be the gradient of a 1-Lipschitz loss function. We also mention the following corollary of Lemma 5.2 that results from Cauchy-Schwarz. In other words, the expected value of any function on an input generated by a random parity function is approximately the same as the expected value of the function on a true random input.
Proof of Theorem 5.1. We follow the proof of Theorem 2.17 until (5.18), where we use instead Lemma 5.2, to write (for = ∞) where 2 − is the CP for parities. Thus in the case of parities, we can remove a factor of 1∕2 on the exponent of the CP. Furthermore, the Cauchy-Schwartz inequality in (5.35) is no longer needed, and the null-flow can be defined in terms of the sum of gradient norms, rather than taking norms squared and having a root on the sum; this does not however change the scaling of the null-flow. The theorem follows by plugging the parameters of the statement. □

Proof of Theorem 2.24
Our next goal is to make a similar argument for stochastic gradient descent. We argue that if we use noisy SGD to train a neural net on a random parity function, the probability distribution of the resulting net is similar to the probability distribution of the net we would get if we trained it on random values in +1 . This will be significantly harder to prove than in the case of noisy gradient descent, because while the difference in the expected gradients is exponentially small, the gradient at a given sample may not be. As such, drowning out the signal will require much more noise.

Gaussian noise, noise accumulation, and blurring
While the previous result works, it requires more noise than we would really like. The biggest problem with it is that it ultimately argues that even given a complete list of the changes in all edge weights at each time step, there is no way to determine the parity function with nontrivial accuracy, and this requires a lot of noise. However, in order to prove that a neural net optimized by noisy SGD (NSGD) cannot learn to compute the parity function, it suffices to prove that one cannot determine the parity function from the edge weights at a single time step. Furthermore, in order to prove this, we can use the fact that noise accumulates over multiple time steps and argue that the amount of accumulated noise is large enough to drown out the information on the function provided by each input. More formally, we plan to do the following. First of all, we will be running NSGD with a small amount of Gaussian noise added to each weight in each time step, and a larger amount of Gaussian noise added to the initial weights. Under these circumstances, the probability distribution of the edge weights resulting from running NSGD on truly random input for a given number of steps will be approximately equal to the convolution of a multivariable Gaussian distribution with something else. As such, it would be possible to construct an oracle approximating the edge weights such that the probability distribution of the edge weights given the oracle's output is essentially a multivariable Gaussian distribution. Next, we show that given any function on +1 , the expected value of the function on an input generated by a random parity function is approximately equal to its expected value on a true random input. Then, we use that to show that given a slight perturbation of a Gaussian distribution for each ∈ +1 , the distribution resulting from averaging together the perturbed distributions generated by a random parity function is approximately the same as the distribution resulting from averaging together all of the perturbed distributions. Finally, we conclude that the probability distribution of the edge weights after this time step is essentially the same when the input is generated by a random parity function is it is when the input is truly random.
Our first order of business is to establish that the probability distribution of the weights will be approximately equal to the convolution of a multivariable Gaussian distribution with something else, and to do that we will need the following definition. In this situation we also say that is a ( , )-blurring. If ≤ 0 we consider every probability distribution as being a ( , )-blurring for all .
The following are obvious consequences of this definition: Lemma 5.5. Let  be a collection of ( , )-blurrings for some given and . Now, select ∼  according to some probability distribution, and then randomly select ∼ . The probability distribution of is also a ( , )-blurring. Lemma 5.6. Let be a ( , )-blurring and ′ > 0. Then *  (0, ′ ) is a ( + ′ , )-blurring We want to prove that if the probability distribution of the weights at one time step is a blurring, then the probability distribution of the weights at the next time step is also a blurring. In order to do that, we need to prove that a slight distortion of a blurring is still a blurring. The first step towards that proof is the following lemma: Proof. First, note that for any with || || 1 < and any and , it must be the case that | ( )| ≤ || || 1 < . That in turn means that for any , ′ with | || 1 , || ′ || 1 < and any , it must be the case that | ( ) − ( ′ ) | ≤ || − ′ || 1 with equality only if = ′ . In particular, this means that for any such , ′ , it must be the case that || ( ) − ( ′ )|| 1 ≤ || − ′ || 1 ≤ || − ′ || 1 with equality only if = ′ . Thus, + ( ) ≠ ′ + ( ′ ) unless = ′ . Also, note that the bound on the second derivatives of implies that | ( )| ≤ || || 2 1 ∕2 for all || || 1 < and all . This means that Next, observe that for any ≥ 0, it must be the case that In particular, if we set = ∕ − √ 2 ∕ , this shows that Proof. First, define ℎ ∶ ℝ → ℝ such that ℎ( ) = (0) + + [∇ ( ) ] (0) for all . Every eigenvalue of [∇ ](0) has a magnitude of at most 1 , so ℎ is invertible. Next, define ⋆ ∶ ℝ → ℝ such that ⋆ ( ) = ℎ −1 ( + ( )) − for all . Clearly, ⋆ (0) = 0, and ⋆ (0) = 0 for all and . Furthermore, for any given it must be the case that max , , ′ | Now, let⋆ be a probability distribution such that ⋆ is a ( , )-blurring of⋆. Next, letˆbe the probability distribution of ℎ( ) when is drawn from⋆. Also, let = ( + [∇ ] (0))( + [∇ ](0)). The fact that || ⋆ −⋆ *  (0, )|| 1 ≤ 2 implies that For any ∈ ℝ , it must be the case that That in turn means that Any blurring is approximately equal to a linear combination of Gaussian distributions, so this should imply a similar result for drawn from a ( , ) blurring. However, we are likely to use functions that have derivatives that are large in some places. Not all of the Gaussian distributions that the blurring combines will necessarily have centers that are far enough from the high derivative regions. As such, we need to add an assumption that the centers of the distributions are in regions where the derivatives are small. We formalize the concept of being in a region where the derivatives are small as follows.
Then, the desired conclusion follows by the previous lemma. □ This lemma could be relatively easily used to prove that if we draw from a ( , )-blurring instead of drawing it from  (0, ) and is stable at with high probability then the probability distribution of + ( ) will be a ( ′ , ′ )-blurring for ′ ≈ and ′ ≈ . However, that is not quite what we will need. The issue is that we are going to repeatedly apply a transformation along these lines to a variable. If all we know is that its probability distribution is a ( ( ) , ( ) )-blurring in each step, then we potentially have a probability of ( ) each time step that it behaves badly in that step. That is consistent with there being a probability of ∑ ( ) that it behaves badly eventually, which is too high. In order to avoid this, we will think of these blurrings as approximations of a ( , 0) blurring. Then, we will need to show that if is good in the sense of being present in the idealized form of the blurring then + ( ) will also be good. In order to do that, we will need the following definition.

Means and Gaussian distributions
Our plan now is to consider a version of NoisyStochasticGradientDescent in which the edge weights get revised after each step and then to show that under suitable assumptions when this algorithm is executed none of the revisions actually change the values of any of the edge weights. Then, we will show that whether the samples are generated randomly or by a parity function has minimal effect on the probability distribution of the edge weights after each step, allowing us to revise the edge weights in both cases to the same probability distribution. That will allow us to prove that the probability distribution of the final edge weights is nearly independent of which probability distribution the samples are drawn from. The next step towards doing that is to show that if we run NoisySGDStep on a neural network with edge weights drawn from a linear combination of Gaussian distributions, the probability distribution of the resulting graph is essentially independent of what parity function we used to generate the sample. In order to do that, we are going to need some more results on the difficulty of distinguishing an unknown parity function from a random function. First of all, recall that corollary 5.3 says that We can apply this to probability distributions to get the following.
Theorem 5.13. Let > 0, and for each ∈ +1 , let be a probability distribution on ℝ with probability density function . Now, randomly select ∈ +1 and ∈ uniformly and independently. Next, draw from and ′ from ( , ( )) for each ⊆ [ ]. Let ⋆ be the probability distribution of and ⋆ be the probability distribution of ′ for each . Then be the probability density function of ⋆ , and for each ⊆ [ ], let ⋆ = 2 − ∑ ∈ ( , ( )) be the probability density function of ⋆ . For any ∈ ℝ , we have that ∑ In particular, if these probability distributions are the result of applying a well-behaved distortion function to a Gaussian distribution, we have the following.
Theorem 5.14. Let , 0 , 1 > 0, and and be positive integers with < 1∕ 1 . Also, for every ∈ +1 , let ( ) ∶ ℝ → ℝ be a function such that | ( ) ( )| ≤ 0 for all and and | ( ) ( )| ≤ 1 for all , , and . Now, randomly select ∈ +1 and ∈ uniformly and independently. Next, draw 0 from  (0, ), set = 0 + ( ) ( 0 ) and ′ = 0 + ( , ( )) ( 0 ) for each ⊆ [ ]. Let ⋆ be the probability distribution of and ⋆ be the probability distribution of ′ for each . Then Proof. First, note that the bound on | By the previous theorem, that implies that The problem with this result is that it requires to have values and derivatives that are bounded everywhere, and the functions that we will encounter in practice will not necessarily have that property. We can reasonably require that our functions have bounded values and derivatives in the regions we are likely to evaluate them on, but not in the entire space. Our solution to this will be to replace the functions with new functions that have the same value as them in small regions that we are likely to evaluate them on, and that obey the desired bounds. The fact that we can do so is established by the following theorem.
for each 0 < ≤ , if [ ] is quasistable at˜( −1) , set In order to analyze the behavior of these variables, we will need to make a series of observations. First, note that for every , , and the probability distribution of˜( −1)′ given that = and ( −1) = is  ( , ). Also, either [ ] is quasistable at or 0 is quasistable at . Either way, the probability distribution of˜( ) under these circumstances must be a ( , )-blurring by Lemma 5.10 and Lemma 5.6. That in turn means that ( ) is a ( , ) blurring for all and , and thus that ( ) ⋆ must be a ( , ) blurring ofˆ( ) . Furthermore, by the previous corollary, If [ ] were quasistable at ( −1) , that is exactly the formula that would be used to calcu-late˜( ) , so [ ] must be quasiunstable at ( −1) . That in turn requires that either [ ] is ( , 1 , 2 )-unstable at˜( −1)′ = ( −1) , || [ ] ( ( −1) )|| ∞ > 0 , or ||˜( −1)′ − ( −1) || 1 > . With probability at least 1 − , neither of the first two scenarios occur for any , while for any given Let (ℎ, ) be a neural net with inputs and edges, be with its edge weights changed to the elements of , be a loss function, and > 0. Next, define such that 0 < ≤ ∕80 2 , and let be a positive integer. Then, let ⋆ be the uniform distribution on +1 , and for each ⊆ [ ], let be the probability distribution of ( , ( )) when is chosen randomly from . Next, let  be a probability distribution on +1 that is chosen by means of the following procedure. First, with probability 1∕2, set  = ⋆. Otherwise, select a random ⊆ [ ] and set  = .
Next, set = ( 40 ) 2 ∕2 . Now, let ′ be with each of its edge weights perturbed by an independently generated variable drawn from  (0, ) and run ℎ (ℎ, ′ ,  , , , ∞,  (0, [2 − 2 2 2 ] ), ). Let be the probability that there exists 0 ≤ < such that there exists a perturbation ′ of with no edge weight changed by more than 160 2 ∕ such that one of the first three derivatives of ( (ℎ, ) ( ) − ) with respect to the edge weights has magnitude greater than . Finally, let be the probability distribution of the final edge weights given that  = ⋆ and ′ be the probability distribution of the final edge weights given that  = . Then