Approximation error of Fourier neural networks

The paper investigates approximation error of two‐layer feedforward Fourier Neural Networks (FNNs). Such networks are motivated by the approximation properties of Fourier series. Several implementations of FNNs were proposed since 1980s: by Gallant and White, Silvescu, Tan, Zuo and Cai, and Liu. The main focus of our work is Silvescu's FNN, because its activation function does not fit into the category of networks, where the linearly transformed input is exposed to activation. The latter ones were extensively described by Hornik. In regard to non‐trivial Silvescu's FNN, its convergence rate is proven to be of order O(1/n). The paper continues investigating classes of functions approximated by Silvescu FNN, which appeared to be from Schwartz space and space of positive definite functions.


INTRODUCTION
Artificial neural networks have been widely used in machine learning and acquired their popularity during 1990s. The second burst occurred in recent years due to a rapid increase of proccessable data ("Big-Data") and availability of powerful computing machinery. Modern trends in deep neural networks helped to achieve superior results in pattern recognition, text processing, and other fields of machine learning. However, this paper is focused on "shallow" 2-layer neural nets -a classic case for multilayer perceptrons. Specifically, the study assesses the convergence rate for one of a two-layer neural network with activators composed of a product of cosine functions with different frequencies. The architecture was proposed by Silvescu in 1999 [1]. Since this idea is inspired by Fourier series and Fourier transform, such networks are referred to as "Fourier Neural Networks" (FNNs).
While most of the theory behind neural networks with standard activation functions is well-established, the peculiar ones such as above leave a room for research. For used in Reference [2]. The paper proceeds with identifying classes of functions, for which the approximation error bound can be computed.
Besides theoretical findings, the paper attempts to confirm them with computational results. The experiments include implementation of the Silvescu Fourier Neural Network in Python's TensorFlow library, generating synthetic data according to classes of functions obtained previously, training the model, and assessing the approximation error.

General model
Before talking about FNN implementations, we shall give a context of feedforward neural nets in general. The name "feedforward" is derived from the concept of passing weighted value of the input vector forward, to the hidden layer. It is then subject to a non-linear transformation called "activation". This helps to catch non-linear behavior of the input data. After this, the obtained value is passed further, to the output layer. Mathematically, the general model can be given as a mapping where x ∈ R d represents a multidimensional input, w k ∈ R d is a weight vector, b k ∈ R is a bias, is called an "activation function", and v k ∈ R are output weights. Such model falls in the paradigm "weighted sum of input characteristics". The standard choices for the activation function are "sigmoidal" functions (sometimes reffered as "squashers"), which have a bounded output. For example, the logistic activation function has the following expression where z ∈ R, and the output is in the interval (0, 1). The hyperbolic tangent function is given by In general, most of them have the shape similar to a heaviside step function The values of these activations are stored in so-called "hidden neurons", a transitional state between input and output. There can be several such neurons placed in parallel in a feedforward network, composing a hidden layer. The anatomy of the network is illustrated in Figure 1.
Having sigmoidal activation and a sufficient number of neurons, a two-layer neural network can approximate almost any continuous function [3,4]. Studies show that the accuracy of approximation increases with the number of hidden layers [5] and with the number of hidden neurons in general [4].

Existing implementations
Fourier Neural Networks were inspired in some way by Fourier Series and Fourier Transform. For this reason, activations used in these nets contain cosine transformations. Several implementations with different activation functions have been proposed starting from the late 1980s.

Gallant and White FNN
One of the earliest examples is the network of Gallant and White [6]. The suggested model uses a "cosine squasher" instead of a standard sigmoid activation. The function is monotone and bounded.
The main characteristic of the activation is that it cuts off frequencies higher than 2 . The network then represents a mapping with n being the number of hidden layer units. According to Gallant and White, the network possesses Fourier Series approximation properties. That is, if a periodic function to be approximated is in L 2 , the network converges in L 2 sense as n grows; and if it is continuous, the network converges uniformly.

Tan, Zuo and Cai, Liu FNN
Another implementation was proposed by several authors, by Tan [7], by Zuo and Cai [8]. In general, they can be F I G U R E 1 Feedforward neural network with one hidden layer and one-dimensional output written as Note that we have same weights and biases in both sin and cos. A slightly different approach is to set the parameters of sin and cos independent of each other: This was suggested by Liu [9]. These two implementations resemble Fourier Series, with the exception that the coefficients are not fixed, as it is the case with Fourier partial sums, but rather trained on data.

Silvescu FNN
Silvescu proposed a different implementation of an FNN [1]. It has the most exotic activation function, as a result of non-standard way of discretization of a problem. The activation is as follows where d is the dimension size of an input. The network formula is then Instead of a nonlinear transformation of weighted sums of inputs, here we take a product of nonlinear filters for each weighted input. This makes Silvescu's implementation the most interesting and challenging to analyze among existing FNNs. Therefore, the paper is focused on the convergence properties of this particular FNN.

Discussion
Despite having several FNN implementations, this study is concerned with Silvescu's FNN. It does not fit into the standard way of constructing two-layer feedforward neural networks, and appears to have the least trivial proof for convergence rate. As one might have noticed, the rest of FNNs (including standard neural nets) can be rewritten in the form . All such neural networks have already been studied by Hornik [10] in 1989, and it has been shown that networks converge at a rate of O(1/n) as the hidden layer size n increases. Silvescu's activation function does not take an affine transformation of input w k ⋅ x + b k as its argument. Instead, it is nonlinear in each spatial component of an input, Silv = ∏ cos(w ki x i + b ki ) and graphically it looks as follows: Since such networks fall out of the paradigm of classic neural nets, little is known about their convergence properties. Our research, therefore, attempts to fill in some of the gaps in neural network approximation theory. In particular, we show that for a smooth enough function f there exists a network of the form which has the error bounded as where B ⊂ R d is a bounded set, C is a constant, and is a probability measure. That is, the order of convergence for Silvescu FNN is O(1/n), given fixed dimension size d and a bound C for Fourier transform. Convergence rates for all mentioned FNN implementations are also expected to be of order O(1/n). A very important method to derive the convergence rate is used by Barron in his investigation of sigmoidal activation functions [2]. The Gallant and White FNN falls under the scope of Barron's work. The study shows that for functions with an existing Fourier integral f (x) = ∫ e iw⋅xf (w)dw and a finite first moment ∫ |w||f (w)|dw < C, two-layer neural networks with sigmoidal activation have an error of O(1/n). Indeed, the way GW was defined lets us perceive it as a sigmoidal function, in which case we can apply Barron's results directly.
In the course of our paper, we utilize the main idea of Barron's error estimation and derive our own proof for Silvescu FNN. The main difference between these two results is in assumptions on the moments of Fourier transform of a function. While Barron sets the first moment ∫ |w||f (w)|dw < C, we choose functions with finite zeroth moment ∫ |f (w)|dw < C. This is, in fact, equivalent to demanding Fourier integral to be in L 1 (R d ).
Such assumption was also used by Jones [11] to show convergence rate of order O(1/n) for networks with an 'affine transformation of an input' mentioned above. FNNs used in the works of Tan, Zuo and Cai, and Liu come from this category. Proper modifications to Barron's proof would also confirm the results of Jones and Hornik [10,11]. The consequent analysis of the assumption helps to identify suitable classes of functions for which the error bound However, for smaller number of neurons, the dependence of error on the hidden layer size seems to experimentally confirm the convergence rate of O(1/n).

Proposition
Let us consider functions on R d for which the Fourier representation has the form is the magnitude distribution, and (w) is the phase. Define the class of functions Γ , which can be represented in the form above for someF(dw) and finite ∫ F(dw): Then, for each C > 0 we can define the classes Γ C Now consider a bounded set B ⊂ R d . Then define Γ B for which the representation in Equation (13) holds for x ∈ B for someF(dw) with finite magnitude measure: Finally, for each C > 0, let For every function f ∈ Γ B,C and any probability measure , there exists f n (x), n ≥ 1, of the form such that

Proof
The proof of Theorem 1 uses several lemmas, which we specify below.
Lemma 1. If f is in the closure of the convex hull of a set G in a Hilbert space, with ∥g ∥ ≤ b for each g ∈ G, then for every n ≥ 1, and every The proof of this lemma can be found in Barron's paper [2].

Lemma 2.
For every function f ∈ Γ C,B and every function g ∈ G cos , the function f is in the closure of the convex hull of G cos , where the closure is taken in L 2 ( , B) and Since the function f we are considering is a real-valued function, from Equation (13) we deduce that for where Most importantly, the last row of Equation (16) represents f (x) as an infinite convex combination of functions in the set To show this, let w 1 , w 2 , … , w n ∈ R d be drawn independently from the same distribution Λ. Consider now g(x, w 1 ), g(x, w 2 ), … , g(x, w n ). Then the sample mean , and at the same time it is a convex combination of functions from G cos . We will get results if we apply Fubini's Theorem to the expected value of squared L 2 ( , B) norm.
The L 2 ( , B) norm above converges to zero in mean. By taking into account that the norm is positive, there is a sequence w 1 , … , w n for which the norm converges to zero. Thus, there exists a convex combination f n (x; w 1 , … , w n ) of functions from G cos that converges to f in L 2 ( , B). Therefore, f is in the closure of the convex hull of G cos in Proof The proof is by induction. First, we consider the base case d = 1. Each function g ∈ G cos is a single variable function. Then we have which is a convex combination of 2 1 − 1 functions in the class G Silv with a kj = w 1 . Thus, when the dimension is d = 1, G cos ⊂ coG Silv .
Next, let us prove the general case for d = m + 1.
, where x, w ∈ R m and a kj , b, b kj ∈ R. Now considerg(x) ∈G cos , wherex ∈ R m+1 andg is a function on R m + 1 . We also havew ∈ R m+1 . Actually, we can writex = (x 1 , x 2 , … , x m , x m+1 ), where the first m coordinates are from x. So, Where we defineŵ mxm ≔ w m x m + w m+1 x m+1 . Now we can apply results for the m-dimensional case to the identity above.g The function above is a convex combination of 2 m functions from G Silv . Then, G cos ⊂ coG Silv when d = m + 1, and therefore, by induction, for any dimension d. Note that expanding the cosine resulted in two cosine products, which have same frequencies. This suggests that almost every a kj is repeated twice along the k-index.

Lemma 4.
For every function f ∈ Γ C,B and every function g ∈ G Silv , the function f is in the closure of the convex hull of G Silv , where the closure is taken in L 2 ( , B).
Proof We will use Lemma 2 and Lemma 3 to show that f is in the closure of the convex hull of G Silv . From Lemma 2 it follows that f ∈ Γ C,B ⊂ coG cos , while from Lemma 3 we obtain G cos ⊂ coG Silv . Convex hull of G cos is also a subset of G Silv , that is, coG cos ⊂ coG Silv . By taking limits, one can show that coG cos ⊂ coG Silv also holds. Thus, f ∈ coG Silv .
Functions in this class are measurable and bounded by Then, by Lemma 1, there exists f n in the convex hull of n points of G Silv such that ||f − f n || ≤ c ′ /n for every n ≥ 1 and every for t k > 0 and ∑ t k = 1. We can restrict coefficients in our network to satisfy the bound ∑ | c k | ≤ 2 d − 1 C.

Functions with absolutely integrable Fourier transform
The goal is to identify functions in the class Γ C (or Γ C,B ), that is, for which ∫ F(dw) < C holds. This condition is equivalent to demanding Fourier transform to be in L 1 , that is, ∫ |F(dw)| < C (or alternatively, ∫ |f (w)|dw < C). We also give the formula to compute the constant C, which we use later to compare a theoretical error bound with experimental findings.
There are two main classes of functions, which satisfy the mentioned conditions for the transform:

Positive definite functions 2. Schwartz functions
First, we consider positive definite functions. In the definition, these are very similar to positive definite kernels, which are important tools in statistical learning theory. Let f be a positive definite function, which requires ∑ x i x j f (x i − x j ) to be non-negative for any choice of x i , x j [12]. Alternatively, the matrix A, whose elements are a i,j = f (x i − x j ), has to be positive semi-definite [2]. Then, if the function is continuous on R d , by Bochner's theorem [13, p. 144], it can be represented as f (x) = ∫ e iw ⋅ x F(dw). Here, F(dw) is a positive measure and corresponds to |F(dw)|. Then the constant in the error bound can be computed as We can observe that for positive definite functions, the L 1 -norm of its Fourier transform is defined by the behavior of f at the origin. That is, one does not even need to know the transform in order to compute the constant C.
There exists another class of functions with absolutely integrable Fourier transform -Schwartz space (R d ). This is a space of rapidly decreasing functions on R d (functions with derivatives decreasing faster than the inverse of any polynomial). The Schwartz space is widely used in Fourier analysis and applications can be found in the field of partial differential equations. The definition is given as follows [14, p. 134]: where , are multi-indices, C ∞ (R d ) is the space of infinitely differentiable functions and It is known that Fourier transform maps functions from Schwartz space onto itself, that is, if f ∈ (R d ), thenf ∈ (R d ) [15, p. 331]. Since we also know that (R d ) ⊂ L p (R d ) [16], the transform then satisfiesf ∈ L 1 (R d ).

Properties of functions in C
Of course, functions obtained from Schwartz space by translation (shift), dilation (scaling), summation, and consequently by a linear combination are also in Schwartz space. This can be seen straight from the definition. From Reference [14, p. 142], product and convolution of Schwartz functions are also in the same space. This is also true for some of such cases when we pick functions from Γ C . To show this, let f be in the class Γ C . Functions obtained by applying the following operations are also in Γ (particularly in Γ C ′ , where C ′ is a different constant).
If we use a substitution x = y/a, we obtain the transformg(w) = (1∕2 ) d ∫ e −iw∕a⋅y f (y)dy =f (w∕a)∕ a. Now, by taking a change of variables as u = w/a we get ∫ |g(w)|dw = ∫ |f (w∕a)∕a|dw = ∫ |f (u)|du < C. Thus, the constant is invariant to time scaling and f (ax) ∈ Γ C . (c) Linear transformation (of time domain). Let f ∈ Γ C and A be an invertible matrix. Consider a function g(x) = f (Ax). Then its Fourier transform will be given by the equatioñ However, the determinant term disappears during the change of variables, when we try to find a bound for L 1 -norm off . The determinant appears from Jacobian and then cancels out with the denominator.
(e) Linear combination. If f i ∈ Γ C i and g = ∑ i f i , then by the linearity of Fourier transform and the previous summation property,g(w) = ∑ ifi (w). Consequently, Then, g(x) ∈ Γ C 1 C 2 : (g) Composition with polynomials. We will use results for the combination and the product, mentioned above. Consider f ∈ Γ C , then by the property of the product of functions, f k ∈ Γ C k . Now, let g(z) be a univariate polynomial function g(f (x)) = ∑ n k=1 k (f (x)) k . By the property of the linear combination of functions, we obtain f (z) ∈ Γ ∑ n k=1 | k |C k . (h) Composition with analytic functions. Let f (x) ∈ Γ C and g(z) be an analytic function g(z) = ∑ ∞ k=0 k z k , for which the radius of absolute convergence is r > C. By a similar approach as with polynomials, we get g(z) ∈ Γ ∑ ∞ k=0 | k |C k . Here, the summation in the subscript of Γ is absolutely convergent since C < r. (i) Ridge functions. Let g(x) = f (a ⋅ x) for some a ∈ R d with |a | = 1, and f (z) be a univariate function from Γ C . Then This can be viewed as a Fourier transform of g(x), concentrated on the set of points w ∈ R d on a line along the direction a. Then, the constant can be taken as C f = C g = ∫g(t)dt.
Note that sigmoidal functions of the form f (x) = (a ⋅ x + b) do not have an integrable Fourier transform.

Examples of functions in C
Let us now look at some examples of functions mentioned above.

Gaussian function
The simplest example would be the Gaussian function f (x) = e −∥x∥ 2 ∕2 . It is in the class of both positive definite functions and Schwartz functions, and, therefore, Since the Fourier transform is a real and positive function of w, Alternatively, we can use the fact that the function is positive definite (proving this fact is hard in practice). We evaluate the Gaussian at the origin and get C = f (0) = 1. Shifted versions of Gaussian will also be in the class Γ C , according to the translation property. Therefore, we include functions of the form f (x) = e −∥x− ∥ 2 ∕2 , where ∈ R d is a shift.

Multivariate Gaussian distributions
These functions arise in almost every field of probability and statistics. Multivariate Gaussians are also in the same Schwartz space: this can be shown by a change of variables (first, use shift, then a linear transformation). However, we now want to find the constant C for this function from our observations above. The multivariate Gaussian is given by the formula [19, p. 197]: where Σ is a symmetric positive definite matrix (covariance matrix), is shift in x and is the mean of distribution. First, we can define a matrix A = Σ −1/2 = Q T D −1/2 Q, so that A T A = Σ −1 . The matrix Q is orthogonal, A is symmetric positive definite and invertible. The function now is of the form where g(y) = e −∥y∥ 2 ∕2 , for whichg(u) = (2 ) −d∕2 e −∥u∥ 2 ∕2 . Using the result for the transformation in time domain and the translation property, the Fourier transform of the latter isf L 1 -norm off is then bounded by The constant then can be taken as C =

Gaussian mixture models
Another important function in probability, statistics and machine learning. Gaussian mixture can be defined as a linear combination of different Multivariate Gaussians [20, p. 425]: where each f k is a multivariate Gaussian with a different mean ( k ) and a different covariance matrix (Σ k ), a k 's are corresponding weights in the mixture for which ∑ a k = 1 and a k > 0. Using the linear combination property, we can verify that the constant C = ∑ a k C i , where Many other functions can be derived from the Gaussian, using the properties we have obtained. We will use some of them in the next section to assess the approximability of functions in Γ by a Silvescu Fourier Neural Network.

Generating data for functions in C
For the experimental part, we fix the dimension size d = 3. The reason is due to enormous error bound that we obtain by setting dimension too large. However, our goal is to assess if the convergence is of order O(1/n), where n is the number of hidden neurons. Therefore, we will test functions obtained in the previous section on different values of n. Namely, n = {1,4,16,64,256,1024}. Regarding the classes of functions in Γ C , the following functions were chosen:

10.
Silvescu activation function It was also interesting to check if the network can approximate the Silvescu activation function adequately.

Software
All parameters such as constants, coefficients, mean vectors and covariance matrices were chosen randomly. Then, the datasets with training size 10,000 and test size 5,000 were generated according to chosen functions. A small normal error  (0,0.03) was added on top. The code was written with the help of Python's NumPy [21] library. Regarding the implementation for Silvescu FNN, computations were run using TensorFlow library [22]. The network used Adam Optimizer [23] for solving a minimization problem, which is a Stochastic Gradient Descent (SGD) algorithm with momentum and an adaptive learning rate. The momentum fastens the convergence by adding a portion of a previous gradient to a current one. With the momentum and the adaptive learning rate combined, the algorithm is less likely to slow down when the gradient approaches zero around a minimum. An initial learning rate was set to = 0.01, the momentum constants were set to default 1 = 0.99 and 2 = 0.999.
Since the implementation uses SGD, we needed to divide the dataset into batches. For a batch size 100 and 10,000 observations in total, we obtained 100 batches. The network training was 50 epochs long (50 iterations over the same dataset). After each epoch (=100 processed batches), the data was shuffled to avoid repetitions. Then, the minimum mean squared error (MSE) across the epochs was chosen. The reason why we use the minimum values for MSE is that our main theorem proved the existence of network with indicated MSE. F I G U R E 2 MSEs for different functions

Hardware
The TensorFlow package enables to use Graphics Processing Unit (GPU) acceleration for training neural networks. This requires to have GPU's that support Nvidia's CUDA Toolkit for accelerated computing. For running the Python code, we used Nvidia's GeForce GTX 1050 Ti. The computations for Silvescu FNNs on average were about 30% slower than for standard feedforward neural nets with sigmoid activation function. This is due to a larger number of floating point operations in Silvescu's activation (≈30d flops) with respect to sigmoid activation (≈2d + 25 flops). It is obvious that with the increase of dimensionality of the problem that gap is expected to rise, too.

Results
The results of the experimental part were good for small values of n (number of neurons), in the interval 16 < n < 1024 the error leveled off, for larger values the error is expected to be above the indicated error bound. Although the theoretical results provide an upper bound for the error, there has to be also a lower bound due to computational reasons. According to graphs (see Figure 2), the error decay may be described as linear for n ≤ 16, that is, order of O(1/n), whereas on the right tail of MSEs, it is best described as O (1). However, this possibly can be explained by a large number of neurons, and consequently, large number of parameters. For Silvescu NN we have (2d + 1)n = 7n parameters. This potentially can cause over-parametrization during training without the regularization. Another reason could be existence of a lower bound for an approximation by Sivlsecu FNN. This is also probable, because the error stops decaying around 10 −3 , which is even smaller than the error for our generated functions ( = 0.03). Then, small fluctuations in the right tail can be explained by Adam Optimizer finding different local minima during training stages. Because we redefine the network each time we change the hidden layer size n.

CONCLUSION
Theoretically, we have proven that there exist such parameters of FNN, for which the network's error is bounded by a constant divided by the number of hidden neurons. However, the existence of such parameters does not imply that the bound is the same for all trained models. This is the best-case scenario, and obtaining such solution is not guaranteed in practice. Main experimental concern would be an algorithm for solving the minimization problem. Additional regularization techniques and constraining network weights as in Section 3 could help to mitigate these challenges. Nevertheless, the importance of this paper is in obtaining error bounds for networks, which fall out of the paradigm of "weighted sum of activation of a linearly transformed input". All standard models were described by Hornik [10] and Cybenko [3], and Barron [2] provided lower (Kolmogorov n-width) and upper bound for approximation by sigmoidal neural networks. Silvescu FNN, on the other hand, has a non-trivial activation, which makes it harder to analyze. The constant term 4 d − 1 in the error bound appears also because of the way the activation was constructed. To counter-balance this term, the number of neurons n has to be exponential in the dimensionality d.
First possible improvement of the work could be extending the results for linear and polynomial functions. For these two functions, functions defined on integers and function on other bounded domains, Barron used suitable spline extrapolations to satisfy his assumption ∫ |w||f (w)|dw < C [2]. This could help us considering more real-life data with such nature. The lower bound for an approximation can be derived. Barron obtains it by considering Kolmogorov's n-width, the closest distance between the function and a collection of basis functions. More research in this direction could shed a light on the analysis of the approximation by linear spans of Silvescu functions.

ACKNOWLEDGMENTS
This work was supported by the Nazarbayev University faculty-development competitive research grants program, Grant Number 240919FD3921, and by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan, IRN AP05133700.

DATA AVAILABILITY
The data that support the findings of this study are openly available at https://github.com/zh3nis.