A literature survey of matrix methods for data science

Efficient numerical linear algebra is a core ingredient in many applications across almost all scientific and industrial disciplines. With this survey we want to illustrate that numerical linear algebra has played and is playing a crucial role in enabling and improving data science computations with many new developments being fueled by the availability of data and computing resources.

1. Introduction. The study of extracting information from data has become a main driver of business, engineering, fundamental research and even culture. We here view data science as drawing on elements of machine learning, data mining and many other mathematical fields such as optimization or statistics [76]. Also, we want to point out that in order to obtain information from data it is not necessarily implied that the data are big but often they are.
We here view the data as being represented in a matrix with the dimensions m, n related to the underlying data. Here m is the dimension of the feature space with feature vectors a i ∈ R m . The dimension n is the number of data points and thus possibly very large. We often assume that n > m but this is not necessary. Additionally, the data can also arise in the form of a tensor of order d The multitude of applications where such data occur and need to be studied goes beyond the scope of this paper. Naturally, matrices arise as part of spatial data analysis [25,196,112] or time-series data analysis [2,146,278] but can also be found in many more disciplines [267].
The tasks of extracting meaningful information from the collected data varies between application areas. Traditional categories for learning are so-called unsupervised learning, semi-supervised learning, and supervised learning (cf. [145,129]).
Before starting the detailed discussion, we would like to point out a particular example that is a core problem in data science, statistics, numerical linear algebra and computer science alike. This is the least squares problem [105], where in brief one wants to fit a linear model to known data. Let us assume that we are given n data points a i ∈ R m and typically outputs y i , e.g. a value of either −1 or 1 for a classification problem. We are then interested in finding a weight vector w ∈ R m such that the residual r i = a T i w − y i is minimized. It is well known that this would lead to a linear regression problem (1.3) min w Aw − y 2 2 + λ w 2 2 with A as above. Here, we also introduce λ > 0 as a regularization or ridge parameter for the regularization term w 2 2 . The regularization term could also be measured in several different norms such as the l 1 -norm [261] or the total variation norm [272]. The solution to this problem is then given by (1.4) w * = A A + λI −1 A T y, a prototypical linear system of equations with a symmetric and positive definite matrix as long as λ > 0. This is both a core problem in numerical linear algebra and data science alike. The goal of this survey is to show how information extraction from data, data modeling and pattern finding relies on efficient techniques from numerical linear algebra that not only enable computations but also reveal hidden information. The paper is structured as follows. We first illustrate the use of classical factorizations such as the QR and singular value decomposition (SVD) and then introduce interpretable factorizations such as the non-negative matrix factorization (NMF) or the CUR decomposition. We additionally discuss literature devoted to kernel methods with special attention given to the graph Laplacian. Randomization as well as functions of matrices are discussed next. We close with a brief discussion of recent results for high-dimensional problems and deep learning applications.
As a word of caution, we want to remark that the field of numerical linear algebra is vast and the analysis of data via techniques from numerical linear algebra is not new. The goal of this survey is to point to recent trends and we apologize to the authors whose results we missed while writing this. In particular we want to refer to the beautiful books by Eldén [88] and Strang [253] that provide general introductions to linear algebra for data science applications.
2. Data matrices and factorizations. The decompositional approach to matrix computations has been named one of the top 10 algorithms of the 20th century [75]. We here review some important matrix factorizations, their applications as well as tailored factorizations used for data science. Matrix factorizations are an ubiquitous tool in data science and have received much attention over the last years. A great example is the use of matrix factorization techniques for recommender systems such as the Netflix challenge 1 [161]. We split the discussion into classical factorizations, which have been the workhorse of many applications such as engineering or fluid mechanics, and factorizations that are designed to more closely resemble the nature of the data.
with U ∈ R n,n and V ∈ R m,m being orthogonal matrices. The matrix S ∈ R n,m is of the following form with singular values σ 1 ≥ σ 2 . . . ≥ σ m ≥ 0. An equivalent and often very useful representation is the outer product form of the SVD as where u i and v i are the columns of U and V , respectively. This allows for a natural interpretation of the columns u i of U as a basis for the column space of A and correspondingly the row-space by the columns v i of V. One of the magic facts about the SVD, easily motivated by the outer product form, is that the best rank-k approximation to A is hereby ignoring the smaller singular values in the summation. This is known as the truncated singular value decomposition and is written in matrix form as where the optimality of this approximation is theoretically underpinned by the Eckart-Young-Mirsky theorem [106]. The SVD is one of the most crucial tools in the complexity reduction of large-scale problems. The singular value decomposition is naturally well-suited to solve the above given least squares problem of the form (2.1) This is expensive due to the cost of computing the full SVD but the truncated SVD can be exploited for solving least squares problems as discussed in [127]. There have been many algorithmic updates in the computation of the (truncated) SVD that deal with the numerical difficulties of large-scale problems, rounding errors, locking of wanted singular vectors and purging of unwanted information [12,140,251,133]. At the heart often lies the Lanczos bidiagonalization and for large scale problems incorporating implicit restarts is mandatory (cf. [27,134,12,251]). The importance of the truncated SVD is unquestioned as also well-known techniques in the statistics community such as the NIPALS (Nonlinear Iterative Partial Least Squares) method have been shown to be mathematically equivalent to the Lanczos bidiagonalization [87,27]. To the best of our knowledge, the algorithmic improvements developed for the Lanczos-bidiagonalization have not been exploited within the NIPALS framework. Nevertheless, NIPALS has become a crucial tool in the analysis of problems from economics applications [143,123,95].
The SVD is also a key ingredient in model order reduction 2 [17,46] classically used for reducing the dimensionality of physics-related models based on differential equations. Recently, model order reduction has found more and more applications in machine learning [163,262]. One of the most important applications that directly mirrors the use of the SVD in computational science and engineering is the use of creating reduced representations of data. Here, applications include text mining [3], face recognition [296], medicine [293] and many more.
The SVD also comes in many disguises among the different disciplines of statistics, engineering, and applied mathematics. Given a data matrix A applying principal component analysis (PCA) has been a key tool for understanding the structure of the data. In PCA, the column mean within A is zero and then the right singular vectors v i are called the principal component directions of A and the left singular vectors are the principal components of A. More information and applications are given in [147,148,193,287].
In order to compute the singular value decomposition for data matrices that originate from massive datasets one often has to resort to techniques from high performance computing [8,254,202]. Additionally, it is possible to rely on randomized algorithms that we discuss in Section 4.
The SVD is also a crucial ingredient in many algorithms for high-dimensional data analysis, see Section 6 on tensor factorizations.
The QR factorization. Given a set of vectors collected in the matrix A and considering the span of the columns of A it is clear that the vectors themselves can potentially be a terribly conditioned basis. Hence, one wants to find a well-conditioned basis, ideally a basis with orthogonal vectors. This task is achieved by computing the QR factorization, i.e., A = QR with Q ∈ R n,n an orthogonal matrix and R ∈ R n,m an upper triangular matrix of the form where the matrixR ∈ R m,m is invertible only if the matrix A has full column rank m. From the representation it is clear that one can also work with A =QR whereQ only contains the first m columns of Q. The QR factorization is another ubiquitous factorization in applied mathematics. Its computation is typically rather expensive and for the details we refer to [105]. In particular, the reduction of A to triangular form via so-called Householder reflectors is a de-facto standard. Also, the solution of the above least squares problem via the QR factorization is well-known [105], using no regularization, that Truncated QR decompositions in disguise are also at the heart of the Lanczos [164] and Arnoldi [9] methods, which are the key algorithms of Krylov subspace methods. They are based on computing the columns of a matrix Q spanning the basis of the Krylov matrix.
Additionally, the pivoted QR factorization [249,248] computes the factorization with P a permutation matrix. This QR factorization can be computed maintaining the sparsity of the matrix A (cf. [248]) and it is also closely related to interpretable factorizations, which we discuss next.

Interpretable Factorizations.
Despite the beautiful mathematical properties of both QR and SVD they sometimes do not provide a representation that allows an easy interpretation of the results for practitioners. For example, while the singular vectors are often found to contain negative values this is not found in the original data. Mathematically this means that a given data matrix A ∈ R n,m is well approximated by the truncated SVD A ≈ U k S k V k but the singular vectors contained in U k (:, l), are in general not sparse or non-negative even though this holds for the data. To preserve the properties inherent in the application while maintaining a good approximation of the original data with reduced complexity, many interpretable factorizations have been developed over recent years (cf. [184,280,80,35,167,77,99] for some of them). Figure 1 illustrates how the CUR approximation, which we introduce in this section, naturally represents the original data. We now briefly introduce these methods and comment on applications and new developments. We here follow the notation of [274] and start with the CX decomposition defined by A = CX where C ∈ R n,k and X ∈ R k,m . Here the matrix C is a matrix consisting of k columns of the data matrix A, i.e., C = A(:, J) for an index set 3 J of order k. The important feature of the matrix C is that its columns are interpretable as they are taken from the original data. Since the data matrix is often sparse this is inherited by C and then requires less memory than storing the singular vector matrix U k . The computation of C and X is done via the minimization of min A − CX , where · may be the 2-norm or the Frobenius norm. This problem is known as the column subset selection problem [34] and its NP-completeness is discussed in [239]. Typically, more structure is required than just a general matrix X and one quickly moves to the interpolative decomposition (ID) [274] (2.2) A ≈ CV where C ∈ R n,k is as before but the matrix V ∈ R m,k is constructed such that it contains a k × k identity matrix and that max i,j |v ij | ≤ 1. A procedure to obtain the low-rank ID via a pivoted QR is given in [274] returning a V T = [I k T l ] P T with P a permutation matrix and T l a solution related to the upper triangular factors of the pivoted QR. This approach works analogously when a one-sided representation in terms of matrix rows is desired. One can also obtain a two-sided interpolative decomposition A ≈ W A(I, J)V T . Here, the matrix W and row-indices I are obtained from first considering a one-sided interpolative decomposition A ≈ CV , which gives columns indices J. One can then compute (2.2) for the matrix C T ≈CṼ T whereC contains columns of C T , i.e., rows of C and thus elements of the rows of the original matrix A. We further have W =Ṽ and I, J the corresponding index sets. Note that W and V do not contain matrix columns and rows of the original matrix A, respectively, and hence properties such as sparsity and non-negativity found in the data are not necessarily carried over. As a result, we can consider a factorization that does not only work on the column space but also incorporate the row space of A and bears a lot of similarity to the two-sided ID. This is achieved by the CUR decomposition where C ∈ R n,k , U ∈ R k,k , and R ∈ R k,m . Here, C and R both are columns respectively rows of the original data matrix A and inherit properties such as non-negativity, sparsity, and finally interpretability. The CUR decomposition is also known as the skeleton decomposition [156,98,264,210,109] and is closely related to a rankrevealing QR factorization [274,114]. The CUR decomposition can be computed from the above mentioned two-sided interpolative decomposition [274] The starting point for the computation of the CUR decomposition is often based on another low-rank factorization of the matrix A, such as the truncated SVD or the two-sided ID discussed above. One typically obtains the intersection matrix U as U = A(I, J) −1 where I and J are index vectors corresponding to the selected rows and columns in C respectively R. The other possibility is U = C † AR † , where † indicates the Moore-Penrose inverse. The crucial point is how to select the index sets I and J. For this one typically computes leverage scores as sums over the rows of the matrices U k and V k with these coming from the truncated SVD A ≈ U k S k V k with j = k i=1 U j,i for j = 1, . . . , n. In [184] the authors provide a statistical interpretation of the leverage scores as a probability distribution. More recently, Embree and Sorensen [245] have introduced a procedure based on the discrete empirical interpolation method (DEIM) [46] where the selection of the column and row indices is based on a greedy projection technique resembling a pivoting strategy within the LU factorization. In [69] the authors compute the column and row subset using conditional expectations with a more efficient numerical realization being recently introduced in [60].
Another important interpretable factorization of the matrix A is the so-called non-negative matrix factorization (NMF) [100,88,73,73,21] which is a low-rank approximation using W ∈ R m,k and H ∈ R k,n with componentwise non-negativity written as W ≥ 0, H ≥ 0. The interpretation of the columns of W (:, j) ∈ R m is that they form a basis of order k that best approximates the data points a j , i.e., the columns of A T via The coefficients of how the data are expanded in the basis defined by W are stored in H. Such linear dimension reduction frameworks are found in various data tasks within image processing or text mining. Here again the non-negativity of the elements in W means that the resulting matrix is more sparse than an SVD-based approach and provides an interpretable feature representation [178,176]. In order to compute the NMF one has to note that the problem is NP hard [270] and ill-posed. The computation of the NMF factorization is typically based on solving the minimization problem min W,H≥0 Alternating minimization procedures, which consist of an alternating update of the factors W and H, are typically employed and we refer to [99,124,181] for more details. In [71] the authors analyze the relationship between the NMF factorization of a kernel matrix and spectral clustering discussed later. For further improving the performance the authors in [43] include a kernel matrix as a regularization term for the objective function. This shows that kernel matrices, which are matrices changing the representation of the data, are often very useful and we discuss these next.
3. Changing the data representation. So far we focused on methods that directly utilize the data A as a matrix. Often it is necessary to transform the data to a different representation. The goal is that for the transformed data the learning task is easier and we now describe several approaches designed for that purpose.
3.1. The graph Laplacian operator. The data encoded in A ∈ R n,m either have a natural representation as a graph with the nodes v j representing the associated feature vectors a j ∀j or they can be modeled that way. The result is a graph G = (V, E) consisting of the nodes v j ∈ V and edges e ∈ E, where an edge e consists of a pair of nodes. We here consider undirected graphs where an edge is typically equipped with an edge weight representing the strength of the connection between the corresponding nodes. A practically very relevant example is given by the Gaussian weight function In [273, Section 2.2] it is shown how data can be converted into the form of a graph. One example is the fully connected graph, where all nodes are connected to each other, not neglecting any information but leading to dense matrices. The weights are then collected into a matrix W ∈ R n,n where the diagonal is set to zero. If two nodes are not connected in the graph the associated matrix entry is set to zero. As a second . As a result the graph Laplacian L = D − W is obtained or the symmetric normalized Laplacian L sym = I − D −1/2 W D −1/2 . The properties of the graph Laplacian are discussed in [273,51,51]. It is easy to see that the matrix is symmetric and the positive semi-definiteness can be seen from the relation In the context of data science many of the properties of the graph Laplacian can be utilized for tasks such as clustering. In particular, the eigeninformation of L and L sym are crucial. For example, the number of zero eigenvalues gives information about the number of connected components of the underlying graph. The eigenvector corresponding to the first non-zero eigenvalue is known as the Fiedler vector. As the eigenvector corresponding to the zero eigenvalue is the constant vector c 1, . . . , 1 T with c ∈ R and since the Fiedler eigenvector is orthogonal to it, we must have sign changes in the Fiedler vector. This makes the Fiedler a first candidate to perform clustering simply by the sign of its entries 4 [266,72,24,45].
In fact, one of the classical tasks that is performed using the eigeninformation of the graph Laplacian is spectral clustering [273], where one computes the first 5 k eigenvectors φ 1 , . . . , φ k . As the graph Laplacian translates the original data A into an alternative space encoded into new matrices W and D, its eigeninformation will lead to a different clustering behavior than traditional methods such as k-means [247]. It can be seen from Figure 2 that standard k-means clustering based on A [128,150] shows poorer performance when compared against spectral clustering [273]. In more detail, the most common spectral clustering methods proceed by using k-means on the rows of the eigenvector matrix In [273] the author illustrates that performing spectral clustering solves relaxed versions of known graph cut problems. If instead of the graph Laplacian the two normalized versions are used, one obtains different results, for the random walk Laplacian L rw = D −1 L see [238] and for L sym [206]. The hyperparameter 6 σ in the denominator of the weight function can be replaced by a local scaling with an additional hyperparameter describing the locality [294]. Often the parameter is chosen in a heuristic way according to the performance of the algorithm using the graph Laplacian [206]. Note that when interpreted in terms of Gaussian processes the parameter σ is related to the characteristic length-scale of the process [219].
As the foundation of the spectral clustering method is the computation of k eigenvectors of the (normalized) graph Laplacian, it is important to be able to compute these eigenvectors efficiently. Be reminded that the matrices L, L rw , and L sym = I − D −1/2 W D −1/2 are singular and the multiplicity of the zero-eigenvalue corresponds to the number of connected components in the graph. As we are interested in computing the smallest eigenvalues, we typically would need to invert the matrix L, which is not possible. Focusing on L sym it is obvious that the smallest eigenvalues of L sym can be computed from the largest eigenvalues of D −1/2 W D −1/2 . Methods for efficiently computing the eigeninformation of such a matrix often rely on Krylov subspaces [106,250,168,225], where it is crucial to perform the matrix vector products with D −1/2 W D −1/2 efficiently. If we assume that the graph consists of one connected component D is diagonal and invertible. The main cost multiplying with D −1/2 W D −1/2 comes out of computing matrix vector products with W . For sparse graphs the matrix W will itself be sparse and matrix vector products will be inexpensive. The main computational challenge is then encountered for non-sparse or fully connected graphs. For large data-sets this is often infeasible and hence more sophisticated techniques are needed. Recently, methods based on the non-equispaced Fourier transform [5], the (improved) fast Gauss transform [286,199], or algebraic fast multipole methods [290,186] have shown great potential resulting in a complexity of O(n log n) for the matrix vector products with fixed m. While these methods provide great speed-ups the dimensionality m of the feature vectors still provides a significant challenge in computations and we return to this point in the next section.
The basis Φ k = [φ 1 , . . . , φ k ] is not only important for spectral clustering but also for a group of methods recently introduced for semi-supervised learning, i.e., given a small set of labeled data the remaining unclassified data points are classified simultaneously. The methods are based on partial differential equation techniques from material science modeling [42,7], namely diffuse-interface methods, and have also been used in image inpainting [22]. The graph Laplacian then replaces the infinitedimensional Laplacian to obtain a differential equation based on the graph data via with ψ(u) a potential enforcing two-classes, ω is a penalization term, and f containing the training data. The variable ε is a hyperparameter related to the thickness of the interface region. The dimensionality of this equation is still vast and a projection using Φ k reduces the PDE to a k-dimensional equation. Note that similar to model order reduction the nonlinearity still needs to be evaluated in the large-dimensional space [46].
We briefly want to comment on the use of the graph Laplacian in image processing, where the pixels are often represented as the nodes in a graph, which would then lead to a fully connected graph. In this field the graph Laplacian is often used as a regularizer for denoising [151,180,220] or image restoration [152]. In [195] the construction of the Laplacian is performed patch-wise in both a local and a non-local fashion. A more in-depth discussion for applications in image processing can also be found in [49].
The graph Laplacian has also recently enjoyed wide applicability within deep learning. Namely, as an essential ingredient within so-called Graph Convolutional Networks [132,38,154] where the usual layer equation for semi-supervised learning becomes with σ l an activation function and weights Θ that need to be learned. The crucial filter matrices K (k) are computed using the graph Laplacian associated with the input X (0) . Mainly, the kernel matrices K (k) are composed from a k-dimensional filter space span {φ 1 , . . . , φ k } , typically of the form K (k) = U φ k (Λ)U T , where U and Λ are the eigenvector and eigenvalue matrix of the graph Laplacian or a slight modification of it. The name convolutional network stems from the fact that the transformation U φ k (Λ)U T can be interpreted as graph Fourier transform [240,66]. In more detail, for a signal x ∈ R n , with n the number of nodes in the graph,x = U T x is the graph Fourier transform and its inverse is given by x = Ux. It is clear that polynomial filters φ j are easy to apply either directly by multiplying with the matrices or via an (approximate) eigendecomposition of the Laplacian. Many filters and efficient methods for their computations have been suggested and we refer to [154,132,6,172,241,175,300,281] for some of them.
There have also been generalizations of the graph Laplacian to other settings. We in particular want to mention the case of hypergraphs [299,298], where an edge is now a collection of possibly many nodes. Again, one can obtain a normalized Laplacian operator of the form L = I −Θ and perform spectral clustering based on the eigeninformation of this matrix. Hypergraphs are encountered in many applications such as biological networks [158], image processing [292], social networks [297] or music recommendation [39]. Hypergraphs have also been used in the context of semisupervised learning [169,218,31] and in particular convolutional neural networks based on hypergraphs [6,285].
Graph Laplacians have also been used to analyze multilayer networks [157,28] where a set of nodes can be connected in various layers (cf. Figure 3 for an illustration). The connections between the nodes and the various layers can be represented as a tensor but also using a (supra)-Laplacian [244] L = L (L) + L (I) with the intra-layer-supra Laplacian L (L) = blkdiag(L 1 , . . . , L K ) and the interlayersupra Laplacian L (I) = L I ⊗ I with L I the interlayer Laplacian. Again, the eigeninformation of the supra-Laplacian provide rich information about the network [244,216,258]. One can also obtain networks where the nodes are not connected across layers but rather only have intra-layer connections. We do not discuss the full details of the various different network structures here but identify this as a very exciting area of future research. In the context of analyzing social relationships we want to mention signed networks, which are graphs with positive and negative edge weights. These networks are used to model friend and foe type relationships and we refer to [171,235,122,170,255] and the references mentioned therein for an overview of some of the crucial applications. Again techniques such as spectral clustering [235,190,192], semi-supervised learning [188], convolutional networks [68] are available to extract further information from the data. The difficulty for signed networks is that the classical graph Laplacian is not sensible and several competing versions are possible (see [97,255] for an overview).
The graph Laplacian is also an essential tool analyzing complex networks via network motifs [18,211], graph centralities [92,94,215], or community detection [204,205]. It has also been suggested to replace the graph Laplacian by a deformed Laplacian or Bethe Laplacian [37,226,200] as H(s) = (s 2 − 1)I − sW + D for a parameter s ∈ R and W the adjacency matrix of the graph. It has been shown that it corresponds to a non-backtracking random walk [162] and shows better performance when community detection is desired in very sparse graphs generated by the stochastic block model.
In the next section, we turn our attention to the case when the kernel, i.e. (3.1) is not only part of the graph Laplacian, but is viewed as the defining element of embedding the data into a high-dimensional space.

Kernel methods.
The transformation of the data via a so-called kernel function, like the Gaussian used for the graph Laplacian, is a technique that has been successfully applied in many data science tasks [237,234,201,141]. In Figure 4we illustrate a dataset that is difficult to linearly separate in two dimensions shown on the left. When the data is transformed via kernelization to three dimensional space it is easily separable. The derivation of kernel methods follows from the evaluation of the function f (x) = w T x for a new data point x uses the computed weights where λ is a regularization parameter and this expression is derived using Lagrangian duality [234]. Inserting the new point x gives and the evaluation of a T i x is now replaced by a more general kernel function such as k(a i , x) = a T i x. We then write the above as To understand the role of the kernel function let us look at the kernel [234] k(v, x) = (x T v) 2 and consider the feature mapping φ that maps the two-dimensional data into threedimensional space via This even comes at a computational advantage as we do not need to compute the higher-dimensional φ vectors. This technique avoiding the direct computation of the higher-dimensional nonlinear relation via the evaluation of a kernel function is known as the kernel trick [230,234,231]. It is clear now that the adjacency matrix of the graph Laplacian is also a kernel matrix for the particular choice of the Gaussian kernel. The assembly of the kernel for all data points a i into a matrix leads to a positive semi-definite Gram/kernel matrix K. In fact, under the name of kernel PCA the leading (now largest) eigenvectors of K are computed to obtain principal components/directions in the data (cf. [233,232]). The use of the kernel trick in machine learning is omnipresent. One of the simplest but powerful methods is the so-called kernel ridge regression (KRR), where the core problem is to minimize the function with b a vector encoding the training data and u a vector of weights. This problem is then tackled via kernelization of the matrix A T A as K to the solution of a linear system [5] (K + λI) w = b where the system matrix is symmetric and positive definite for a positive regularization or ridge parameter λ ∈ R. Such a system is typically solved numerically with the use of a preconditioned iterative solver such as the conjugate gradient method [106,224,135]. The key ingredients are the matrix vector products with K + λI and developing a preconditioner P ≈ K + λI. For the matrix vector product the nonequispaced fast Fourier transform (NFFT) can be used for a variety of different kernels [5] and in the case of Gaussian kernels many bespoke methods exist such as [286,290,186]. A method based on sketching and the random feature method [217] is introduced in [10], which can also be applied to various different kernel functions. A flexible preconditioner, which changes in every iteration, combined with a suitable Krylov solver was presented in [246] whereas the authors in [236] construct a lowrank approximation preconditioner based on an interpolative decomposition and fast matrix vector products. A more difficult scenario arises when the feature vectors a i compromising the kernel matrix live in a higher-dimensional space. In this case, fast matrix vector multiplication becomes more difficult and suffers from the curse of dimensionality unless certain decay rates are imposed on the matrix entries [199].
In [288] the authors consider an example with m = 90 where they use a divide-andconquer parallel algorithm that also comes with communication avoidance. For more details we refer to [288] and also the mentioned literature for competing methods. In [221] the authors use a technique based on randomization (cf. Section 4) for both the matrix vector product and the preconditioner in KRR while a similar problem is solved in [291] via high performance computing approaches. The power of the kernelization has been exploited as an essential ingredient of support vector machines (SVM) [234,268,59]. Support vector machines are derived from a margin maximization and a kernelization of the inner products. The resulting kernel matrix is then at the heart of the quadratic program that needs to be solved for determining the support vectors [234] but the computation suffers from having to deal with dense kernel matrices. In [212] the sequential minimal optimization (SMO) method is introduced, which breaks the problem into smaller analytically solvable problems. Kernel matrices are also crucial when solving the closely related support vector regression (SVR) problem [243,84]. While both SVM and SVR remain popular methods, deep learning techniques have recently gained more popularity and have been combined with kernel techniques via graph convolution networks [132,154] or as loss functions in convolutional networks [256,183].
Often the linear algebra tasks associated with large scale kernel methods will involve randomization, which we discuss next.

Randomization.
In their seminal review paper [126] the authors discuss the importance of the decompositional approach to matrix computations [75] discussed earlier. Due to the computational complexity it is not always straightforward to efficiently compute matrix factorizations since in many data science applications the matrix A is so large that the techniques presented earlier cannot be applied. Additionally, the data might be corrupted or it is desirable to avoid many passes over the data so that classical methods need further thought.
One of the key ingredients in allowing efficient numerical linear algebra is the process of randomization. Randomization has proven to be a valuable tool in various matrix computation tasks such as matrix vector products [58,78] or low-rank approximations [26,79]. These techniques typically follow the scheme of producing a skinny matrix Q ∈ R n,k+p such that we obtain the orthogonal-projection-approximation onto the subspace spanned by Q via A ≈ QQ T A.
From the information contained in Q one then computes standard factorizations such as QR or SVD decompositions. In more detail, we form the matrix B = Q T A ∈ R k+p,m , which then gives the low-rank approximation A ≈ QB. Plugging in the SVD of B results in an approximate SVD of A via The first stage, i.e, the computation of the matrix Q, follows from the prototype algorithm illustrated in Algorithm 4.1. Note that we are typically only interested in a Algorithm 4.1 Prototype algorithm from [126].
decomposition of order k when proceeding to the second stage of the method but that G is of dimension k + p where p is the oversampling parameter. The dimensionality of G is crucial as our aim is to reduce the complexity as much as possible. Theoretical bounds for the approximation quality are given in [126] and depend on the parameters k and p as well as the matrix size and the (k + 1)-st singular value of A. Such a randomized SVD has become the an essential tool in many scientific computing and data science applications in areas such as uncertainty quantification [173], optimal experimental design [4], or computer vision [89]. One of the key applications of randomized methods within data science is the use for approximation of kernel matrices [187,81,295,174,142,23] where the desire is to avoid the computation of the full kernel matrix since the storage demand can be too large. In particular, the kernel matrix is approximated via where in the traditional setup the matrix Q contains columns of the identity matrix drawing columns from the matrix A and we obtain the so-called Nyström scheme This means in order to approximate a kernel matrix A one only needs to compute a small number of columns and rows, which has been shown to be very successful (cf. [214,81]). In [187] the author suggests to use a Q obtained via Algorithm 4.1 and thus obtain a variant of the traditional Nyström method.
We already pointed out in Section 2.2 that interpretable decompositions can be obtained via QR or SVD approximations to A. As the author in [187] points out computing a matrix that holds a basis for the column space is crucial. For this one computes Y = AG and via a subspace iteration proceeds to iteratively refine the matrix Y using the update Y = AA T Y . The resulting Y is then used to identify the relevant row-indices for the ID. In the same way one can obtain the two-sided ID and also a randomized CUR decomposition, where the details are given in [187]. Note that algorithms for directly computing the CUR decomposition compute some sort of statistical score for the importance of the columns/rows via the truncated SVD [184,276,34]. When obtaining the CUR from the two-sided ID we get the index sets from the ID matrices. The usefulness of the importance sample or sampling statistics is illustrated in [36] where the author shows that operations from numerical linear algebra such as inner products or matrix vector products can be performed using randomized numerical linear algebra. The key ingredients when multiplying AB or a T b is the sampling strategy for producing the approximations. The strategy is often based on an importance sampling strategy [80,82] to draw elements or rows/columns of a matrix for further processing. For example, the authors in [90] use the vector q , j = 1, . . . , n of probabilities 7 to produce an approximation C ≈ AB based on rows of A and columns of B. For the solution of a least squares problem [83,36] the (random) selection process is decoupled from from a deterministic computation via a traditional numerical linear algebra method. In particular, the first stage computes a score based on e.g. Euclidean norms, a truncated SVD matrix, or columns of a truncated Hadamard matrix. This means that the original problem is separated into a random sampling procedure leading to a reduced formulation that is then solved via a deterministic NLA approach. Due to the success in data science applications randomized methods have also penetrated classical problems in scientific computing such as solving linear systems of equations [110,263,203], eigenvalue problems [229,113] or inverse problems [283,282,269]. One area where randomization has helped greatly with the evaluation of complex mathematical expressions is the computations of functions of matrices, which we want to describe now.

Functions of matrices. The concept of evaluating a function of a matrix
f : R n,n → R n,n , where f is not meant to be evaluated element-wise, is an old but still very relevant one and we refer to [137] for an excellent introduction to the topic. One can define the matrix function via the Cauchy integral theorem as where f is analytic on and inside a closed contour Γ ⊂ C that encloses the spectrum of A. There are other definitions that are equivalent for analytic functions (cf. [137]).
Matrix functions provide a large number of challenges for numerical methods and an early discussion with 19 dubious methods for the matrix exponential can be found in [197] with an update provided in [198] adding Krylov subspace methods as the twentieth method. One typically considers two different setups, the first is the computation or approximation of the matrix f (A) explicitly and the other is the evaluation of f (A)b where the explicit evaluation f (A) is avoided and we only focus on the application of this matrix to a vector b ∈ R n . For a recent survey on evaluating f (A)b we refer to [120]. In the this case the computation of f (A) and then applying it to b is obviously avoided. Efficient techniques typically depend on the size of the data matrix A. The approximation of f (A)b approximating the Cauchy integral formula (5.1) via contour integration was given in [125]. Often the main work goes into solving linear systems with (zI − A) and once direct solvers become infeasible approaches based on Krylov subspace methods have shown very good performance for approximating the evaluation of f (A)b [139,159,85,119,159]. For the more complex case of computing f (A) we again refer to [137] for an overview of suitable methods. We here want to mention certain scenarios in data science applications that lead us to such matrix functions. Throughout scientific computing and engineering the fractional Laplacian, where instead of second order derivatives one considers derivatives of arbitrary orders, has gained much popularity in recent years [213] and one way to evaluate this efficiently is using matrix functions, e.g. [41]. Fractional powers of the Laplacian have recently become more popular for semi-supervised learning [15,65], where to the best of our knowledge techniques based on the full eigendecomposition have been employed, which for large graphs quickly becomes infeasible. In [189] a multilayer graph is considered where arbitrary matrix powers of the layer Laplacians are evaluated using the polynomial Krylov subspace method (PKSM) [137] and using contour integrals [191].
For this we return again to the study of complex networks and are now interested in the network centrality [93] where we assume that the network is understood as an undirected graph. One measure of centrality is the degree matrix D of the graph Laplacian assigning the degree d ii to every node. Other centrality measures are of great use for understanding the underlying data. For example, the eigenvector centrality [29] defined as with W the adjacency matrix of the network and (λ 1 , φ 1 ) the Perron-Frobenius eigenvalue and eigenvector, respectively. The authors in [93] illustrate that powers of the adjacency matrix W provide essential information about the graph. In particular, studying the (i, j)-th entry of W n allows to count the number of different walks of length n getting from node i to node j. For example the diagonal entries of W 2 give the degrees of the nodes in the network. Note that powers of the graph Laplacian are considered in [1] for the purpose of spectral clustering and within graph convolutional networks for semi-supervised learning in [179]. The use W 2 suggests to consider higher powers of the adjacency matrix for longer walks and one obtains as the so-called subgraph centrality for node i, (cf. [91,94]). The sum over all those entries is then known as the Estrada index [101,67,62]. Due to the high complexity the computation of the matrix function exp(A) should be avoided, especially if complex networks are considered. It is well known that expressions of the form u T f (A)u can be well approximated by e T 1 f (T k )e 1 , where T k is the tridiagonal matrix coming from the Lanczos process applied A, u is assumed to have norm one, and e 1 is the first unit vector. As the authors in [103,102,103] show there is a beautiful relation between u T f (A)u and Gauss-quadrature with first results going back to [107]. In particular, the authors in [19] use Gauss quadrature estimates coming from an efficient computation of the Krylov basis via the Lanczos process. In [30] the authors also rely on the power of Gauss quadrature to compute Katz scores and commute times between nodes in a network.
Using the Lanczos process or other Krylov subspace methods [103,104,252] to approximate quantities of the form u T f (A)u also proves essential for applications when the trace of a matrix has to be estimated by where the v i are carefully chosen random vectors. One such estimator is the so-called Hutchinson estimator [11] and the numerical computation again relies on the Lanczos process and its efficient approximation of the spectrum of A [265,194,16]. Gaussian processes [219,182,223] are another essential tool that is heavily used in data science providing significant numerical challenges evaluating matrix functions as we will illustrate now. Following [74], a Gaussian process is a collection of random variables with a joint probability distribution. Let us consider X = {x 1 , . . . , x n } with all the x i ∈ R d . The Gaussian process can then define a distribution over functions f (x) ∼ GP(µ(x), k(x, x )) with mean µ(x) : R d → R and covariance function k(x, x ) : R d × R d → R. Now f X ∈ R n represents the vector of the function values for f (x i ), µ X ∈ R n the evaluation of µ(x i ), and the matrix K XX ∈ R n,n represents the evaluation of k(x i , x j ) ∀i, j. Then it holds that f X ∼ N (µ X , K XX ). The chosen covariance kernel depends on hyperparameters θ and we assume that we are given a set of noisy function values y ∈ R n with variance σ 2 . Assuming a Gaussian prior distribution depending on θ one obtains the log marginal likelihood as It is obvious that for the numerical solution of the minimization of the log marginal likelihood the evaluation of the log-determinant is crucial. If the matrix K XX is small then we can afford the computation via the Cholesky decomposition, which is an O(n 3 ) computation. Then for where W = LL T is the Cholesky decomposition of the symmetric positive definite matrix W . In [222,223] the authors show that the methods become more efficient if one works with Markovian fields (cf. [242] for approaches beyond Cholesky for Markovian fields). In general the Cholesky decomposition will be too expensive and again Krylov subspace methods such as the Lanczos process and its connection to Gauss-quadrature are exploited. From the relation exp(trace(log(W ))) = exp(trace(Xlog(Λ)X −1 ) = exp we obtain the following beautiful relationship log(det(W )) = trace(log(W )).
This shows that in order to approximate the log-determinant, it is again crucial to employ efficient trace estimators. The Lanczos-based approximation has been used in [177,13,265] and the authors in [74] combine the Lanczos-based ideas with fast matrix vector products by modified structured kernel interpolation [279] allowing a good approximation of the log-determinants and also its derivatives. We refer to [138] for an overview of software for the computation of matrix functions in various programming languages.
6. High-dimensional problems. All of the above problems are tailored to when the data is given as a matrix A ∈ R n,m but often it is also natural for the data to be given as a tensor (6.1) A ∈ R n1×n2×...×n d with the dimensions n i ∈ N for all d modes. It is obvious that the storage requirements for a full tensor A quickly surpass the available resources in many applications. This exponential increase in relation to the parameter d is often referred to as the curse of dimensionality. This problem is closely related to the approximation of a nonlinear function in a high-dimensional space [70] where the choice of functions φ is crucial and these could be a basis, a frame, or a dictionary of functions [259,40,57,86,149]. In this case it is desirable to find an approximation with some form of sparsity, i.e., a small as possible number of terms l. Alternatively, one can interpret the function as being defined on a tensor product space and aiming at designing a low-rank approximation to f (cf. [207] for example).
The approximation of f in (6.2) is then performed relying on similar formats and tool as the ones that are used for the approximation of the given tensor A in (6.1), which we briefly describe now. For this reason approximating the tensor has a long tradition in computational mathematics, physics, chemistry, and other disciplines. While being already 10 years old we would like to point to [160] for a seminal overview of numerical tensor methods. There are several low-rank approximations to the full tensor A that correspond to different tensor formats such as the classical CANDECOMP/PARAFAC(CP) formulation [153] where with • being the dyadic product. While this format is the most natural, it has properties that make it not ideal for (many) applications, e.g. the set of rank R tensors is not closed, the computation ill-posed, etc. (cf. [111] for more details). Alternatively, one can consider the Tucker format/HOSVD [63,64], which somewhat generalizes the SVD to the multidimensional case but still suffers from the curse of dimensionality due to the d-dimensional core tensor needed. This is overcome by the tensor-train format [209], which consists of d core tensors of maximal dimension 3.
In the context of data science tensors have long been an essential tool and to list all applications and techniques goes beyond the scope of this paper. We here only list a few results that address similar questions to the above mentioned matrix cases: nonnegative factorizations [55,56,50], interpolative decomposition [228], CUR [60,185], kernel methods [257,131,130], semi-supervised learning [115], and randomization [14,277]. This list is far from complete and for more detailed reviews on tensor methods for data science we refer to the recent surveys [53,54,52] and the references given therein. For a very broad overview of recent techniques and literature for numerical tensor methods beyond data science we refer to [111].
One of the key areas that tensor prove very powerful tools is the field of deep learning, which we now briefly review.
7. Numerical linear algebra in deep learning. The study of computational and mathematical aspects of deep learning is glowing white hot with activity and we only want to point to number of places where linear algebra is used for better understanding or improving performance of deep learning methods.
In the most generic setup, the task in an artificial neural networks is to determine the weight matrices W (j) and bias vectors b (j) from minimizing a loss function to obtain the function see [136,108,166] for more information and further references. The loss function typically consists of a sum of many terms due to the large number of training data and as a result the optimization is based on gradient descent schemes [33]. For classical optimization problems it is often desirable to use second order methods due their superior convergence properties. These typically require the use of direct or iterative solvers to compute the step towards optimality, e.g. solving with the Hessian matrix in a Newton method. As a result the applicability of Newton-type schemes for the computation of W (j) and b (j) has received more attention with a strong focus on exploiting the structure of the Hessian matrix [32,61,275,47,227,271]. The computational complexity of neural networks is challenging on many levels. In order to reduce the numerical effort the authors in [260,284] use the SVD to analyze the weight matrices and then restructure the neural network accordingly, while the authors in [289] use the condition numbers of the weight matrices for such a restructuring. The weight matrices have received further attention as in [208] the authors compress the weight matrices of fully connected layers using a low-rank tensor approximation, namely, the tensor train format [209]. Additionally, low-rank approximations are also useful for speeding up the evaluation of convolutional neural networks [144] by using a low-rank representation of the filters, which are used for detecting image features. For a very similar task the authors in [165,117,118] rely on optimized tensor decompositions. Note that recently these techniques have also been applied to adversarial networks [44].
The connection of neural network architectures to optimal control problems as introduced in [121,116] makes heavy use of PDE-constrained optimization techniques including efficient matrix vector products. This topic has also recently received more attention from within the machine learning community [48] and promises to be a very interesting field for combining traditional methods from numerical analysis with deep learning.
As already pointed out in Section 3.1 graph convolutional networks (GCNs) have shown great potential and received much popularity recently as tools for semisupervised learning [154]. The expressive power of representing data as graphs that is the basis of GCNs has led to using Adversarial Networks [96], matrix completion [20], or graph auto encoders [155]. This is only a brief glimpse where numerical linear algebra techniques have enhanced deep learning methodology. We believe that this will be an area of intense research in the future.

Conclusion.
We have illustrated with this literature review that numerical linear algebra is alive and kicking. In a sense the rise of machine learning, big data, and data science shows that old methods are still very much in fashion but also that new techniques are required to either meet the demands of practioners or account for the complexity and size of the data.