Non-parametric regression for networks

Network data are becoming increasingly available, and so there is a need to develop suitable methodology for statistical analysis. Networks can be represented as graph Laplacian matrices, which are a type of manifold-valued data. Our main objective is to estimate a regression curve from a sample of graph Laplacian matrices conditional on a set of Euclidean covariates, for example in dynamic networks where the covariate is time. We develop an adapted Nadaraya-Watson estimator which has uniform weak consistency for estimation using Euclidean and power Euclidean metrics. We apply the methodology to the Enron email corpus to model smooth trends in monthly networks and highlight anomalous networks. Another motivating application is given in corpus linguistics, which explores trends in an author's writing style over time based on word co-occurrence networks.


Introduction
Networks are of wide interest and able to represent many different phenomena, for example social interactions and connections between regions in the brain (Kolaczyk, 2009;Ginestet et al., 2017). The study of dynamic networks has recently increased as more data of this type is becoming available (Rastelli et al., 2018), where networks evolve over time. In this paper we develop some non-parametric regression methods for modelling and predicting networks where covariates are available. An application for this work is the study of dynamic networks derived from the Enron email corpus, in which each network corresponds to communications between employees in a particular month (Diesner et al., 2005). Another motivating application is the study of evolving writing styles in the novels of Jane Austen and Charles Dickens, in which each network is a representation of a novel based on word co-occurences, and the covariate is the time that writing of the novel began (Severn et al., 2020). In both applications, the goal is to model smooth trends in the structure of the dynamic networks as they evolve over time.
An example of previous work on dynamic network data is Friel et al. (2016) who embedded nodes of bipartite dynamic networks in a latent space, motivated by networks representing the connection of leading Irish companies and board directors. We will also use embeddings in our work, although the bipartite constraints are not present. Further approaches include object functional principal components analysis (Dubey and Mueller, 2019) applied to time-varying networks from New York taxi trip data; multi-scale time-series modelling (Kang et al., 2017) applied to magnetoencephalography data in neuroscience; and quotient space space methodology applied to brain arterial networks (Guo and Srivastava, 2020).
The analysis of networks is a type of object oriented data analysis (Marron and Alonso, 2014), and important considerations are to decide what are the data objects and how are they represented. We consider datasets where each observation is a weighted network, denoted G m = (V, E), comprising a set of nodes, V = {v 1 , v 2 , . . . , v m }, and a set of edge weights, E = {w ij : w ij ≥ 0, 1 ≤ i, j ≤ m}, indicating nodes v i and v j are either connected by an edge of weight w ij > 0, or else unconnected (if w ij = 0). An unweighted network is the special case with w ij ∈ {0, 1}. We restrict attention to networks that are undirected and without loops, so that w ij = w ji and w ii = 0, then any such network can be identified with its graph Laplacian matrix L = (l ij ), defined as The graph Laplacian matrix can be written as L = D − A, in terms of the adjacency matrix, A = (w ij ), and degree matrix D = diag( m j=1 w 1j , . . . , m j=1 w mj ) = diag(A1 m ), where 1 m is the m-vector of ones. The ith diagonal element of D equals the degree of node i. The space of m × m graph Laplacian matrices is of dimension m(m − 1)/2 and is where 0 m is the m-vector of zeroes. In fact the space L m is a closed convex subset of the cone of centred symmetric positive semi-definite m × m matrices and L m is a manifold with corners (Ginestet et al., 2017).
Since the sample space L m for graph Laplacian data is non-Euclidean, standard approaches to nonparametric regression cannot be applied directly. In this paper, we use the statistical framework introduced in Severn et al. (2020) for extrinsic analysis of graph Laplacian data, in which "extrinsic" refers to the strategy of mapping data into a Euclidean space, where analysis is performed, before mapping back to L m . The choice of mapping enables freedom in the choice of metric used for the statistical analysis, and in various applications with manifold-valued data analysis there is evidence of advantage in using non-Euclidean metrics (Dryden et al., 2009;Pigoli et al., 2014).
A summary of the key steps in the extrinsic approach is: i) embedding to another manifold M m by raising the graph Laplacian matrix to a power, ii) mapping from M m to a (Euclidean) tangent space T ν (M m ) in which to carry out statistical analysis, iii) inverse mapping from the tangent space T ν (M m ) to the embedding manifold M m , iv) reversing the powering in i), and projecting back to graph Laplacian space L m , which we explain in more detail as follows. First write L = UΞU T by the spectral decomposition theorem, with Ξ = diag(ξ 1 , . . . , ξ m ) and U = (u 1 , . . . , u m ), where {ξ i } i=1,...,m are the eigenvalues, which are non-negative as any L is positive semi-definite, and {u i } i=1,...,m are the corresponding eigenvectors of L. We consider the following map which raises the graph Laplacian to the power α > 0: In this paper we take M m to be the Euclidean space of symmetric m × m matrices. In terms of F α (·) we define the power Euclidean distance between two graph Laplacians as where X = trace(X T X) is the Frobenius norm, also known as the Euclidean norm. For the special case α = 1, (3) is just the Euclidean distance. Severn et al. (2020) further considered a Procrustes distance which includes minimization over an orthogonal matrix, approximately allowing relabelling of the nodes, and in this case the embedding manifold M m is a Riemannian manifold known as the size-and-shape space (Dryden and Mardia, 2016, p.99). However, in this paper for simplicity and because the labelling is known, we shall just consider the power Euclidean metric.
After the embedding the data are mapped to a tangent space T ν (M m ) of the embedding manifold at ν using the bijective transformation . In our applications we take ν = 0 and the tangent co-ordinates are v = vech(HF α (L)H T ), where vech() is the vector of elements above and including the diagonal and H is the Helmert sub-matrix (Dryden and Mardia, 2016, p49). The main point of note is that the tangent space is a Euclidean space of dimension m(m − 1)/2. Hence, multivariate linear statistical analysis can be carried out in T ν (M m ); for example, Severn et al. (2020) considered estimation, two-sample hypothesis tests, and linear regression.
After carrying out statistical analysis, the fitted values are then transformed back to the graph Laplacian space as follows: where G α : M m → M m is a map that reverses the power, and P L : M m → L m is the projection to the closest point in graph Laplacian space using Euclidean distance. The projection P L is obtained by solving a convex optimization problem using quadratic programming, and the solution is therefore unique. See Severn et al. (2020) for full details of this framework.
For visualising results, it is useful to map the data and fitted regression lines in L m into R 2 . In this paper we do so using principal component analysis (PCA) such that the two plotted dimensions reflect the two orthogonal dimensions of greatest sample variability in the tangent space (Severn et al., 2020).
2 Nadaraya-Watson estimator for network data

Nadaraya-Watson estimator
We first review the classical Nadaraya-Watson estimator (Watson, 1964;Nadaraya, 1965) before defining an analogous version for data on L m . Consider the regression problem where we want to predict an unknown variable y(x) ∈ R with known covariate x ∈ R p for the dataset of independent and identically distributed random variables ({Y 1 , X 1 }, . . . , {Y n , X n }) observed at ({y 1 , x 1 }, . . . , {y n , x n }) with E{|Y |} < ∞. The aim is to estimate the regression function The Nadaraya-Watson estimator ism where K h ≥ 0 is a kernel function with bandwidth h > 0.
Consider now a version of the regression problem with y(x) replaced with an m × m graph Laplacian matrix L(x) ∈ L m with known covariates x ∈ R p and dataset We wish to estimate the regression function A natural analogue ofm(x) in (5) for graph Laplacian data given covariate, x ∈ R p , iŝ where which is bounded above and strictly positive for all u. We use a truncated version of (7) such that K h (u) = 0 for u > c (with c large) in order that this truncated kernel has compact support, as required by theoretical results presented later. WhereverL N W is defined (meaning that at least one of the K h (x − x i ) is non-zero) it is a sum of positively weighted graph Laplacians. Since the space L m is a convex cone (Ginestet et al., 2017), itself defined as the sum of positively weighted graph Laplacians, thusL N W (t) ∈ L m as required.
The estimator in (6) can equivalently be written as the graph Laplacian that minimises a weighted sum of squared Euclidean, d 1 , distances to the sample datâ using weighted least squares. In principle, the Euclidean distance d 1 in (8) can be replaced with a different distance metric, d, though solving for the estimator entails an optimisation on the manifold L m , which can be theoretically and computationally challenging. Hence instead we generalise to other distances via an extrinsic approach, and definê which is simpler provided, as here, the embedding manifold M m is chosen such that the optimisation is straightforward. The projection is needed to map back to graph Laplacian space L m .
For the power Euclidean metric, d α , consider the Nadaraya-Watson estimator in the tangent space in terms of which, after mapping back to L m using (4), the resulting Nadaraya-Watson estimator in the graph Laplacian space iŝ When α = 1 this simplifies to (6).

Uniform weak consistency
First we show that the Nadaraya-Watson estimator for graph Laplacians (6) is uniformly weakly consistent. Let where µ is the probability measure of x. Proposition 1. Suppose the kernel function K h is non-negative on R p , bounded above, has compact support and is strictly positive in a neighbourhood of the origin.
is uniformly weakly consistent for the true regression function Λ(x).
Proof. Consider the univariate regression problem for the (i, j)th element ofL N W (x). From Devroye and Wagner (1980) we know that under the conditions of the proposition we have as h → 0 and nh d → ∞; and since The result can be extended to the power Euclidean distance based Nadaraya-Watson estimator (11). Let where µ is the probability measure of X. Proposition 2. Under the conditions of Proposition 1 it follows that J n,α → 0. Hence the power Euclidean Nadaraya-Watson estimatorL α (x) is uniformly weakly consistent for the true regression function Λ(x).
Proof. First embed the graph Laplacians in the Euclidean manifold M m and map to a tangent space T ν (M m ). Consider the univariate regression problem for the (i, j)th element of π −1 ν (F α (L(x))). Again from Devroye and Wagner (1980); Spiegelman and Sacks (1980) we know that under the conditions of the proposition we have uniform weak consistency in the tangent space.
as h → 0 and nh d → ∞. Also, using the continuous mapping theorem and Pythagorean arguments as in Severn et al. (2020), we have as h → 0 and nh d → ∞.

Bandwidth selection
The result that the power Euclidean Nadaraya-Watson estimator is uniformly weakly consistent gives reassurance that the method is a sensible practical approach to non-parametric regression for predicting networks. The result is asymptotic, however, which leaves open the question of how to choose the bandwidth, h, in practice. One way to do so is to select it via cross validation (Efron and Tibshirani, 1993) as follows. Denote byL −i (x; h) a Nadaraya-Watson estimator at x, based on distance metric d, with bandwidth h, trained on all the training observations excluding the ith. Selection of bandwidth by cross validation then involves choosing h to minimise the criterion 3 Application: Enron email corpus The Enron dataset was made public during the legal investigation of Enron by the Federal Energy Regulatory Commission (Klimt and Yang, 2004) and an overview can be found in Diesner et al. (2005). Similar to Shetty and Adibi (2004) we use this data to form social networks between m = 151 employees and the data are available from Rossi and Ahmed (2015). For each month we create a network with employees as nodes. The edges between nodes have weights that are the number of emails exchanged between the two employees in the given month. The networks we consider are for the whole months from June 1999 (month 1) to May 2002 (month 36), and we standardise by dividing by the trace of the graph Laplacian for each month. The aim is to model smooth trends in the structure of the dynamic networks as they evolve over time, and we also wish to highlight anomalous months where the network is rather unusual compared to the fitted trend.
In Figure 1 we plot the distances between consecutive monthly graph Laplacians using Euclidean distance (a) and square root Euclidean distance (b). Some of the largest successive distances are We provide a PCA plot of the first two PC scores in Figure 2(a),(b) and include the Nadaraya-Watson estimator projected into the space of the first two PCs. Here the bandwidth has been chosen by cross-validation as h = 2 for the Euclidean case and h = 1 for the square root metric. The Nadaraya-Watson estimator provides a smooth path through the data, and the structure is clearer in the square root metric plot.
We are interested in finding anomalies in the Enron dynamic networks and so we compute the distances from each network to the fitted value from the Nadaraya-Watson estimate. Figure 2 shows these residual distances of each graph Laplacian to the fitted Nadaraya-Watson values for (c) the Euclidean metric and (d) the square root metric. Some of the largest residuals are months 1,7,35 for Euclidean and 7,33,34,35 for the square root metric, and these are candidates for anomalies.
From Figure 2(b) it looks like there is an approximate horseshoe shape in the PC score plot which could be an example of the horseshoe effect (Kendall, 1971;Diaconis et al., 2008;Morton et al., 2017). We might conclude there is a change point in the data around months 20-26 from these plots but this may be misleading (Kendall, 1970). Explained in Mardia et al. (1979, page 412), the horseshoe effect occurs when the distances which are "large", between data points, appear the same as those that are "moderate". Morton et al. (2017) described this as a "saturation property" of the metric, and so on the PCA plot the point corresponding to a 'large' time is pulled in closer to time 1 than we intuitively would expect.
As an alternative to PCA, which seeks to address this horseshoe effect, we consider multidimensional scaling (MDS) with a Mahalanobis metric in the tangent space (Mardia et al., 1979, p  between two graph Laplacians L k and L l , at times k and l respectively, which is: where µ and Σ kl are the mean and covariance matrix of π −1 0 (F α (L k )) − π −1 0 (F α (L l )) respectively. Here we take µ as zero and consider an isotropic AR(1) model which has covariance matrix which is a diagonal matrix where the diagonal elements are the variance of elements and we have assumed a 0 covariance between any other elements. Writing y k = π −1 0 (F α (L k )) and v k = π −1 0 (F α (L k−1 )) we estimate ρ by least squares ρ = , and we take σ = 1 as this is just an overall scale parameter. The Mahalanobis metric between graph Laplacians, L k and L l , can now be written as The plot of MDS with the Mahalanobis distance are given in Figure 3(a)-(b). In both plots there are large distances between the first few and last few observations compared to the central observations, which is broadly in keeping with Figure 1(a),(b) although the middle observations do seem too close together in the MDS plots. We consider an alternative estimate in choosing ρ that maximises the variance explained by the first PC scores for each example, shown in Figure 3(c)-(d).
These final MDS plots are more in agreement with the distance plots of Figure 1. In particular in Figure 3(d) we see that months 7, 34, 35, 36 look rather different from the rest.
Finally we consider the main features of all the results from Figure 1- Figure 3 and we see that the 7th, 34th and 35th months stand out as strong anomalies. The 7th month corresponds to December 1999, and this is picked out to be an anomaly in Wang et al. (2014), believed to coincide with Enron's tentative sham energy deal with Merrill Lynch created to meet profit expectations and boost the stock price. Month 34 and 35 correspond to March and April 2002 these correspond to the former Enron auditor, Arthur Andersen, being indicted for obstruction of justice (The Guardian, 2006).

Application: 19th century novel networks
We consider an application where it is of interest to analyze dynamic networks from the novels of Jane Austen and Charles Dickens. The 7 novels of Austen and 16 novels of Dickens were represented as samples of network data by Severn et al. (2020). Each novel is represented by a network where each node of the network is a word, and edges are formed with weights proportional to the number of times a pair of words co-occurs closely in the text. For each novel we produce a network counting pairwise word co-occurrences, and words are said to co-occur if they appear within five words of each other in the text. A choice that needs to be made is if we allow cooccurrences over sentence boundaries and chapter boundaries (Evert, 2008, Section 3), and for this dataset we allow it. The data are obtained from CLiC (Mahlberg et al., 2016).
We take the node set V as the m = 1000 most common words across all the novels of Austen and Dickens. A pre-processing step for the novels is to normalise each graph Laplacian, in order to remove the gross effects of different lengths of the novels, by dividing each graph Laplacian by its own trace, resulting in a trace of 1 for each novel. Our key statistical goal is to investigate the authors' evolving writing styles, by carrying out non-parametric regression with a graph Laplacian response on the year t that each novel was written.
We apply the Nadaraya-Watson regression to the Jane Austen and Charles Dickens networks separately to predict their writing styles at different times. The response is a graph Laplacian and the covariate is time t for each novel, with a separate regression for each novelist. We compared using the metrics d 1 and d 1 2 . For each author a Nadaraya-Watson estimate was produced for every 6 months within the period the author was writing. We compared different bandwidths, h, in the Gaussian kernel. The results are shown in Figure 4 plotted on the first and second principal component space for all the novels.
For both metrics for Dickens when h = 1 the regression lines are not at all smooth. For both metrics with h = 2 the regression line for Dickens appears to show an initial smooth trend, then a turning point around the years 1850 and 1851 (between David Copperfield and Bleak House which are novels 16 and 17 in Figure 4). After 1851 there is much less dependence on time. This change in structure is especially evident in the h = 4 plot for both metrics, which has the smallest value of the cross-validation criterion (14) out of these choices h ∈ {1, 2, 4}. In the year 1851 Dickens had a tragic year including his wife having a nervous breakdown, his father dying and his youngest child dying (Charles Dickens Info, 2020). We see that the possible turning point is around the same time as these significant events.
As there are far fewer novels written by Austen it is less obvious if there is any turning point in her writing, however it is clear that Lady Susan (novel 1) is an anomaly, not fitting with the regression curve that does follow Austen's other works more closely. Lady Susan is Austen's earliest work, and is a short novella published 54 years after Austen's death.

Discussion
The two applications presented involve a scalar covariate, but the Nadaraya-Watson estimator is appropriate to more general covariates, e.g. spatial covariates. A further extension would be to adapt the method of kriging, also referred to as Gaussian process prediction. Kriging is a geospatial method for prediction at points on a random field (e.g. see Cressie, 1993), and Pigoli et al. (2016) considered kriging for manifold-valued data. The kriging predictor of an unknown graph Laplacian L(x) on a random field with known coordinates x for the dataset ({L 1 , x 1 }, . . . , {L n , x n }) is of the form Z(x) = n i=1 b(x i )L i , where the weights, b(x i ), are determined by minimizing the mean square prediction error for a given covariance function.
The Nadaraya-Watson estimator can also be applied in a reverse setting where some variable t i is dependent on the graph Laplacian L i , this can be written as t i = t(L i ). This could be used if, for example, one had the times networks were produced and then wanted to predict the time a new network was produced. In this case the Nadaraya-Watson estimator is a linear combination of known t i values, weighted by the graph Laplacian distances, given bŷ where d can be any metric between two graph Laplacians. Severn (2019) provided an application of this approach using the Gaussian kernel defined in (7), predicting the time that a novel was written using the network graph Laplacian as a covariate.
Other metrics could also be used, for example the Procrustes metric of Severn et al. (2020). To solve (9) for the Procrustes metric, the algorithm for obtaining a weighted generalised Procrustes mean given in Dryden and Mardia (2016, Chapter 7) can be implemented.
In Euclidean space there are more general results for the Nadaraya-Watson estimator including weak convergence in L p norm, rather than p = 2 results that we have used (Devroye and Wagner, 1980;Spiegelman and Sacks, 1980). More general results also exist, e.g. see Walk (2002), including strong consistency. It will be interesting to explore which of these results can be extended to graph Laplacians, although the additive properties of the p = 2 case have been particularly important in our work.