Fallacy of data-selective inference in modelling networks

Recent years have seen a growing array of activities in developing statistical models for modelling real-life networks. Since many of these networks are sparse, an all too often practice in the literature is to apply a developed model to a subnetwork typically by discarding nodes due to their lack of connectivity. In this note, we provide the first result highlighting issues with this practice which we call the fallacy of data-selective inference. We demonstrate this fallacy by examining the estimation bias in the Erd } os – Rényi model theoretically and in the stochastic block model empirically.

236 nodes (7% of all nodes), in which a link is formed between two authors if they wrote at least two papers together.See also Jin et al. (2021).The second one is another co-authorship network with 2,236 nodes (63% of all nodes), in which a link is formed between two authors if they wrote at least one paper together.The third one is a directed citation network with 2,654 authors (74% of all nodes).See also Zhang et al. (2021).Other attempts to use this dataset include Li et al. (2020) in which a network with 706 authors (20% of all nodes) was formed by repeatedly deleting nodes with less than 15 mutual citations and their corresponding edges.Jin et al. (2021) examined a citee network with 1,790 (50% of all nodes) constructed by tying an edge between two authors if they have been cited at least once by the same author other than themselves.
In addition to the datasets above, there is a large growing body of works opting for data-selective inference that remove nodes before analysis.Among many others, see Chen et al. (2018) and Ma, Ma & Yuan (2020) for the Simmons College and Caltech data, two datasets on friendship networks in universities, Sengupta and Chen (2018) for the British MPs network (where 329 out of 360 MPs belonging to the giant component were analyzed), Ma, Su & Zhang (2020) for Pokec social network (for which only those nodes with no fewer than 10 links were retained for analysis), and Yan et al. (2019) for using a directed β-model for analysing the Lawyer dataset (in which 8 out of 71 nodes were removed).A notable feature of the analyses in these papers is that nonnegligible portions of the nodes are excluded.
Statistical models are inherently motivated by data in real life.Real-life networks are known to be typically sparse in the sense that the total number of observed edges does not scale proportionally to the total number of possibly edges.As a result, there often exists a nonnegligible fraction of nodes either disconnected from the largest connected component or not having enough links, prompting the use of data-selective inference.We have seen several examples discussed already.There are various reasons why this inference is employed.Chief among them is probably due to convenience but more significantly due to the nature of a developed model and its associated algorithms not working in case of disconnected networks or very sparse networks.For example, the degree-corrected stochastic block model (Karrer & Newman, 2011) and the β-model (Chatterjee et al., 2011) cannot handle nodes with zero degree for obvious reasons.On the other hand, it is frequently argued that those nodes and edges in the giant component, or those nodes that are better connected, are the only nodes and edges that matter.
However, this data-selective inference selects nodes via what is known as nonrandom sampling in statistics, since nodes in a giant component or well-connected nodes are systematically favored over other nodes.This discussion raises immediately the following fundamental question: Does data-selective inference provide valid inference?
Before we provide an answer, we note that an argument to avoid the question above would be to simply assume that the intended statistical model only works for the nodes in a giant component or those nodes that are well connected.While this argument is acceptable for mathematical convenience, it is not logically coherent or correct from a statistical or practical point of view.The selection of nodes is based entirely on the links (the response variable in a network model) and thus is nonrandom.Intuitively, this type of biased sampling may produce artificial signal that does not exist at all or mitigate existing signals or both, leading to problematic or even completely wrong findings.The fact that a nonnegligible fraction of nonrandom nodes is removed from analysis suggests that systematic bias will occur as a result.
The practice of ignoring selected nodes for modelling a network appears to originate from physics and computer science communities where the intention was to find meaningful clusters of nodes and hence is not statistical model-based (Girvan & Newman, 2002;Newman, 2006).Later, statisticians injected rigour into this line of research, notably by introducing likelihood-based estimators for statistical network models (Bickel & Chen, 2009).On the one hand, statistical modelling is extremely attractive because it provides a proper probabilistic framework for statistical inference and allows easy generalization of a model to more complex situations.On the other hand, however, issues such as sampling and asymptotics, including consistency and limiting distributions as the size of a network grows, inevitably arise.In particular, it is no longer appropriate for a statistical framework to ignore the nonrandom sampling issue in data-selective inference that removes nodes based on their links.
In the next section, we highlight the problem of data-selective inference by quantifying the bias of the parameter estimates for the simplest network model theoretically and that for the stochastic block model via simulation.Since consistency of parameter estimation is a basic requirement for any statistical model, our rational for examining estimation consistency is obvious.In particular, any systematic bias incurred in this dataselective inference will have knock-on effects on all aspects of any downstream analysis, including goodness-of-fit measures of a model, hypothesis testing and model selection; see, for example, Bickel andSarkar (2016), Lei (2016), Wang andBickel (2017), andHu et al. (2020), for additional use of data-selective inference for data analysis.

| DATA-SELECTIVE INFERENCE PRODUCES BIAS
We now quantify the bias caused by data-selective inference in some simple network models when only those nodes in a giant component are used to fit a model.If data-selective inference poses problems for these simple models, it will not work for more complex models.In what follows, we assume that an observed network is the realization of some statistical network model f F , with F a family of candidate models.Crucially, we will assume that f would have produced the whole network, including any isolated vertices or small components.Given a realized network from the model, we want to quantify the bias of the estimator of the unknown parameter(s) in f, if we only use those nodes in the giant component.
Motivated by the several widely used datasets in the literature as discussed in Section 1, we will specify the parameter(s) in f such that a fixed proportion of nodes is not in the giant component.
We will derive theoretically the bias of data-selective inference in the Erd} os-Rényi model (Erd} os & Rényi, 1959, 1960) and use simulation to study the bias in estimating the parameters in a simple stochastic block model.The Erd} os-Rényi model is interesting because the methods and approaches developed for understanding this model and the insight it brings provide the foundation for the study of more general graphs.On the other hand, the stochastic block model is one of the most popular network models that is widely applied, studied and extended in the literature.

| The Erd} os-Rényi model
As a reminder, the Erd} os-Rényi model deals with undirected graphs in which edges are independently formed with the same probability p.A sufficient condition for a realized network from this model to have a giant component and smaller components with probability tending to one is to take p ¼ pðnÞ ¼ λ=n for some fixed λ > 1 (Chapter 4, e.g., van der Hofstad, 2016).We will denote this family of models by ERðλ=nÞ, and we are interested in estimating p given a network generated from this model.Notice that λ is very close to the expected degree of each node, which is given by λ Á ðn À 1Þ=n.Consequently, ERðλ=nÞ produces sparse networks with the expected total degree scaling linearly in n.
Denote η λ as the unique solution for which η λ < 1 of the equation which exists if and only if λ > 1.It is known that in this regime, ERðλ=nÞ will produce giant components with size tightly concentrating around ð1 À is the maximum likelihood estimator using data-selective inference.
Proposition 1. Fix any λ > 1 and consider the models ERðλ=nÞ.Let G max ¼ ðC max ,EðC max ÞÞ be the induced subgraph of ERðλ=nÞ consisting only of the giant component.Let pmax be as in (2), p ¼ pðnÞ ¼ λ=n and η λ as in (1).Then, A few remarks are in order.Firstly, the estimator pmax is unbiased asymptotically only when ð1 þ η λ Þ=ð1 À η λ Þ ¼ 1, which never holds for fixed for any fixed λ > 1.Thus, the incurred bias, that is, the asymptotic factor by which we are overestimating p, will not disappear as n grows large.Figure 1 shows how the asymptotic bias ð1 þ η λ Þ=ð1 À η λ Þ deviates from one as a function of λ.Secondly, we can see that the bias increases when λ decreases.Indeed, the larger λ, the more nodes in the giant component and the smaller the probability η λ and thus the smaller the bias.On the other hand, as λ ! 1, it is easy to see that η λ ! 1, making the bias in Proposition 1 approach þ.For the limit case λ ¼ 1, C max will have size of order n 2=3 (Chapter 5 van der Hofstad, 2016), which in light of the proposition means that we must abandon all hope of recovering p if we only focus on the giant component.
While Proposition 1 follows readily from results in the random network literature, to the best of our knowledge, it has not been stated in this form in the literature before.In particular, the results we draw upon for its proof are mostly rooted in probability theory and as far as we know have not been used before to explicitly quantify the biases incurred by statistical procedures.
We have chosen to focus on the regime p $ λ=n with a fixed λ > 1, since this regime seems most appropriate for modelling many networks observed in practice; see, for example, the discussion in Section 2.3.Nonetheless, it can be instructive to consider other regions of the parameter space for the Erd} os-Rényi model, especially the scenarios in which pn ¼ λ ¼ λ n ! .When λ > ð1 þ ϵÞlog ðnÞ, for an arbitrary ϵ > 0, the Erd} os-Rényi random graph will be connected with high probability (Theorem 5.8, van der Hofstad, 2016).As long as the only requirement for the intended model is connectivity of the observed network, data-selective inference will not be an issue in this regime.On the other hand, if 1 ( λ < ð1 À εÞlog ðnÞ for any ϵ > 0, not only will the network be disconnected with high probability, it will also contain isolated nodes, making this regime susceptible to data-selective inference1 (Theorem 5.8, van der Hofstad, 2016).More precisely, for λ ¼ αlog ðnÞ,α > 0, the expected number of isolated vertices is roughly n 1Àα (Proposition 5.9, van der Hofstad, 2016).An interesting behaviour can be observed in the critical regime between these two cases, when λ ¼ logðnÞ þ β, for some fixed β ℝ.In this case, the number of isolated vertices converges in distribution to a Poisson random variable with mean e Àβ , meaning asymptotically, we expect to observe e Àβ isolated vertices and we will observe no isolated vertices with probability e Àe Àβ þ oð1Þ (equation (5.3.29),van der Hofstad, 2016).To give a concrete example, suppose a scaling of λ ¼ logðnÞ thus β ¼ 0.
Then, there would only be a chance of roughly e À1 ≈ 0:37 that the observed network is connected and the number of isolated nodes that would be discarded by data-selective inference would be of constant order.

| The stochastic block model
This model postulates that nodes in a network can be grouped into communities where the probability of any pair of nodes making connections depends only on their community membership.We here focus on what is called the symmetric stochastic block model with two communities, for which the probability matrix of making connections is where a,b > 0 are constants.For this model, a pair of nodes link with probability a=n within the same community and with probability b=n between communities.The scaling 1=n ensures that a resulting network from this model will have a giant component with a fixed, smaller-than-one proportion of the nodes with high probability.See Figure 2 for the proportion of the nodes in the giant components produced under this parametrization.The symmetric stochastic block model is widely studied and relatively well understood (Abbe, 2018).Because of the sparsity of any resulting network, many clustering methods for community detection including spectral methods based on the adjacency matrix or the graph Laplacian, as well as their semidefinite relaxations, do not work well under this parametrization.Indeed, Zhang and Zhou (2016) showed that under our scaling, no consistent algorithm exists that achieves vanishingly small misclassification.In view of this, we take an oracle approach by assuming that the community membership of each node is known a priori and focus on what happens if we estimate P when nodes not in the giant component are removed as in data-selective inference.
In our simulation, we fix the number of nodes to be n ¼ 10,000 and set the size of each community as n=2.We consider a fine grid of values ða,bÞ by taking their values from 0.05 to 8.05 in steps of 0.05, resulting in 25, 921 distinct combinations of ða,bÞ.For each such pair ða,bÞ, we sample a network from the symmetric stochastic block model and calculate the maximum likelihood estimate for a and b using only the nodes in the giant component.We repeat this process M ¼ 1,000 times for every pair ða,bÞ.
Denote the resulting estimators as â, b and P. We measure the accuracy of the estimator by calculating the ratio of the spectral norm of the estimated probability matrix and that of the true probability matrix defined as For better visibility, we only display values of λ ranging from 1.3 to 7 since the bias diverges to þ when λ approaches 1 where kPk 2 ¼ ða þ bÞ=n for the 2 Â 2 matrix.Note that we have purposefully chosen a large n and M so that the resulting averages of the estimates will be close to their true limit.Figure 2 shows the average proportion of nodes in the giant component for each pair ða,bÞ and Figure 3 the average value of ρ.When the proportion is one, the giant component contains all the nodes.The closer the average value of ρ is to one, the less biased the estimates are.In addition to the figures, Table 1 provides the average values of the size of the giant component and ρ and their standard deviations, for selected values of kPk 2 .
The simulations show that the incurred bias and the size of the giant component behave similarly when a þ b is a constant.We highlight the bias of the parameter estimates more closely.When a þ b ¼ 2:5, the giant component on average contains 37% of all nodes and we overestimate kPk 2 by a factor of 4.4.This may not exactly come as a surprise: If we discard a large proportion of nodes, it can be expected that estimates are inaccurate.A more interesting and more critical behavior is observed as we make our way from the bottom left to the top right corner of the plots in Figures 2 and 3.
For a þ b ¼ 4 (bottom black line in Figure 3), the giant component on average contains around 80% of all nodes, with an average bias ρ ¼ 1:5.
For a þ b ¼ 6 (second black line in Figure 3), the giant component contains on average 94% of all nodes, while the average estimated ρ is 1.13, which is still significantly larger than one.Even for a þ b ¼ 8 (middle black line in Figure 3), when the giant component contains on average 98% of all nodes, we overestimate kPk 2 by a factor of 1.04.Only once a þ b ≥ 10:70, where the giant component contains 99.5% of all nodes is the incurred bias smaller than 1%.
The above results illustrate that even in the idealistic scenario where the statistician has perfect knowledge of the underlying communities, and even if only a seemingly insignificant fraction of say, 0.5% of the nodes is removed, parameter estimation based solely on the giant component will be biased regardless of the network size.This casts severe doubt on the suitability of the stochastic block model and its variants for fitting many popular datasets if only the nodes in giant components are retained.As we have argued, having biased estimators will have consequences in all aspects of any downstream statistical inference including consistency, model selection, hypothesis testing and so on.

| A quick illustration of the fallacy
We now illustrate the fallacy of data-selective inference with the stochastic block model when it is applied to detecting communities in the political blog data network, to highlight a wider problem in statistical modelling of networks.If one assumes that this model generates all the 1,494 nodes, the mere existence of more than 200 nodes not in the giant component suggests imposing a scaling of 1=n on the connectivity probability matrix of the data generating process.This is similarly done previously to ensure that the resulting giant component contains a positive fraction of Mean proportion of nodes in the giant component for each pair (a,b), averaged over M ¼ 1,000 repetitions.On the border a þ b ¼ 2:5, the proportion of nodes in the giant is 37%.For better visibility, we have truncated by only including points for which a þ b ≥ 2:5.Also note the use of an exponential colour scaling for better visibility the vertices.Under this regime, however, there is no way to separate all the vertices and thus no algorithm can provide consistent community detection or parameter estimation (Abbe, 2018).If instead one focuses on the giant component and assumes consistent community estimation for the nodes in the giant component, the connectivity probability matrix will be overestimated with bias, as we have illustrated.By focusing on a nonrandom sample, we have no idea whether an intended model truly reflects the data generating process or rather is merely an artefact of biased sampling.The estimation bias of course is just the tip of a larger problem in biased sampling, as this bias will propagate through all stages of modelling process including model selection and inference.
Thus, for this dataset and many other widely used network datasets, our arguments lead to a fundamental choice that we statisticians need to make.If we assume that the stochastic block model generates the whole network including all the nodes, consistent community detection and parameter estimation will be impossible.On the other hand, if we make the unrealistic assumption that the model only generates a subnetwork consisting only of those nodes in the giant component, we face the problem of data-selective inference.
F I G U R E 3 Mean spectral ratio ρ when estimates are based on the giant component only, averaged over M ¼ 1,000 repetitions.The red line corresponds to a þ b ¼ 2 and an average bias factor of 60.57.The black lines correspond to a þ b ¼ 4,6,8,10,12 with bias factors ρ ¼ 1:51,1:13,1:04,1:014,1:005, respectively.For better visibility, we have truncated by only including points for which a þ b ≥ 2:5.Also note the use of an exponential colour scaling for better visibility Note.The respective standard deviations, rounded to four significant digits, are given in parenthesis.
We now prove Proposition 1. Proposition 1 follows from the following deep result on the phase transition of ERðλ=nÞ, which can be found in van der Hofstad (2021).
Theorem 1 Phase transition in Erd} os-Rényi random graphs, Theorem 2.33 in van der Hofstad (2021), abbreviated.Fix λ > 0 and let C max be the largest connected component of the Erd} os-Rényi graph ERðλ=nÞ.Then, where ζ λ is the survival probability of a Poisson branching process with mean offspring λ.In particular, ζ λ > 0 precisely when λ > 1.
Further, for λ > 0, with η λ ¼ 1 À ζ λ , Proof of Proposition 1. Proposition 1 follows from Theorem 1 by repeated application of Slutzky's theorem.First, by (A2) and Slutzky, Thus, by Slutzky, Now, a final application of Slutzky's theorem together with (A2) and (A3) yields pmax p ¼ jEðC max Þj jCmax j 2 Average size of the giant component and biases ρ for different values of kPk 2 T A B L E 1