Vertex Nomination Via Seeded Graph Matching

Consider two networks on overlapping, non-identical vertex sets. Given vertices of interest in the first network, we seek to identify the corresponding vertices, if any exist, in the second network. While in moderately sized networks graph matching methods can be applied directly to recover the missing correspondences, herein we present a principled methodology appropriate for situations in which the networks are too large for brute-force graph matching. Our methodology identifies vertices in a local neighborhood of the vertices of interest in the first network that have verifiable corresponding vertices in the second network. Leveraging these known correspondences, referred to as seeds, we match the induced subgraphs in each network generated by the neighborhoods of these verified seeds, and rank the vertices of the second network in terms of the most likely matches to the original vertices of interest. We demonstrate the applicability of our methodology through simulations and real data examples.


INTRODUCTION AND BACKGROUND
In this paper, we address the problem of nominating vertices across a pair of networks: Given vertices of interest (VOIs) in a network = ( , ), our task is to identify the corresponding vertices of interest, if they exist, in a second network ′ = ( ′ , ′ ). Our methods will leverage vertices in the neighborhood of the VOIs in that have verifiable matches in ′ to (ideally) create local neighborhoods of the VOIs in both and ′ . These neighborhoods are then soft-matched (see Algorithm 1, adapted here from [11]) across networks, yielding a nomination list for each VOI in ; i.e., a ranking of the vertices in the local neighborhood of the seeds in ′ , ideally with the corresponding VOI's in ′ concentrating at the top of the list. While global methods can (and have been) applied to identify the VOI's in ′ directly, performance of these methods can suffer from the noise induced by vertices without correspondences across networks [28]. Localization is a prominent tool used across various fields such as machine learning (see for example [41,53] on using locality based anomaly detection in time series of graphs and [16] on localized multiple kernel learning), pattern recognition (this includes clustering algorithms which have been using localization for many years -for example -nearest neighbor based classification rules -see for example [47,15,23,37,8]), and object recognition (see for example [46] on using convolutional networks for localization and object boundary detection and [4] on local algorithms for geometric object recognition). Inspired by the many successes localization has seen in other fields of research, we bring the concept of localization to the fore-front of network alignment. Our methods are inherently local, leveraging recent advancements in both graph matching [11,29] and vertex nomination [7,12] to nominate across essentially arbitrarily large networks.
Formally, suppose we are given two networks = ( , ) and ′ = ( ′ , ′ ) on overlapping but not necessarily identical vertex sets and ′ respectively. For simplicity, we will presently restrict our attention to the case of a single VOI in (as the case of multiple VOIs is an immediate extension of our methodology for a single VOI), and we write where and ′ represent the VOI in and ′ resp.; and ′ represent the seeded vertices across networks-those vertices that appear in both vertex sets whose correspondence across networks (i.e., the seeding ↔ ′ ) is known a priori-and necessarily satisfy ∶= | | = | ′ |; and ′ are the shared non-seed vertices-those vertices that appear in both vertex sets whose correspondence across networks is unknown a priori-with | | = | ′ | = ; and and ′ are the unshared vertices-those vertices that appear in only one or the other vertex set without correspondences across networks-with | | = and | ′ | = ′ . Thus, we can write ∶= | | = 1 + + + , and ′ = | ′ | = 1 + + + ′ .
While the correspondence between vertices in and ′ is unknown a priori, we will further assume that it is unknown which vertices in ⧵ { , } are in versus , as are the values of , and ′ . Our inference task is to identify ′ ∈ ′ ⧵ ′ (i.e., the corresponding VOI in ′ ) using only the knowledge of the graph structures and the correspondence ↔ ′ . For the purposes of this paper, we will assume that the corresponding vertex ′ does exist in ′ , else our task is impossible. Our goal will be to nominate vertices in ′ in a principled manner so that the true match is high in the nomination list, thus saving the end-user time in searching for this true match. While this core-junk network framework has appeared often in the literature (see, for example, [22]), herein we will consider a more general random graph model that allows for heterogeneity in vertex degree and behavior (see, Section 3).
Our approach to this inference task lies on the boundary between Graph Matching and Vertex Nomination. Stated simply, the formulation of the graph matching problem (GMP) considered herein seeks to align the vertices in two networks so as to minimize the number of induced edge disagreements between the aligned networks. Graph matching has been been extensively studied in the literature (for an excellent survey of the literature, see [5,13]) with applications across various fields including pattern recognition (see, for example, [2,54,60]), computer vision (see, for example, [59,26,55]), and biology (see, for example, [57,36,24]), among others. The seeded graph matching SGM algorithm on which we base our primary algorithm has run-time ( 3 ) at worst, which has been shown to be reasonable in comparison to other state-of-the-art algorithms (such as the PATH algorithm of [58] -see [52] and [11] for more information on the computational complexity of this algorithm). Furthermore, the authors of [30,29] show that it has theoretical guarantees for converging to the correct solution under reasonable model assumptions.
The classical formulation of the vertex nomination (VN) inference task [34,6,48,50,12,31] can be stated as follows: given a network with latent community structure in which one of the communities is of particular interest and given a few vertices from the community of interest, the task in vertex nomination is to order the remaining vertices in the network into a nomination list, with the aim of having vertices from the community of interest concentrate at the top of the list. Thus, vertex nomination can also be thought of as a method for inferring missing vertex labels, and is related to the class/labeled instances acquisition task and collective classification methods of [3,51,45]. The goal of vertex nomination is similar in spirit to popular network-based information retrieval (IR) procedures such as PageRank [39] and personalized recommender systems on graphs [19]. However, this formulation of VN is distinguished from other supervised network mining tasks in both the generality of what defines a vertex of interest [43,32] and the (often) limited nature of the available training data (i.e., known vertices of interest) in . Our present task can be viewed as vertex nomination across networks: for a vertex of interest in , we use graph matching methodologies to order the vertices in ′ into a nomination list, with the aim of having the corresponding vertex of interest in ′ near the top of the list.

Our contributions:
In summary, our contributions are as follows: • Leveraging the idea of principled sub-sampling of a graph, we reduce time-complexity for matching two graphs via localization.
• Combining the task of vertex nomination to across graph nomination tasks.
• Extending the SoftSGM algorithm of [11] to the task of vertex nomination.
• Demonstrating via two real world graph data-sets, we conduct an out-of-the-box large-scale evaluation of our VNmatch algorithm.
The remainder of the paper is laid out as follows: In Section 2, we give an overview of some related work, after which, in Section 3, we introduce a formal definition of what we mean by "corresponding vertices." Following, in Section 4, we introduce our across graph VN scheme, VNmatch, along with a brief mathematical description of the utilized subroutines including the soft seeded graph matching algorithm (SoftSGM, Algorithm 1), introduced in [11]. In Sections 5.1 and 5.2, we explore applications of VNmatch on both synthetic and real data, including a pair of high school friendship networks and a pair of online social networks. We conclude with an overview of our findings and a discussion of potential extensions in Section 6.
We employ the SoftSGM Algorithm of [11] as a means by which to nominate vertices in the VNmatch algorithm so as to introduce this algorithm as a useful tool in the across graph vertex nomination task. However, other vertex nomination schemes exist which could also be adapted and utilized in the VNmatch algorithm (in particular Steps 3 and 4). For example, the use of spectral methods, which tend to work well for matching larger graphs, may be desirable when the original graphs are on the order of millions of vertices and localization trims the networks down to only thousands of vertices. For details regarding adjacency or Laplacian spectral embedding, see [42].

RELATED WORK
A number of inexact graph matching algorithms have been extended/developed recently to match graphs with overlapping, nonidentical vertex sets. Two such algorithms include percolation based algorithms (see for example [22,9,21]) and Bayesian based algorithms (see for example [40]). In [22] and [21], the authors focus their efforts on proving that under the independent-edge-sampling model ( , ; . ), where a graph, , is generated from an Erdös-Renyi ( , ) model and two subgraphs of , namely 1 and 2 , are generated so that the probability of a node from belonging to is , independently for = 1, 2, and similarly for edges (with probability ). Under the independent-edge-sampling model, it is shown in [22] that for sufficiently large the true partial matching is recoverable under particular model assumptions and for some formulation of their objective; however, as the authors admit, the optimization formulation proposed is not scalable, and there is no mention of how the correct formulation of the objective is to be obtained.
Using the same independent-edge-sampling model, the authors of [21] introduce an iterative percolation based graph matching method for seed-based graph matching, demonstrating that their method (under this model and particular assumptions) matches nearly all overlapping nodes correctly. In [9], the authors introduce a degree-driven percolation based graph matching algorithm which uses an iterative approach to match nodes with higher degree to lower degree using percolation based graph matching. For scale-free networks, the authors show that, under particular model assumptions, their method, which does not aim to match all nodes, but to match subsets of nodes from the two graphs, matches nearly all vertices which have a match correctly and that the algorithm does not match any nodes incorrectly. While this works well for scale-free networks, the advantages of this method would be more limited on graphs with more block structure and without having higher-degree nodes which help seed the rest of the algorithm. The authors point out that when seeds are chosen uniformly at random ( 1∕2+ ) seeds are needed to match most vertices correctly, but allowing for more intelligent seed-selection based on vertex degree, as few as seeds may be sufficient, for some arbitrarily small .
Each of the above approaches is theoretically based in relatively simple random graph models (ER for [22] and the Chung-Lu model in [9]), while also demonstrating good performance in more complex real data settings. Our present approach is naturally situated in the more general Random Dot Product Graph setting of [56]. While still not able to capture all the intricacies of real network data, the random dot product graph is quite flexible and encompasses numerous other common random graph models (ER, Chung-Lu, positive definite stochastic blockmodel, etc.). In addition, we also demonstrate the effectiveness of our method on more complex real data networks as well.
Percolation based algorithms could certainly be used for vertex nomination in a similar way that we present vertex nomination based on the seeded graph matching algorithm of [11] (which is based on a fast approximate quadratic programming algorithm of [52]). One of the advantages of the present optimization based approach is the ability to efficiently explore the space of locally optimal solutions near the global optimum. Practically, the graphs to be matched in real data are much more messy than theory would allow, and the variations that can be obtained from these local optima provide a degree of robustness to model misspecification. Furthermore, the SGM algorithm itself runs quickly on modestly sized networks and has asymptotic guarantees for particular models and conditions (see for example [52,11] and [30,29]).
In [20] the authors focus on the task of de-anonymizability, and explore a method for matching nodes based on node-degree; that is, the authors consider two graphs drawn in some manner from a larger graph and attempt to de-anonymize (match) the vertices in the two graphs which have the highest degrees. We are not concerned with matching vertices based on their degree, since a vertex of interest is based on an external characteristic that is not necessarily related to the degree distribution of the two graphs.
Another technique for approximate graph matching relies on Bayesian methods [40]. The authors of [40] introduce a method which relies on estimating the posterior probability that two nodes should be matched based on a particular prior. In the aforementioned paper, the authors rely on node attributes, such as vertex degree, mapping a few nodes at a time in an iterative manner until all nodes are matched; any nodes matched in one iteration will be used as seeds (referred to as anchors) in the next iteration. In the end, the authors seek to obtain a hard matching of the nodes that maximizes the sum of the log-posteriors for all node pairs. While the idea of a posterior probability that two nodes should be matched is a similar idea to what we present, the purpose of our more frequentist method is to utilize a soft matching of the nodes in order to rank them in order from most to least likely matches to the vertex (or vertices) of interest.

CORRESPONDING VERTICES
Consider two social networks in which vertices represent users/accounts and edges represent whether or not two accounts are linked in some way. An individual may have an account on one network or the other or both. We would say that two accounts across the platforms correspond to each other if the same individual runs both accounts; that is, both nodes correspond to the same individual although with possibly different node labels. Arguably an individual who has an account on both networks will have similar, though not identical, behavior across the two networks. Consider an email network in which nodes are email addresses and two email addresses share an edge (directed or not) if they send correspondence to one another, and a phone network in which nodes are phone numbers and edges represent whether or not one of the numbers calls the other. In this example, a vertex in the email network will correspond to a vertex in the call network if the email and phone number belong to the same individual. An individual who uses both email and phone correspondence may communicate with individuals in the two networks in a similar, though not identical, manner. Thus, if there is a connection between two individuals in one network and those same individuals exist in the second network, one would think that it is more likely that there exists a connection between these individuals in the second network, i.e. there is a positive correlation between the edges between these vertices across the networks. To model this correspondence, we proceed as follows.
With notation as above, let = { } ∪ ∪ and ′ = { ′ } ∪ ′ ∪ ′ denote the set of shared vertices between and , the induced subgraph of (resp., ′ ) on the vertex set (resp., ′ ). As and ′ are graphs on the same (though potentially differently labeled) vertex sets, we model a shared structure present across and ′ as ( , ′ ) ∼ ( ). Before defining the model, we first recall the definition of a random dot product graph (RDPG); see [56].
We say that graph with adjacency matrix is distributed as a random dot product graph with parameter (abbreviated ∼ ( )) if given , Conditioning on , this is an independent-edge random graph model where vertex is associated with a latent position vector ∈ ℝ , and the probability of an edge between any two vertices is determined by the dot product of their associated latent position vectors.
To imbue multiple random dot product graphs with a notion of vertex correspondence, we correlate the behavior of nodes across networks. We call this new model the model, which is defined as follows.
The bivariate graph valued random variables ( , ′ )-with respective adjacency matrices and ′ -are said to be distributed as a pair of -correlated random dot product graphs with parameter (abbreviated ( , ′ ) ∼ ( )) if Our framework posits ( , ′ ) ∼ ( ) for a latent position matrix ∈ ℝ (1+ + )× . In order to generate the full graphs and ′ which also have unshared vertices, we generate ∼ ([ , ]) and ′ ∼ ([ , ′ ]), so that the induced subgraphs ( , ′ ) ∼ ( ) and the remaining edges of and ′ are formed independently as in the case for the general RDPG. Thus, the first 1 + + vertices in the two graphs correspond to one another via the identity map and the remaining and ′ vertices of and ′ , respectively, represent the unshared vertices and ′ . Here, ∈ ℝ × and ′ ∈ ℝ ′ × represent the respective latent positions for the unshared vertices in and ′ . For ease of notation, we will write is realized as two graphs: on = 1 + + + vertices { } ∪ ∪ ∪ and ′ on ′ = 1 + + + ′ vertices { ′ } ∪ ′ ∪ ′ ∪ ′ . If and ′ exhibit latent community structure, it can be fruitful to model them as Stochastic block model (SBM) random graphs [18]. SBMs have been extensively studied in the literature and have been shown to provide a useful and theoretically tractable model for more complex graphs with underlying community structure [44,49,1,38]. We define the stochastic block model as follows. Definition 3. We say that graph = ( , ) with adjacency matrix is distributed as a stochastic block model random graph with parameters , and Λ (abbreviated ∼ SBM( , , Λ)) if 1.
is partitioned into blocks = 1 ∪ 2 ∪ ⋯ ∪ ; 2. ∶ → {1, … , } is a map such that ( ) denotes the block label of the th vertex; Recall that the edge probability matrix for a random dot product graph model, with parameter ∈ ℝ × , is equal to , which is positive semi-definite. If consists of precisely distinct rows, then a graph generated via ( ) can also be said to be generated from a stochastic block model having blocks, block assignment vector assigning vertices with the same latent position to the same block, and probability matrix Λ = ( ) ( ( ) ) where ( ) here refers to the × matrix of distinct rows of . Moreover, if Λ is positive semidefinite, then ∼ SBM( , , Λ) can be realized as a RDPG with appropriately defined . Thus, there is an overlap in the set of random dot product graph models and stochastic block models.
We can then define the model as follows.

VERTEX NOMINATION VIA SEEDED GRAPH MATCHING
With this notion of corresponding vertices, we next introduce our proposed algorithm for finding the corresponding vertex ′ ∈ ′ to a particular vertex of interest ∈ . Again, we assume a single vertex of interest for simplicity, as the extension to multiple vertices of interest follows immediately. Before presenting our main algorithm, VNmatch (Algorithm 2), we first provide the necessary details for the subroutine of Algorithm 2 we employ, namely the SoftSGM algorithm of [11]. The easy interpretability and simple extension of the SoftSGM algorithm to generating nomination lists for vertices of interest make it a natural candidate for the vertex nomination subroutine task of Algorithm 2; however, other methods of graph matching, such as spectral-based methods, for which extension to vertex nomination is possible could also be used during this step of the algorithm.
The SGM algorithm of [11] approximately solves this indefinite SGMP relaxation using the Frank-Wolfe algorithm [14], and then projects the obtained doubly stochastic solution onto Π − . The algorithm performs excellently in practice in both synthetic and real data settings, with a (( − ) 3 ) runtime allowing for its efficient implementation on modestly sized networks. Since we ultimately aim to create a nomination list (and not a 1-to-1 correspondence necessarily) for the VOI of likely matches in ′ ∖ ′ , we use the SoftSGM algorithm of [11]-a stochastic averaging of the original SGM procedure over multiple random restarts-in order to softly match the graphs. Rather than the 1-to-1 correspondence output from SGM, SoftSGM (pseudocode provided in Algorithm 1 for completeness) outputs a function (⋅, ⋅) ∶ × ′  → [0, 1], where ( , ) represents the likelihood vertex in ′ matches to vertex in . As noted in Table 1, ⊕ denotes the direct sum between two matrices and , and 0 denotes the × all zeroes matrix. Also, the function in Algorithm 1 refers to as in Equation 4. Step 4: create nomination list for by ranking the vertices in ′ by decreasing value of ( , ⋅); Output:

VNmatch
We consider two graphs and ′ with vertex sets = { } ∪ ∪ ∪ and ′ = { ′ } ∪ ′ ∪ ′ ∪ ′ , where the vertices in ∖ and ′ ∖ ′ are shared between the two graphs. As stated previously, our task is to leverage an observed one-to-one correspondence ↔ ′ to find the vertex ′ ∈ ′ corresponding to a particular vertex of interest ∈ . If and ′ are modestly sized (on the order of thousands of vertices), we could use Algorithm 1, the SoftSGM algorithm of [11], to soft match and ′ , padding or ′ as necessary when ≠ ′ . As the purpose of matching the graphs in this inference task is to identify the vertex ′ ∈ ′ ; we create a ranked nomination list, which we denote by , for by ordering the vertices in ′ by decreasing value of ( , ⋅): (with ties broken uniformly at random) In practice, however, the networks under consideration may be too large to directly apply SoftSGM or similar global graph matching procedures. For example, many of the partially crawled social networks found at [27] contain tens-of-millions of vertices or more. Therefore, rather than applying SoftSGM globally, we reduce the size of the problem through localization. In our underlying network model, the local structure around a vertex in one graph will be similar to the local structure around a vertex in the second graph. With this in mind, given ℎ ∈ ℕ and a set Υ ⊂ , we define the h-neighborhood of Υ in via ℎ (Υ) ∶= { ∈ ∶ there exists a path of length ≤ ℎ in from to a vertex in Υ}.
Note, by convention Υ ⊂ ℎ (Υ). We denote by = ,ℎ ∶= ∩ ℎ ( ) the set of seeded vertices in with shortest path distance to less than or equal to ℎ, and we define ′ to be the corresponding seeds in ′ with | | = = | ′ |. Notionally, as ℎ → ∞, ℎ ( ) tends towards the connected component of containing , and we say that ℎ = ∞ yields ℎ ( ) to be the entire vertex set of .

FIGURE 1
Average size of over 50 Monte Carlo simulations as a function of (ℎ, ); seed sets and vertex of interest selected randomly. As ℎ increases, more vertices in the graph are within an ℎ-path of the randomly chosen VOI, and therefore more of the seed vertices are in ℎ ( ).
where ties are broken uniformly at random. Remark 1. Figure 1 demonstrates how | | depends on | | and ℎ for graphs generated from a stochastic blockmodel (Figure  1a, model described in Definition 3) and for the Facebook network of [35] which we consider in detail in Section 5.2.1. In both cases, the seed sets and VOI are chosen uniformly at random. As expected, as ℎ increases, | | approaches | |. It is important to keep in mind that increasing ℎ also increases and, consequently, the sizes of ( ) and ( ′ ), increasing computational complexity. In both the simulated and Facebook examples, ℎ = 2 seems an appropriate choice, and is the value we use for the networks in further exploration (see Section 5).

SIMULATIONS AND REAL DATA EXPERIMENTS
Note here that all necessary code and data needed to produce the figures in this section can be found at http://www.cis.jhu.edu/~parky/D3M/VNSGM/.
We will measure the performance of VNmatch via rank( ′ ), the expected rank of ′ in Φ when ties are broken uniformly at random. Since the size of the set of candidate matches ′ ∶= ′ ⧵ ′ (seeds in ′ will never be matched to by SoftSGM) varies greatly in each experiment, we will compare across experiments by computing the normalized rank of ′ Note that ( ′ ) = 0 (resp., ( ′ ) = 0.5 or ( ′ ) = 1) implies that the Φ (1) = ′ (resp., Φ (| ′ |∕2) = ′ or Φ ( ) = ′ for ≥ | ′ |); i.e., the VOI was first, half-way down, or effectively last in the nomination list. A low value of ( ′ ) corresponds to a low ranking of ′ in the nomination list output from the VNmatch algorithm and corresponds to a measure of how much time is saved (versus a uniformly random search) by the end-user when searching through the candidate set of vertices for the true match ′ . We view a score of ( ′ ) = 5∕100 as better than a score of ( ′ ) = 5∕10 since the amount of time saved by the end-user is greater in the first case.

FIGURE 3
For pairs of 300-node graphs generated from a (3, , Λ), we the average normalized rank, ( ′ ), as a function of , , and .

Simulation experiments
We first explore the performance of Algorithm 2 in the -RDPG setting, followed by the -SBM setting (see Section 3 for descriptions of these models). To wit, we first generate pairs of graphs from a ( ), where the latent positions of are uniformly chosen so that each row of is a unit vector and for any two rows of , namely and , ∈ (0, 1). In Figure  2 we explore how ( ′ ) is affected by the number of seeds used in the matching as compared against various correlation values = 0, 0.3, 0.5, 0.7, 1 (2a) and disparities in the sizes of the graphs to be matched when = 0.6 (2b). Next, we generate pairs of graphs from a In Figure 3, we explore how ( ′ ) is affected by the number of seeds used in the matching as compared against correlation values = 0, 0.3, 0.5, 0.7, 1 (3a) and disparities in the sizes of the graphs to be matched when = 0.6 (3b). In order to explore how the number of seeds used in matching, , affects the location of the VOI in the nomination list, in both the RDPG and SBM setting, we vary from 1 to 9, and run 100 Monte Carlo replicates using VNmatch, with both the VOI and the seeds chosen uniformly at random in each Monte Carlo replicate. In Figures 2a and 3a, we record the average normalized rank of the VOI in the nomination list (±2s.e.) for the RDPG and SBM settings, respectively. It is apparent that for  FIGURE 4 The induced subgraphs for the High School Facebook and Survey networks generated by the shared vertices [35].
sufficiently correlated networks, as the number of seeds increases, our proposed nomination scheme becomes more accurate; i.e., the location of the VOI in the nomination list is closer to the top of the list. For graphs with very low correlation, the uniformly poor performance can be attributed to both the lack of much common structure between and ′ and the failure of SoftSGM to tease out this common structure. Since both and ′ are dense networks, ( ) and ( ′ ) generally contained between 250 and 300 vertices each. Thus, the proportion of shared vertices in and ′ is rather high for this example. To explore how the normalized rank of the VOI is influenced by matching graphs which differ in size, we next consider pairs of graphs on different sized vertex sets. We will set the number of vertices in the smaller graph, ′ , to be | ′ | = | | = 300 , for = 0.25, 0.30, … , 1. Let ′ = ′ and suppose there exists an induced subgraph of so that ( , ′ ) ∼ 0.6 ( ) for Figure 2b and ( , ′ ) ∼ 0.6 (3, , Λ) for Figure 3b. For each , we plot the average ( ′ ) (±2s.e.) over 100 Monte Carlo replicates for fixed = 4. As can be seen, under this model when the original networks and ′ have a large discrepancy between the sizes of their vertex sets there is less accuracy in the VNmatch algorithm. Furthermore, the more obvious community structure present in the SBM setting contributes to better performance of the VNmatch algorithm. Although we are not matching graphs and ′ with vertex size difference ratio at every iteration, since the connectivity of the vertices is high, and ′ do not deviate much from being the full graphs.
Li and Campbell explore the effects of utilizing seeds in graph matching problems in [28]. They found that although a small number of seeds can greatly increase the number of correctly matched vertices, as the number of shared users decreases so does the ability to find a good match. As might be expected, since the number of potential mismatches increases as the number of shared users decreases, Figures 2b and 3b are consistent with Li and Campbell's results.

Real data experiments
In this section, we explore two applications of VNmatch on real data. Section 5.2.1 explores a pair of high-school networks obtained from [35] in which the first graph is created based on student responses to a 'who-knows-who' survey and the second is a Facebook friendship network involving some of the same students. In Section 5.2.2, we consider Instagram and Twitter networks having over-lapping vertex sets in which we would like to identify which Instagram profile corresponds to a particular Twitter profile.

Finding friends in high school networks
We consider two High School friendship networks on over-lapping vertex sets published in [35]. The first network, having 156 vertices, represents a Facebook network of profiles in which two vertices are adjacent if the pair of individuals were friends on Facebook. The second network consists of 134 vertices, each representing a particular student, and two vertices are adjacent if one of the students reported that they are friends with the other student. There are 82 shared vertices across the two networks for which we know the bijection between the two vertex sets, and the remaining vertices are known to have no such correspondence. In the language of Section 1, = 156, ′ = 134, + = 81, = 74, and ′ = 52.
Due to the large number of unshared vertices (nearly 40% and 50% for the Survey and Facebook networks, respectively), for illustrative purposes we perform our analysis of this data set by looking at the induced subgraphs generated by the shared vertices. A brief glimpse into the effects of the unshared vertices can be found in the supplemental material accompanying this article. This step is purely for exploratory analysis and would not be feasible in practice, as we would not have prior knowledge about which vertices in the networks are shared as opposed to unshared. At the same time, immediate success of VNmatch is still not guaranteed since the structure of the two graphs is very different, see Figure 4. Furthermore, we can see that there appears to be a 2-block structure for each of the (shared) networks, although, if we were to model these networks the block probability matrices for the two networks appears to differ (unlike our simulation examples).
We first explore how VNmatch performs when finding the VOI using a single seed. Let and ′ denote, respectively, the induced subgraphs of the High School Facebook and Friendship-Survey networks generated by the 82 shared vertices. We run 82 experiments, one for considering each ∈ as the VOI, and for each VOI we consider using each ∈ 2 ( ) as our single seed for VNmatch. In Figure 5, for each , we plot how often ( ′ ) ∈ {0, (0, 0.5), [0.5, 1], } in light green, dark green, light purple, and dark purple, respectively (colors listed in order as they appear in Figure 5 from bottom to top): When ( ′ ) = 0, the true match ′ is at the top of the nomination list -this is the best case possible; when ( ′ ) ∈ (0, 0.5), ′ is somewhere between the top of the nomination list and half-way down (i.e. better than chance, but not first); when ( ′ ) ∈ [0.5, 1] the nomination list from VNmatch is worse than a uniformly random nomination list; and finally ( ′ ) = means that ′ ∉ ′ and our algorithm cannot hope to nominate the correct vertex. The height of the stack represents the total number of vertices in 2 ( ). While beyond the scope of this work, this figure points to the impact of seed-selection as well-chosen seeds can be the difference between perfect algorithmic performance and performance worse than chance. Note also that for vertices 6, 31, 36, and 49, ′ ∉ ′ for all ∈ 2 ( ), so, matching the two neighborhoods for these vertices would never be successful for = ℎ = 2. We next consider the effects of increasing . For simplicity, we present our findings while considering vertex 27 to be the VOI. Vertex 27 shows moderately good performance using 1 seed in Figure 5, although not the best. We expect VNmatch to work equally well on any other vertex with similar (or better) performance to vertex 27 as noted in Figure 5.
With vertex = 27 as the VOI in , for each increasing from 2 to 9 we uniformly at random generate 100 seed sets from 2 ( ) and apply VNmatch to match and ′ using these seed sets. For = 1, rather than having 100 Monte Carlo replicates, we consider only the 47 possible seed sets of size 1 in 2 (27). Figure 6, displays ( ′ ) as a function of , with Figure 6 a showing the general performance of ( ′ ) with respect to and Figure 6b displaying a frequency histogram (conditioned on ( ′ ) ≠ ) of ( ′ ) for each ∈ {1, … , 9}.

FIGURE 6
Using = 27 as the VOI, vary from 1 to 9 in VNmatch. For > 1, uniformly at random generate 100 seed sets from 2 ( ). For = 1, consider all 47 possible seed sets of size 1.

FIGURE 7
Graphs of a particular friend of the VOI for both Twitter and Instagram; VOI in red and seeds in pink.

Finding Friends on Instagram from Twitter
We next consider nominating across two publicly available social network datasets, one derived from Twitter and one derived from Instagram, where there is an edge between two vertices if one vertex is following the other vertex in the respective social network. We consider a single vertex present on both the Twitter and Instagram networks and construct the two-hop neighborhoods of this vertex in each network, yielding a 163 vertex Twitter graph ( Figure 7a) and a 28 vertex Instagram graph ( Figure  7b). After identifying a VOI in each network, a simple metadata analysis of vertex features yields 10 potential seeds. In Figure  8, we plot the average value of ( ′ ) (±2s.e.) when using a seed set of size = 2, 4, 6, 8, 10. To avoid pathologies arising from ′ ∉ ′ , we use vertex 8 as a seed in each experiment. As there are few seeds here, we average over all possible sets of seeds of size in each example.
There are a few takeaways from this figure. First note that as the number of seeds increases, the performance of VNmatch increases significantly (i.e., the rank of ′ in Φ is closer to the top). In fact, we find that there are two vertices (including the central vertex in both graphs) whose presence in the seed sets are crucial in that if they are in the seed set then ( ′ ) = 0 every time, and if not then ( ′ ) > 0.5. Thus, the improvement upon ( ′ ) in Figure 8 is due to the increased proportion of seed sets  which contain the two crucial seeds for identifying the true match. Furthermore, these are the only two seeds which are adjacent to the vertex of interest. This indicates that in the future it may be beneficial to focus on what vertex-properties impact seedusefulness in terms of assistance with matchability. Also note that these graphs are quite local-the full Twitter and Instagram networks would have >> 10 7 vertices-yet our algorithm still performed quite well only considering ≈ 10 2 vertices. Indeed, by whittling the networks down into local neighborhoods, we are able to leverage the rich local signal present across networks without the computational burden induced by working with the full, often massive, networks themselves.

CONCLUSIONS
In this paper, we introduce an across-graph vertex nomination scheme based on local neighborhood alignment for identifying a vertex of interest. Our algorithm operates locally within much larger networks, and can scale to be implemented in the very large networks ubiquitous in this age of big data. We demonstrated the efficacy of our principled methodology on both simulated and real data networks, including an application to networks from Twitter and Instagram.
In this paper we have focused on finding a corresponding vertex in a second network to the VOI in the first network with a notion of correspondence in our real-data examples meaning that two nodes across the networks represent the same individual. Another application of this algorithm would be finding vertices, either across two networks or across two subnetworks of one larger network, that have similar structural role across the two networks. Since the resulting nomination list of the VNmatch algorithm already outputs nodes in an ordering that is based on which vertices in a localized version of the second network have similar localized structural role to the VOI in the first network, this extension follows immediately.
In the future, we would like to theoretically and empirically explore the impacts of network correlation and errors on VNmatch for various random graph models. We are also actively seeking to understand the effects of different types of seeds and what makes a "good" seed. The impact of unshared vertices and their connections on the performance of the VNmatch algorithm is still an open area of investigation. Applying VNmatch to multiple VOI could be done either iteratively or simultaneously. Other questions to explore include the addition of attributes and how to apply VNmatch simultaneously across multiple (more than 2) networks.
contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and DARPA, or the U.S. Government. The authors would also like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, UK, for support and hospitality during the program for Theoretical Foundations for Statistical Network Analysis where a portion of work on this paper was undertaken. The authors would also like to thank Jason Matterer for his helpful comments and suggestions.

Author contributions
Conceived and designed the methodology: HP VL CP. Performed the experiments: HP YP. Analyzed the data: HP YP. Wrote the paper: HP VL CP.

Financial disclosure
None reported.